Data-driven models are revolutionizing science and industry. Scalable systems are needed to collect, stream, process, and validate data at scale. This course is an introduction to “big” data engineering where students will receive hands-on experience building and deploying realistic data-intensive systems. It will cover streaming, data cleaning, relational data modeling and SQL, and machine learning model training. A core theme of the course is “scale”, and we will discuss the theory and the practice of programming with large external datasets that cannot fit in main memory on a single machine. The course will consist of bi-weekly programming assignments, a midterm examination, and a final.
This course is intended to be an overview of the design and implementation of data-intensive systems for non-majors. The content is organized into lectures (L0-L22) and practica (P0-P6).
Number | Topic | Video |
L0 | What is Data Engineering? | Link |
L1 | What is “Big Data”? | Link |
L2 | Perspectives on Data | Link |
L3 | A Survey of Big Data Infrastructure | Link |
— | Data Storage and Encoding | — |
L4 | Dictionary Encoding | Link |
L5 | Fixed and Variable Length Codes | Link |
L6 | Huffman Coding | Link |
L7 | Hashing | Link |
L8 | Physical Design | Link |
P0 | Financial Time Series Alignment | Link |
— | Information Retrieval | — |
L9 | Information Retrieval | Link |
L10 | Text Retrieval | Link |
P1 | String Similarity Metrics | Link |
L11 | Approximate Similarity Search | Link |
P2 | Inverted Index Search | Link |
L12 | Knowledge Bases | Link |
L13 | Joins | Link1 & Link2 |
P3 | Geospatial Joins | Link |
— | Basic Systems Techniques | — |
L14 | Iterators and Streaming | Link |
L15 | Data Flow Model | Link |
L16 | Aggregation | Link |
P4 | Python Query Engine | Link |
L17 | Out-of-Core Algorithms | Link |
— | Distributed Systems | — |
L18 | Shared Storage v.s Shared Nothing | Link1 & Link2 |
L19 | More on Distributed Systems | Link |
L20 | Data Shuffling | Link |
P6 | Apache Spark | Link |
L21 | Wrapping Up | Link |