Introduction to Data Engineering

Data-driven models are revolutionizing science and industry. Scalable systems are needed to collect, stream, process, and validate data at scale. This course is an introduction to “big” data engineering where students will receive hands-on experience building and deploying realistic data-intensive systems. It will cover streaming, data cleaning, relational data modeling and SQL, and machine learning model training. A core theme of the course is “scale”, and we will discuss the theory and the practice of programming with large external datasets that cannot fit in main memory on a single machine. The course will consist of bi-weekly programming assignments, a midterm examination, and a final.

This course is intended to be an overview of the design and implementation of data-intensive systems for non-majors. The content is organized into lectures (L0-L22) and practica (P0-P6).

Number	Topic	Video
L0	What is Data Engineering?	Link
L1	What is “Big Data”?	Link
L2	Perspectives on Data	Link
L3	A Survey of Big Data Infrastructure	Link
—	Data Storage and Encoding	—
L4	Dictionary Encoding	Link
L5	Fixed and Variable Length Codes	Link
L6	Huffman Coding	Link
L7	Hashing	Link
L8	Physical Design	Link
P0	Financial Time Series Alignment	Link
—	Information Retrieval	—
L9	Information Retrieval	Link
L10	Text Retrieval	Link
P1	String Similarity Metrics	Link
L11	Approximate Similarity Search	Link
P2	Inverted Index Search	Link
L12	Knowledge Bases	Link
L13	Joins	Link1 & Link2
P3	Geospatial Joins	Link
—	Basic Systems Techniques	—
L14	Iterators and Streaming	Link
L15	Data Flow Model	Link
L16	Aggregation	Link
P4	Python Query Engine	Link
L17	Out-of-Core Algorithms	Link
—	Distributed Systems	—
L18	Shared Storage v.s Shared Nothing	Link1 & Link2
L19	More on Distributed Systems	Link
L20	Data Shuffling	Link
P6	Apache Spark	Link
L21	Wrapping Up	Link