Introduction to Data Engineering

Data-driven models are revolutionizing science and industry. Scalable systems are needed to collect, stream, process, and validate data at scale. This course is an introduction to “big” data engineering where students will receive hands-on experience building and deploying realistic data-intensive systems. It will cover streaming, data cleaning, relational data modeling and SQL, and machine learning model training. A core theme of the course is “scale”, and we will discuss the theory and the practice of programming with large external datasets that cannot fit in main memory on a single machine. The course will consist of bi-weekly programming assignmentsa midterm examination, and a final.

This course is intended to be an overview of the design and implementation of data-intensive systems for non-majors. The content is organized into lectures (L0-L22) and practica (P0-P6).

L0What is Data Engineering?Link
L1What is “Big Data”?Link
L2Perspectives on DataLink
L3A Survey of Big Data InfrastructureLink
Data Storage and Encoding
L4Dictionary EncodingLink
L5Fixed and Variable Length CodesLink
L6Huffman CodingLink
L8Physical DesignLink
P0 Financial Time Series AlignmentLink
Information Retrieval
L9Information RetrievalLink
L10Text RetrievalLink
P1String Similarity MetricsLink
L11Approximate Similarity SearchLink
P2Inverted Index SearchLink
L12Knowledge BasesLink
L13JoinsLink1 & Link2
P3Geospatial JoinsLink
Basic Systems Techniques
L14Iterators and StreamingLink
L15Data Flow ModelLink
P4Python Query EngineLink
L17Out-of-Core AlgorithmsLink
Distributed Systems
L18Shared Storage v.s Shared NothingLink1 & Link2
L19More on Distributed SystemsLink
L20Data ShufflingLink
P6Apache SparkLink
L21Wrapping UpLink