Introduction to Data Engineering

Course Description: Data-driven models are revolutionizing science and industry. Scalable systems are needed to collect, stream, process, and validate data at scale. This course is an introduction to “big” data engineering where students will receive hands-on experience building and deploying realistic data-intensive systems. It will cover streaming, data cleaning, relational data modeling and SQL, and machine learning model training. A core theme of the course is “scale”, and we will discuss the theory and the practice of programming with large external datasets that cannot fit in main memory on a single machine. The course will consist of bi-weekly programming assignmentsa midterm examination, and a final.

Location: MWF 9:30-10:20 SHFE 203

Office Hours: MW 4:30-5:30 243 JCL (Sanjay)

Office Hours (TA): Wed 11-12 (Rose), Thurs 9:30-10:30 (Will) both in 259 JCL

Grading: Quizzes (10%), Homework (20%), Midterm (30%), Final (40%) . The exam schedule is listed below:

  • Midterm (6:30 pm-8:30 pm May 10)
  • Final (10:30am-12:30pm June 12)
  • For any conflicts, a makeup exam will be scheduled prior to these times. It is your responsibility to coordinate this well in advance.

Late Policy: 0% for all late work, reasonable exceptions will be considered including family emergencies, illness, etc.

Official Communication: The TA(s) and Instructor WILL NOT respond to personal emails. Please communicate through Piazza either with a public post if it is of general interest or a private message.

4/1Course Introduction (pdf)
4/3Iterators (pdf)
4/5Operators (pdf) (submission instructions)HW0
4/8Composing Operators (pdf) (
4/10Main-Memory Aggregation (pdf)
4/12Out-of-core algorithms (pdf) (
4/15Out-of-core cont’d/ Hash Join (pdf)HW1
4/17In Class Quiz
4/19Parallelism (pdf)HW0 Due
4/22Parallelism Cont’d (pdf)
4/24SQL I (pdf) (queries.sql) (Quiz 1 solutions)
4/26SQL II (pdf)
4/29SQL III (pdf)HW2, HW1 DUE
5/1Text Retrieval (pdf)
5/3Text Retrieval (pdf)
5/6Transitive Closure (pdf)
5/8No Class
5/10ML Systems IMidterm
5/13ML Systems IIHW2 DUE
5/15Integrity Constraints I (pdf) (wiki)
5/17Integrity Constraints II
5/20Integrity Constraints III (Quiz Solutions)HW3
5/22ETL/Data Extraction I (pdf)
5/24Privacy I (pdf)
5/29Privacy IIHW4 OUT
5/31Approximation I
6/3Approximation IIHW3 DUE
6/5Approximation III
6/12Final 10:30am-12:30pmHW4 DUE