Data-driven models are revolutionizing science and industry. This course covers computational methods for structuring and analyzing data to facilitate decision-making. We will cover algorithms for transforming and matching data; hypothesis testing and statistical validation; and bias and error in real-world datasets. A core theme of the course is “generalization”; ensuring that the insights gleaned from data are predictive of future phenomena. The course will include bi-weekly programming assignments, a midterm examination, and a final.
Time and Place: MWF 930a–1020a RO 11
Exams:(MT) Nov 6, 930a-1020a, (Final) Dec-11 1030a-1230p
Quizzes: Periodic take home quizzes announced in class and posted to the website.
Office Hours: Sanjay MWF 1030p-1130 (243 JCL), Xi 130-230 (TBD), Qiming 230-330 (205 JCL)
Grading: 0.2*Midterm + 0.2*Quizzes + 0.4*Final + 0.2*Homework
Non-Letter Grade: If applicable, students must indicate whether they DO NOT want a letter grade by Nov 27 and are still expected to complete all the assignments to pass.
The course will be divided into 4 roughly 2-3 week modules. Each module studies a data science problem in detail (both the math and the programming!) and culminates in a programming assignment. The skills learned in each module will be cumulative and notes will be periodically posted below.
Module 1. Opinion Polling
Public opinion polls play an important role in politics, marketing, and economics. The first module of this course will use opinion polling as an example for a gentle introduction to the course teaching both programming skills (using Python for data analysis) and analytical skills (basic descriptive statistics).
|Probability/Random Variables||(L1, L2,Self Study Code 1,L3,Code 2)|
|Bias in Sampling Processes|
|Python Bias Simulation Assignment|
Module 2. Hypothesis Testing
Students will learn how to design experiments, test for significance, and interpret/present data with Python.
|Introduction to Pandas|
|Descriptive Statistics and Aggregation|
|Hypothesis Testing and Significance|
|Failures of Data-Driven Approaches|
|Permutation Test Assignment|
Module 3. Forecasting
Forecasting is the process of making predictions of the future based on past data. This module will teach the basic process of forecasting, when it works, and how to evaluate a model’s efficacy. We will conclude by illustrating the connections between forecasting and modern Artificial Intelligence.
|Introduction to Numpy|
|Training and Testing|
|Linear Models (why are they so common?)|
|Introduction to sci-kit learn|
|Election Prediction Assignment|
Module 4. Data Integration
Data integration involves combining data residing in different sources and providing users with a unified view of them. This module will focus on the problem of record-linkage. Where differing rows in multiple datasets refer to the same real world entity. We will study algorithms to efficiently resolve these differences.
|Data and Schema Integration|
|Naive Matching and Jaccard Similarity|
|String Matching and Edit Distance|
|Precision, Recall, Approximations|
|Amazon v.s. Google Challenge Assignment|