Data Science For Computer Scientists CMSC 21800

Data-driven models are revolutionizing science and industry. This course covers computational methods for structuring and analyzing data to facilitate decision-making. We will cover algorithms for transforming and matching data; hypothesis testing and statistical validation; and bias and error in real-world datasets.  A core theme of the course is “generalization”; ensuring that the insights gleaned from data are predictive of future phenomena. The course will include bi-weekly programming assignments, a midterm examination, and a final.

Course Policies

Time and Place: MWF 930a–1020a RO 11

Exams:(MT) Nov 6, 930a-1020a, (Final) Dec-11 1030a-1230p

Quizzes: Periodic take home quizzes announced in class and posted to the website.

Office Hours: Sanjay : MWF 1030 -1130 (243 JCL), Xi : Tu 330-430 (205 JCL), Qiming : F 230-330 (205 JCL)

Grading: 0.3*Midterm + 0.4*Final + 0.3*Homework

Non-Letter Grade: If applicable, students must indicate whether they DO NOT want a letter grade by Nov 27 and are still expected to complete all the assignments to pass.

Practice Tests: (Midterm 1a, Midterm 1a Solutions)

Course Structure

The course will be divided into 4 roughly 2-3 week modules. Each module studies a data science problem in detail (both the math and the programming!) and culminates in a programming assignment. The skills learned in each module will be cumulative and notes will be periodically posted below.

Module 1. Opinion Polling

Public opinion polls play an important role in politics, marketing, and economics. The first module of this course will use opinion polling as an example for a gentle introduction to the course teaching both programming skills (using Python for data analysis) and analytical skills (basic descriptive statistics).

Topic Notes
Introduction (L0)
Probability/Random Variables (L1, L2,Self Study Code 1,L3,Code 2)
Sampling Statistics (L1, L2)
Bias in Sampling Processes (L1, Example)
Python Polling Assignment Assignment, Errata, Solution

Module 2. Hypothesis Testing

Students will learn how to design experiments, test for significance, and interpret/present data with Python.

Topic Notes
Introduction to Pandas (L1)
Hypothesis Testing and Significance (Reading 1, Code 1, L1, L2 )
Failures of Data-Driven Approaches (Reading 2, Reading 3, L1)
Exploratory Data Analysis (Code 1, L1)
Correlation (L1)
Hypothesis Test Assignment Assignment

Module 3. Forecasting

Forecasting is the process of making predictions of the future based on past data. This module will teach the basic process of forecasting, when it works, and how to evaluate a model’s efficacy. We will conclude by illustrating the connections between forecasting and modern Artificial Intelligence.

Topic Notes
Introduction to Feature Engineering L1, L2
Rules, Training, and Testing
Regression
Classification
Game of Thrones Prediction Assignment

Module 4. Data Integration

Data integration involves combining data residing in different sources and providing users with a unified view of them. This module will focus on the problem of record-linkage. Where differing rows in multiple datasets refer to the same real world entity. We will study algorithms to efficiently resolve these differences.

Topic Notes
Data and Schema Integration
Naive Matching and Jaccard Similarity
MinHash
String Matching and Edit Distance
Transitive Closure
Precision, Recall, Approximations
Amazon v.s. Google Challenge Assignment