Data-driven models are revolutionizing science and industry. This course covers computational methods for structuring and analyzing data to facilitate decision-making. We will cover algorithms for transforming and matching data; hypothesis testing and statistical validation; and bias and error in real-world datasets. A core theme of the course is “generalization”; ensuring that the insights gleaned from data are predictive of future phenomena.

### Course Structure

**(UPDATED!!!) **For Fall 2020, this course will be an online course with recorded lectures (live at MWF 9:30am). On Mondays and Fridays, the lecture will be a conceptual lecture introducing theory and concepts. On Wednesdays, we will work together on a problem set or coding example in the allotted lecture time.

**Recommended Reading**: Naked Statistics https://www.amazon.com/Naked-Statistics-Stripping-Dread-Data/dp/039334777X/, The Art of Statistics https://www.amazon.com/Art-Statistics-How-Learn-Data/dp/1541618513/

**Exams and Grading**: There will be 3 take-home exams and a quarter-long final course project. Grading is as follows: *0.3*Exam1 + 0.3*Exam2 + 0.2*Exam3 + 0.2*Project*

### Lectures

The main lectures of the course introduce core topics in data science and show how to connect theoretical concepts in statistics with real-world data analysis problems.

The recent pandemic illustrates just how hard it is to quantify and measure real-world phenomena. This lecture describes the types of data we may measure, populations (and their samples), and different types of biases that may enter the data collection process.

Many real-world datasets are collected over a period of time. This lecture describes how to model and represent time-series data.

A latin phrase that means “from the smaller to the bigger”. What can do we learn about a population from concise summary statistics and when can these insights be misleading? This lecture talks about measures of “typical” (mean, median, mode), and measures of spread (variance, inter-quartile range, and support).

Why can very small samples predict so much about a population? This lecture describes the key results of the central limit theorem and explains how to quantify the error in a sample estimate.

Recently, it seem like opinion polls are so wrong about everything. This lecture describes how systematic sampling can bias estimates and discusses stratified sampling techniques.

Correlation does not imply causation. Just because one variable has the ability to predict another, doesn’t mean there is a cause-and-effect relationship between them. This lecture defines correlation and differentiates this concept from causation.

A randomized controlled trial is the primary method for determining causation. This lecture describes hypothesis testing, significance, and the assumptions needed to demonstrate causation beyond a reasonable doubt.

A single anomalous measurement can affect measures of significance or confidence. This lecture describes how to make measures of correlation and causation more robust to outliers. We also describe the “principle of similar confidence”, where only like measurements should be compared against each other.

This lecture covers multiple hypothesis testing, selective hypothesis testing, and other pitfalls of data-driven methodology.

Suppose you find yourself stranded on a desert island and need to eat the local fruits to survive. Some of them make you sick and some don’t. How do you predict which ones are dangerous?

Prediction requires turning data into numerical vectors, called featurization. The way that we choose to featurize data can have profound effects on what our models learn.

What explains the fragility of machine learning approaches? This lecture describes overfitting, underfitting, concept drift and other common failures of machine learning.

How does Alexa understand language so well? In this lecture, we describe some of the break throughs in natural language understanding that allow us to build complex, conversational systems.

Can a machine learn to “see”? This lecture describes how once seemingly intractable problems in computer vision have been solved over the last 10 years.

The recent breakthroughs in learning language, vision, and speech have been made by deep neural networks. How and why do they work, and what still confuses us about their effectiveness?

Machine learning does not solve everything and classical data structures play an important role in the analysis and presentation of data.

What do we do when our data cannot fit in main memory? This lecture surveys the world of Big Data and explains how the techniques learned in class can be scaled up.

John Manyard Keynes had the following thought experiment: you are playing poker with the Archbishop of Cantebury, he tells you that he will deal the first hand, miraculously he recieves a royal flush. The probabily of recieving such a hand is exeedingly rare, did he cheat?

### Practica

Each of the “practicum” lectures, illustrates a coding example or a problem set that reinforces the concepts learned in class.

### Supplementary Lectures

These are supplementary lectures that are designed to cover topics you may have learned in pre-requisite classes.A random experiment is any process where the outcome is not known before hand. This lecture talks about random experiments, outcomes, events, and how to interpet them.

A probability space assigns relative likelihoods to resultant events from a random experiment. This lecture describes the axioms of probability and the concept of an “algebra” of events.

How likely is one event knowing that another definitely happens? Probability spaces are naturally subdividable and we describe the concepts of conditioning and independence.

We often assign numerical values to the outcomes of random experiments–these are called Random Variables. This lecture describes discrete and continuous random variables and distributions.

Suppose we repeatedly and independently run a random experiment, how do we characterize the long-term behavior of the process? The expected value of a random variable characterizes where the long-term average of independent trials of a random experiment converge to.

Random experiments are often coupled with each other, where the result of one gives us some information about another. This lecture overviews quantifying how informative an observation is. We describe correlation between random variables, the variance of random variables, and a concept called entropy.

Pandas is a Python framework for manipulating tabular data. This lecture describes the basics of working with Pandas including loading, filtering, and constructing datasets in Pandas.

This lecture describes how to do basic data analysis in pandas. We start by describing how to aggregate and group data in Pandas, and then extend this with a deep dive into linking and merging different datasets.

Presenting results is an important part of data science. This lecture describes how to use the matplotlib library to visualize results.

We overview tools for working with numerical data organized into vectors, matrices, and arrays.