Data Science For Computer Scientists CMSC 21800

This course is an introduction to using computational tools to derive insights from data. The course is roughly broken into three topics:

  • Measurement. How do we accurately measure real-world phenomena?
  • Model-based thinking. How do we predict or forecast unobserved phenomena?
  • Reliability and Scale. How do we build data-driven computing applications?

Read my welcome email first before you do anything else!

Course Structure

Lectures: For Fall 2020, this course will be an online course with recorded lectures (live at MWF 9:10am). On Mondays and Fridays, the lecture will be a conceptual lecture introducing theory and concepts. On Wednesdays, we will work together on a problem set or coding example in the allotted lecture time.

Zoom link: https://uchicago.zoom.us/j/93614402820?pwd=NmlmcXZWdEdmMml3Y1hzZjViQXFzdz09

Slack Link (for course projects and logistics): https://join.slack.com/t/cs-ezq4738/shared_invite/zt-hn95c2mv-LpcbB49MvDrwQdWimrBmRQ

Recommended Reading: Naked Statistics https://www.amazon.com/Naked-Statistics-Stripping-Dread-Data/dp/039334777X/, The Art of Statistics https://www.amazon.com/Art-Statistics-How-Learn-Data/dp/1541618513/

Exams and Grading: There will be 3 take-home exams and a quarter-long final course project. Grading is as follows: 0.25*Exam1 + 0.25*Exam2 + 0.25*Exam3 + 0.25*Project

  • Exam 1. October 16th Friday 9:30am – Sunday October 18th 11:59pm
  • Exam 2. November 6th Friday 9:30am – Sunday November 8th 11:59pm
  • Exam 3. December 2nd Wed 9:30am – Friday Dec 4th 11:59pm
  • Final Project. Due December 10th 11:59pm (details) REVISED!!

Office hours: 11am-12pm Tuesdays: https://uchicago.zoom.us/j/93614402820?pwd=NmlmcXZWdEdmMml3Y1hzZjViQXFzdz09

Lectures

The main lectures of the course introduce core topics in data science and show how to connect theoretical concepts in statistics with real-world data analysis problems.

L0. “How to Track a Wildfire”
Recent events illustrate just how hard it is to quantify and measure real-world phenomena. In this lecture, we talk about the subjectivity of data analysis and talk about what data can and can’t do for you. Watch Slides
L1. “Why am I wrong?”
This lecture describes the types of data we may measure, populations (and their samples), and different types of biases that may enter the data collection process. Watch
L2. “Much Ado About Sampling”
This lecture will discuss how we mathematically model “simple” sampling processes and what kinds of insights we can gain from such models. Watch Slides
L3. “The Unreasonable (In)Effectiveness of Opinion Polls”
Recently, it seems like opinion polls are so wrong about everything. This lecture describes how systematic sampling can bias estimates and discusses stratified sampling techniques. Watch Slides
L4. “A minore ad maius”
From the smaller to the bigger. We describe how different aggregations of population data can be compared and what those comparisons mean. Watch Slides
L5a. “Beyond a Reasonable Doubt”
A randomized controlled trial is the primary method for determining causation. This lecture describes hypothesis testing, significance, and the assumptions needed to demonstrate causation beyond a reasonable doubt. Watch Slides
L5b. “Beyond a Reasonable Doubt”
A randomized controlled trial is the primary method for determining causation. This lecture describes hypothesis testing, significance, and the assumptions needed to demonstrate causation beyond a reasonable doubt. Watch Slides
L6. “Does Whiskey Cure Diabetes?”
Correlation does not imply causation. Just because one variable has the ability to predict another, doesn’t mean there is a cause-and-effect relationship between them. This lecture defines correlation and differentiates this concept from causation. Watch Slides
L7. “Real Estate in Jeff Bezos’s Neighborhood”
A single anomalous measurement can affect measures of significance or confidence. This lecture describes how to make measures of correlation and causation more robust to outliers. We also describe the “principle of similar confidence”, where only like measurements should be compared against each other. Watch Slides
L8a. “Why this isn’t a stats class”
This lecture introduces using computational techniques to solve statistical estimation problems. Watch Slides
L8b. “All Models Are Wrong, Some Are Useful”
This lecture introduces using computational techniques to solve statistical estimation problems. Watch Slides
L9. “If the glove fits…”
This lecture covers model fitting and the bias-variance tradeoff. Watch Slides
L9b. “Exceptions to the Rule”
This lecture covers prediction rules and the bias-variance tradeoff from a different perspective. Watch Slides
L10. “Easy and Hard Predictions”
In this lecture, we describe how we measure the success of machine learning models. Watch Slides
L11. “Features of Success”
Two lectures on KMeans clustering and Principal component analysis. Watch Lecture 1 Slides Slides Watch Lecture 2
L12. “Herding Cats”
Can a machine learn to “see”? This lecture describes how once seemingly intractable problems in computer vision have been solved over the last 10 years. Watch Slides
L13. “Alexa, What Time is It?”
How does Alexa understand language so well? In this lecture, we describe some of the break throughs in natural language understanding that allow us to build complex, conversational systems. WatchSlides
L14. “The Mystery of Machine Learning”
The recent breakthroughs in learning language, vision, and speech have been made by machine learning. How and why do they work, and what still confuses us about their effectiveness? Also, a lot about the dangers of relying too much on predictive models. Watch Slides
L15. “Why Your Data Structures Class Still Matters”
Machine learning does not solve everything and classical data structures play an important role in the analysis and presentation of data. Watch Slides
L16. “Review”

Practica

Each of the “practicum” lectures, illustrates a coding example or a problem set that reinforces the concepts learned in class.

P1. Non-Response Bias in Opinion Polls Assignment Watch (UChicago Only) Solution
P3. Practical Issues With Correlation Assignment + Solutions Watch (Uchicago only)
P4. How to Simulate an Election Assignment + Solutions Watch (UChicago Only)
P6. Image Classification Assignment Watch (Uchicago Only)
P7. Feature and Model Drift
P7. Final Review

Supplementary Lectures

These are supplementary lectures that are designed to cover topics you may have learned in pre-requisite classes.
S1. Random Experiments
A random experiment is any process where the outcome is not known before hand. This lecture talks about random experiments, outcomes, events, and how to interpet them.
Watch
S2. Probability Spaces
A probability space assigns relative likelihoods to resultant events from a random experiment. This lecture describes the axioms of probability and the concept of an “algebra” of events.
Watch
S3. Conditional Probability
How likely is one event knowing that another definitely happens? Probability spaces are naturally subdividable and we describe the concepts of conditioning and independence.
Watch
S4. Random Variables
We often assign numerical values to the outcomes of random experiments–these are called Random Variables. This lecture describes discrete and continuous random variables and distributions.
Watch
S5. Expectation
Suppose we repeatedly and independently run a random experiment, how do we characterize the long-term behavior of the process? The expected value of a random variable characterizes where the long-term average of independent trials of a random experiment converge to.
Watch
S6. Covariance and Variance
Random experiments are often coupled with each other, where the result of one gives us some information about another. This lecture overviews quantifying how informative an observation is. We describe correlation between random variables, and the variance of random variables.
Watch
S7. Introduction to Pandas
Pandas is a Python framework for manipulating tabular data. This lecture describes the basics of working with Pandas including loading, filtering, and constructing datasets in Pandas.
Notes Watch