Data Science For Computer Scientists CMSC 21800

Data-driven models are revolutionizing science and industry. This course covers computational methods for structuring and analyzing data to facilitate decision-making. We will cover algorithms for transforming and matching data; hypothesis testing and statistical validation; and bias and error in real-world datasets.  A core theme of the course is “generalization”; ensuring that the insights gleaned from data are predictive of future phenomena.

Course Structure

(UPDATED!!!) For Fall 2020, this course will be an online course with recorded lectures (live at MWF 9:30am). On Mondays and Fridays, the lecture will be a conceptual lecture introducing theory and concepts. On Wednesdays, we will work together on a problem set or coding example in the allotted lecture time.

Recommended Reading: Naked Statistics https://www.amazon.com/Naked-Statistics-Stripping-Dread-Data/dp/039334777X/, The Art of Statistics https://www.amazon.com/Art-Statistics-How-Learn-Data/dp/1541618513/

Exams and Grading: There will be 3 take-home exams and a quarter-long final course project. Grading is as follows: 0.3*Exam1 + 0.3*Exam2 + 0.2*Exam3 + 0.2*Project

Lectures

The main lectures of the course introduce core topics in data science and show how to connect theoretical concepts in statistics with real-world data analysis problems.

L1. “How to Track a Pandemic”
The recent pandemic illustrates just how hard it is to quantify and measure real-world phenomena. This lecture describes the types of data we may measure, populations (and their samples), and different types of biases that may enter the data collection process.
L2. “Mixed Signals”
Many real-world datasets are collected over a period of time. This lecture describes how to model and represent time-series data.
L3. “Ad minore a maius”
A latin phrase that means “from the smaller to the bigger”. What can do we learn about a population from concise summary statistics and when can these insights be misleading? This lecture talks about measures of “typical” (mean, median, mode), and measures of spread (variance, inter-quartile range, and support).
L4. “The Unreasonable Effectiveness of Opinion Polls”
Why can very small samples predict so much about a population? This lecture describes the key results of the central limit theorem and explains how to quantify the error in a sample estimate.
L5. “The Unreasonable In-Effectiveness of Opinion Polls”
Recently, it seem like opinion polls are so wrong about everything. This lecture describes how systematic sampling can bias estimates and discusses stratified sampling techniques.
L6. “Does Whiskey Cure Diabetes?”
Correlation does not imply causation. Just because one variable has the ability to predict another, doesn’t mean there is a cause-and-effect relationship between them. This lecture defines correlation and differentiates this concept from causation.
L7. “Beyond a Reasonable Doubt”
A randomized controlled trial is the primary method for determining causation. This lecture describes hypothesis testing, significance, and the assumptions needed to demonstrate causation beyond a reasonable doubt.
L8. “Real Estate in Jeff Bezos’s Neighborhood”
A single anomalous measurement can affect measures of significance or confidence. This lecture describes how to make measures of correlation and causation more robust to outliers. We also describe the “principle of similar confidence”, where only like measurements should be compared against each other.
L9. “How to Be Insignficant”
This lecture covers multiple hypothesis testing, selective hypothesis testing, and other pitfalls of data-driven methodology.
L10. “All Models Are Wrong, Some Are Useful”
Suppose you find yourself stranded on a desert island and need to eat the local fruits to survive. Some of them make you sick and some don’t. How do you predict which ones are dangerous?
L11. “The Features of Success”
Prediction requires turning data into numerical vectors, called featurization. The way that we choose to featurize data can have profound effects on what our models learn.
L12. “Superstition and Other Dangers of Machine Learning”
What explains the fragility of machine learning approaches? This lecture describes overfitting, underfitting, concept drift and other common failures of machine learning.
L13. “Alexa, What Time is It?”
How does Alexa understand language so well? In this lecture, we describe some of the break throughs in natural language understanding that allow us to build complex, conversational systems.
L14. “Herding Cats”
Can a machine learn to “see”? This lecture describes how once seemingly intractable problems in computer vision have been solved over the last 10 years.
L15. “The Mystery of Neural Networks”
The recent breakthroughs in learning language, vision, and speech have been made by deep neural networks. How and why do they work, and what still confuses us about their effectiveness?
L16. “Why Your Data Structures Class Still Matters”
Machine learning does not solve everything and classical data structures play an important role in the analysis and presentation of data.
L17. “Big Data in an Hour”
What do we do when our data cannot fit in main memory? This lecture surveys the world of Big Data and explains how the techniques learned in class can be scaled up.
L18. “Cheating Arhcbishops”
John Manyard Keynes had the following thought experiment: you are playing poker with the Archbishop of Cantebury, he tells you that he will deal the first hand, miraculously he recieves a royal flush. The probabily of recieving such a hand is exeedingly rare, did he cheat?

Practica

Each of the “practicum” lectures, illustrates a coding example or a problem set that reinforces the concepts learned in class.

P1. A Review of Probability and Statistics
P2. Linking Datasets Using Pandas
P3. Non-Response Bias in Opinion Polls
P4. Mortality in The Titanic
P5. Partical Physics
P6. Bike Sharing
P7. Word2Vec and Language Modeling
P8. A Python Search Engine

Supplementary Lectures

These are supplementary lectures that are designed to cover topics you may have learned in pre-requisite classes.
S1. Random Experiments
A random experiment is any process where the outcome is not known before hand. This lecture talks about random experiments, outcomes, events, and how to interpet them.
S2. Probability Spaces
A probability space assigns relative likelihoods to resultant events from a random experiment. This lecture describes the axioms of probability and the concept of an “algebra” of events.
S3. Conditional Probability
How likely is one event knowing that another definitely happens? Probability spaces are naturally subdividable and we describe the concepts of conditioning and independence.
S4. Random Variables
We often assign numerical values to the outcomes of random experiments–these are called Random Variables. This lecture describes discrete and continuous random variables and distributions.
S5. Expectation
Suppose we repeatedly and independently run a random experiment, how do we characterize the long-term behavior of the process? The expected value of a random variable characterizes where the long-term average of independent trials of a random experiment converge to.
S6. Covariance and Variance and Entropy
Random experiments are often coupled with each other, where the result of one gives us some information about another. This lecture overviews quantifying how informative an observation is. We describe correlation between random variables, the variance of random variables, and a concept called entropy.
S7. Introduction to Pandas
Pandas is a Python framework for manipulating tabular data. This lecture describes the basics of working with Pandas including loading, filtering, and constructing datasets in Pandas.
S8. Aggregation, Joins, and Merging in Pandas
This lecture describes how to do basic data analysis in pandas. We start by describing how to aggregate and group data in Pandas, and then extend this with a deep dive into linking and merging different datasets.
S9. Visualization
Presenting results is an important part of data science. This lecture describes how to use the matplotlib library to visualize results.
S10. Introduction to numpy
We overview tools for working with numerical data organized into vectors, matrices, and arrays.
S11. Introduction to sklearn
This lecture describes the basic usage of the popular Python machine learning library, sklearn.