CS8 Weekly Study Guide: Week 1
Tuesday, January 17 - Friday, January 20
Reading (my notes below):
● For Wednesday: CIT 1.1, 1.2, and 1.3
● For Friday: CIT Chapter 2
Live Lecture Notes Lecture 1
A. What is Data Science
a. Drawing useful conclusions from data using computation.
b. Identify patterns in information
i. Identifying patterns in information
ii. Uses visualizations
i. Are patterns reliable
ii. Uses randomization
i. Making informed guesses
ii. Uses machine learning
B. Come to lecture weeks 3 through 14!!!!!!! Or do a final project if you already did CS and Stats.
Live Lecture Notes Lecture 2: Cause and Effect
A. Connector Courses!
B. Causality: not so easy to establish!
a. Basic Language:
i. Individuals, study subjects, participants, units: in the chocolate study, it was “European
ii. Each individual gets a treatment (chocolate consumption)
iii. There was an outcome (heart disease)
b. The word “relation”: in statistics, it’s called association: any relation whatsoever
c. How do you establish association?
ii. Our heart disease data: Since the association definition is loose and the numbers are
different (12 vs 17.4), this data shows an association, by the definition above. If you
plotted the data points, you would see an association.
d. Establishing Causality
i. Harder than association
ii. The study did NOT establish causality
e. Cholera Example
i. Miasmatism: most of them did not have causal information, and just assumed the
ii. John Snow: looked at trends of dying to doubt miasmas. People breathed the same air,
but they didn’t get sick. He thought the problem was water instead.
1. His approach: draw a map
2. Plot deaths on the map.
3. Found that deaths are clustered around the Broad Street pump, but he made sure
to look at other pumps. 4. He did house by house interviews to make sure.
5. Explain seeming “outliers”
6. He now as an association. But how does he establish causation?
7. They took the handle off the pump, and people stopped dying.
8. He needed comparison of death rates between clean/dirty water.
9. In the area where there were BOTH water companies, the houses that got each
water companies were in the same area: no difference between those houses -
which is very important!
1. Treatment Group
2. Control Group
3. Groups must be similar except for treatment. Otherwise, you are not sure what
caused the effect.
4. Make sure to compare proportions and not absolute counts
iv. If you can’t perform an experiment, then you must have to do an observational study. If
so, there might be hidden variables called confounding factors. It’s a difference other
than the treatment that is present in the groups.
1. You will be able to calculate the chance that your two groups are different and
account for that
2. Randomized Controlled Experiment
vi. Watch out for the placebo effect. Make sure you make sure people don’t know what
group they’re in. But sometimes this is hard.
vii. Make sure your groups are completely random. That’s pretty challenging, even YOU may
not know it’s not random.
CIT Reading Quick Summary Notes:
1. Chapter 1: Data Science (Estimated Read time: 20-30 min)
a. 1.1 Introduction:
i. Data science = drawing conclusions from data using computation
ii. Use computation and randomization
iii. 1.1.1 Computational Tools
1. Install Anaconda
iv. 1.1.2 Statistical Techniques
1. Core inferential problems are from stats: testing hypotheses, estimating
confidence, and predicting unknown quantities.
2. Data science extends statistics.
3. Computers can use resampling, take into account lots of information, and can
operate with few assumptions/conditions
b. 1.2 Why Data Science?
i. Making decisions from partial data & uncertain outcomes
ii. Large-scale data an