MIS 0855 Lecture Notes - Lecture 2: George E. P. Box, Data Science, Decision Tree Learning
Science and Data Science: What is data science?
• Compare it to the definition of science: knowledge about or study of the natural world
based on facts learned through experiments and observation
• What makes knowledge actionable? Why is that a goal? How does big data facilitate
this?
o Actionable – needs to project into the future, needs to be generalizable and
robust
• First: Statistics
o What is statistics? – Statistics studies data in terms of collection, analysis,
interpretation, presentation, and organization
o It helps us to answer these questions:
▪ What patterns are there in my data?
▪ What is the chance that an event will occur?
▪ Which patterns are significant?
▪ What is a high level summary of my data?
• Now: Big Data & Machine Learning
o What is machine learning (ML)? – ML gives computers the ability to learn
without being explicitly programmed
o A computer program is said to learn from experience E with respect to same task
T and some performance measure P, if its performance on T, as measured by P,
improves with experience E.
▪ T: playing checkers
▪ P: percentage of games won against an arbitrary opponent
▪ E: playing practice games against itself
• Statistics vs. ML (Breiman2001)
o Input x → Nature → Output y
o Why analyze data? – To predict or extract information
o Statistics: input x → linear reg, logistic reg, cox → output y
o ML: input x → unknowns → output y
▪ Figure out unknowns with Decision Trees or Neural Nets
• The dangers of (big data) analytics
o It’s easy to fid hat’s ot really there
o The direction of causality can be tricky
o Dirty data is eeryhere
• “o…“tart ith a hypothesis
o The testale preditios fro a idea ith a uderlyig ratioale that akes
sense
find more resources at oneclass.com
find more resources at oneclass.com