STATS 10 Lecture Notes - Lecture 2: Data Set, Sample Size Determination, Scientific Control
Chapter 1: Introduction to Data
Data
● Data is info, measurements, or observations recorded or collected
○ More than just numbers→ context is everything
○ Does not have to be numbers at all; (e.g. images and sound files are also data; think Shazam)
■ Statisticians generally try to recode non-number data into numbers in order to analyze it
○ Data set (dataset): collection of data
■ Note: data=plural but can be used as single; consistency is key; same with data/dataset
Populations and Samples
● Population: entire collection of objects of interests; can be people, things, or even groups of things
○ Usually very large and thus difficult (or impossible) to observe/collect data on directly
● Sample: portion of a population of interest
○ Usually taken to measure certain characteristics of a population; typically have data on a sample
● Sample size: number of objects of interest in the sample, usually denoted by n
Observations and Variables
● Observation (observable unit): set of data collected on an object of interest
● Variable: any piece of info (e.g. characteristic, number,
quantity) that can be measured or counted on an object of
interest (examples: gender, height, color, area)
Data Tables
● Almost always, datasets are stored as a data table
○ Each observation is a row in the data table (horizontal)
○ Each variable is a column in the data table (vertical)
○ # of rows=sample size (and sample size=# of
observations)
Two Types of Variables
● Numeric (or quantitative) variables describe quantities of the
object of interest; values are numbers
○ Tells us “how much”/”how many”; e.g. age, height, temperature, GPA, weight
● Categorical (or qualitative) variables describe qualitites of the objects of interest; Values are categories
○ Tells us “what type” or “what kind” e.g. sex, hair color, class subject, eye color
○ If a categorical variable contains unique value for each individual (e.g. Name), often called unique
identifier; not unique identifier if two subjects share the same name (being indifferentiable)
● WARNING: it is not always obvious if a varible is numeric or categorical just by whether values are numerical or
not; important to consider what the values represent in conext!
○ Some variables w/ numbercial values are categorical (e.g. area codes)
■ Some datasets code categorical variables as numbers (Yes=0 and No=1)
○ Some numeric variables can be recoded as categorical variables
■ age=numeric, but age range=categorical
● CONEXT IS KEY!: most important aspect of data but is often overlooked
○ who/what are observational units? What variables measured? How and what units of measurement? Who
collected the data? Where? When? Why? How?
○ The relevance, strength, and reliability of a data set depends on answers to these questions
Organize It
● organizing /displaying data=important in understand what our data tells us
● w/ categorical variables, we’re usually interested in how often a particular category occurs in our sample
○ Frequency of a value= number of times the value is observed in a data set (e.g. 27)
■ One way to display frequency is w/ a frequency table
find more resources at oneclass.com
find more resources at oneclass.com