Class Notes (1,100,000)
CA (630,000)
Carleton (20,000)
PSYC (3,000)
PSYC 3000 (100)
Lecture 2

PSYC 3000 Lecture Notes - Lecture 2: Exploratory Data Analysis, Scatter Plot, Replot


Department
Psychology
Course Code
PSYC 3000
Professor
Bruce Hutcheon
Lecture
2

Page:
of 8
PSYC 3000 Exploratory Data Analysis
Overview:
- Exploratory data analysis seeks to validate data & look for interesting patterns
o Characterize & compress data, look for problems in data
o Detect patterns & suggest models
o Doesn’t test hypotheses
o Results often never published
o Nonetheless crucial; before proceeding to hypothesis test, should always carry out
some sort of exploratory analysis
Exploratory data analysis:
- Tools for exploring patterns in data:
o Graphical tools line plots, frequency histograms, scatter plots
- #s summarizing distributions mean, mode, median, etc.
- Graphical displays of numeric results boxplots, plots w/ error bars
Graphic tools:
- Following plots useful for assessing patterns in data, can all be generated in SPSS:
o “run-series” plots, scatter plots, histograms
- These plots directly display distribution of your data; plots only display numeric
summaries of your data
Run-series plots:
- Horizontal axis = time or sequence #; vertical axis shows value of each observation
- Plot shows 57 successively measured scores in self-report depression inventory (*graph
in lecture slide*)
o Note: “T-scores” standardized scores w/ m = 50 & s = 10
- 42nd point (*circled in graph in lecture slide*) = obvious outlier
o Doesn’t tell why value of this observation = so much lower than any other
- Sometimes undetected change in experimental conditions results in whole group of
outliers
- For instance, depression scores for participants 40 to 50 have diff mean from rest of data
(*plot in lecture slide*)
o These observations may have been calculated incorrectly, or something else may
be going on
- Couldn’t conduct t-test on these data b/c they aren’t homogeneous (i.e., there is more
than distribution present)
o Run series plot has revealed that data don’t satisfy assumptions of test
- Remember: can detect either single outliers or many outliers deriving from single mistake
- Can detect sudden changes in variance of data
find more resources at oneclass.com
find more resources at oneclass.com
- Plot at right doesn’t tell you why variance = unstable, but certainly indicates there may be
problem perhaps 1 of experimenters who calculated depression scores used incorrect
formula (*plot in lecture slide*)
- Important that data points should be plotted in same sequence they were gathered; allows
to see sudden changes in data values
- Ignoring sequence info can cause trouble; here are same data as before, but plotted in
random order possible problem in data much less visible (*plot in lecture slide*)
- Remember:
o Shows all your data; don’t leave out data points or summarize them they show
you everything
o Can reveal problems in your experimental setup
o When creating run-series plots, always retain order in which data were gathered
Scatter plots:
- Another graphical tool of exploratory data analysis tool
- Like run-series plots, show all data points in data set (other types of graphical display
may only show summaries of your data)
- Unlike run-series plots, each point represents measurements of 2 diff variables on same
participant
- For instance, plot shows birth weights & lengths of newborn babies (*in lecture slide*)
o Clustering along diagonal indicates these 2 variables aren’t independent
o Infant indicated by circled data point doesn’t appear to obey overall length/weight
relationship for infants
o This infant may genuinely have this length/weight relationship, but also possible
that mistake has been made in recording data, so we should take closer look
before proceeding w/ other analyses
o Note: this point wouldn’t have been identified as problem w/o using scatter plot
since, for this infant, neither its length nor weight have particularly unusual values
o This is the special virtue of scatter plots: reveal unusual relationships b/w
variables rather than just unusual values
- Remember:
o Show all data in dataset for 2 variables
o Can detect unusual combos of values of 2 variables (even when individual values
aren’t outside their usual bounds)
o Run-series plots can reveal outlying data values; scatter plots can reveal outlying
relationships
Histograms:
- In contrast to run-series & scatter plots, process data before showing them; don’t show all
data, but show data summary
- Ex. Reaction times in David Howell’s replication of Sternberg’s memory experiment
(times needed to say if # flashed on screen occurred in memorized list) (*in lecture
slide*)
find more resources at oneclass.com
find more resources at oneclass.com
- Essential properties of histogram:
1. Bins all same size so that counts in each bin can be compared
2. At least interval data on X axis otherwise how would you know when bins are same
size?
3. Vertical axis always begins at 0
- Remember: don’t show “true” distribution of values in data; you would have to show all
data to see true distribution
- Show distribution of bin occupancies; this only approximates true distribution of values,
& means that distribution you see in histogram depends on nature of bins (& not just on
nature of data)
- For instance, look what happens when you re-plot same eyeblink data using diff bin
widths
- Using narrower bins certainly gives more detail, but broad properties lost (*graphs in
lecture slide*)
- Hint for getting right # of bins: if you find that many of bars in your histogram = low-
occupancy bins (counts of 1 or 2), your bins = too small (*in lecture slide*)
- Widen bins, so that many observations can accumulate in each 1, then you’ll see
distribution of values more clearly
- Can use to spot non-normal distributions which may pose problems for subsequent
statistical tests
- Distribution skewed & has outlier on far right (*in lecture slide*)
o In research, distribution like this would cause you to scrutinize data closely to find
out if they’re really suitable for t-test
- Remember:
o Allow you to visualize distribution of values in dataset
o When you view histogram, not looking at all data; looking at data after they’re
gathered into bins
o Diff choices for bins may produce diff looking distributions
Numeric Tools:
- Used to characterize essential properties of distributions for ordinal, interval, or ratio
variables
- Each have strengths & weaknesses w/ which you should be familiar
Mean & Standard Deviation:
- 1st: for interval or ratio variables, we can create highly compact description of
distribution by specifying location of its centre together w/ 1 or more indications of how
observations vary around this centre
- If distribution relatively symmetric & w/o severe outliers, mean & SD used
- Doing this allows us to concentrate on essential properties of distribution while
neglecting details about where each data point falls
find more resources at oneclass.com
find more resources at oneclass.com