Class Notes (1,100,000)

CA (630,000)

Carleton (20,000)

PSYC (3,000)

PSYC 3000 (100)

Bruce Hutcheon (50)

Lecture 2

PSYC 3000 – Exploratory Data Analysis

Overview:

- Exploratory data analysis seeks to validate data & look for interesting patterns

o Characterize & compress data, look for problems in data

o Detect patterns & suggest models

o Doesn’t test hypotheses

o Results often never published

o Nonetheless crucial; before proceeding to hypothesis test, should always carry out

some sort of exploratory analysis

Exploratory data analysis:

- Tools for exploring patterns in data:

o Graphical tools – line plots, frequency histograms, scatter plots

- #s summarizing distributions – mean, mode, median, etc.

- Graphical displays of numeric results – boxplots, plots w/ error bars

Graphic tools:

- Following plots useful for assessing patterns in data, can all be generated in SPSS:

o “run-series” plots, scatter plots, histograms

- These plots directly display distribution of your data; plots only display numeric

summaries of your data

Run-series plots:

- Horizontal axis = time or sequence #; vertical axis shows value of each observation

- Plot shows 57 successively measured scores in self-report depression inventory (*graph

in lecture slide*)

o Note: “T-scores” standardized scores w/ m = 50 & s = 10

- 42nd point (*circled in graph in lecture slide*) = obvious outlier

o Doesn’t tell why value of this observation = so much lower than any other

- Sometimes undetected change in experimental conditions results in whole group of

outliers

- For instance, depression scores for participants 40 to 50 have diff mean from rest of data

(*plot in lecture slide*)

o These observations may have been calculated incorrectly, or something else may

be going on

- Couldn’t conduct t-test on these data b/c they aren’t homogeneous (i.e., there is more

than distribution present)

o Run series plot has revealed that data don’t satisfy assumptions of test

- Remember: can detect either single outliers or many outliers deriving from single mistake

- Can detect sudden changes in variance of data

find more resources at oneclass.com

find more resources at oneclass.com

- Plot at right doesn’t tell you why variance = unstable, but certainly indicates there may be

problem – perhaps 1 of experimenters who calculated depression scores used incorrect

formula (*plot in lecture slide*)

- Important that data points should be plotted in same sequence they were gathered; allows

to see sudden changes in data values

- Ignoring sequence info can cause trouble; here are same data as before, but plotted in

random order – possible problem in data much less visible (*plot in lecture slide*)

- Remember:

o Shows all your data; don’t leave out data points or summarize them they show

you everything

o Can reveal problems in your experimental setup

o When creating run-series plots, always retain order in which data were gathered

Scatter plots:

- Another graphical tool of exploratory data analysis tool

- Like run-series plots, show all data points in data set (other types of graphical display

may only show summaries of your data)

- Unlike run-series plots, each point represents measurements of 2 diff variables on same

participant

- For instance, plot shows birth weights & lengths of newborn babies (*in lecture slide*)

o Clustering along diagonal indicates these 2 variables aren’t independent

o Infant indicated by circled data point doesn’t appear to obey overall length/weight

relationship for infants

o This infant may genuinely have this length/weight relationship, but also possible

that mistake has been made in recording data, so we should take closer look

before proceeding w/ other analyses

o Note: this point wouldn’t have been identified as problem w/o using scatter plot

since, for this infant, neither its length nor weight have particularly unusual values

o This is the special virtue of scatter plots: reveal unusual relationships b/w

variables rather than just unusual values

- Remember:

o Show all data in dataset for 2 variables

o Can detect unusual combos of values of 2 variables (even when individual values

aren’t outside their usual bounds)

o Run-series plots can reveal outlying data values; scatter plots can reveal outlying

relationships

Histograms:

- In contrast to run-series & scatter plots, process data before showing them; don’t show all

data, but show data summary

- Ex. Reaction times in David Howell’s replication of Sternberg’s memory experiment

(times needed to say if # flashed on screen occurred in memorized list) (*in lecture

slide*)

find more resources at oneclass.com

find more resources at oneclass.com

- Essential properties of histogram:

1. Bins all same size – so that counts in each bin can be compared

2. At least interval data on X axis – otherwise how would you know when bins are same

size?

3. Vertical axis always begins at 0

- Remember: don’t show “true” distribution of values in data; you would have to show all

data to see true distribution

- Show distribution of bin occupancies; this only approximates true distribution of values,

& means that distribution you see in histogram depends on nature of bins (& not just on

nature of data)

- For instance, look what happens when you re-plot same eyeblink data using diff bin

widths

- Using narrower bins certainly gives more detail, but broad properties lost (*graphs in

lecture slide*)

- Hint for getting right # of bins: if you find that many of bars in your histogram = low-

occupancy bins (counts of 1 or 2), your bins = too small (*in lecture slide*)

- Widen bins, so that many observations can accumulate in each 1, then you’ll see

distribution of values more clearly

- Can use to spot non-normal distributions which may pose problems for subsequent

statistical tests

- Distribution skewed & has outlier on far right (*in lecture slide*)

o In research, distribution like this would cause you to scrutinize data closely to find

out if they’re really suitable for t-test

- Remember:

o Allow you to visualize distribution of values in dataset

o When you view histogram, not looking at all data; looking at data after they’re

gathered into bins

o Diff choices for bins may produce diff looking distributions

Numeric Tools:

- Used to characterize essential properties of distributions for ordinal, interval, or ratio

variables

- Each have strengths & weaknesses w/ which you should be familiar

Mean & Standard Deviation:

- 1st: for interval or ratio variables, we can create highly compact description of

distribution by specifying location of its centre together w/ 1 or more indications of how

observations vary around this centre

- If distribution relatively symmetric & w/o severe outliers, mean & SD used

- Doing this allows us to concentrate on essential properties of distribution while

neglecting details about where each data point falls

find more resources at oneclass.com

find more resources at oneclass.com

**Unlock Document**

###### Document Summary

Tools for exploring patterns in data: graphical tools line plots, frequency histograms, scatter plots. #s summarizing distributions mean, mode, median, etc. Graphical displays of numeric results boxplots, plots w/ error bars. Following plots useful for assessing patterns in data, can all be generated in spss: run-series plots, scatter plots, histograms. These plots directly display distribution of your data; plots only display numeric summaries of your data. Horizontal axis = time or sequence #; vertical axis shows value of each observation. Plot shows 57 successively measured scores in self-report depression inventory (*graph in lecture slide*: note: t-scores standardized scores w/ m = 50 & s = 10. 42nd point (*circled in graph in lecture slide*) = obvious outlier: doesn"t tell why value of this observation = so much lower than any other. Sometimes undetected change in experimental conditions results in whole group of outliers.

## More from OC504149

###### PSYC 3000 Lecture 4: Tests that use Normal sampling distributions

Lecture Note

###### PSYC 3000 Lecture Notes - Lecture 1: Operationalization, Harry Harlow

Lecture Note

###### PSYC 3000 Lecture Notes - Lecture 3: Exploratory Data Analysis, Variance, Statistical Hypothesis Testing

Lecture Note

## Classmates also unlocked

###### PSYC 3000 Final: Cheat Sheet (Final Exam)

Exam Note

###### PSYC 3000 Midterm: Cheat Sheet (Midterm #1)

Exam Note

###### PSYC 3207 Mock Midterm Exam.pdf

Exam Note

###### PSYC 3000 Midterm: Cheat Sheet (Midterm #2)

Exam Note

###### PSYC 3000 Lecture Notes - Lecture 1: Harry Harlow, Operationalization

Lecture Note

###### midterm cheat.docx

Exam Note

###### PSYC 3000 Lecture Notes - Lecture 84: Junkers F.13, Factorial Experiment, Dependent And Independent Variables

Lecture Note

###### PSYC 3000 Study Guide - Analysis Of Variance

Exam Note

###### PSYC 3000 Lecture 9: Cheat Sheet Midterm 1

Lecture Note

###### final exam.docx

Exam Note

###### PSYC 3000 Study Guide - Quiz Guide: Squared Deviations From The Mean

Exam Note

###### exam.doc

Exam Note

###### PSYC 3000 Lecture Notes - Lecture 58: Multiple Comparisons Problem, Family-Wise Error Rate, Analysis Of Variance

Lecture Note

###### PSYC 3000 Lecture 4: Tests that use Normal sampling distributions

Lecture Note

###### PSYC 3000 Lecture Notes - Lecture 2: Exploratory Data Analysis

Lecture Note