Data Analysis Mid-Semester Notes
Lecture 1: Introduction to Statistics
Statistics: processing and analysing data
Descriptive: collecting, presenting and characterising
Inferential: use sample to draw conclusions about the population
Population: whole dataset
Sample: subset
Parameter: numerical measure that describes a characteristic of a population
Statistic: numerical measure that describes a characteristic of a sample
Primary source: collect yourself or internally
Secondary source: buy data/external source
Collecting Data
Important Sources
1. Data distributed by organisation or individual
2. Designed experiment
3. Survey
4. Observational study
Data
Categorical
Nominal: no order
Ordinal: order to categories
Numerical
Discrete (finite e.g 1,2,3) OR continuous (infinite 1.2713…)
Interval (no ratio proportions) OR ratio (ration comparisons)
Time series (time element) OR cross sectional (one point in time)
Graphing/tables
Categorical Data
Summary table
Bar graph (good for frequency comparisons)
Pie graph (good for proportions)
Christina Meyers BSB 123 Data Analysis 1 Lecture 2: Presenting data in tables and charts & Introduction to descriptive measures
Numerical Data
Ordered array: ordered from smallest to largest
Frequency distribution: summary table in which data is arranged into numerically ordered classes
Histogram: graph of continuous data in a frequency distribution (no gaps)
Class intervals and boundaries
Range max min
range
Classwidth
no. of classes
Lowerboundary Upperboundeary
Classmid-point 2
Rule of thumb: Usually at least 5 classes, but no more than 15
Two Variables: Bivariate Data
x: Independent variable
y: dependent variable
Does (dependent variable, y) depend on (independent, x)?
Categorical vs categorical
Contingency table (allows for cross-tabulation of data)
Clustered bar chart/stacked bar chart (converts contingency table into graphical form)
Numerical vs numerical
Scatter plot (x,y pairs of data)
Line chart (time series – time is the independent variable, against a dependent variable)
Numerical vs categorical
Pivot table
Positive relationship: as x increases, y increases
Negative relationship: as x increase, y decreases
No relationship: random movements of x and y
Christina Meyers BSB 123 Data Analysis 2 Central tendency
Mean: average (sum of values divided by the no. of values)
X i
x
n
Median: middle number of an ordered array
Medianposition n 1
3
Rule 1: If Data set is even, median is the average of the two middle ranked values
Rule 2: If Data set is odd, median is the middle ranked value
Mode: most frequently observed value [may be no mode or several modes (bimodal)]
Quartiles
Split the ranked data into 4 segments, with an equal number of values per segment
Q1 Q2 Q3 Q4
25% 25% 25% 25%
n1 n1 3(n1) 4(n1)
4 2 4 4
(Median)
Rules
Rule 1: If the result is an integer, then the quartile is equal to the ranked value
Rule 2: If the result is a fractional half, then the quartile is equal to the mean of the corresponding
ranked values
Rule 3: If the result is neither an integer nor a fractional half, round the result to the nearest integer
and select that ranked value
Box and whisker plot
Christina Meyers BSB 123 Data Analysis 3 Lecture 3: Numerical descriptive measures
Variation
Range max min
Outliers can lead to an untrue indication of the range
Interquartile range
IQR Q3Q1
Ignores extreme values
Variance: Measure of variation based on squared deviations from the mean
Population Variance
2
2 (X )
N
Sample Variance
2
S (x x)
n 1
Standard Deviation
Only measures 1 variable
Population Standard Deviation
2
Sample Standard Deviation
2
S S
Coefficient of Variation: Relative measure of variation, the standard deviation divided by the mean,
multiplied by 100%
CV S
x
Covariance
Measures direction of linear relationship between two numerical variables i.e positive or negative
(x x)(y y)
Cov(X,Y)
n1
Christina Meyers BSB 123 Data Analysis 4 Correlation
Measures the direction and strength of the relationship
Cov(X,Y)
rXY
S x y
Z Score
The difference between a given observation and the mean, divided by the standard deviation
Z x x
S
A Z Score above 3 below -3 is considered an outlier.
An outlier can cause numerical measures to be distorted, resulting in misleading overall trends
Shape
Longer left tail Longer right tail
Christina Meyers BSB 123 Data Analysis 5 Lecture 4: Simple Linear Regression and Introduction to Probabili
More
Less