LS280 - Chapter 1 2014-02-07 1:47 PM
Statistics - the science and practice of developing human knowledge through
the use of empirical data.
• Based on statistical theory which uses probability theory to
estimate popular values.
Mathematical Statistics - study of how to create the statistical methods using
Applied Statistics - the practice of developing knowledge by using statistical
methods to analyze data and make inferences about the population from
which the data came.
Conducting Research in the social world:
• First, the theory that underlies the phenomenon is considered
• Next, the researcher develops the research question regarding the
situation of interest.
• Hypotheses are developed to test the theory in light of research
• Following hypotheses development, research design involves
determining things such as how concepts are measured, type of
data needed, how the data will be collected required sample size
• After fata is collected, data analysis is conducted and hypotheses
• Researcher draws conclusions about his/her hypotheses and
• When quantitative research methods are used, there is a tendency
to consider statistics only in the data analysis stage.
• Statistics should be considered from the research question
development stage to the results and conclusion stage.
Population versus the Sample
• Population is the total number of individuals, objects or items you
are interested in
• Sample is a subset of the population
• Representative of the Population: when we randomly select
individuals from a population, each person has an equal chance of
being in the sample. Descriptive Statistics versus Inferential Statistics
• Describe a phenomenon about society or people - descriptive
o Explains how frequently something of interest occurs in the
observations you make
o Ex: how many books the average Canadian reads per year
o Describe how often things occur
• Interested in knowing if there is a relationship
o Generalize the results found in a sample to the entire
population of interest
o Ex: overall mental health decreases as the number of
alcoholic drinks consumed per week increases
Ethics and Statistics
• We must present our data and findings in an honest and
• Four major areas of responsibility
o Responsibility to society
o Responsibility to Employers and Clients
Responsibility to Other Statistical Practitioners
• Variable - a phenomenon of interest that can take on different
values and can be measured
Independent versus Dependent Variables
• Dependent variable - changes as a result of the change in an
• Independent variable - variable that is hypothesized or suggested
to influence the dependent variable
Nominal Level of Measurement
• Nominal - level of measurement when the difference within the
variable is just a name or a symbol.
o Ex: bus #101 and bus #202 - qualitative categories
• Respondent - person being observed in the research study. One
who completes a survey Ordinal Level of Measurement
• Ordinal - when the answers to the qualitative categories or
attributes have some order to them
o Looking for answers we can see the order of the respondent’s
o Ex: three runners running for a race or media - look at the
rank of the medals to only see an order of performance.
Internal Levels of Measurement
• Interval and ratio variables are identical.
• Interval refers to a space between things when there is an eqial
difference between the level of the variable but the variable itself
does not have an absolute 0 value.
Ratio Level of Measurement
• When the intervals between the values are equal and comparable
and the zero value means the absence of something.
• Ex: age is a ratio level variable. LS 280 - Chapter 2 2014-02-07 1:47 PM
Empirical data is gathered from objects or participants of a research study.
For example, data gathered on cultural norms, stock fluctuation, and gender
difference in academic performance.
Social science disciplines gather some form of empirical data from the real
world and after it is gathered, the data needs to be organized and presented
in a manner that can provide summary information about the phenomena of
Measurement and Coding
• Nominal - Gender (Male = 1, Female = 2)
• Ordinal - Age (20-25 = 1, 26-35 = 2)
• Interval - Satisfied with life (Strongly agree =1, Disagree = 2)
• Ratio - internet hours (actual number of hours)
• When we assess a variable with a specific level of measurement we
say that it produces a specific type of data
Frequency Distributions and Tables
• Nominal and Ordinal variables have qualitative categories as
potential values where as interval and ratio variables have
• Given that a variable can have different values, we can count it’s
frequency - number of observations of a specific value within a
• Example: gender has two possible categories of male and female so
if there are 46 male and 54 female, there are two values with
• Frequency Distribution is the summary of values of a variable based
on the frequencies in which they occur. We look at how the values
of the variable are distributed across all of the cases in the data
• Relative frequency is a comparative measure of the proportion of
observed vales to the total number of responses within a variable.
It provides a proportion or fraction of one occurrence relative to
Equation: relative frequency = f/n; where
f - frequency of specific responses and
n=total number of responses • Percentage frequency also provides a useful way of displaying the
frequency of data. It is expressed as a percentage value and is
calculated as: %frequency = f/n x 100; where
f = frequency of responses
n = total number of responses within the variable
• Relative frequencies are written as a decimal
• Cumulative percentage frequency gives percentage of observations
up to the end of a specific value.
Simple Frequency Tables for Nominal and Ordinal Data
• Simple frequency table displays frequency distribution of one
variable at a time - nominal, ordinal, interval, or ratio
• List possible values the variable can have in one column
• Record number of times that each value occurs in another column
• Frequency tables are easy to create for nominal and ordinal
variables because there is a limited range of potential values
Simple Frequency Tables for Interval and Ratio Data
• When there are too many values on which to report frequencies,
the table becomes less useful as a device to communicate summary
information about the data
• Class intervals to create frequency tables for interval and ratio
variables that have a large range of potential values
• Class intervals - values that are combined into a single group for a
frequency table. They have a class width - range and starting and
ending values which are the class limits.
• Class intervals must be exhaustive including the entire range of the
data and mutually exclusive meaning that the class widths are
unique enough that the observed value can only be placed into one
• Step 1: Determine the range of the data (largest-smallest value)
• Step 2: Determine width and number of class intervals
o Divide range of the data by number of class intervals
o Intervals should be of equal width
• Step 3: Determine Class Boundaries o With continuous data, having gaps may cause problems when
we have values falling between them.
o To calculate class boundary, subtract 0.50 from lower class
limit and add 0.50 to the upper class limit for each interval.
o Boundaries do not have a value separating them like class
intervals because they are continuous
• Step 4: Determine Each Class Interval Midpoint
o Midpoint is the average value of the class interval - often
used as a rough estimate of the average case in each interval.
o Add the lower and upper limits and divide by 2
• In putting the frequency table together: use the class intervals,
record the number of observations that fall between the class limits
of each interval
Cross-Tabulations for Nominal, Ordinal, Interval and Ratio Data
• Cross tabulations display a summary of the distribution of two or
• Cross-tabs allow you to observe how frequency distribution of one
variable relates to that of one or more other variables
• Tabulates the frequencies by categories or class intervals of the
variables being compared
• Cross-tab includes any combination of nominal, ordinal, interval and
Comparing the Distribution of Frequencies
• When data spans different time periods, we can calculate change
from one year to another and report this change as a percentage
• P = f(time2) - f(time1)/f(time1) x 100
where f1 = frequency of a specific response at time 1
f2 = frequency of specific response at time 2
• Comparison of two values of a variable based on their frequency
• Ratio = fv1/fv2
where fv1 = frequency of first value to be compared
fv2 = frequency of second value to be compared Rates
• Ratios are useful when the values being compared are in the same
• When you need to compare values where other factors affect those
values, rates are more useful.
• Rate = Number of events for the population of interest/Total
population of the population of interest x 10,000
o Multiply by 10,000 to avoid small decimals (per 10,000
Pie Charts and Bar Charts, for Nominal and Ordinal Data
• Pie chart displays distribution of a variable out of 100 percent
where 100 percent represents the entire pie
• Frequency or percentage frequency may be used in constructing the
• Pie charts are useful for nominal and ordinal variables
• Bar chart displays the frequency of a variable with the variable
categories along the x-axis and the variable frequencies on the y-
Frequency Polygon and Cumulative Percentage Frequency Polygon, for
Interval and Ratio Level Data
• Frequency Polygon is a line graph of the frequency distribution of
interval or ratio data and is constructed by placing class intervals on
the x-axis and frequencies on y-axis.
• Frequency polygon can be used to compare the distribution of a
variable acorss groups of respondents
o Example: what age do you drink alcohol - compare frequency
distributions (male vs female) of age in which respondents
stated they began drinking alcohol.
• Cumulative frequency polygon graphs the cumulative frequency
column in a frequency polygon - used for comparing frequency of a
variable across groups
Histograms • Histogram is a plot of the frequency of an interval or ratio data.
• Histograms are useful for representing interval and ratio variables
because they show continuous nature of data without necessarily
creating class intervals.
• Class intervals were constructed to reduce the variable from a ratio
level to four classes.
Stem and Leaf Plots
• Provides a graphical representation of the frequency of interval or
• Drawback is that actual values of the data are not shown
• In a stem and leaf plot, the shape of the leaf corresponds to the
shape of the histogram
• Graphical summary of data based on percentiles
• Box represents the distribution of the data between the 25 thand
• The light line in the middle represents the median (50 th percentile)
and the lines coming out of the box extend to the lowest and
highest value in the data which provides the range LS280 - Chapter 1 Synopsis 2014-02-07 1:47 PM
Chapter 1 Summary
• Three Branches of Statistics
o Description - more prevalent in daily life.
o Association - degree to which we can calculate the extent to
which patterns move together and are associated
o Inference - ability to take information based on smaller
convenient samples and generalize it to the entire population
• Populations vs. Samples
o Sampling Error - an error or assumption of error (refer to
things which are random or improbable) Randomness causes
problems in interpretation.
Assume it happens all the time
Build an estimate into our calculations
o Independent and dependent variables
Independent variables INFLUENCE the dependent
variables. Usually ‘x’
Dependent variables are outcome variables that we feel
is interesting. Chosen to see what influences it.
The independent variables explain the dependent
o Scope conditions • Units of Analysis - represents what we gather the data on
o Couples etc
• Measurement Levels
o Ratio LS280 - Chapter 3 2014-02-07 1:47 PM
Measures of Central Tendency
• Which value lies in the middle of a distribution.
• Mode - value that occurs with greatest frequency.
o When there is too much data, construct a frequency table in
ascending order by frequency
o One mode - unimodal
• Median - middle point in the distribution that separates the upper
and lower 50 percentile.
o Numbers have to be put in ascending order
o If there is an odd number of values, the median is (n+1)/2
o If there is an even number of values, the middle point is
• Mean - mean or average score in the distribution
o To obtain the mean score, you add up all of the x scores and
divide the total by the sample size (n).
When to use Mean, Median and Mode as Measures of Central Tendency
• Mode for nominal data
o Categories in a nominal level of measurement are qualitative
labels and as such the differences within the variables are
really just a name or symbol, they are not numeric.
o Mode tells you which is the most commonly occurring
• Median for ordinal data
o data for ordinal level variables have some order to them, the
value that sits at the center of the distribution is meaningful.
• Mean for Interval/Ratio
o The responses have an order to them (e.g., first, second,
third) and there are equal intervals between the responses
o However, if the data is denser on one end of the distribution
than the other, there is skewness in which case Median is
o The mean is sensitive to extreme values (also referred to as
outliers), whereas the median is not because the mean uses
the actual values in its estimation, whereas the median only
counts the number of values. Use mean and median Measures of Dispersion
• Describe the variability in the data.
• Variability is the extent to which the data varies from its mean -
tells us how spread out the data is across the range of values
o percentage frequency and cumulative percentage frequency
columns from the frequency table
o percentiles are percentages of frequencies indicating the
percentage of scores that fall within a given area.
• Calculating the percentile
o place the numbers in order from lowest to highest, then
number them in order of their position.
o To determine an unknown score for a known (or desired)
percentile you multiply the desired percentile by the sample
o If you want to know the score for the 50th percentile,
multiply the desired percentile (0.50) by the sample size -
value represents the position of the number
o To determine the percentile for a known score, divide the
position number of the known score by the sample size
• The Range
o the range is the value of the largest observation minus the
value of the smallest observation
o The interquartile range creates quartiles (25th, 50th, and
75th percentiles) and removes the first and fourth quartile
from the calculation of its range, considers only the middle 50
percent of the range (25th percentile to the 75th percentile).
• Calculating Interquartile Range:
o calculate the 25th and 75th percentiles
o Since the interquartile range only considers the values
between the 25th and the 75th percentile, the range can be
obtained by subtracting the value in the 25th percentile from
the value in the 75th percentile
• The Variance
o extent to which the data varies from its mean o If we subtract each observation from the mean, we get
deviation, meaning how far the observation deviates from its
o Square each deviation value and sum them up
o divide the sums of squares of the deviations by the sample
size minus 1 - average variability within the data
• Standard Deviation
o average amount, measured in standard units, in which the
data scores vary (positively and negatively) from the mean.
o most commonly used measure of dispersion because it allows
the researcher to report the deviation in the units of
measurement used in the study
o Higher standard deviation