# BSB 123-Study Notes

Unlock Document

Queensland University of Technology

Management and Human Resources

BSB123

All

Spring

Description

1
BSB 123 – DATA ANALYSIS NOTES
LECTURE 1: INTRODUCTION TO STATISTICS
INTRODUCTION TO EXCEL
LECTURE NOTES
Key Definitions
• Population: members of a group about which you want to draw a conclusion ie. all of what you are
interested in
• Sample: portion of subset of the population selected for analysis
• Parameter: numerical measure which describes a characteristic of a population ie. average based on
population data
• Statistic: numerical measure that describes a characteristic of a sample ie. average based on a sample. It
is more practical and is used more than parameters
Population vs. Sample
• Measures used to describe a population are called parameters
• Measures used/computed from sample data are called statistics
• Size of the sample has a big impact on the result
• Analysing the sample allows you to say something about a larger group but there is always a chance for
error
Population and Sample Data
• A sample is a selection of measurements (subset) from all measurements (population)
• Purpose of analysing a sample is to make a statistical inference
• A note on notation:
- Greek letters (µ, θ, N) are used for population data
- roman letters (x, s, n) are used for sample data
*Note: there are two formulas for each notation*
Types of Data
• Data does not have to be numerical
• There are two types of data:
- categorical/qualitative
- numerical/quantitative
• Numerical data is measured on a natural number scale
• Categorical data can only be named or categorised. It can be further categorised:
- nominal: no natural/implied order
- ordinal: there is an implied order
Further Classifications of Numerical Data
• Continuous or discrete
- continuous: Can take on any real number
Infinite number of items eg. time
- discrete: Countable number of responses
Finite number of items (looking at integers)
Note: whilst there cannot be half a person there can be half a shoe size
• Interval or ratio
- interval: Difference between measurements (no true 0) eg. temperature
Jessica King BSB123 – Data Analysis Semester 1, 2009 2
It is untrue to say size 10 is double a size 5 (shoe sizes)
Uses discrete data
- ratio: Differences between measurements where the true 0 exists
It is true to say $100 is double $50
Uses continuous data
• Time Series or Cross Sectional
- time series: Data collected through time (look for trends)
- cross sectional: Collected for a certain point in time
TEXTBOOK NOTES – Chapter 1: Presenting and Describing Information
Definitions
• Variables: characteristics of items or individuals
• Data: observed values of variables
• Population: consists of all the members of a group about which you want to draw a conclusion. Two
factors need to be specified when defining a population:
• the entity (eg. people or vehicles)
• the boundary (eg. those registered to vote or registered in QLD for road use)
• Sample: the portion of the population selected for analysis. The people or vehicles in the sample
represent a portion, or subset, of the people or vehicles comprising the population.
• Parameter: numerical measure that describes a characteristic of a population
• Statistic: numerical measure that describes a characteristic of a sample
• Categorical variables: yield categorical responses, such as yes or no answers. Categorical responses can
also yield more than one possible response.
• Continuous variables: produce numerical responses that arise from a measuring process. The more
precise the measuring device used to greater the likelihood of detecting small differences in
measurements and therefore having more precise data.
• Descriptive statistics: focuses on collecting, summarising and presenting a set of data
• Discrete variables: produce numerical responses that arise from a counting process.
• Focus group: a market research tool which is used to elicit unstructured responses to open-ended
questions.
• Inferential statistics: uses sample data to draw conclusions about a population.
• Interval scale: ordered scale in which the difference between measurements is a meaningful quantity but
does not involve a true zero point.
• Nominal scale: classifies data into various distinct categories in which no ranking is implied. Nominal
scaling is the weakest form of measurement because you cannot specify any ranking across the various
categories.
• Numerical variables: yield numerical responses, such as your height in centimetres. There are two types
of numerical variables: discrete and continuous.
• Ordinal Scale: classifies data into distinct categories in which ranking is implied ie. things are ranked in
order of satisfaction level. Ordinal scaling is a stronger form of measurement than nominal scaling
because an observed value classified into one category possesses more of a property than does an
observed value classified into another category. However, ordinal scaling is still relatively week because
the scale does not account for the amount of the differences between the categories. Ordering only
implies which category is greater or preferred – not by how much.
• Operational definition: a universally accepted meaning that is clear to all associated with an analysis
• Ratio scale: an ordered scale in which the difference between the measurements involves a true zero
point eg. weight, length age or salary.
Basic Concepts of Statistics Chapter Summary
Jessica King BSB123 – Data Analysis Semester 1, 2009 3
• Statistics examines ways to process and analyse data and provides procedures to collect and
transform data in ways that are useful to business decision-makers
• Identifying the most appropriate source of data is a critical aspect of statistical analysis.
• Data from a categorical variable are measured on a nominal scale or on an ordinal scale.
• Data from numerical variable are measured on an interval or ratio scale.
• Data measured on an interval scale or on a ratio scale constitute the highest levels of measurement.
They are stronger forms of measurement than an ordinal scale because you can determine not only
which observed value is the largest but also by how much.
Jessica King BSB123 – Data Analysis Semester 1, 2009 4
LECTURE 2: PRESENTING DATA IN TABLES AND CHARTS
LECTURE NOTES
Presenting data – what to do with information – allows it to be seen visually
Tables and Charts for Categorical Data
Categorical Data
↓
Summary Table Graphing Data
↓
Bar Charts Pie Charts
The Summary Table/Frequency Table
• Choose key points of the information and communicate these on the table
• Example: Gender; M, F, M, M, F, F, M, F, M, F, F ...
Gender Tally Frequency
M IIII ... 570
F II ... 430
Total 1000
Bar Chart
• Best for discussing the best/worst amount, preferred option etc.
• Each category is discrete and separate
• In presentation, always use an appropriate title/bar heading
• In excel, use absolutely references eg. B2/$B$6 and therefore the last value will not change
Pie Chart
• Useful for graphing percentages - % are rounded to the nearest whole percentage
• More useful for discussing portions
• Not for use with more than 6-8 categories – as it is not visually effective and doesn’t correctly display the
information figures.
Tables and Charts for Numerical Data
Numerical Data
↓
Ordered Array Frequency Distributions
Cumulative Distributions
↓ ↓
Stem and Leaf Display Histogram Polygon Ogive
The Ordered Array
• A sequence of data in rank order provides signals of variability within the range and may help identify
outliers
• Abnormal values/outliers- figures which are too smaller or too large compared to the other results. They
can have a significant impact on the result through distorting the answer/end result. These figures can be
removed if they are too significant.
Stem and Leaf Diagram
• Quick and simple way to see distribution details in a data set
• Separate sorted values into groups (stem) and values within each group (leaves)
Jessica King BSB123 – Data Analysis Semester 1, 2009 5
• Data may be heavily skewed ie. more of a certain result than others and the distribution can be either
symmetrical or non-symmetrical
• Round of numbers when they are more than 2 digits. Only ever leave one unit – results in the loss of some
detail eg. Slide 11
Tabulating Numerical Data: Frequency Distributions
• A summary table in which data is arranged into numerically ordered classes or intervals
• A summary/condensed version of numerical data – condenses raw material and makes it more useful i.e.
allows for quick visual interpretation
• Example: Ages; 41. 39. 21, 35, 65, 54 ...
Age Frequency
21 1
22 0
23 1
... ...
65 0
This is not a good/accurate/feasible representation, therefore group responses into CLASSES.
Grouped in classes:
Age Frequency
20-<30 5
30-<40 10
40-<50 12
Class Intervals and Boundaries
• Data belongs to one and only one class ie. there are no overlaps and all have the same width/interval
• Width determined by: range (max – min)/desired class groupings
• At least 5 but not more than 15 groups i.e. less data = smaller number of groups
• Round up interval width to get desirable end points
Cumulative Frequency
• Adding as you go down in the frequency/percentage column
The Histogram
• Graph of data in a frequency distribution
• Only for continuous data, whilst bar charts are for discrete data
• Boundaries are shown on the x axis and frequency or percentage is shown on the y axis.
• Class ranges over the whole amount ie. there are no gaps between the bars
• X axis is like a number line
• Collectively exhaustive
• Excel does not do histograms automatically ie. change the setting to ensure there are no gaps!
Cross-tabulations
• Graphs with 2 variables ie. type of cola preferred and ethnic background
• Contingency table: allows for cross-tabulation of data – data values or percentage figures can be used (both
rows and columns)
• Clustered and stacked bar charts: converts information from contingency table into graphical form. Also
referred to as a side-by-side bar chart
Jessica King BSB123 – Data Analysis Semester 1, 2009 6
Scatter Diagrams
• Used to examine possible relationships between two numerical variables
• On variable is measured on the vertical/y axis – the dependant variable and the other is measured on the
horizontal/x axis – the independent variable
• See whether variation in the independent affects the dependant variable eg. Ice cream sales and temperature
– Do ice cream (y) sales depend on temperature (x)
• Note: if all the data in a series go upwards – it is a positive relationship
If all the data in a series go downwards – it is a negative relationship
If the data in a series are all random – no relationship exists
• In excel, the column on the right defaults to be x whilst the column on the left defaults to be y
Time Series Plot
• Used to study patterns in values of a variable over time
• Time ALWAYS goes on the x axis
Guidelines for good graphs
• Vertical axis scale should begin at zero
• Graph should contain a title
• Use a scale – properly labelled and scaled
• Use the simplest graph for the data given
• Avoid unnecessary things eg a key
• Do not distort data ie. frequency/quantity
• Amount of data should be proportional to the area/volume
TEXT BOOK NOTES: Chapter 2: Presenting Data in Tables and Charts
Definitions
Summary table: gives the frequency, proportion or percentage of the data in each category which allows
differences to be seen between the categories
Bar Chart: represented by a bar, the length of which indicates the proportion, frequency or percentage of values
falling into that category
Pie Chart: a circle, used to represent the total, which is divided into slices, each slice representing a category
Ordered Array: sorting the ray data in order of magnitude – from smallest to largest
Stem-and-leaf Displays: a quick and easy way of displaying numerical data easily by dividing into groups
(stems) such that the values within each group (the leaves) branch out on the right of each row.
Frequency Distributions: a summary table in which the data is arranged into numerically ordered classes or
intervals
Cumulative Distributions: gives the percentage of values that are less than a certain value
Histogram: a grouped frequency, relative frequency or percentage distribution can be graphically represented as
a histogram. The horizontal axis is divided into intervals corresponding to the classes
Contingency Tables: (or cross-classification tables) presents the data for two categorical variables. The row
contains the categories of one variable and the columns the categories of the other variable.
Side-by-Side Bar Charts: a useful way to display the results of cross classification data/to make comparisons
between data
Scatter Diagram: used when examining the relationship between two numerical variables to obtain a picture of
the relationship
Time-series plot: used to study patterns in the values of a variable over time
What type of chart should you use?
Jessica King BSB123 – Data Analysis Semester 1, 2009 7
• Selection of a char depends on your intention.
• If it is the comparison of categories which is most important, use a bar chart
• If observing the portion of the whole that lies in a particular category is most important, use a pie chart
Frequency Distributions
• Ordered arrays and stem-and-leaf plots are of limited use when we have a large quantity of data or the
data is highly variable
• Allows you to condense a set of data
• Select an appropriate number of classes and a suitable class width. Classes should be exhaustive and
mutually exclusive, so that any data value belongs only to one class
• The number of classes chosen depends on the amount of data – a small number of classes for small
amounts of data and a larger number of classes for larger amounts of data
• If there are too few classes, we lose too much information and if there are too many classes, the data are
not condensed enough
• Each class should be of equal width, determined by: Class width = range/number of classes
• The centre of each class, called the class mid-point, is halfway between the lower boundary and the
upper boundary
Relative Frequency Distributions and Percentage Distributions
• A relative frequency distribution is obtained by dividing the frequency in each class by the total number
of values
• From this, a percentage distribution can be obtained by multiplying each relative frequency by 100%
Scatter Diagrams
• When examining the relationship between two numerical variables, we can use a scatter diagram to
obtain a picture of a possible relationship.
• You plot one variable, the independent variable on the horizontal or x axis and the other variable, the
dependent variable on the vertical or y axis.
Time-Series Plots
• Used to study patterns in the values of a variable over time.
• Displays the time period of on the horizontal axis and the variable of interest on the vertical axis
Summary Table
Type of Data
Type of Analysis Numerical Categorical
Tabulating, organising and Ordered array, stem and leaf Summary table, bar chart, pie
graphically presenting the display, frequency chart
values of a variable distribution, relative
Jessica King BSB123 – Data Analysis Semester 1, 2009 8
frequency distribution,
cumulative percentage
distribution, histogram
Graphically presenting the Scatter diagram, time series Contingency table, side-by-
relationship between two plot side bar chart
variables
Jessica King BSB123 – Data Analysis Semester 1, 2009 9
LECTURE 3: NUMERICAL DESCRIPTIVE MEASURES
LECTURE NOTES
Describing Data
Measures of Central Tendency
• Mean/average
• Median (centre of data)
• Mode
n
Arithmetic Mean/Average
• Most common measure of central tendency i=1xi
• Affected by extreme values (outliers) x =
n
• Mean example: 2, 4, 5, 9
X1, 2 ,3x 4 x
Median
• The middle number/value
• Not affected by extreme values and is therefore a better representation of central tendency
• Median = n+1/2 (ranked value ie. not the actual median but its position)
• Median example:
o Median = 5+1/2 = 3 value in the ordered array
o Data = 2, 3, 4, 5, 6, 10
Median = 6+1/2 = 3.5 value as there is no such thing, find the average of the two
middle values ie. 4 + 5 = 9/2 = 4.5
• If n is odd, answer will always be an exact value
• If n is even, answer will always be a half value
Mode
• Value that occurs most/is most frequent
• Not affected by extreme values
• There may be no single mode answer for a given data set – unlike mean and median
• There may however, be several modes:
o 2 modes = bimodal
o 3modes = trimodal
o 4+ modes = calculation may be irrelevant
• Mode example: 1, 1, 1, 100, 101, 102, 104, 109, 110 mode = 1
Review Example
• House prices – talk about median not average
• Mean: provides you with a summary but the outlier distorts the results. Therefore, median is the best
measure!
Quartiles
• Indication of how data is spread over the range
• Splits ascending order data into 4 segments with an equal number of values per segment
• Second quartile = the same as the median
Jessica King BSB123 – Data Analysis Semester 1, 2009 10
• Quartile 1 (Q1and Q )3= measures of non-central tendency
• Not as useful for data spread over a small range
• First quartile position (where n = the number of observed values):
o Q =1(n+1)/4
o Q =2(n+1)/2
o Q =3(n+1)/4
Measures of Variation
Variation
Range IQR Variance Standard Deviation Coefficient of Variation
Gives information on the spread of variability of data values
• Example:
A. 4, 5, 6 x = 5
B. 0, 5, 10 x = 5 Doesn’t reveal what is truly going on – there is
C. -10, 5, 20 x = 5 significant variation in the data
Range
• Difference between the largest and the smallest values in a set of data
• Range = l argestmallest
• Ignores data distribution
• Sensitive to outliers
Interquartile Range (IQR)
• A resistant summary measure – do we need to look at the full spread?
• Eliminates outlier problems by removing high/low calculations
• IQR = Q –3Q 1 ie. it calculates the middle 50% of the data
• Less misleading than range calculations
• Resistant to changes in extreme variables
Measures of Dispersion
• Sets of data may have the same mean but may have a different/wider/smaller dispersion of data across
the range
Quantifying Dispersion
• For example
__________
_________
________
_______
______
____
___ 10 data points = 10 distances
_____
______
_______
_________
Negative Positive
Therefore square the distances (to make them all positive) and work the squared deviations
Jessica King BSB123 – Data Analysis Semester 1, 2009 11
Mean Squared Deviation/Variance
• Calculate each deviation; square them all (including the positive results) and calculate the average
• Note: this is not a true average but an approximate value
• When finding the average (using sample figures),
N = number of observations -1 (to adjust for the biasness of the sample statistics)
Variance
n 2
∑ (xi− x)
s = i=1
n−1
Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
2
S = S
Measures of Dispersion
Measuring Dispersion
• Small standard deviation = average distance from a point to the standard deviation is smaller
• Large standard deviation = average distance from a point to the standard deviation is larger
Variance and Standard Deviation
Advantages
• Each value in the data set is used in the calculation
• Values far from the mean are given extra weight as the mean deviations are squared
Disadvantages
• Sensitive to outliers
• Measure of absolute NOT relative variation
Example:
a. S=10 Cannot say there is more dispersion in B than A unless there is more information
b. S=100
When comparing data with different means, the coefficient of variation NOT the standard deviation must be
used.
Coefficient of Variation
• Measures relative variation ie. shows variation relative to the mean
S
CV = ×100
X
Z Scores
• The difference between a given observation and the mean divided by the standard deviation
• A negative score would indicate the value is below the mean
• Eg. A score of 2.0 means that a value is 2.0 standard deviations away from the mean
X − X
Z =
s
Jessica King BSB123 – Data Analysis Semester 1, 2009 12
Shape of a Distribution
• Describes how data are distributed ie. either symmetrical or skewed
• Example: x = 34/3 = 8.5
Median = 10.5
Mean < median and therefore left-skewed
Box-and-Whisker Plots
• Box and central in are centred between the end points if data is symmetric around the mean
• Data can still be skewed
Everything so far has been based on analysing only one variable.
Bivarriate data measures/two variable
• Calculating the relationship between 2 variables ie. as x increases so to does y
Sample Covariance
• Measures the strength between two numerical variables
• Can be either a positive or a negative answer
• Only concerned with the direction of the relationship
n
(X − X )(Y − Y )
i=1 i i
cov (X ,Y ) =
n −1
Sample Coefficient of Correlation (r)
• Measures the relative strength of the linear relationship between two variables
cov (X , Y)
r = S S
X Y
Features of r (correlation)
• Invariant to units of measure
• Ranges ALWAYS between -1 and 1
o Closer to -1, stronger negative linear relationship
o Closer to +1, stronger positive linear relationship
o Closer to 0, weaker linear relationship
TEXT BOOK NOTES: Chapter 3: Numerical Descriptive Measures
Definitions
Variation: measures the spread of dispersion of values in a data set
Range: difference between the highest and lowest value
Box-and-whisker plot:
Mean: The sum of the values divided by the number of values
Median: the middle value in a set of data that has been ordered from lowest to highest value
Mode: value in a data set that appears most frequently
Interquartile range: the difference between the third and first quartiles in a data set
Coefficient of Variation: relative measure of variation, expressed as a percentage rather than in terms of the
units of particular data
Jessica King BSB123 – Data Analysis Semester 1, 2009 13
Z Scores: measures of relative standing that take into consideration both the measures of standard deviation
Covariance: measure of the strength and direction of the linear relationship between two numerical variables
Mean
• Most common measure of central tendency
• Uses all the data values and can be calculated exactly
• The sum of the values divided by the number of values
Median
• The middle value in a set of data that has been ordered from lowest to highest value
• Value that partitions or splits an ordered set of data into two equal parts
• Not affected by extreme values
• 50% of values are equal to or smaller than the median and 50% of values are equal to or larger than the
median
• Calculate the median by two rules:
o If there is an odd number of values, the median is the middle ranked value
o If there is an even number of values, the median is the mean of the two middle ranked variables
Mode
• Value in a data set that appears most frequently
• Extreme values do not affect the mode
• More variable from sample to sample than either the mean or median.
• There can be no mode or than can be several modes for a set of data
Quartiles
• Divide a set of data into quarters
• First quartile, 1 divides the lower 25% of values from the other 75%
• Second quartile, Q2the median or 50%
• Third quartile, Q has 75% of values below it and 25% of values above it
3
• Use the following rules to calculate quartiles:
o If the result is an integer, then the quartile is equal to the ranked value
o If the result is a fractional half, then the quartile is equal to the mean of the corresponding ranked
values.
o If the result if neither an integer nor a fractional half, round the result to the nearest integer and
select that ranked value
Range
• Equal to the largest value minus the smallest value
• Measures only the total spread of the data
• It is based only on the two extreme values and ignores all others.
• It does not take into account how the data is distributed and does not indicate whether the values are
evenly distributed
• Range is distorted by very high or very low values
Interquartile Range
• Difference between the third and first quartiles in a data set
• More meaningful measure of variation than the range because it ignores extreme values by finding the
range of the middle 50% of the ordered array of data
Jessica King BSB123 – Data Analysis Semester 1, 2009 14
Summary measures such as the median, Q1 and Q3 and the IQR which are not influenced by extreme values
are called resistant measures.
Variance and Standard Deviation
• Measures the average scatter around the mean
• Sum of squares: squaring the deviations from the mean before summing.
• Because the sum of squares is a sum of squared differences that will always be negative, neither the
variance nor the standard deviation can ever be negative. For a data set, the variance and the standard
deviation will usually be positive and will only be zero if there is no variation ie. all values are equal
Sample Variance – Definition Formula
• Sum of the squared deviations from the sample mean divided by the same size minus one.
Sample Standard Deviation – Definition Formula
• The square root of the sample variance
Sample Variance – Calculation Formula
• The sum of the squared deviations from the mean divided by the same size minus one.
The following summarises the characteristics of the range, IQR, variance and standard deviation
• The more spread out or dispersed, the data, the larger the range, IQR, variance and standard deviation
• The more concentrated or homogeneous, the data, the smaller the range, IQR, variance and standard
deviation
• If the values are all the same (ie. no variation), the range, IQR, variance and standard deviation will all
equal zero
• None of the measures of variation can ever be negative.
Coefficient of Variation
• A relative measure of variation, expressed as a percentage rather than in terms of the units of particular
data
• Measures the scatter in the data relative to the mean
• Useful when comparing two or more sets of data that are measured by different units or when the scale
of the data sets is substantially different
• Is equal to the standard deviation divided by the mean, multiplied by 100%
Z Scores
• Measures of relative standing that take into consideration both the measures of standard deviation
• Represents the distance between a given observation and the mean expressed in standard deviations
• An extreme value or outlier will have a large z score, either positive or negative – therefore useful in
identifying these values
Shape
• A distribution is symmetrical if the lower and upper halves of the graph are mirror images of each other
• If the distribution is not symmetrical, it may be skewed.
o Skewed to the right or positively skewed if there is long tail to the right indicating that there are
relatively few large data values or more smaller values
o Skewed to the left or negatively skewed if there is a long tail to the left indicating that there are
relatively few small data values and more larger data values
Jessica King BSB123 – Data Analysis Semester 1, 2009 15
• For most continuous distributions it can be said:
o Mean < median, the distribution is likely to be negative or left skewed
o Mean = median, the distribution is symmetrical
o Mean > median, the distribution is likely to be positive or right skewed
Box-and-whisker Plots
• Provides a graphical representation of the data based on the five-number summary
• Shows the range, IQR and quartiles
• The distribution of a box-and-whisker plot is mirrored in its graphical shape
Covariance
• Measure of the strength and direction of the linear relationship between two numerical variables
• Positive value indicates a positive linear relationship between the two variables and a negative value
indicates a negative relationship
• A value of zero indicates that there is no linear relationship between the variables
The Sample Covariance
n
∑ (X −iX )(Y −iY )
cov (X ,Y ) = i=1
n −1
The Sample Covariance – Calculation Formula
cov (X , Y)
r =
S X Y
• Can have any value, it is difficult to use it as a measure of the relative strength of a linear relationship.
Coefficient of Correlation
• Measures the relative strength of a linear relationship between two variables
• Values of the coefficient of correlation range from -1 for a perfect negative linear correlation to +1 for a
perfect positive linear correlation.
• Perfect means that if the points a plotted in a scatter diagram, all the points will lie in a straight line.
• Sample coefficient of correlation, r, can be calculated
The sample coefficient of correlation
• The sample covariance divided by the sample standard deviations of X and Y
Jessica King BSB123 – Data Analysis Semester 1, 2009 16
LECTURE 4a: SIMPLE LINEAR REGRESSION
LECTURE NOTES
Bivariate Data
• When the two variables are both numerical – what can be done?
- group using scatter plot
- covariance (determines the direction of the relationship, if any) and correlation (determines the
strength of the relationship, if any)
- Regression: explain and measure the relationship between two numerical variables ie. a small
increase in x = a large increase in y
- In graphs there may be some correlation but the values may be different in terms of their slop.
Regression – fit a long on the data to show the relationship between x and y
Regression
• Explains and measures the relationship between two variables – aids prediction
• Gives an equation of a straight line to represent the relationship between x and y
Simple Linear Regression
• Determines a relationship between dependent variable (y) and independent variable (x)
• Relationship is causal – changes in y are assumed to be caused by changes in x
• Note: for a straight line, there can only be 1x (independent variable)
Straight Lines
• Y = mx + c
Where c = the intercept (y when x = 0)
m = slope/gradient
To determine the line of best fit:
• The best fit line will be the one that minimises the sum of squared errors
• e = errors
Simple Linear Regression Equation
• provides an estimate of the population regression line
• y = b0+b 1 1
where y = estimated y value
b = intercept (c)
0
b 1 slope (m)
x 1 x value
• note: use lower case for sample, use upper case for population
• In excel – regression feature is part of the data analysis tool pack – regression option
• r = absolute value of correlation – only tells magnitude not the direction
Ie. r = 0.9 but is it positive or negative.
When m = +, then r = +
When m = -, then r = -
• Intercept = value for c
• Together these allow you to find a line of best fit
Interpreting the coefficients (m and c)
• Ads to an analysis of a situation
• Don’t attempt to interpret data outside the range of the data given
• Allows you to make a prediction with a certain x value
Jessica King BSB123 – Data Analysis Semester 1, 2009 17
Note
1. intercept coefficient may not be interpretable
2. Slope interpretation is only valid with the data range.
TEXT BOOK NOTES: Chapter 12: Type of Regression Models
Definitions
Simple linear regression: a single numerical independent variable X is used to predict the numerical dependent
variable Y
Y intercept: represents the mean value of Y when X = 0
Relevant range: only consider this of the independent variable in making predictions
Regression analysis allows you to develop a model to predict the values of a numerical variable based on the
values of one or more other variables
The Least-Squares Method
• The most common approach to find b and 0 is th1 method of least squares. This method minimises the
sum of the squared differences between the actual values (Y) aid the predicted values using the simple
linear regression equation
Jessica King BSB123 – Data Analysis Semester 1, 2009 18
LECTURE 4b: SIMPLE LINEAR REGRESSION
LECTURE NOTES
Probability is the language of uncertainty, which allows us to make inferences or decisions based upon an
assessment of the likelihood of different outcomes
Terminology
• Experiment process through which observation or measurement is obtained - Implies that the outcome
of each experiment is unknown until it has occurred
• Sample space: list of all possible outcomes – outcome is unknown until experiment is completed
• Simple event: any of the individual outcomes associated with the random experiment
• Event: collection of simple events
Example 1
Experiment = toss a coin
S = {H, T} = 2 outcomes
X H T
P (x) ½ 1/2
∑ = 1
Example 2
Die eg. A {1, 2, 3}
B {2, 4, 6}
C {5, 6}
P (A) and P (B)
0 < P (A) < 1
0%/impossible 100%/certain
Interpreting Probabilities
• Change an outcome is achieved in a particular experiment
• Notation = P (A)
• Probability ranges between 0 (0%) and 1 (100%)
• Sum of probability of all simple events must equal 1
Assessing Probabilities
A priori classical
• Based on prior knowledge eg. rolling a die (1/6)
Empirical classical probability
• Based on observed data or experimentation to record possibilities eg. P (A) = N /N
A
Subjective probability
• Based on individual judgement about probability of occurrence
Calculating and Combining Probabilities
• Probabilities to be calculated include: marginal, joint, conditional, complement, either/or
• Contingency table can be highly useful
Marginal Probability
• Probability of a single outcome
• Eg. P (coke) =120 = 0.6
200
Jessica King BSB123 – Data Analysis Semester 1, 2009 19
Complement Probability
• Not including A or PA() or P (A’)
• Complement is a list of associated outcomes with that event not occurring
• P(A ) = 1 – P(A)
As P of the whole sample = 1
Therefore, P (A) + A() = 1
Therefore, PA ) = 1 – P(A)
Joint Probabilities (intersection)
• Intersection f two events in the event that both A and B occur
• Denoted as A∩ B
• Probability of joining events P B) where ∩ = and!
Mutually exclusive events
∩
• Events cannot occur together ie. P B) = 0
• Eg. male and female are mutually exclusive
Collectively exhaustive
• One set of events must occur
• Set of events covers the entire space eg. male or female
Either/or (The Union)
• Union of events, A & B, is whether A or B or A & B
• P (A ∪ B) where ∪ = union (everything in A and B)
• Need to use the addition rule:
P (A ∪ B) = P (A) + P (B) - P (AB)
TEXT BOOK NOTES: Chapter 4 Basic Probability
Definitions
Probability: numerical value that represents the change, likelihood or possibility that a particular event will
occur
Impossible event: An event that has no chance of occurring has a probability of 0
Certain event: An event that has certain change of occurring has a probability of 1
Venn diagram: another way to present a sample space – graphically represents the various events as unions and
intersections of circles
Simple probability: refers to the probability of occurrence of a simple event, P(A).
Joint probability: refers to the probability of an occurrence involving two or more events
Conditional probability: refers to the probability of event A given information about the occurrence of another
event, B
Decision tree: alternative to a contingency table or a Venn diagram
Statistical independence: When the outcome of one event does not affect the probability of occurrence of
another event
Basic Probability Concepts
• Probability is a numerical value that represents the change, likelihood or possibility that a particular
event will occur
• An event that has no chance of occurring has a probability of 0
• An event that has certain change of occurring has a probability of 1
• There are three approaches to assigning a probability to an event:
- a priori classic probability
- empirical classic probability
- subjective probability
a priori classic probability
• the probability of success is based on prior knowledge of the process involved.
Jessica King BSB123 – Data Analysis Semester 1, 2009 20
• Each outcome is equally likely and the change of occurrence is given by:
X
Probability of occurrence =
T
Where X = number of ways in which the event occurs
And T = total number of possible outcomes
Empirical classic probability
• Outcomes are based on observed data, not prior knowledge of a process
Subjective probability
• Differs from the other two approaches because a subjective probability differs from person to person
• Useful in making decisions in situations where the above cannot be used
Events and Sample Spaces
• Each possible outcome of a variable is referred to as an event
• A simple event is described by a single characteristic
• A joint event is an event that has two or more characteristics
• The complement of even A (written A’) includes all simple events that are not included in the event A
• The collection of all the possible simple events is called the sample space
Contingency Tables and Venn Diagrams
• Contingency table = table of cross-clarification
• Venn diagram is another way to present a sample space – graphically represents the various events as
unions and intersections of circles
• The area contained with circle A and circle B (centre area) is the intersection of A anB) since
it contains all the outcomes that are in both event A and B.
• The total area of the two circles is the union of A and BB) and contains all outcomes in event A
and/or B
• The area in the diagram outside A B contains outcomes that are neither in event A nor event B
Simple (Marginal) Probability
• Refers to the probability of occurrence of a simple event, P(A).
• Example:
numberwhop lannedtopu rchase = 250 = 0.25
P (Planned to purchase) = tota ln umberofhou ses 1000
Thus, there is a 25% likelihood that a house planned to make the purchase
• Also called marginal probability, because you obtain the total number of successes from the appropriate
margin of the contingency table
Joint Probability
• Refers to the probability of an occurrence involving two or more events eg. probability of getting a head
on the firs toss of a coin
• Example:
plannedtop urchaseAND actuallypu rchased
P (planned to purchase and actually purchased) = tota ln umberofres pondents
= 200 = 0.2
1000
• Marginal probability of an event

More
Less
Related notes for BSB123