false

Study Guides
(247,929)

Canada
(121,167)

University of Waterloo
(5,717)

Statistics
(160)

STAT 231
(57)

Conrad
(2)

Midterm

Unlock Document

Statistics

STAT 231

Conrad

Spring

Description

Chapter 1: INTRODUCTION
• Empirical study: one in which we learn by observation or experiment
o Deals with collections of individual units – populations and processes
o To know more about a process, examine a sample of units generated by the process
• Statistical sciences deal with the study of variability in process/populations, and with good ways to
collect/analyze data about such processes
• Population: collection of units populations are static (defined at one moment in time)
• Process: mechanism by which units are produced usually occur over time
• Variates: characteristics of the units measured quantities (blood pressure), discrete quantities (number of
damaged pixels in monitor), categorical quantities (color), complex quantities (image)
• Attributes: functions of the variates over the whole population what we are interested in
• Data: values of the variates for a sample of units in population/process
o Can be provided by existing source
o Sample surveys: select a representative sample of units from population & determine variates
o Observational studies: different from sample surveys in that observational studies has
infinite/conceptual population of interest
o Experiments: experimenter intervenes and changes or sets the values of variates
• 2 classes of summaries: graphical and numerical
Numerical Summaries
• Definition 1 – sample percentiles/quantiles: the pth quantile (AKA the 100pth percentile) is a value q(p)
such that a fraction p of the y values in the data set are less than or equal to q(p)
• q(0.5), q(0.25), and q(0.75) are called the median, lower quartile, and upper quartile respectively.
1) Measures of location
1 n
y= ∑ y i
Average (mean) : n i=1
E(aX) = aE(x) E(x+a) = E(x) + a E(ax+b) = aE(x) + b
2
Var(aX + b) = a Var(x)
Cov(X,Y) = E(XY) – E(X)E(Y) if X intersects Y, Cov=0 (but not vice versa) COV (X,Y )
Corr(X,Y) = √Var(X)Var(Y) COV(aX+b,cY+d) = acCOV(X,Y) COV(X,X) = Var(X)
̂
Median: the middle value when n is odd; average of 2 middle values when n is even not as
affected by extreme values as the mean
Mode: value of y which appears in the sample most frequently
2) Measures of scale or dispersion
y
¿
2
(¿i¿−́y)
n
Sample variance: ∑ ¿
i=1
2 1
s = n−1 ¿
2
Sample standard deviation: s= √
Range: max(y) i min(y) i
Interquartile range IQR: q(0.75) – q(0.25)
3) Measures of Shape
Skewness: measure of the lack of symmetry in a distribution. Positive values mean right tail is larger in
histogram, and hence right skewed skewness = 0 means symmetric curve
4
Kurtosis: extent to which data is prone to very large/small observations. Sinceyi−y) > 0,
kurtosis is always positive. Values > 3 indicate heavier tail (& peaked centre) than the Normal
distribution. Small kurtosis is like a flat hill. Also, usually calculated for symmetric data.
Graphical Summaries
• Histogram: graph in which a rectangle is placed about each interval. The height of the rectangle is chosen
so that the area is proportional to the frequency of that interval.
1) Standard histogram: intervals are of equal length. Height of the rectangle is the frequency. 2) Relative frequency histogram: intervals may not be of equal length. Height of rectangle for that
interval is chosen so the area = frequency/n (relative frequency for interval). Density marks the
vertical axis, and so the sum of the areas of rectangles in histogram equals to 1.
• ECDF Empirical CDFs: first, order the values of y from least to greatest. Then form a step function.
o Plot of ECDF does not show the shape of the distribution as clearly as the histogram
o Shows proportion of y-values in any given interval (a, b] = F(b) – F(a)
o Allows us to determine the pth quantile, or 100pth percentile
• Boxplots (box and whisker plots): good for when # of groups are large, or sample sizes within groups
are small good way to display data side by side
o The box represents IQR from q(0.25) to median, to q(0.75) small IQR = lower variability
o The line on the left (or bottom) is placed at the smallest observed data value that is larger than the
value q(0.25) – (1.5*IQR)
o The line on the right/top is placed at the largest data value smaller than q(0.75) + (1.5*IQR)
• Scatterplot: good at showing data/relationship on 2 or more variates for each unit in the sample
SUM tricks
n
∑ i= n(n+1)
• i=1 2
n
∑ i =n(n+1)(2n+1)
• i=1 6
Probability Distributions and Statistical Models
• Model depends on distribution of variate values in population (histogram) and selection procedure
y μ
• The average corresponds to , the expected value of Y
̂
• The sample median corresponds to the population median m.
o For continuous distributions, m is the solution of F(m) = 0.5 where F(y) =y) is the cdf of Y 1 1
o For discrete distributions, m is a point chosen such that P(Ym) ≥ and P(Y ≥ m) ≤
2 2
2
• The sample standard deviation corresponds to σ , the s.d. of Y , where =E¿
• The histogram (with y-axis on the density scale) corresponds to the probability density function of Y
Data Analysis and Statistical Inference
• 2 broad aspects of the analysis and interpretation of data
1) Descriptive statistics: portrayal of data in numerical/graphical ways to show features of interest
2) Statistical inference: use the data obtained in study of a process/population to draw general
conclusions about the process/population form of inductive inference specific to general
• 3 main types of problems
1) Estimation problems: estimate one or more attributes to a process/population
2) Prediction problems: use the data to predict a future value for a process variate or a unit to be selected
from the population
3) Hypothesis testing problems: using the data to assess the truth of some question/ hypothesis
Chapter 2: MODEL FITTING, MAXIMUM LIKEHOOD ESTIMATION, MODEL CHECKING
• Statistical model: mathematical model that incorporates probability in some way
• be clear about what the target population/process is, and how the variables being considered are defined
and measured response variate is caused/explained by explanatory variate
• choice of a model is driven by combination of these 3 factors:
1) background knowledge/assumptions about population/process which lead to certain distributions
2) past experience with data from population/process, which showed that certain dist are suitable
3) current data set, against which models can be assessed • suppose that for some discrete r.v. Y, we consider a family who’s probability function depends on the
θ
parameter . where A is countable/discrete set of real numbers,
θ
the range of Y. To apply the model to a specific problem, we need a value for this is called fitting
the model, or estimating the value of θ.
Y Binomial(n,θ);00;E(Y )=θ=Var(Y)
• Poisson Distribution:
Y Exponential θ ;E Y =θ;Var Y =θ ( ) 2
• Exponential Distribution:
θ
The CDF is F(y) = 1 – e
Y G μ,σ ∨Y N μ,σ ( )
• Gaussian Distribution (Normal): we use normal
2
o μ∧σare parameters,with−∞

More
Less
Related notes for STAT 231

Join OneClass

Access over 10 million pages of study

documents for 1.3 million courses.

Sign up

Join to view

Continue

Continue
OR

By registering, I agree to the
Terms
and
Privacy Policies

Already have an account?
Log in

Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.