Study Guides (247,929)
Canada (121,167)
Statistics (160)
STAT 231 (57)
Conrad (2)


13 Pages
Unlock Document

STAT 231

Chapter 1: INTRODUCTION • Empirical study: one in which we learn by observation or experiment o Deals with collections of individual units – populations and processes o To know more about a process, examine a sample of units generated by the process • Statistical sciences deal with the study of variability in process/populations, and with good ways to collect/analyze data about such processes • Population: collection of units  populations are static (defined at one moment in time) • Process: mechanism by which units are produced  usually occur over time • Variates: characteristics of the units  measured quantities (blood pressure), discrete quantities (number of damaged pixels in monitor), categorical quantities (color), complex quantities (image) • Attributes: functions of the variates over the whole population  what we are interested in • Data: values of the variates for a sample of units in population/process o Can be provided by existing source o Sample surveys: select a representative sample of units from population & determine variates o Observational studies: different from sample surveys in that observational studies has infinite/conceptual population of interest o Experiments: experimenter intervenes and changes or sets the values of variates • 2 classes of summaries: graphical and numerical Numerical Summaries • Definition 1 – sample percentiles/quantiles: the pth quantile (AKA the 100pth percentile) is a value q(p) such that a fraction p of the y values in the data set are less than or equal to q(p) • q(0.5), q(0.25), and q(0.75) are called the median, lower quartile, and upper quartile respectively. 1) Measures of location 1 n y= ∑ y i  Average (mean) : n i=1  E(aX) = aE(x)  E(x+a) = E(x) + a  E(ax+b) = aE(x) + b 2  Var(aX + b) = a Var(x)  Cov(X,Y) = E(XY) – E(X)E(Y)  if X intersects Y, Cov=0 (but not vice versa) COV (X,Y )  Corr(X,Y) = √Var(X)Var(Y)  COV(aX+b,cY+d) = acCOV(X,Y)  COV(X,X) = Var(X) ̂  Median: the middle value when n is odd; average of 2 middle values when n is even not as affected by extreme values as the mean  Mode: value of y which appears in the sample most frequently 2) Measures of scale or dispersion y ¿ 2 (¿i¿−́y) n  Sample variance: ∑ ¿ i=1 2 1 s = n−1 ¿ 2  Sample standard deviation: s= √  Range: max(y) i min(y) i  Interquartile range IQR: q(0.75) – q(0.25) 3) Measures of Shape  Skewness: measure of the lack of symmetry in a distribution. Positive values mean right tail is larger in histogram, and hence right skewed  skewness = 0 means symmetric curve 4  Kurtosis: extent to which data is prone to very large/small observations. Sinceyi−y) > 0, kurtosis is always positive. Values > 3 indicate heavier tail (& peaked centre) than the Normal distribution. Small kurtosis is like a flat hill. Also, usually calculated for symmetric data. Graphical Summaries • Histogram: graph in which a rectangle is placed about each interval. The height of the rectangle is chosen so that the area is proportional to the frequency of that interval. 1) Standard histogram: intervals are of equal length. Height of the rectangle is the frequency. 2) Relative frequency histogram: intervals may not be of equal length. Height of rectangle for that interval is chosen so the area = frequency/n (relative frequency for interval). Density marks the vertical axis, and so the sum of the areas of rectangles in histogram equals to 1. • ECDF Empirical CDFs: first, order the values of y from least to greatest. Then form a step function. o Plot of ECDF does not show the shape of the distribution as clearly as the histogram o Shows proportion of y-values in any given interval (a, b] = F(b) – F(a) o Allows us to determine the pth quantile, or 100pth percentile • Boxplots (box and whisker plots): good for when # of groups are large, or sample sizes within groups are small  good way to display data side by side o The box represents IQR from q(0.25) to median, to q(0.75)  small IQR = lower variability o The line on the left (or bottom) is placed at the smallest observed data value that is larger than the value q(0.25) – (1.5*IQR) o The line on the right/top is placed at the largest data value smaller than q(0.75) + (1.5*IQR) • Scatterplot: good at showing data/relationship on 2 or more variates for each unit in the sample SUM tricks n ∑ i= n(n+1) • i=1 2 n ∑ i =n(n+1)(2n+1) • i=1 6 Probability Distributions and Statistical Models • Model depends on distribution of variate values in population (histogram) and selection procedure y μ • The average corresponds to , the expected value of Y ̂ • The sample median  corresponds to the population median m. o For continuous distributions, m is the solution of F(m) = 0.5 where F(y) =y) is the cdf of Y 1 1 o For discrete distributions, m is a point chosen such that P(Ym) ≥ and P(Y ≥ m) ≤ 2 2 2 • The sample standard deviation corresponds to σ , the s.d. of Y , where =E¿ • The histogram (with y-axis on the density scale) corresponds to the probability density function of Y Data Analysis and Statistical Inference • 2 broad aspects of the analysis and interpretation of data 1) Descriptive statistics: portrayal of data in numerical/graphical ways to show features of interest 2) Statistical inference: use the data obtained in study of a process/population to draw general conclusions about the process/population  form of inductive inference  specific to general • 3 main types of problems 1) Estimation problems: estimate one or more attributes to a process/population 2) Prediction problems: use the data to predict a future value for a process variate or a unit to be selected from the population 3) Hypothesis testing problems: using the data to assess the truth of some question/ hypothesis Chapter 2: MODEL FITTING, MAXIMUM LIKEHOOD ESTIMATION, MODEL CHECKING • Statistical model: mathematical model that incorporates probability in some way • be clear about what the target population/process is, and how the variables being considered are defined and measured  response variate is caused/explained by explanatory variate • choice of a model is driven by combination of these 3 factors: 1) background knowledge/assumptions about population/process which lead to certain distributions 2) past experience with data from population/process, which showed that certain dist are suitable 3) current data set, against which models can be assessed • suppose that for some discrete r.v. Y, we consider a family who’s probability function depends on the θ parameter . where A is countable/discrete set of real numbers, θ the range of Y. To apply the model to a specific problem, we need a value for  this is called fitting the model, or estimating the value of θ. Y Binomial(n,θ);00;E(Y )=θ=Var(Y) • Poisson Distribution: Y Exponential θ ;E Y =θ;Var Y =θ ( ) 2 • Exponential Distribution: θ The CDF is F(y) = 1 – e Y G μ,σ ∨Y N μ,σ ( ) • Gaussian Distribution (Normal):  we use normal 2 o μ∧σare parameters,with−∞
More Less

Related notes for STAT 231

Log In


Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.