Study Guides (247,929)
Canada (121,167)
Statistics (160)
STAT 231 (57)
Conrad (2)
Midterm

STAT MIDTERM 1 NOTES.docx

13 Pages
186 Views
Unlock Document

Department
Statistics
Course
STAT 231
Professor
Conrad
Semester
Spring

Description
Chapter 1: INTRODUCTION • Empirical study: one in which we learn by observation or experiment o Deals with collections of individual units – populations and processes o To know more about a process, examine a sample of units generated by the process • Statistical sciences deal with the study of variability in process/populations, and with good ways to collect/analyze data about such processes • Population: collection of units  populations are static (defined at one moment in time) • Process: mechanism by which units are produced  usually occur over time • Variates: characteristics of the units  measured quantities (blood pressure), discrete quantities (number of damaged pixels in monitor), categorical quantities (color), complex quantities (image) • Attributes: functions of the variates over the whole population  what we are interested in • Data: values of the variates for a sample of units in population/process o Can be provided by existing source o Sample surveys: select a representative sample of units from population & determine variates o Observational studies: different from sample surveys in that observational studies has infinite/conceptual population of interest o Experiments: experimenter intervenes and changes or sets the values of variates • 2 classes of summaries: graphical and numerical Numerical Summaries • Definition 1 – sample percentiles/quantiles: the pth quantile (AKA the 100pth percentile) is a value q(p) such that a fraction p of the y values in the data set are less than or equal to q(p) • q(0.5), q(0.25), and q(0.75) are called the median, lower quartile, and upper quartile respectively. 1) Measures of location 1 n y= ∑ y i  Average (mean) : n i=1  E(aX) = aE(x)  E(x+a) = E(x) + a  E(ax+b) = aE(x) + b 2  Var(aX + b) = a Var(x)  Cov(X,Y) = E(XY) – E(X)E(Y)  if X intersects Y, Cov=0 (but not vice versa) COV (X,Y )  Corr(X,Y) = √Var(X)Var(Y)  COV(aX+b,cY+d) = acCOV(X,Y)  COV(X,X) = Var(X) ̂  Median: the middle value when n is odd; average of 2 middle values when n is even not as affected by extreme values as the mean  Mode: value of y which appears in the sample most frequently 2) Measures of scale or dispersion y ¿ 2 (¿i¿−́y) n  Sample variance: ∑ ¿ i=1 2 1 s = n−1 ¿ 2  Sample standard deviation: s= √  Range: max(y) i min(y) i  Interquartile range IQR: q(0.75) – q(0.25) 3) Measures of Shape  Skewness: measure of the lack of symmetry in a distribution. Positive values mean right tail is larger in histogram, and hence right skewed  skewness = 0 means symmetric curve 4  Kurtosis: extent to which data is prone to very large/small observations. Sinceyi−y) > 0, kurtosis is always positive. Values > 3 indicate heavier tail (& peaked centre) than the Normal distribution. Small kurtosis is like a flat hill. Also, usually calculated for symmetric data. Graphical Summaries • Histogram: graph in which a rectangle is placed about each interval. The height of the rectangle is chosen so that the area is proportional to the frequency of that interval. 1) Standard histogram: intervals are of equal length. Height of the rectangle is the frequency. 2) Relative frequency histogram: intervals may not be of equal length. Height of rectangle for that interval is chosen so the area = frequency/n (relative frequency for interval). Density marks the vertical axis, and so the sum of the areas of rectangles in histogram equals to 1. • ECDF Empirical CDFs: first, order the values of y from least to greatest. Then form a step function. o Plot of ECDF does not show the shape of the distribution as clearly as the histogram o Shows proportion of y-values in any given interval (a, b] = F(b) – F(a) o Allows us to determine the pth quantile, or 100pth percentile • Boxplots (box and whisker plots): good for when # of groups are large, or sample sizes within groups are small  good way to display data side by side o The box represents IQR from q(0.25) to median, to q(0.75)  small IQR = lower variability o The line on the left (or bottom) is placed at the smallest observed data value that is larger than the value q(0.25) – (1.5*IQR) o The line on the right/top is placed at the largest data value smaller than q(0.75) + (1.5*IQR) • Scatterplot: good at showing data/relationship on 2 or more variates for each unit in the sample SUM tricks n ∑ i= n(n+1) • i=1 2 n ∑ i =n(n+1)(2n+1) • i=1 6 Probability Distributions and Statistical Models • Model depends on distribution of variate values in population (histogram) and selection procedure y μ • The average corresponds to , the expected value of Y ̂ • The sample median  corresponds to the population median m. o For continuous distributions, m is the solution of F(m) = 0.5 where F(y) =y) is the cdf of Y 1 1 o For discrete distributions, m is a point chosen such that P(Ym) ≥ and P(Y ≥ m) ≤ 2 2 2 • The sample standard deviation corresponds to σ , the s.d. of Y , where =E¿ • The histogram (with y-axis on the density scale) corresponds to the probability density function of Y Data Analysis and Statistical Inference • 2 broad aspects of the analysis and interpretation of data 1) Descriptive statistics: portrayal of data in numerical/graphical ways to show features of interest 2) Statistical inference: use the data obtained in study of a process/population to draw general conclusions about the process/population  form of inductive inference  specific to general • 3 main types of problems 1) Estimation problems: estimate one or more attributes to a process/population 2) Prediction problems: use the data to predict a future value for a process variate or a unit to be selected from the population 3) Hypothesis testing problems: using the data to assess the truth of some question/ hypothesis Chapter 2: MODEL FITTING, MAXIMUM LIKEHOOD ESTIMATION, MODEL CHECKING • Statistical model: mathematical model that incorporates probability in some way • be clear about what the target population/process is, and how the variables being considered are defined and measured  response variate is caused/explained by explanatory variate • choice of a model is driven by combination of these 3 factors: 1) background knowledge/assumptions about population/process which lead to certain distributions 2) past experience with data from population/process, which showed that certain dist are suitable 3) current data set, against which models can be assessed • suppose that for some discrete r.v. Y, we consider a family who’s probability function depends on the θ parameter . where A is countable/discrete set of real numbers, θ the range of Y. To apply the model to a specific problem, we need a value for  this is called fitting the model, or estimating the value of θ. Y Binomial(n,θ);00;E(Y )=θ=Var(Y) • Poisson Distribution: Y Exponential θ ;E Y =θ;Var Y =θ ( ) 2 • Exponential Distribution: θ The CDF is F(y) = 1 – e Y G μ,σ ∨Y N μ,σ ( ) • Gaussian Distribution (Normal):  we use normal 2 o μ∧σare parameters,with−∞
More Less

Related notes for STAT 231

Log In


OR

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


OR

By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.


Submit