applying statistics to a scientific, industrial, or societal problem, it is necessary to begin with a
population or process to be studied. Populations can be diverse topics such as "all persons living
in a country" or "every atom composing a crystal". A population can also be composed of
observations of a process at various times, with the data from each observation serving as a
different member of the overall group. Data collected about this kind of "population" constitutes
what is called a time series.
For practical reasons, a chosen subset of the population called a sample is studied — as
opposed to compiling data about the entire group (an operation called census). Once a sample
that is representative of the population is determined, data are collected for the sample members
in an observational or experimental setting. These data can then be subjected to statistical
analysis, serving two related purposes: description and inference.
* Descriptive statistics summarize the population data by describing what was observed in the
sample numerically or graphically. Numerical descriptors include mean and standard deviation for
continuous data types (like heights or weights), while frequency and percentage are more useful
in terms of describing categorical data (like race).
* Inferential statistics uses patterns in the sample data to draw inferences about the population
represented, accounting for randomness. These inferences may take the form of: answering
yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the
data (estimation), describing associations within the data (correlation) and modeling relationships
within the data (for example, using regression analysis). Inference can extend to forecasting,
prediction and estimation of unobserved values either in or associated with the population being
studied; it can include extrapolation and interpolation of time series or spatial data, and can also
include data mining.
“... it is only the manipulation of uncertainty that interests us. We are not concerned with the
matter that is uncertain. Thus we do not study the mechanism of rain; only whether it will rain.”
Dennis Lindley, "The Philosophy of Statistics", The Statistician (2000).
The concept of correlation is particularly noteworthy for the potential confusion it can cause.
Statistical analysis of a data set often reveals that two variables (properties) of the population
under consideration tend to vary together, as if they were connected. For example, a study of
annual income that also looks at age of death might find that poor people tend to have shorter
lives than affluent people. The two variables are said to be correlated; however, they may or may
not be the cause of one another. The correlation phenomena could be caused by a third,
previously unconsidered phenomenon, called a lurking variable or confounding variable. For this
reason, there is no way to immediately infer the existence of a causal relationship between the
two variables. (See Correlation does not imply causation.)
For a sample to be used as a guide to an entire population, it is important that it is truly a
representative of that overall population. Representative sampling assures that the inferences
and conclusions can be safely extended from the sample to the population as a whole. A major
problem lies in determining the extent to which the sample chosen is actually representative.
Statistics offers methods to estimate and correct for any random trending within the sample and
data collection procedures. There are also methods of experimental design for experiments that
can lessen these issues at the outset of a study, strengthening its capability to discern truths
about the population. Statisticians describe stronger methods as more "robust".
Randomness is studied using the mathematical discipline of probability theory. Probability is used
in "mathematical statistics" (alternatively, "statistical theory") to study the sampling distributions of
sample statistics and, more generally, the properties of statistical procedures. The use of any