Bias, Variance, and MSE of Estimators
September 4, 2010
We assume that we have iid (independent identically distributed) samples X (1;:::;X (n)that follow some
unknown distribution. The task of statistics is to estimate properties of the unknown distribution. In this
note we focus one estimating a parameter of the distribution such as the mean or variance. In some cases
the parameter completely characterizes the distribution and estimating it provides a probability estimate.
In this note, we assume that the parameter is a real vector ▯ 2 R . To estimate it, we use an es-
timator which is a function of our observations ▯(x (1;:::;x (n). We follow standard practice and omit
(in notation only) the dependency of the estimator on the samples, i.e. we write ▯. However, note that
^ ^ (1) (n)
▯ = ▯(X ;:::;X ) is a random variable since it is a function of n random variables.
A desirable property of an estimator is that it is correct on average. That is, if there are repeated
(1) (n) ^ (1) (n)
samplings of n samples X ;:::;X , the estimator ▯(X ;:::;X ) will have, on average, the correct
value. Such estimators are called unbiased.
De▯nition 1. The bias of ▯ is Bias(▯) = E(▯) ▯ ▯. If it is 0, the estimator ▯ is said to be unbiased.
There is, however, more important performance characterizations for an estimator than just being unbi-
ased. The mean squared error is perhaps the most important of them. It captures the error that the estimator
makes. However, since the estimator is a RV, we need to average over its distribution thus capturing the
average performance if there are many repeated samplings of X (1);:::;X (n.
^ 2 P d ^ 2
De▯nition 2. The mean squared error (MSE) of an estimator is E(k▯ ▯ ▯k ) = E( j=1(▯j▯ ▯ j ).
^ 2 ^ ^ 2
E(k▯ ▯ ▯k ) = trace(Var(▯)) + kBias(▯)k :
Note that Var(▯) is the covariance matrix of ▯ and so its trace is d Var(▯j).
Proof. Since the MSE equals j=1 E((▯j▯ ▯ j ) it is su▯cient to prove for a scalar ▯, E((▯ ▯ ▯) ) =
Var(▯) + Bias (▯):
^ 2 ^ ^ ^ 2 ^ ^ 2 ^ 2 ^ ^ ^
E((▯ ▯ ▯) ) = E(((▯ ▯ E(▯)) + (E(▯) ▯ ▯) ) = Ef(▯ ▯ E(▯)) + (E(▯) ▯ ▯) + (▯ ▯ E(▯))(E(▯) ▯ ▯)g
^ 2 ^ ^ ^ ^ ^ 2 ^ ^ ^ ^ 2 ^ ^
= Var(▯) + Bias (▯) + E((▯ ▯ E(▯))(E(▯) ▯ ▯)) = Var(▯) + Bias (▯) + E(▯E(▯) ▯ (E(▯)) ▯ ▯▯ + E(▯)▯)
2 2 2 2
= Var(▯) + Bias (▯) + (E(▯)) ▯ (E(▯)) ▯ ▯E(▯) + ▯E(▯) = Var(▯) + Bias (▯):
Since the MSE decomposes into a sum of the bias and variance of the estimator, both quantities are
important and need to be as small as possible to achieve good estimation performance. It is common to
trade-o▯ some increase in bias for a larger decrease in the variance and vice-verse.
Note here and in the sequel all expectations are with respe; : : : . X
Two important special cases are the mean ▯ = X X (iwhich estimates the vector E(X) and