Chapter 4 – Topic 4A – Basic Concepts of Validity
In the previous chapter, regardless of the method used, the assessment of reliability invariably boils down
to a reliability coefficient. Put simply, the validity of a test is the extent to which it measures what it
claims to measure. The most fundamental and important characteristic of a test is validity – reliability is
important too, but only insofar as it constrains validity (to the extent that a test is unreliably, it cannot be
valid). Reliability is a necessary but not a sufficient precursor of validity. Test validation is a
developmental process that begins with test construction and continues indefinitely.
Validity: A Definition
Definition of validity: A test is valid to the extent that inferences made from it are appropriate,
meaningful, and useful. Note that a test score per se is meaningless until the examiner draws inferences
from it based on the test manual or other research findings. Validity reflects an evolutionary, research
based judgement of how adequately a test measures the attribute it was designed to measure.
Consequently, the validity of tests is not easily captured by neat statistical summaries but is instead
characterized on a continuum from weak to acceptable to strong.
The three categories of validity are: Content, Criterion-related, and Construct. The use of these
labels do not imply that there are distinct types of validity or that a specific validity procedure is best for
one test use and not another. It is stressed that validity is a unitary concept determined by the extent to
which a test measures what it purports to measure.
Content validity is determined by the degree to which the questions, tasks, or items on a test are
representative of the universe of behaviour the test was designed to sample (it is no more than a sampling
issue). If the sample (specific items on the test) is representative of the population (all possible items),
then the test possesses content validity.
Content validity is a useful concept when a great deal is known about the variable that the
researcher wishes to measure. However, test developers must take care to specify the relevant universe of
responses as well. Content validity is more difficult to assure when the test measures an ill-defined trait
(for example anxiety – in this case a panel of experts would judge the content validity of the questions).
Quantification of Content Validity
When a panel of judges look at the questions and try to determine the content validity, think to
the 2x2 table and how only one box would represent the two judges agreeing that the question was very
relevant – you would use the formula of D/(A+B+C+D). Such a coefficient does not by itself establish the
validity of a test – quantification of content validity is no substitute for careful selection of items.
Face validity is not really a form of validity at all. A test has face validity if it looks valid to test
users, examiners, and especially the examinees. It is crucial that a test possess face validity from a public
relations standpoint but this is not to be confused with objective validity.
Criterion-validity is demonstrated when a test is shown to be effective in estimating an
examinee’s performance on some outcome measure. The variable of interest is the outcome measure,
called a criterion. The test score is useful only insofar as it provides a basis for accurate prediction of the
Two different approaches to validity evidence are subsumed under the heading of criterion related
validity. In concurrent validity, the criterion measures are obtained at approximately the same time as the
test scores (psychiatric diagnosis from a paper pencil task). In predictive validity, the criterion measures
are obtained in the future, usually months or years after the test scores are obtained (college entrance
exams and the GPA of the students).
Characteristics of a Good Criterion
Remember that a criterion is any outcome measure against which a test is validated. The criterion
itself must be reliable if it is to be a useful index of what the test measures. Validity coefficient is
calculated by the correlation of the test and the correlation of the criterion measure. Rxy = square root
(Rxx*Ryy) – hence if the reliability of either the test or criterion is low, the validity coefficient is also
A criterion must also be free of contamination from the test itself. If the test contains the same
questions as the criterion measure, then the correlation between the two will be inflated and this source of
error is referred to as criterion contamination (also possible when the criterion consists of ratings from
In a concurrent validation study, test scores and criterion information are obtained
simultaneously. Concurrent evidence of test validity is usually desirable for achievement tests, tests used
for licensing or certification, and diagnostic clinical tests. An evaluation of concurrent validity indicates
the extent to which test scores accurately estimate an individual’s present position on the relevant
criterion. A test with demonstrated concurrent validity provides a shortcut for obtaining information that
might otherwise require the extended investment of professional time.
Correlations between a new test and existing tests are often cited as evidence of concurrent
validity – has a catch 22 to it – old tests validating a new test – though appropriate if the two conditions
are met: 1) the criterion (existing) tests must have been validated through correlations with appropriate
nontest behavioural data and 2) the instrument being validated must measure the same construct as the
In a predictive validity study, test scores are used to estimate outcome measures obtained at a
later date – they are particularly relevant for entrance examinations and employment tests (who is likely
to succeed in a future endeavour).
When tests are used for purposes of prediction, it is necessary to develop a regression equation. A
regression equation described the best fitting straight line for estimating the criterion from the test.
Validity Coefficient and the Standard Error of the Estimate
In the hypothetical case where Rxy is 1.00, the test would possess perfect validity and allow for
flawless prediction. There is no general answer to how high a validity coefficient should be but we can
look at the standard error of estimate.
The standard error of estimate (SEest) is the margin of error to be expected in the predicted
criterion score. SEest and SEM both help gauge margins of error – SEM indicates the margin of
measurement error caused by unreliability of the test, where SEest indicates the margin of prediction error
caused by the imperfect validity of the test.
The SEest help answer the fundament question “How accurately can criterion performance be
predicted from test scores”.
Design Theory Applied to Psychological Tests
Proponents of decision theory stress that the purpose of psychological testing is not measurement
per se but measurement in the service of decision making. The link between testing and decision making
is nowhere more obvious than in the context of predictive validation studies.
Take a single test as an example and those predicted to succeed (pass) are referred to as the
selection ratio. Think of a 2x2 box again (those predicted to succeed/fail and those who did succeed/fail).
If a test has good predictive validity, then most persons predicted to succeed will succeed and most person
predicted to fail will fail. False positives are those predicted to succeed that fail and false negatives are
those who are predicted to fail that succeed. Hit rate = hits/(hits + misses).
Proponents of decision theory make two fundamental assumptions about the use of selection
1. The value of various outcomes to the institution can be expressed in terms of a common
utility scale (one scale is the profit and loss scale)
2. In institutional selection decisions, the most generally useful strategy is one that maximizes
the average gain on the utility scale over similar decisions. Maximization is the fundamental
A construct is a theoretical, intangible quality or trait in which individuals differ. Examples of
constructs include leadership ability, overcontrolled hostility, depression, and intelligence. A test
designed to measure a construct must estimate the existence of an inferred, underlying characteristic
based on a limited sample of behaviour. Construct validity refers to the appropriateness of these
inferences about the underlying construct. All psychological constructs possess two characteristics in
Chapter 4 topic 4a basic concepts of validity. In the previous chapter, regardless of the method used, the assessment of reliability invariably boils down to a reliability coefficient. Put simply, the validity of a test is the extent to which it measures what it claims to measure. The most fundamental and important characteristic of a test is validity reliability is important too, but only insofar as it constrains validity (to the extent that a test is unreliably, it cannot be valid). Reliability is a necessary but not a sufficient precursor of validity. Test validation is a developmental process that begins with test construction and continues indefinitely. Definition of validity: a test is valid to the extent that inferences made from it are appropriate, meaningful, and useful. Note that a test score per se is meaningless until the examiner draws inferences from it based on the test manual or other research findings.