Chapter 5: Measurement Concepts
Reliability of Measures
•Reliability refers to the consistency or stability of a measure of behavior. A reliable
measure of a psychological variable such as intelligence will yield the same result
each time you administer the intelligence test to the same person. The test would be
unreliable if it measured the same person as average one week, low the next and
bright the next. Put simply, a reliable measure does not fluctuate from one reading
to the next.
•A more formal way of understanding reliability is to use the concepts of true score
and measurement error. Any measure that you make can be thought of as
compromising two components: 1) a true score, which is the real score on the
variable and 2) measurement error. An unreliable measure of intelligence
contains considerable measurement error and so does not provide an accurate
indication of an individual’s true intelligence.
•In contrast, a reliable measure of intelligence – one that contains little measurement
error – will yield an identical (or nearly identical) intelligence score each time the
same individual is measured.
•To illustrate the concept of reliability further, imagine that you know someone
whose “true” intelligence score is 100. Now suppose that you administer an
unreliable intelligence test to this person each week for a year. Now suppose that
you test another friend who also has a true intelligence score of 100; however, this
time you administer a highly reliable test. What might your data look like? In each
case, the average score is 100. However scores on the unreliable test range from 85
to 115, whereas scores on the reliable test range from 97 to 103. The measurement
error in the unreliable test is revealed in the greater variability shown by the person
who took the unreliable test.
•Researchers cannot use unreliable measures to systematically study variables or the
relationships among variables. Trying to study behavior using unreliable measures
is a waste of time because the results will be unstable and unable to be replicated.
•Reliability is most likely to be achieved when researchers use careful measurement
procedures. It might mean paying close attention to the way questions are phrased
or the way recording electrodes are placed on the body to measure physiological
•We can assess the stability of measures using correlation coefficients. There are
several ways of calculating correlation coefficients; the most common correlation
coefficient when discussing reliability is the Pearson product-moment
correlation coefficient. The Pearson correlation coefficient (symbolized as r) can
range form 0.00 to +1.00 and 0.00 to -1.00.
•A correlation of 0.00 tells us that the two variables are not related at all. The closer
a correlation is to 1.00, either +1.00 or -1.00, the stronger the relationship. The
positive and negative signs provide info about the direction of the relationship.
When the correlation coefficient is positive, there is a positive linear relationship. A
negative linear relationship is indicated by a minus sign.
•To assess the reliability of a measure, we will need to obtain at least two scores on
the measure from many individuals. If the measure is reliable, the two scores should
be very similar; a Pearson correlation coefficient that relates the two scores should
be a high positive correlation. When you read about reliability, the correlation will
usually be called a reliability coefficient. Let’s examine specific methods of assessing
•Test-retest reliability is assessed by measuring the same individuals at two points
in time. For example, the reliability of a test of intelligence could be assessed by
giving the measure to a group of people on one day and again a week later. We
would then have two scores for each person, and a correlation coefficient could be
calculated to determine the relationship between the first test score and the retest
•It is difficult to say how high the correlation should be before we accept the measure
as reliable, but for most measures the reliability coefficient should probably be at
•Given that test-retest reliability involves administering the same test twice, the
correlation might be artificially high because the individuals remember how they
responded the first time. Alternate forms reliability is sometimes used to avoid this
problem. Alternate forms reliability involves administering two different forms of
the same test to the same individuals at two points in time.
•Intelligence is a variable that can be expected to stay relatively constant over time;
thus, we expect the test-retest reliability for intelligence to be very high. However,
some variables may be expected to change from one test period to the next. For
example, a mood scale designed to measure a person’s current mood state is a
measure that might easily change from one test period to another and so test-retest
reliability might not be appropriate.
Internal Consistency Reliability
•Most psychological measures are made up of a number of different questions, called
items. Internal consistency reliability is the assessment of reliability using
responses at only one point in time. Because all items measure the same variable,
they should yield similar or consistent results. One indicator of internal consistency
is split-half reliability; this is the correlation of an individual’s total score on one
half of the test with the total score on the other half. The two halves are created
randomly by dividing the items into two parts. The actual calculation of a split-half
reliability coefficient is a bit more complicated because the final measure will
include items from both halves. Thus, the combined measure will have more items
and will be more reliable than either half by itself.
•Split-half reliability is relatively straightforward and easy to calculate, even without
a computer. One drawback is that it does not take into account each individual
item’s role in a measure’s reliability. Another internal consistency indicator or
reliability, called Cronbach’s alpha, is based on the individual items. Here the
researcher calculates the correlation of each item with every other item. A large
number of correlation coefficients are produced. The value of alpha is the average of
all the correlation coefficients. It is also possible to examine the correlation of each
item score with the total score based on all items. Such item-total correlations
and Cronbach’s alpha are very informative because they provide information about
each individual item. Items that do not correlate with the other items can be
eliminated form the measure to increase reliability.
•In some research, raters observe behaviors and make ratings or judgments.
Interrater reliability is the extent to which raters agree in their observations. A
commonly used indicator of interrater reliability is called Cohen’s Kappa.
Indicators of Construct Validity
•Construct validity refers to the adequacy of the operational definition of variables.
How do we know that a measure is valid? Construct validity information is gathered
through a variety of methods. The simplest way to argue that a measure is valid is
to suggest that the measure appears to accurately assess the intended variable. This