Chapter 1 o Aptitude tests: potential for learning or acquiring a
Because of diversity concerns, growing numbers of specific skill; i.e. a spelling aptitude test measures
colleges no longer rely on SAT tests. The percentage of how many words you might be able to spell given a
those who use it however went from 46 to 60. certain amount of training.
LSAT and GRE tests are the most difficult modern tests. o Intelligence tests: measures a person’s general
What a Test Is potential to solve problems, adapt to changing
Test: measurement device or technique used to circumstances, think abstractly and profit from
quantify behavior or aid in the understanding and experience.
prediction of behavior. Tests are never of full Personality tests: measure overt and covert
understanding because it only measures a sample of dispositions of the individual (e.g. tendency for person
your behavior, as well as there are always errors to behave in a given situation).
associated; tests aren’t perfect measures, but they add o Structured tests (aka objective tests): provide a
to the predictions. statement, of the ‘self-report’ variety, and require
Item: specific stimulus to which a person responds
subject to choose between 2 or more alternative
overtly (response can be evaluated or scored on a scale responses.
for example). Items are subject to scientific inquiry o Projective personality tests (aka unstructured):
because they are what make up the tests. either the test materials or the required response
Psychological test (aka educational test): set of items (or both) are ambiguous. One example is the
designed to measure characteristics of human beings
Rorschach Inkblot Test. These assume that a
that pertain to behavior. They can measure to what person’s interpretation of the stimulus reflects her
extent a person may engage in overt behavior, has unique characteristics.
engaged in overt behavior or covert behavior. The main Principles, Applications and Issues of Psychological Tests
use of these tests is to evaluate individual differences Reliability: accuracy, dependability, consistency or
inability and personality and assume that differences repeatability of test results. Basically, it’s the degree to
shown on the test reflect actual differences among which test scores are free from error.
individuals. Validity: meaning and usefulness of test results.
Scales: these are what psychologists make up to relate Basically, it’s the degree to which a certain inference
raw scores on test items to some defined theoretical or based on a test is appropriate.
empirical distribution (this happened because of Test administration: how the test is being given. Some
misinterpretations of scores: i.e. a person can get 75% tests are easy to give, others are not.
in a class where everyone else got 99%, or a person can Interview: method of gathering information through
get 75%in a class where everyone else got 20%). verbal interaction, such as direct questions.
Test scores may be related to traits (e.g. stubbornness Historical Perspectives
or shyness) or to a state (hopelessness…). Chinese had a sophisticated civil service testing
Types of Tests programs 4000 years ago. These were given every 3 rd
Individual Tests: given to one person at a time. year to help determine promotion decisions. At the Han
Group Test: administered to more than one person at Dynasty’s arrival, test batteries (2 or more tests used in
a time by a single examiner or test administrator conjunction) was common. The British copied them in
(person giving test) 1832 through missionary works, for their own civil
Human ability tests: incorporates all of achievement, service employment.
1 line of inquiry (work of Darwin. Galton and Cattel):
aptitude and intelligence because the 3 cannot be
clearly separated since they are highly interrelated. Testing became more focused on individual differences
o Achievement tests: they measure previous after Charles Darwin’s publication of the Origin of
learning; i.e. spelling achievement test measures Species in 1859. Darwin’s theory was that the most
how many words you can spell correctly. adaptive characteristics of species survived and evolved.
Galton (Darwin’s relative) took this idea and applied it to people in his book Hereditary Genius. He focused Achievement tests became popular then because they
mainly on sensory and motor differences (reaction time, were MCQ (versus essays). They were also easy to
visual acuity, physical strength…). administer and score since they lack subjectivity.
2 line of inquiry (work of Herbart, Weber, Fechner and Currently however, people have returned to favoring
Wundt): Another human area of interest was the written tests.
consciousness. Herbart explored it with mathematical As a result of the countermovement against tests, the
equations. Weber followed him and tried to determine Wechsler-Bellevue Intelligence Scale was created.
thresholds (minimum stimulus necessary to activate a Unlike the Binet scale, it produced more than only 1
sensory system). Fechner then built on that idea and score (the modern IQ score), so it showed a pattern of
determined that the strength of the sensation grows as the individual. This one was also better because it
the logarithm of the stimulus intensity. Wundt then focused less on language.
promoted this work and is known as the founder of the Before and after WW2, personality tests also began to
science psychology. This further led into experimental blossom. These measured traits (relatively enduring
psychology. dispositions – tendencies to think, act, or feel in a
The actual reason that we have tests is not because of certain manner – that distinguish individuals). The first
these 2 lines, it’s actually because of the need to classify personality test was the Woodworth Personal Data
the mentally and emotionally handicapped. The first Sheet, and it was made of structured true or false
test create was at the start of the 1900s and was in questions. The motivation to create personality tests
France, created by Binet and Simon under government was because of military screening. The problem with
authority. It was expected to be a general intelligence the early personality tests was that they assumed
test to classify and help the intellectually subnormal. participants never lied, or that they had the same
Evolution of Intelligence and Standardized Achievement understanding of what the question asked as the
The Binet-Simon scale had the standardization sample administrator. Afterwards, projective tests came to light
of 50 children. Many people take this for granted, but but people tended to prefer the Thematic Apperception
since the 50 children were rich white ones, the test Test more than the Rorschach Inkblot test, believing it
doesn’t apply fairly to other people who do not fit that to be more scientifically valid. This ended in the 1940’s.
description (e.g. black, Hispanic, poor…). Binet was Emergence of New Approaches
aware of this though, so he tried to increase the size In 1943, the Minnesota Multiphasic Personality
and representativeness of the sample. A representative Inventory (MMPI; another structured personality test)
sample is one that comprises individuals similar to became popular. This is because it emphasized the need
those for whom the test is to be used. In 1908, the scale for empirical data.
was improved to include more children (200 of them) More personality tests were then made, and they were
and it also determined child’s mental age based on the statistical procedure factor analysis
(measurement of child’s performance on test relative to (method of finding the minimum number of dimensions
other children of that particular age group). – ‘factors’ include characteristics and attributes – to
One of the most important trends in testing: strive account for a large number of variables). Cattell used
towards better tests. The version that reached and this to determine the 16 Personality Factors.
remained in the US is Treman’s 1916 version, labeled as Period of Rapid Changes in Status of Testing
Stanford-Binet Intelligence Scale. By 1949, clinical psychology students became certified.
Testing became even more popular in WW1 because Testing was the major function of the psychologist, but
people wanted to get men who were emotionally and it was the physician that provided the psychotherapy.
intellectually stable. However the Binet test wasn’t The psychologists started to blame ‘testing’ for their
efficient, since they needed large-scale tests, not secondary role in the health industry, and so, testing as
individual ones. The army then reached out to APA’s a career started to become more hated, and less of an
president, Yerkes. He created 2 tests: Army Alpha (for option.
those that can read) and Army Beta (for the illiterate). Current Environment Major branches of psychology emerged in the 2000s: Actuarial vs. Clinical predictions: Several studies
forensic, neuropsychological, health, and child. These showed that actuarial numbers (such as number of
fields grew around using tests. arrests, severity of crime…) are better predictors for
recidivism than are clinical judgments and tests. Yet,
Chapter 21 other studies showed the opposite finding. Computer
Professional Issues Shaping the Field of Testing usage in testing also raises a few biases: people will
Theoretical Concerns: these focus on reliability take the test lightly, they may have an inappropriate
(dependability) issues of test results. According to software, or lack of clinician involvement. Either way,
APA and other research associations, unreliable tests ethical guidelines specify that it’s the clinician’s
are unstable, and so are meaningless. Basically, the responsibility to make sure of appropriateness. Also,
problems with tests is that either they aren’t precise results showed that most clinicians don’t want to use
enough to measure determinants of human behavior, technology in testing anyways.
or current understanding is not precise enough to Moral Issues Shaping Field of Testing
make accurate predictions. Tests are not better than Human Rights:
the theories they are based on. o One of these rights is the right to not be tested. Yet,
o Assumption 1: Saying that a test has reliability APA specified 3 cases when this can be vetoes:
means the test results are attributable to a Testing is mandated by law, informed consent is
systematic source of variance, and so, is in itself implied (if testing is for example in educational
stable. In describing the functioning of a person, activities) and if the test’s purpose is to evaluate
psychologists are saying that these functions are decisional capacity.
stable (even if they’re short-term), regardless of o Another right is the right to know their test scores
situation or environment. What they measure is and interpretations. It’s important to keep scores
something that exists in absolute terms. secure, but if there is something the public needs to
Psychologists assume that any sort of variance is know (such as biases, or that their life is harmful
from the person, not the test, such that if they’re
and they should change it), it needs to be told.
measuring a stable characteristic of the person, and o A recent right is that to know who will have access
it doesn’t come out as stable, it would be because to the data.
of test error (measurement/instrumental error) not o Another recent one is the right of confidentiality.
because the person has changed. So, making sure APA forces physicians to inform confidentiality
the test has no errors is important. This assumption issues of taking a test online.
isn’t entirely correct because researchers cannot o The test interpreters have an ethical obligation to
fully attribute errors to measurement errors nor to
make sure their test takers know their rights.
people’s fluctuations. Labeling: It is common to give labels to people after a
o Assumption 2: Most of the tests assume that we diagnosis has been made, but some disorders have
can measure human characteristics independently become associated with negative stigma (e.g.
of the context in which they occur, but this has no schizophrenia, AIDS…). Some labels are also self-
scientific support. prophesizing, such that they prevent people from
Adequacy of Tests: Shakow is the father of clinical receiving appropriate treatment (e.g. being labelled a
psychology, and even he admits that we haven’t yet
chronic schizophrenic). Another effect of labeling is
reached our goal of providing objective assessments that those who are psychiatrically disturbed are so
of psychological functioning or personality. We know because of low control over their bodies. Labeling
we don’t have the perfect tests, but they are process makes treatment more difficult by lowering
adequate for now. How tests are used is determined their tolerance for stress.
by law (e.g. if SATs consistently underrepresent Invasion of privacy: people started being suspicious of
blacks, maybe it should be re-examined for bias). tests, but 2 defense sides arose by Dahlstorm. One
was that psychological tests are so limited that they cannot invade anybody’s privacy. The other was that correctly, so they were useful). Tests don’t need to be
the notion of invasion of privacy is itself unclear. perfect, they just need to be useful.
Knowing about a person’s privacy will only be a Access to psychological testing sources: fees are really
problem if it’s used inappropriately. So, participants high to run, administer and interpret test results
now have the right to know the limits of (15,000-25,000). The Wechsler tests are 1,100.
confidentiality (if the results show that the person Because they are so expensive, it’s hard for people to
may cause harm), and that these results may be used have access to it. People are now trying to include
in court if necessary and dangerous. them in insurances to make them more accessible,
Divided Loyalties: the question is, who is the client – but that will exclude some tests, and so tests will now
the individual or the industry that ordered the test? be chosen based on usefulness.
For example, a firm hires a psychologist to eliminate Current Trends
potential employers who can’t handle stress well. The Proliferation of New Tests: new tests keep getting
psychologist should maintain the secrecy of why created because of professional disagreement over
specific people aren’t hired, but should also explain to the best strategy to measure human characteristics.
the people why such a decision was made. If the Another reason is that people are always trying to
psychologist does tell the client why the decision was make tests less biased. Final reason is that people can
made, the person can tell other people who will profit from making tests financially, and so they take
outsmart the test. Right now, this problem is solved it to their advantage. Nontraditional tests are being
by letting the psychologist tell the clients the purpose made because they reflect the role of scientific
of the test, and refrain from exposing personal psychology, as well as the fact that they’re trying to
information of clients to firm. integrate tests into other fields of psychology.
Responsibilities of test users and constructors: A test Higher Standards, Improved technology and increased
that is valid and reliable for one group may not be so objectivity: The APA published guidelines to be
for another group. So, test constructors must take this followed by test conductors, so now, everything is
into account. Also, when interpreting results, getting better than before. Better tests are also being
psychologists have to take into account the created because of the ease at which statistical
characteristics of the person they measured. Test measures can be performed, due to technology.
users are also required to know why they’re taking Greater public awareness and influence: public
the test and the consequences that may arise. Test awareness led to increased demand for psychological
users must always ask if the test is any good as a services. The demand is balance by the tendency
measure of the characteristic being measured (is it
towards legislative regulations and policies (e.g. those
reliable or valid?) and if the test should be used for that restrict using intelligence tests to diagnose
the purpose specified (is it ethical?). retardation). The best product of increased
Social Issues Shaping the Field of Psychology awareness, however, is the extra focus on
Dehumanization: there is always that thought that the safeguarding human rights. This has led to the
computers and data sheets we are inserting into them probability of test misuse and abuse to decrease.
are making our life choices; ‘We are not considered Computer and Internet Applications: computers are
human anymore’. Computerized test interpretations being used to administer, score and interpret
threaten our humanity, and so they must be psychological tests. The use of computers extends to
implemented carefully, always open to scrutiny. all types of tests, including behavioral assessments
Usefulness of Tests: sometimes the theories behind a Future Trends
test may be incorrect, but they still lead to useful Psychological testing is predicted to continue on
predictions (this is like how the sun was believed to existing for a long time. They are important part of the
revolve around the earth. The formulae created were psychologist’s career, and they are a profitable industry.
based on wrong theories but predicted things The future will also likely see the development of many
more tests (probably better, and based off of older ones). The Weschler and Binet intelligence tests are not Magnitude: property of moreness. A scale has this
likely to remain dominant. One test already rising to attribute if we can say that one item represents more
challenge them is the Kaufman Assessment Battery for or less than another. However, there are instances
Children. As for structured personality testing, the where this is not needed (e.g. if a coach labels teams
MMPI-2 seems to be the premier test for years to come. as team 1, team 2 …, it doesn’t mean anything is more
The Rorschach test’s future is uncertain, especially since than the other, they’re just labels.
it’s based on Freud and his theoretical psychometric Equal Intervals: a scale has this property if the
beliefs. Also, the inkblots of the test are now available difference between 2 points at any place on the scale
online, and so future test takers can see them and
has the same meaning as the difference between 2
understand their interpretations (so this defeats the other points that differ by the same number of scale
purpose). The Thematic Apperception Test (TAT) is even units (e.g. on an inch measure, 2 and 4 are the same
more difficult to predict. This is because although it’s an distance apart as 10 and 12; this doesn’t work for IQ
incredibly extensive research base, its stimuli are measures). A scale with this property has
measurement units that fall on the equation y=mx+b
One thing that will not change is the argumentative (a straight line).
nature of psychologists: they will always be debating on Absolute 0: being on absolute 0 means there’s
which is the best test to use. nothing beyond this that exists. For biology, to say a
Today, integration of concepts from experimental person has 0 heartbeats is absolute (the person has
cognitive psychology, computer science, neuroscience
no heart rate at all), but for psychology, defining an
and psychometrics are rapidly shaping the field. absolute 0 is harder (what does it mean that a person
Multimedia computerized tests form the most recent lies 0 on the shyness scale?).
generation of assessment instruments. Computers offer Scales of Measurement: Types
an unlimited scope in developing games (from Nominal scales: not really scales at all; their only
interactive virtual realities to helping in desensitization purpose is to name objects. They’re used when the
with phobias). The computer is going to play an information is qualitative (mostly used by social
important part in the future of psychology.
scientists to categorize). This scale has none of the 3
properties mentioned above
Chapter 2 Ordinal Scale: allows you to rank individuals but not
Why We Need Statistics say anything about the meaning of the differences
The outcome of a test is usually represented as a score between their ranks. This scale has property of
Statistical methods serve 2 purposes in understanding
magnitude, but not equal intervals or absolute zeroes.
science: For example, scales that rank by height. IQ tests also
1.Descriptive statistics: methods used to provide lie here.
concise description of collection of quantitative Interval scale: these scales have magnitude and equal
information. interval properties, but not absolute 0. This means
2.Inferential statistics: methods used to make
that we can’t make any predictions about ratios (i.e.
inferences from observations of a small group for the Fahrenheit scale, it’s an interval because it has
(sample) to a larger group (population). Statistics can everything but an absolute 0, so we know that 212 is
be used to make inferences (logical deductions about warmer than 22, but we can’t say that 50 is the
events that can’t be observed directly). First comes a double of 25). Celsius scales are also here because,
period of exploratory data analysis (gather and display
although they have a measure of 0, an absolute 0
cues) then comes confirmatory data analysis (cues are means that what we are measuring doesn’t exist (but
evaluated against rigid statistical rules). for Celsius, 0 means it’s just low, but temperature still
Scales of Measurement: Properties exists).
Measurement: application of rules for assigning Ratio Scale: has all 3 properties. Continuing on with
numbers to objects. There are 3 important properties
the temperature, the only scale that is ratio is the Kelvin scale. Its 0 means all molecular activity has Statistics are used to summarize data.
stopped. Variable: score that can have different values.
Scales of Measurement: Permissible Operations Mean arithmetic average score in a distribution.
Nominal data can only use frequency distributions, but quation (instead of , use for population).
no math. Ordinal data can be manipulated by arithmetic This means the sum of all values of the variable x
but it’s difficult to interpret (you can know the rank, but divided by the total sample number.
not the individual information). For interval, you can Describing Distributions: Standard Deviation
speak about differences but not ratios, and for ratio Standard Deviation: approximation of average
scales, you can use anything. deviation around the mean. quation Square root
Frequency Distributions * ( - ) /N]. The variance equation is the same, but you
Frequency distribution: displays scores on a variable to just square everything. The difference between
reflect how frequently each value was obtained. standard deviation and variance is that variance is just
Usually, scores are on horizontal, and the vertical axis the average squared deviation around the mean and
says how many times each value on the horizontal axis doesn’t make any actual sense. These 2 have many
was observed (frequency). Usually (but not always) the advantages: knowing them allows us to make precise
distribution is symmetrical. When they aren’t statements about the distribution. This is when we’re
symmetrical, they are known as skewed to one side. It’s talking about population. When we want to talk about
positively skewed if the tail falls off on the higher side of the sample, instead of using ‘ ’ we use S, and instead of
the x-axis. Income is an example of a distribution with a having only N, we have N-1. We use N-1 to recognize
positive skew (some people have really high incomes, that S of a sample is only an estimate of the variance of
and so the tail will be towards the higher side of x-axis). the population (side note: roman letters are used for
A frequency polygon: points are placed on a graph to samples and Greek letters are used for populations).
represent the frequency and they are then connected. Z Score
For either frequency distribution or polygon, you need eans and standard deviation don’t convey much
to specify width of class interval (intervals that share information still. score transforms data into
the same frequency; they’re chosen based on standardized units that are easier to interpret. ( -
convenience for the researcher). ) S. Basically, a score is a deviation of a score from
Percentile Ranks the mean in standard deviation units. If score is equal to
Percentile Ranks: it tells you what percent of people mean, Z is 0, if more, Z is +ve and if less, Z is –ve.
falls in a particular location of the graph; the specific CES-D is a general measure of depression that has been
case you’re looking at is excluded. To calculate it, you used in epidemiological studies. Scores on this test
need to determine how many cases fall below the score range from 0-60, with scores more than 16 meaning
of interest, how many people are in the whole group, that you have clinically significant levels of depression.
and then divide the people who have less by the total This test doesn’t have high validity (if you score less
and multiply by 100. P r B/N x 100. For example, than 16, you’re not depressed for sure, but if you’re
finishing 62 in a race of 50,000 people means higher than that, it doesn’t mean necessarily that you
62/50,000 = 0.9988. that multiplied by 100 is 99.88, so are depressed).
you’re in the 99.88 percentile. The percentile ranking Standard Normal Distribution
depends on the number of cases you’re looking at. Symmetrical binomial probability distribution: x axis has
Percentiles: specific scores or points within a Z scores and the y axis has frequency. Transforming
distribution. The difference between percentile and scores into Z scores means that their mean is now 0 and
percentile ranking is that percentile is in raw score the SD is 1. It also means that we can easily predict
units. So, if the infant mortality rate is 4.22 out of 1000, proportions of cases as well as the percentile ranks.
and there are 13 countries with worse mortality rates, These methods can only be used when the distribution
the percentile is 4.22/1000 and the percentile rank is 72 of scores is normal. Non-normal statistics are referred
Describing Distributions: Mean to as nonparametric statistics. Stanine literally a combination of ‘standard nine’;
Chapter 2 system developed in WW2 for the US Air Force; it
Percentiles and Z Scores converts any set of scales into a transformed scale
Applying the Z score to exams: assume you got a 60 in ranging from 1-9 with a mean of 5 and SD of 2. Their
the class. The mean was 55.70 and the standard only advantage was that on computer cards, they use
deviation was 6.08. This means your Z score would be only one side, but now that we don’t use those cards,
(60-55.70) 6.08 .707. If you look at appendix 1, you’d we don’t really need the system. Stanines are
find that that Z score corresponds to the 76 percentile, distributed such that 1 has the bottom 4%, 2 has the
and that would mean that out of every 100 students, 76 next 7%, 3 has the next 12%, 4 has the next 17%, 5 has
people scored lower than you. This assumes that the the next 20%, and it starts going down again (it’s a
class distribution is normal. curve). So to find out which stanine your own, convert
cCall’s T the raw score into a Z score and then find which
McCall was a mathematician who tried to develop a percentile your z score falls on, and then you can look at
system to derive equal units on mental quantities (one the Stanine table and find out which one you’re in.
set of scores that could be applied to other situations Norms
without standardizing the set); instead he developed an Norms (aka mean or 50 percentile): performances by
alternative to the Z score system. He suggested a defined groups on particular tests. For a test, norms are
random sample of 12-year-olds be tested and their based on distribution of scores obtained by some
scores obtained. He then converted their raw scores to defined sample of individuals.
their percentile equivalents and set the mean (50 Most intelligence tests are transformed to have a norm
percentile) at 50, and the standard deviation at 10 (the of 100 and SD of 15
SD is McCall’s T). So, the system is different than Z Controversies:
scores in that the mean is 50 (instead of 0) and the SD is A troubling issue in psychological testing is that
10 (instead of 1). To convert a Z score to a T: T=10Z+50. different ethnic groups have different norms. So, if tests
There is nothing special about this transformation; you would be used as screening for employment,
can manipulate the Z scores in whatever way you like. overselection would happen (selecting a higher
One test that manipulated these are the SATs. They use percentage from a particular group than would be
a mean of 500 and SD of 100. Before 1995, the scores expected on the basis of representation (e.g. if 60% of
were compared to the original group of students who applicants are white, and 75% of those hired are white,
took the test. Afterwards, it was made more accurate there was overselection). To correct this problem
by standardizing the tests. (especially with GATB tests in US), they compared the
Standardization vs. ormalization cCall’s T and applicants only to norms of their own ethnic group. This
similar systems are standardizations in that if the results eliminated overselection but created the problem that
were skewed before, after the linear transformations, although 2 people may have the same score, their raw
they’ll still be skewed. ormalization changes the scores would be different and so selecting between the
characteristic of the distribution. 2 would be difficult. The National Academy of Science
Quartiles and Deciles promotes the use of separate testing norms, but the
Quartiles: divides the percentage scale into 4 groups; actual civil law declared illegal to do so.
the 2 quartile is the median (half the scores are above Age-Related Norms
and half are below). It has an interquartile range Some tests have different norms for particular age
(interval of scores bounded by the 25 and 75 th groups. The IQ tests are like this. So, when applying an
percentiles; it represents the middle 50%). The quartiles IQ test, the tester’s task is to determine the mental age
are Q1, Q2 and Q3. of the person.
Deciles: divides the percentages into 10 groups. The Tracking
deciles are denoted as D9 (point below which there are Experts have discovered that children at the same age
90% of the scores), D8, D7… level tend to go through different growth patterns (e.g. children who are born small tend to stay small and grow cut-off point, teachers began to ‘teach the test’ (i.e.
at a slower pace). So, pediatricians who are trying to they focused on teaching only subjects that will be
determine if a child’s height is below or above average tested like math and science, and ignored art). Also, it
need to know more than age; they also need to know was found that teachers became under a lot of pressure
the child’s percentile within a given age group. Tracking to ‘produce’ excellent students that they started
is the tendency to stay about the same level relative to cheating the system.
one’s peers. The tracking system is useful in medicine, Are 4 Grades smarter than 3 Graders?
for example to track the health and nutrition of a At California, the standardized test they use is the STAR.
person. However, when it comes to education, it’s The test reveals indicators of school performance, so it’s
controversial. Some people believe that intellectual a serious test. Results of students from 2005, 2006 and
growth parallels physical growth, such that people who 2007 are graphed, and a trend arises: 3 graders are
are originally below their peers will grow intellectually always performing lowly, and 4 graders always excel.
slower. However, others seem to believe that this The best explanation is that the tests of 3 graders may
system discriminates against children. be too hard for them, while the tests for the 4 graders
Criterion-Referenced Tests are too easy. This is another problem of standard based
Norm-referenced test: compares each person with a testing based on arbitrary cut-off points.
norm. People have argued that such tests force
competition among people. Chapter 4
Criterion-referenced test: describes the skill, task or History and Theory of Reliability
knowledge that the test taker can demonstrate. The Errors of measurement: discrepancies between true
results of these tests would not be used to make ability and measurement of ability. In psychology,
comparisons, but to design individualized program of ‘error’ does not imply that a mistake has been made,
instruction. It was popular in the 1990s as a humanistic but that there will always be inaccuracy. Our goal is to
trend, but many schools showed that there was a large find the magnitude of error and minimize it. Tests with
number of students not succeeding, and so Obama minimal error are considered reliable.
declared it a conservative approach and removed it to Conceptualization of Error
enforce higher standards. This however raises the The issue of error is most prominent in psychology
problem of the arbitrariness of the cut-off points for because what we measure is not easily observed. It’s
passing high stake tests. like we have a ‘rubber yardstick’ as our instrument and
Within High School Norms for University Admission what we are trying to measure may be underestimated
University of California noticed that it has (if the yardstick is shrinking) or overestimated (if it
underrepresentation of minorities in its classes, and so, stretched).
decided to reform. Instead of looking at SAT scores Spearman’s arly Studies
(which would compare students with each other across Spearman was the first to develop reliability assessment
the state), they decided to take the top 4% of students methods, the idea he had taken from De oivre’s
in each high school (so as to compare students only concept of sampling error (a century earlier) and
within each high school). This actually increased the Pearson’s product moment correlation.
underrepresentation and the program was abandoned. Thorndike furthered this idea by introducing several
No Child Left Behind coefficients into the equations. Later, Cronbach
The No Child Left Behind (NCLB) Act was initiated by developed more advance methods for evaluating many
Bush with the justification that each child should sources of error in behavioral research. Currently, the
receive a quality of education. The law required that theories have evolved to include computer technologies
each child be tested with standardized tests and that through the item response theory (IRT).
schools make the results public. This seemed to be a Basics of Test Score Theory
good idea, but it soon became evident that since the Classical test theory has been around for 100 years and
tests are standard based and are based on an arbitrary it assumes that each person has a true score that would be obtained if there were no errors in measurement. Another problem is that this model assumes that
Equation: X=T+E (x is observed score, T is true score, characteristics are constant over time, and variations
and E is error). are assumed to be errors.
Another assumption is that errors of measurement are Item Response Theory
random. Systemic errors are considered but they don’t IRT has been developing over the past decade. The
really account for much (e.g. if a carpenter constantly computer is used to focus on the range of item difficulty
misreads 2cm as 1cm, his wood won’t be accurate, but that helps assess an individual’s ability level (so
it will all be the same size). The distribution of random basically, you get asked a few questions and if you keep
errors is bell-curved, with the center of the distribution getting them right, the computer asks more difficult
representing the true score. The true score may be questions, and if you get a bunch wrong, the computer
better estimated by finding the mean of observations drops you to an easier level). This way, a person’s skill is
from replicated applications. sampled fairly. One difficulty: the method requires a
The basic measure of error is standard error of bank of items that have been systematically evaluated
measurement (standard deviation of errors, since we for difficulty level. Also, considerable effort is needed
assume that the distribution of random errors will be for test development.
the same for all people). Models of Reliability
Domain Sampling Model Federal government mandates that all tests be reliable
Test Scores gain reliability as the number of items before use in employment or education purposes.
increases. Sources of Error
Central concept in classical test theory. It considers Loud noises in the room
problems created by using a limited number of items to Room temperature isn’t suitable
represent a larger, more complicated construct. Test-takers may be feeling down
Reliability is conceptualized as the ratio of variance of Items on the test might not be representative of the
observed score on the shorter test and the variance of domain (e.g. you can spell 96% of dictionary words,
the long-run true score. We’re looking at the error of but the 20-item test included 5 items (20%) that you
using a sample of items as opposed to the entire
domain (we want to check your spelling. We should Measuring test-reliability:
really get you to spell every word in the dictionary, but Time Sampling: Test-Retest Method: used to evaluate
nobody has time for that, so we get you to spell a error associated with giving a test at 2 times. This is
sample of words). only needed when we measure traits that do not
Reliability can be estimated from the correlation of the
change over time (e.g. IQ is constant, but the
observed test score with the true score. Since we can’t Rorschach inkblot test is not). It’s an easy way to test
practically get the true score, instead, we can create reliability: just give the test twice and find the
many tests by sampling from the same domain and get correlations. You have to watch out for:
a normal distribution. We would get correlations o Carryover effect: when first testing session
between each test and average them (technically you
influences scores from the second testing session.
should convert Fishcer’s r to and then the scores are When they are present, the test-retest correlation
averaged). overestimates reliability. This isn’t really a problem
The correlation between 2 randomly parallel tests when the second session is systemic (as opposed to
would be expected to be less than the correlation of random). If testing affects all test takers equally, it’s
either test with the score. Equation: r √r average.
1j 1j not an error.
Criticism of Classical Test Theory: It requires the same o Practice effects (follow from carryover effect): some
test items to be administered to each person. For skills improve with practice, so test takers’ scores
example, intelligence has a small number of items that may sharpen those skills (such as dexterity skills).
concentrate on an individual’s ability level (some may This is considered an effect because practice affects
be too hard/easy), so the reliability is not very strong.
people differently. Because of these problems, time interval between between 2 items is 0 when the items measure
tests must be selected and evaluated carefully. If the different things, but when the distribution is skewed,
time is too short, then we’ll have to deal with alpha usually gives a value. So now, people are
carryover and practice effects, but if it’s too long, starting to use coefficient omega instead (omega
then we have 3 problems (is it because the test isn’t estimates the extent to which all items measure the
reliable, the characteristic actually changes overtime, same underlying trait).
or some combination of both?). KR 20rmula
Item Sampling: Parallel Forms Method: Parallel forms Kuder and Richardson advanced reliability assessment
reliability compares 2 equivalent forms of a test that by developing methods for evaluating reliability within a
measure the same attribute (same difficulty). Pearson single test administration. Their method is better than
product moment correlation coefficient is used. The using the split-half for assessing internal reliability
because it simultaneously considers all possible ways of
method tries to make sure that the test is reliable
such that the error variance is not attributable to the splitting (instead of arbitrarily splitting). 20 assess
selection of one set of items. The tests can be