Test and Measurement
Chapter 6: Writing and Evaluating test items
Simple guidelines for item writing.
o 1. Define clearly what you want to measure
o 2. Generate an Item pool
o 3. Avoid exceptionally long items
o 4. Keep the to level of reading difficulty appropriate for those who will
complete the scale.
o 5. Avoid “double-barreled” item that convey two or more ideas at the
o 6. Consider mixing positively and negatively worded items
When writing items, you need to be sensitive to ethnic and cultural differences
o The Dichotomous Format
Offers to alternatives for each items, usually a point is given for the
selection of one of the alternatives
Example is true false
These tests are simple, they are easy to administer, and they are
easy to score
The problems with true false is that students can just memorize
the material and do very well, without ever understanding the
To be reliable the test must include many items, because with 2
the chance of getting the correct answer is 50%
Many personality tests require responses in a true-false
Because it re requires absolute judgment
o The Polytomous format
The polytomous format- resembles the dichotomous format
except that each item has >re than two alternatives
Multiple-choice tests are easy to score, and the probability of
obtaining a correct response by chance is lower than it is for true-
false items. The test can cover a large amount of information in a
relatively short time. Incorrect choices are called distractors.
reliability of an item is not enhanced by distractors that no one
would ever select.
Studies have shown that it is rare to find items for which more
than three or four distractors operate efficiently.
Three-option multiple-choice items are as good as, if not better
than, items that lave more than three alternatives
A correction for guessing corrected score = R – w/n-1 R = the number of right responses
W = the number of wrong responses
N = the number of choices for each item
How many times have you narrowed your answer down to two
alternatives but could not figure out which of the two was co
correct In this case, we advise you to guess.
The guessing threshold describes the chances that a low-ability
the test ta taker will obtain each score
Essay exams can be evaluated using the same principles used for
structured tests. The validity of the scoring procedure should be
assessed by determining the association between two s scores
provide dependent scorers
o The Likert Format
One popular format for attitude and personality scales requires
that a respondent indicate the degree of agreement with a
particular attitudinal question.
Scale using the Likert format consists of items such as ‘I am afraid
5 alternatives are offered
Scoring requires that any negatively worded items be reverse
scored and the responses are then summed.
Popular in measurements of attitude.
o The Category Format
A technique that is similar to Likert format but that uses an even
greater number of choices is the category format
1O-point rating systems
1O-point scales are affected by the groupings of the people or
things being rated.
People will change ratings depending on context
Reliability and validity may be higher if all response options are
clearly labeled, as opposed to just labeling the category extremes
Experiments have shown that this problem can be avoided if the
endpoints of the scale are clearly defined and the subjects are
frequently reminded of the definitions of the endpoints.
The number of categories required depends Is on the fineness of
the discrimination that subjects are willing to make
Increasing the number of choices beyond nine or so can reduce
reliability because responses may be more likely to include an
element of randomness when there e are so many alternatives that
respondents cannot clearly discriminate between
The optimum number of categories is between four and seven.
Visual analogue scale. – the respondent is given a 100-millimeter
line and asked to place a mark between two well-defined
endpoints. The scales are scored according to the measured distance from the first endpoint to the mark -popular for
measuring self-rated health, scoring is time consuming
Check Lists and Q-sorts
o One format common in personality measurement is the adjective
o The adjective checklist requires subjects either to endorse such adjectives
or not, thus allowing only two choices for each item
o The Q-sort can be used to describe oneself or to provide ratings of others
o A subject is given statements and asked to sort them into nine piles.
o If a statement really hit home, you would place it in pile 9 those that were
not at all descriptive would be placed in pile 1.
o Checklists have fallen out of favor because they are more prone to error
than are formats that require responses to every item.
Item analysis, a general term for a set of methods used to evaluate test items,
is one of the most important aspects of test construction
The optimal difficulty level for items is usually about halfway between 100%
of the respondents getting the item correct and the level of success expected
by chance alone.
The optimum difficulty level for a four-choice item is approximately .625.
To arrive at this value, we take the 100% success level (1.00) and subtract
from it the chance performance level (.25). Then we divide the result by 2 to
find the halfway point and add this value to the expected chance level.
**STEPS ARE ONE PAGE 171
For most tests, items in the difficulty range of .30 to .70 tend to maximize
information about the differences among individuals.
constructing a good test, one must also consider human fact