Textbook Notes (363,339)
Canada (158,340)
York University (12,359)
Psychology (3,543)
PSYC 3090 (54)
Chapter 8

PSYC 3090 Chapter 8: CHAPTER 8

11 Pages
Unlock Document

York University
PSYC 3090
Krista Trobst

CHAPTER 8 – TEST DEVELOPMENT Test development: umbrella term for all that goes into the process of creating a test 1. Test conceptualization a. Early stage in test development process wherein the idea for a particular test or test revision is first conceived 2. Test construction a. Stage in the process of test development that entails writing test items as well as formatting items, settling scoring rules, and otherwise designing and building a test 3. Test tryout a. A stage in the process of test development that entails administering a preliminary version of a test to a representative sample of testtakers under conditions that stimulate the conditions under which the final version of the test will be administered 4. Item analysis a. General term to describe various procedures, usually statistical, designed to explore how individual test items work as compared to other items in the test and in the context of the whole test 5. Test revision a. Action taken to modify a test’s content or format for the purpose of improving the test’s effectiveness as a tool of measurement -revision + tryout often repeated -although very typical, many exceptions to it Test Conceptualization -may be after reviewing literature -emerging social phenomenon -need to assess mastery in emerging occupation -when need arises of any kind Preliminary Questions -what is test designed to measure? -objective of the test? -need for the test? -who will use the test? Who will take the test? -etc Norm-Referenced Vs Criterion-Referenced Tests -different approaches to test development + analysis -good item for norm referenced: item for which high scorers respond correctly, low scorers score incorrectly -same pattern for criterion-oriented, but not what makes item good -need to meet criteria, not necessarily be first in class, although exceptions -criterion = licensing -criterion uses group that have mastered, group that has not to test it Pilot Work Pilot work: refer to the preliminary research surrounding the creation of a prototype of the test -determine how to best measure targeted construct Test Construction Scaling Scaling: process of setting rules for assigning numbers in measurement -L.L Thurstone credit for being creating methodologically sound scaling methods, also introduced absolute scaling, which is procedure for obtaining a measure of item difficulty across samples of testtakers who vary in ability Types of Scales -aged-based scale, grade-based scale, stanine scale, uni/multidimensional, comparative or categorical Scaling Methods -testtaker presumed to have more or less of the characteristic measured by valid test as a function of the test score Rating scale: defined as a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker Summative scale: final test score is obtained by summing the ratings across all the items Likert scale: 5 alternative responses, sometimes seven, usually on agree-disagree or approve- disapprove continuum -usually reliable -results in ordinal-level data -some scales uni or multidimensional -unidimensional -> one dimension presumed to underlie the ratings -multidimensional -> more than one dimension is thought to guide the responses Method of paired comparisons: scaling method whereby one of a pair of stimuli is selected according to a rule Comparative scaling: entails judgments of a stimulus in comparison with every other stimulus on the scale Categorical scaling: stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum Guttman scale: another scaling method that yields ordinal-level measures, items on it range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured -all responses who agree with the stronger statements of the attitude will also agree with milder statements Scalogram analysis: an item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker’s responses -obtain arrangement of items wherein endorsement of one item automatically connotes endorsement of less extreme positions -this is not always possible -everything above yields ordinal data -method of equal-appearing intervals, by thurstone, scaling method used to obtain data that are presumed to be interval in nature -steps 1. Reasonable large number of statements reflecting positive and negative attitudes towards something are collected 2. Judges evaluate each statement in terms of how strongly it indicates that whatever it is is justified, each judge instructed to rate each statement on a scale as if the scale were interval in nature 3. Mean and standard deviation of judges rating calculated for each statement 4. Items are selected for inclusion in the final scale based on several criteria including: a. Degree to which item contributes to a comprehensive measurement of the variable in question b. The test developer’s degree of confidence that the items have indeed been sorted into equal intervals 5. Scale now ready for administration a. Way scale is used depends on objectives of test situation -this method is direction estimation -indirect estimation -> no need to transforms testtaker’s responses into some other scale -scaling method used depends on many factors Writing Items -when making mc, should contain twice the number final version will contain Item pool: reservoir from which items will or will not be drawn for the final version of the test -comprehensive sampling provides basis for content validity of final version of test Item Format Item format: variables such as the form, plan, structure, arrangement, and layout of individual test items Selected-response format: require testtakers to select a response from a set of alternative responses -3 types of selected response 1. Multiple choice: has 3 elements a. stem b. correct alternative or option c. several incorrect alternatives or options i. referred to as distractors or foils 2. matching item: testtaker presented with 2 columns, premises + responses, testtaker much determine which response matches which premise 3. Binary-choice item: multiple choice that contains only 2 possible responses, usually true-false item a. Good item contains single idea, not excessively long, and not subject to debate Constructed-response format: require testtakers to supply or to create the correct answer, not merely select it -3 types 1. Completion item: requires examinee to provide a word or phrase that completes a sentence a. Good item should be worded so that correct answer is specific, shouldn’t lead to many variables 2. Short answer item: completion item that is less than 2 paragraphs long 3. Essay item: test item that requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis and/or interpretation a. Problem is that tends to focus on more limited area than can be covered in same amount of time when using other items, and subjectivity in scoring Writing Items for Computer Administration -digital media can store item banks + individualize testing through item branching Item bank: relatively large and easily accessible collection of test questions CAT: computerized adaptive testing, refers to interactive, computer-administered test-taking process wherein items presented to the testtaker are based on the testtaker’s performance on previous items -tends to reduce floor + ceiling effects Item branching: the ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items -can be used for achievement + personality tests -achievement – get certain wrong/right -> next step -personality – inconsistent answer -> be more careful next time prompt Scoring Items -model used most often is cumulative model Class scoring/category scoring: testtaker responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way Ipsative scoring: comparing testtaker’s score on one scale within a test to another scale within that same test -can only compare intra-individually Test Tryout -should be tried on people who are similar in critical respects to the people for whom the test was designed -number of people who should be tried out one, informal rule of thumb is that should be no fewer than 5 and as many as 10 for each item on the test -in general, more subjects the better -when using factor analysis + little subjects, have phantom factors – just artifacts of the small sample size -test tryout should be executed under conditions as similar as possible to standardized test, all instructions, time limits etc What is a Good item? -good test item is reliable + valid, also helps discriminate -correctly scored by high scorers on test as whole Item analysis: the different types of statistical scrutiny that the test data can potentially undergo -can be quantitative or qualitative Item Analysis -statistical procedures used can become complex -criteria of best item differ as function of developer’s objectives Item-Difficulty Index -index found by calculating the proportion of the total number of testtakers who answered the item correctly -can range from 0-1 -larger item-difficulty index, easier the item Item-difficulty index: in achievement or ability testing and other contexts in which responses are keyed correct, a statistic indicating how many testtakers responded correctly to an item; in contexts where the nature of the test is such that responses are not keyed correct, the same statistic may be referred to as item-endorsement index Item-endorsement index: in personality assessment and other contexts in which the nature of the test is such that responses are not keyed correct or incorrect, a statistic indicating how many testtakers responded to an item in a particular direction -index of difficulty of average test item for particular test can be calculated by averaging the item-difficulty indices for all the test’s items -for maximum discrimination, optimal average is .5, individual items ranging from 0.3 to 0.8 -must take guessing into account, should be midpoint between 1.00 and chance success proportion -see page 263 Giveaway item: inserted near the beginning of an ac
More Less

Related notes for PSYC 3090

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.