Study Guides (248,234)
Canada (121,419)
Psychology (1,882)
PSYC37H3 (12)

Midterm Notes

19 Pages
Unlock Document

Jessica Dere

Chapter 1 o Aptitude tests: potential for learning or acquiring a Because of diversity concerns, growing numbers of specific skill; i.e. a spelling aptitude test measures colleges no longer rely on SAT tests. The percentage of how many words you might be able to spell given a those who use it however went from 46 to 60. certain amount of training. LSAT and GRE tests are the most difficult modern tests. o Intelligence tests: measures a person’s general What a Test Is potential to solve problems, adapt to changing Test: measurement device or technique used to circumstances, think abstractly and profit from quantify behavior or aid in the understanding and experience. prediction of behavior. Tests are never of full  Personality tests: measure overt and covert understanding because it only measures a sample of dispositions of the individual (e.g. tendency for person your behavior, as well as there are always errors to behave in a given situation). associated; tests aren’t perfect measures, but they add o Structured tests (aka objective tests): provide a to the predictions. statement, of the ‘self-report’ variety, and require Item: specific stimulus to which a person responds subject to choose between 2 or more alternative overtly (response can be evaluated or scored on a scale responses. for example). Items are subject to scientific inquiry o Projective personality tests (aka unstructured): because they are what make up the tests. either the test materials or the required response Psychological test (aka educational test): set of items (or both) are ambiguous. One example is the designed to measure characteristics of human beings Rorschach Inkblot Test. These assume that a that pertain to behavior. They can measure to what person’s interpretation of the stimulus reflects her extent a person may engage in overt behavior, has unique characteristics. engaged in overt behavior or covert behavior. The main Principles, Applications and Issues of Psychological Tests use of these tests is to evaluate individual differences Reliability: accuracy, dependability, consistency or inability and personality and assume that differences repeatability of test results. Basically, it’s the degree to shown on the test reflect actual differences among which test scores are free from error. individuals. Validity: meaning and usefulness of test results. Scales: these are what psychologists make up to relate Basically, it’s the degree to which a certain inference raw scores on test items to some defined theoretical or based on a test is appropriate. empirical distribution (this happened because of Test administration: how the test is being given. Some misinterpretations of scores: i.e. a person can get 75% tests are easy to give, others are not. in a class where everyone else got 99%, or a person can Interview: method of gathering information through get 75%in a class where everyone else got 20%). verbal interaction, such as direct questions. Test scores may be related to traits (e.g. stubbornness Historical Perspectives or shyness) or to a state (hopelessness…). Chinese had a sophisticated civil service testing Types of Tests programs 4000 years ago. These were given every 3 rd  Individual Tests: given to one person at a time. year to help determine promotion decisions. At the Han  Group Test: administered to more than one person at Dynasty’s arrival, test batteries (2 or more tests used in a time by a single examiner or test administrator conjunction) was common. The British copied them in (person giving test) 1832 through missionary works, for their own civil  Human ability tests: incorporates all of achievement, service employment. 1 line of inquiry (work of Darwin. Galton and Cattel): aptitude and intelligence because the 3 cannot be clearly separated since they are highly interrelated. Testing became more focused on individual differences o Achievement tests: they measure previous after Charles Darwin’s publication of the Origin of learning; i.e. spelling achievement test measures Species in 1859. Darwin’s theory was that the most how many words you can spell correctly. adaptive characteristics of species survived and evolved. Galton (Darwin’s relative) took this idea and applied it to people in his book Hereditary Genius. He focused Achievement tests became popular then because they mainly on sensory and motor differences (reaction time, were MCQ (versus essays). They were also easy to visual acuity, physical strength…). administer and score since they lack subjectivity. 2 line of inquiry (work of Herbart, Weber, Fechner and Currently however, people have returned to favoring Wundt): Another human area of interest was the written tests. consciousness. Herbart explored it with mathematical As a result of the countermovement against tests, the equations. Weber followed him and tried to determine Wechsler-Bellevue Intelligence Scale was created. thresholds (minimum stimulus necessary to activate a Unlike the Binet scale, it produced more than only 1 sensory system). Fechner then built on that idea and score (the modern IQ score), so it showed a pattern of determined that the strength of the sensation grows as the individual. This one was also better because it the logarithm of the stimulus intensity. Wundt then focused less on language. promoted this work and is known as the founder of the Before and after WW2, personality tests also began to science psychology. This further led into experimental blossom. These measured traits (relatively enduring psychology. dispositions – tendencies to think, act, or feel in a The actual reason that we have tests is not because of certain manner – that distinguish individuals). The first these 2 lines, it’s actually because of the need to classify personality test was the Woodworth Personal Data the mentally and emotionally handicapped. The first Sheet, and it was made of structured true or false test create was at the start of the 1900s and was in questions. The motivation to create personality tests France, created by Binet and Simon under government was because of military screening. The problem with authority. It was expected to be a general intelligence the early personality tests was that they assumed test to classify and help the intellectually subnormal. participants never lied, or that they had the same Evolution of Intelligence and Standardized Achievement understanding of what the question asked as the The Binet-Simon scale had the standardization sample administrator. Afterwards, projective tests came to light of 50 children. Many people take this for granted, but but people tended to prefer the Thematic Apperception since the 50 children were rich white ones, the test Test more than the Rorschach Inkblot test, believing it doesn’t apply fairly to other people who do not fit that to be more scientifically valid. This ended in the 1940’s. description (e.g. black, Hispanic, poor…). Binet was Emergence of New Approaches aware of this though, so he tried to increase the size In 1943, the Minnesota Multiphasic Personality and representativeness of the sample. A representative Inventory (MMPI; another structured personality test) sample is one that comprises individuals similar to became popular. This is because it emphasized the need those for whom the test is to be used. In 1908, the scale for empirical data. was improved to include more children (200 of them) More personality tests were then made, and they were and it also determined child’s mental age based on the statistical procedure factor analysis (measurement of child’s performance on test relative to (method of finding the minimum number of dimensions other children of that particular age group). – ‘factors’ include characteristics and attributes – to One of the most important trends in testing: strive account for a large number of variables). Cattell used towards better tests. The version that reached and this to determine the 16 Personality Factors. remained in the US is Treman’s 1916 version, labeled as Period of Rapid Changes in Status of Testing Stanford-Binet Intelligence Scale. By 1949, clinical psychology students became certified. Testing became even more popular in WW1 because Testing was the major function of the psychologist, but people wanted to get men who were emotionally and it was the physician that provided the psychotherapy. intellectually stable. However the Binet test wasn’t The psychologists started to blame ‘testing’ for their efficient, since they needed large-scale tests, not secondary role in the health industry, and so, testing as individual ones. The army then reached out to APA’s a career started to become more hated, and less of an president, Yerkes. He created 2 tests: Army Alpha (for option. those that can read) and Army Beta (for the illiterate). Current Environment Major branches of psychology emerged in the 2000s:  Actuarial vs. Clinical predictions: Several studies forensic, neuropsychological, health, and child. These showed that actuarial numbers (such as number of fields grew around using tests. arrests, severity of crime…) are better predictors for recidivism than are clinical judgments and tests. Yet, Chapter 21 other studies showed the opposite finding. Computer Professional Issues Shaping the Field of Testing usage in testing also raises a few biases: people will  Theoretical Concerns: these focus on reliability take the test lightly, they may have an inappropriate (dependability) issues of test results. According to software, or lack of clinician involvement. Either way, APA and other research associations, unreliable tests ethical guidelines specify that it’s the clinician’s are unstable, and so are meaningless. Basically, the responsibility to make sure of appropriateness. Also, problems with tests is that either they aren’t precise results showed that most clinicians don’t want to use enough to measure determinants of human behavior, technology in testing anyways. or current understanding is not precise enough to Moral Issues Shaping Field of Testing make accurate predictions. Tests are not better than  Human Rights: the theories they are based on. o One of these rights is the right to not be tested. Yet, 1 o Assumption 1: Saying that a test has reliability APA specified 3 cases when this can be vetoes: means the test results are attributable to a Testing is mandated by law, informed consent is systematic source of variance, and so, is in itself implied (if testing is for example in educational stable. In describing the functioning of a person, activities) and if the test’s purpose is to evaluate psychologists are saying that these functions are decisional capacity. stable (even if they’re short-term), regardless of o Another right is the right to know their test scores situation or environment. What they measure is and interpretations. It’s important to keep scores something that exists in absolute terms. secure, but if there is something the public needs to Psychologists assume that any sort of variance is know (such as biases, or that their life is harmful from the person, not the test, such that if they’re and they should change it), it needs to be told. measuring a stable characteristic of the person, and o A recent right is that to know who will have access it doesn’t come out as stable, it would be because to the data. of test error (measurement/instrumental error) not o Another recent one is the right of confidentiality. because the person has changed. So, making sure APA forces physicians to inform confidentiality the test has no errors is important. This assumption issues of taking a test online. isn’t entirely correct because researchers cannot o The test interpreters have an ethical obligation to fully attribute errors to measurement errors nor to make sure their test takers know their rights. people’s fluctuations.  Labeling: It is common to give labels to people after a o Assumption 2: Most of the tests assume that we diagnosis has been made, but some disorders have can measure human characteristics independently become associated with negative stigma (e.g. of the context in which they occur, but this has no schizophrenia, AIDS…). Some labels are also self- scientific support. prophesizing, such that they prevent people from  Adequacy of Tests: Shakow is the father of clinical receiving appropriate treatment (e.g. being labelled a psychology, and even he admits that we haven’t yet chronic schizophrenic). Another effect of labeling is reached our goal of providing objective assessments that those who are psychiatrically disturbed are so of psychological functioning or personality. We know because of low control over their bodies. Labeling we don’t have the perfect tests, but they are process makes treatment more difficult by lowering adequate for now. How tests are used is determined their tolerance for stress. by law (e.g. if SATs consistently underrepresent  Invasion of privacy: people started being suspicious of blacks, maybe it should be re-examined for bias). tests, but 2 defense sides arose by Dahlstorm. One was that psychological tests are so limited that they cannot invade anybody’s privacy. The other was that correctly, so they were useful). Tests don’t need to be the notion of invasion of privacy is itself unclear. perfect, they just need to be useful. Knowing about a person’s privacy will only be a  Access to psychological testing sources: fees are really problem if it’s used inappropriately. So, participants high to run, administer and interpret test results now have the right to know the limits of (15,000-25,000). The Wechsler tests are 1,100. confidentiality (if the results show that the person Because they are so expensive, it’s hard for people to may cause harm), and that these results may be used have access to it. People are now trying to include in court if necessary and dangerous. them in insurances to make them more accessible,  Divided Loyalties: the question is, who is the client – but that will exclude some tests, and so tests will now the individual or the industry that ordered the test? be chosen based on usefulness. For example, a firm hires a psychologist to eliminate Current Trends potential employers who can’t handle stress well. The  Proliferation of New Tests: new tests keep getting psychologist should maintain the secrecy of why created because of professional disagreement over specific people aren’t hired, but should also explain to the best strategy to measure human characteristics. the people why such a decision was made. If the Another reason is that people are always trying to psychologist does tell the client why the decision was make tests less biased. Final reason is that people can made, the person can tell other people who will profit from making tests financially, and so they take outsmart the test. Right now, this problem is solved it to their advantage. Nontraditional tests are being by letting the psychologist tell the clients the purpose made because they reflect the role of scientific of the test, and refrain from exposing personal psychology, as well as the fact that they’re trying to information of clients to firm. integrate tests into other fields of psychology.  Responsibilities of test users and constructors: A test  Higher Standards, Improved technology and increased that is valid and reliable for one group may not be so objectivity: The APA published guidelines to be for another group. So, test constructors must take this followed by test conductors, so now, everything is into account. Also, when interpreting results, getting better than before. Better tests are also being psychologists have to take into account the created because of the ease at which statistical characteristics of the person they measured. Test measures can be performed, due to technology. users are also required to know why they’re taking  Greater public awareness and influence: public the test and the consequences that may arise. Test awareness led to increased demand for psychological users must always ask if the test is any good as a services. The demand is balance by the tendency measure of the characteristic being measured (is it towards legislative regulations and policies (e.g. those reliable or valid?) and if the test should be used for that restrict using intelligence tests to diagnose the purpose specified (is it ethical?). retardation). The best product of increased Social Issues Shaping the Field of Psychology awareness, however, is the extra focus on  Dehumanization: there is always that thought that the safeguarding human rights. This has led to the computers and data sheets we are inserting into them probability of test misuse and abuse to decrease. are making our life choices; ‘We are not considered  Computer and Internet Applications: computers are human anymore’. Computerized test interpretations being used to administer, score and interpret threaten our humanity, and so they must be psychological tests. The use of computers extends to implemented carefully, always open to scrutiny. all types of tests, including behavioral assessments  Usefulness of Tests: sometimes the theories behind a Future Trends test may be incorrect, but they still lead to useful Psychological testing is predicted to continue on predictions (this is like how the sun was believed to existing for a long time. They are important part of the revolve around the earth. The formulae created were psychologist’s career, and they are a profitable industry. based on wrong theories but predicted things The future will also likely see the development of many more tests (probably better, and based off of older ones). The Weschler and Binet intelligence tests are not  Magnitude: property of moreness. A scale has this likely to remain dominant. One test already rising to attribute if we can say that one item represents more challenge them is the Kaufman Assessment Battery for or less than another. However, there are instances Children. As for structured personality testing, the where this is not needed (e.g. if a coach labels teams MMPI-2 seems to be the premier test for years to come. as team 1, team 2 …, it doesn’t mean anything is more The Rorschach test’s future is uncertain, especially since than the other, they’re just labels. it’s based on Freud and his theoretical psychometric  Equal Intervals: a scale has this property if the beliefs. Also, the inkblots of the test are now available difference between 2 points at any place on the scale online, and so future test takers can see them and has the same meaning as the difference between 2 understand their interpretations (so this defeats the other points that differ by the same number of scale purpose). The Thematic Apperception Test (TAT) is even units (e.g. on an inch measure, 2 and 4 are the same more difficult to predict. This is because although it’s an distance apart as 10 and 12; this doesn’t work for IQ incredibly extensive research base, its stimuli are measures). A scale with this property has outdated. measurement units that fall on the equation y=mx+b One thing that will not change is the argumentative (a straight line). nature of psychologists: they will always be debating on  Absolute 0: being on absolute 0 means there’s which is the best test to use. nothing beyond this that exists. For biology, to say a Today, integration of concepts from experimental person has 0 heartbeats is absolute (the person has cognitive psychology, computer science, neuroscience no heart rate at all), but for psychology, defining an and psychometrics are rapidly shaping the field. absolute 0 is harder (what does it mean that a person Multimedia computerized tests form the most recent lies 0 on the shyness scale?). generation of assessment instruments. Computers offer Scales of Measurement: Types an unlimited scope in developing games (from  Nominal scales: not really scales at all; their only interactive virtual realities to helping in desensitization purpose is to name objects. They’re used when the with phobias). The computer is going to play an information is qualitative (mostly used by social important part in the future of psychology. scientists to categorize). This scale has none of the 3 properties mentioned above Chapter 2  Ordinal Scale: allows you to rank individuals but not Why We Need Statistics say anything about the meaning of the differences The outcome of a test is usually represented as a score between their ranks. This scale has property of Statistical methods serve 2 purposes in understanding magnitude, but not equal intervals or absolute zeroes. science: For example, scales that rank by height. IQ tests also 1.Descriptive statistics: methods used to provide lie here. concise description of collection of quantitative  Interval scale: these scales have magnitude and equal information. interval properties, but not absolute 0. This means 2.Inferential statistics: methods used to make that we can’t make any predictions about ratios (i.e. inferences from observations of a small group for the Fahrenheit scale, it’s an interval because it has (sample) to a larger group (population). Statistics can everything but an absolute 0, so we know that 212 is be used to make inferences (logical deductions about warmer than 22, but we can’t say that 50 is the events that can’t be observed directly). First comes a double of 25). Celsius scales are also here because, period of exploratory data analysis (gather and display although they have a measure of 0, an absolute 0 cues) then comes confirmatory data analysis (cues are means that what we are measuring doesn’t exist (but evaluated against rigid statistical rules). for Celsius, 0 means it’s just low, but temperature still Scales of Measurement: Properties exists). Measurement: application of rules for assigning  Ratio Scale: has all 3 properties. Continuing on with numbers to objects. There are 3 important properties the temperature, the only scale that is ratio is the Kelvin scale. Its 0 means all molecular activity has Statistics are used to summarize data. stopped. Variable: score that can have different values. Scales of Measurement: Permissible Operations Mean arithmetic average score in a distribution. Nominal data can only use frequency distributions, but quation (instead of , use for population). no math. Ordinal data can be manipulated by arithmetic This means the sum of all values of the variable x but it’s difficult to interpret (you can know the rank, but divided by the total sample number. not the individual information). For interval, you can Describing Distributions: Standard Deviation speak about differences but not ratios, and for ratio Standard Deviation: approximation of average scales, you can use anything. deviation around the mean. quation Square root Frequency Distributions * ( - ) /N]. The variance equation is the same, but you Frequency distribution: displays scores on a variable to just square everything. The difference between reflect how frequently each value was obtained. standard deviation and variance is that variance is just Usually, scores are on horizontal, and the vertical axis the average squared deviation around the mean and says how many times each value on the horizontal axis doesn’t make any actual sense. These 2 have many was observed (frequency). Usually (but not always) the advantages: knowing them allows us to make precise distribution is symmetrical. When they aren’t statements about the distribution. This is when we’re symmetrical, they are known as skewed to one side. It’s talking about population. When we want to talk about positively skewed if the tail falls off on the higher side of the sample, instead of using ‘ ’ we use S, and instead of the x-axis. Income is an example of a distribution with a having only N, we have N-1. We use N-1 to recognize positive skew (some people have really high incomes, that S of a sample is only an estimate of the variance of and so the tail will be towards the higher side of x-axis). the population (side note: roman letters are used for A frequency polygon: points are placed on a graph to samples and Greek letters are used for populations). represent the frequency and they are then connected. Z Score For either frequency distribution or polygon, you need eans and standard deviation don’t convey much to specify width of class interval (intervals that share information still. score transforms data into the same frequency; they’re chosen based on standardized units that are easier to interpret. ( - convenience for the researcher). ) S. Basically, a score is a deviation of a score from Percentile Ranks the mean in standard deviation units. If score is equal to Percentile Ranks: it tells you what percent of people mean, Z is 0, if more, Z is +ve and if less, Z is –ve. falls in a particular location of the graph; the specific CES-D is a general measure of depression that has been case you’re looking at is excluded. To calculate it, you used in epidemiological studies. Scores on this test need to determine how many cases fall below the score range from 0-60, with scores more than 16 meaning of interest, how many people are in the whole group, that you have clinically significant levels of depression. and then divide the people who have less by the total This test doesn’t have high validity (if you score less and multiply by 100. P r B/N x 100. For example, than 16, you’re not depressed for sure, but if you’re finishing 62 in a race of 50,000 people means higher than that, it doesn’t mean necessarily that you 62/50,000 = 0.9988. that multiplied by 100 is 99.88, so are depressed). you’re in the 99.88 percentile. The percentile ranking Standard Normal Distribution depends on the number of cases you’re looking at. Symmetrical binomial probability distribution: x axis has Percentiles: specific scores or points within a Z scores and the y axis has frequency. Transforming distribution. The difference between percentile and scores into Z scores means that their mean is now 0 and percentile ranking is that percentile is in raw score the SD is 1. It also means that we can easily predict units. So, if the infant mortality rate is 4.22 out of 1000, proportions of cases as well as the percentile ranks. and there are 13 countries with worse mortality rates, These methods can only be used when the distribution the percentile is 4.22/1000 and the percentile rank is 72 of scores is normal. Non-normal statistics are referred Describing Distributions: Mean to as nonparametric statistics. Stanine literally a combination of ‘standard nine’; Chapter 2 system developed in WW2 for the US Air Force; it Percentiles and Z Scores converts any set of scales into a transformed scale Applying the Z score to exams: assume you got a 60 in ranging from 1-9 with a mean of 5 and SD of 2. Their the class. The mean was 55.70 and the standard only advantage was that on computer cards, they use deviation was 6.08. This means your Z score would be only one side, but now that we don’t use those cards, (60-55.70) 6.08 .707. If you look at appendix 1, you’d we don’t really need the system. Stanines are find that that Z score corresponds to the 76 percentile, distributed such that 1 has the bottom 4%, 2 has the and that would mean that out of every 100 students, 76 next 7%, 3 has the next 12%, 4 has the next 17%, 5 has people scored lower than you. This assumes that the the next 20%, and it starts going down again (it’s a class distribution is normal. curve). So to find out which stanine your own, convert cCall’s T the raw score into a Z score and then find which McCall was a mathematician who tried to develop a percentile your z score falls on, and then you can look at system to derive equal units on mental quantities (one the Stanine table and find out which one you’re in. set of scores that could be applied to other situations Norms without standardizing the set); instead he developed an Norms (aka mean or 50 percentile): performances by alternative to the Z score system. He suggested a defined groups on particular tests. For a test, norms are random sample of 12-year-olds be tested and their based on distribution of scores obtained by some scores obtained. He then converted their raw scores to defined sample of individuals. th their percentile equivalents and set the mean (50 Most intelligence tests are transformed to have a norm percentile) at 50, and the standard deviation at 10 (the of 100 and SD of 15 SD is McCall’s T). So, the system is different than Z Controversies: scores in that the mean is 50 (instead of 0) and the SD is A troubling issue in psychological testing is that 10 (instead of 1). To convert a Z score to a T: T=10Z+50. different ethnic groups have different norms. So, if tests There is nothing special about this transformation; you would be used as screening for employment, can manipulate the Z scores in whatever way you like. overselection would happen (selecting a higher One test that manipulated these are the SATs. They use percentage from a particular group than would be a mean of 500 and SD of 100. Before 1995, the scores expected on the basis of representation (e.g. if 60% of were compared to the original group of students who applicants are white, and 75% of those hired are white, took the test. Afterwards, it was made more accurate there was overselection). To correct this problem by standardizing the tests. (especially with GATB tests in US), they compared the Standardization vs. ormalization cCall’s T and applicants only to norms of their own ethnic group. This similar systems are standardizations in that if the results eliminated overselection but created the problem that were skewed before, after the linear transformations, although 2 people may have the same score, their raw they’ll still be skewed. ormalization changes the scores would be different and so selecting between the characteristic of the distribution. 2 would be difficult. The National Academy of Science Quartiles and Deciles promotes the use of separate testing norms, but the Quartiles: divides the percentage scale into 4 groups; actual civil law declared illegal to do so. the 2 quartile is the median (half the scores are above Age-Related Norms and half are below). It has an interquartile range Some tests have different norms for particular age (interval of scores bounded by the 25 and 75 th groups. The IQ tests are like this. So, when applying an percentiles; it represents the middle 50%). The quartiles IQ test, the tester’s task is to determine the mental age are Q1, Q2 and Q3. of the person. Deciles: divides the percentages into 10 groups. The Tracking deciles are denoted as D9 (point below which there are Experts have discovered that children at the same age 90% of the scores), D8, D7… level tend to go through different growth patterns (e.g. children who are born small tend to stay small and grow cut-off point, teachers began to ‘teach the test’ (i.e. at a slower pace). So, pediatricians who are trying to they focused on teaching only subjects that will be determine if a child’s height is below or above average tested like math and science, and ignored art). Also, it need to know more than age; they also need to know was found that teachers became under a lot of pressure the child’s percentile within a given age group. Tracking to ‘produce’ excellent students that they started is the tendency to stay about the same level relative to cheating the system. one’s peers. The tracking system is useful in medicine, Are 4 Grades smarter than 3 Graders? for example to track the health and nutrition of a At California, the standardized test they use is the STAR. person. However, when it comes to education, it’s The test reveals indicators of school performance, so it’s controversial. Some people believe that intellectual a serious test. Results of students from 2005, 2006 and rd growth parallels physical growth, such that people who 2007 are graphed, and a trend arises: 3 graders are are originally below their peers will grow intellectually always performing lowly, and 4 graders always excel. rd slower. However, others seem to believe that this The best explanation is that the tests of 3 graders may system discriminates against children. be too hard for them, while the tests for the 4 graders Criterion-Referenced Tests are too easy. This is another problem of standard based Norm-referenced test: compares each person with a testing based on arbitrary cut-off points. norm. People have argued that such tests force competition among people. Chapter 4 Criterion-referenced test: describes the skill, task or History and Theory of Reliability knowledge that the test taker can demonstrate. The Errors of measurement: discrepancies between true results of these tests would not be used to make ability and measurement of ability. In psychology, comparisons, but to design individualized program of ‘error’ does not imply that a mistake has been made, instruction. It was popular in the 1990s as a humanistic but that there will always be inaccuracy. Our goal is to trend, but many schools showed that there was a large find the magnitude of error and minimize it. Tests with number of students not succeeding, and so Obama minimal error are considered reliable. declared it a conservative approach and removed it to Conceptualization of Error enforce higher standards. This however raises the The issue of error is most prominent in psychology problem of the arbitrariness of the cut-off points for because what we measure is not easily observed. It’s passing high stake tests. like we have a ‘rubber yardstick’ as our instrument and Within High School Norms for University Admission what we are trying to measure may be underestimated University of California noticed that it has (if the yardstick is shrinking) or overestimated (if it underrepresentation of minorities in its classes, and so, stretched). decided to reform. Instead of looking at SAT scores Spearman’s arly Studies (which would compare students with each other across Spearman was the first to develop reliability assessment the state), they decided to take the top 4% of students methods, the idea he had taken from De oivre’s in each high school (so as to compare students only concept of sampling error (a century earlier) and within each high school). This actually increased the Pearson’s product moment correlation. underrepresentation and the program was abandoned. Thorndike furthered this idea by introducing several No Child Left Behind coefficients into the equations. Later, Cronbach The No Child Left Behind (NCLB) Act was initiated by developed more advance methods for evaluating many Bush with the justification that each child should sources of error in behavioral research. Currently, the receive a quality of education. The law required that theories have evolved to include computer technologies each child be tested with standardized tests and that through the item response theory (IRT). schools make the results public. This seemed to be a Basics of Test Score Theory good idea, but it soon became evident that since the Classical test theory has been around for 100 years and tests are standard based and are based on an arbitrary it assumes that each person has a true score that would be obtained if there were no errors in measurement. Another problem is that this model assumes that Equation: X=T+E (x is observed score, T is true score, characteristics are constant over time, and variations and E is error). are assumed to be errors. Another assumption is that errors of measurement are Item Response Theory random. Systemic errors are considered but they don’t IRT has been developing over the past decade. The really account for much (e.g. if a carpenter constantly computer is used to focus on the range of item difficulty misreads 2cm as 1cm, his wood won’t be accurate, but that helps assess an individual’s ability level (so it will all be the same size). The distribution of random basically, you get asked a few questions and if you keep errors is bell-curved, with the center of the distribution getting them right, the computer asks more difficult representing the true score. The true score may be questions, and if you get a bunch wrong, the computer better estimated by finding the mean of observations drops you to an easier level). This way, a person’s skill is from replicated applications. sampled fairly. One difficulty: the method requires a The basic measure of error is standard error of bank of items that have been systematically evaluated measurement (standard deviation of errors, since we for difficulty level. Also, considerable effort is needed assume that the distribution of random errors will be for test development. the same for all people). Models of Reliability Domain Sampling Model Federal government mandates that all tests be reliable Test Scores gain reliability as the number of items before use in employment or education purposes. increases. Sources of Error Central concept in classical test theory. It considers  Loud noises in the room problems created by using a limited number of items to  Room temperature isn’t suitable represent a larger, more complicated construct.  Test-takers may be feeling down Reliability is conceptualized as the ratio of variance of  Items on the test might not be representative of the observed score on the shorter test and the variance of domain (e.g. you can spell 96% of dictionary words, the long-run true score. We’re looking at the error of but the 20-item test included 5 items (20%) that you using a sample of items as opposed to the entire couldn’t spell). domain (we want to check your spelling. We should Measuring test-reliability: really get you to spell every word in the dictionary, but  Time Sampling: Test-Retest Method: used to evaluate nobody has time for that, so we get you to spell a error associated with giving a test at 2 times. This is sample of words). only needed when we measure traits that do not Reliability can be estimated from the correlation of the change over time (e.g. IQ is constant, but the observed test score with the true score. Since we can’t Rorschach inkblot test is not). It’s an easy way to test practically get the true score, instead, we can create reliability: just give the test twice and find the many tests by sampling from the same domain and get correlations. You have to watch out for: a normal distribution. We would get correlations o Carryover effect: when first testing session between each test and average them (technically you influences scores from the second testing session. should convert Fishcer’s r to and then the scores are When they are present, the test-retest correlation averaged). overestimates reliability. This isn’t really a problem The correlation between 2 randomly parallel tests when the second session is systemic (as opposed to would be expected to be less than the correlation of random). If testing affects all test takers equally, it’s either test with the score. Equation: r √r average. 1j 1j not an error. Criticism of Classical Test Theory: It requires the same o Practice effects (follow from carryover effect): some test items to be administered to each person. For skills improve with practice, so test takers’ scores example, intelligence has a small number of items that may sharpen those skills (such as dexterity skills). concentrate on an individual’s ability level (some may This is considered an effect because practice affects be too hard/easy), so the reliability is not very strong. people differently. Because of these problems, time interval between between 2 items is 0 when the items measure tests must be selected and evaluated carefully. If the different things, but when the distribution is skewed, time is too short, then we’ll have to deal with alpha usually gives a value. So now, people are carryover and practice effects, but if it’s too long, starting to use coefficient omega instead (omega then we have 3 problems (is it because the test isn’t estimates the extent to which all items measure the reliable, the characteristic actually changes overtime, same underlying trait). or some combination of both?). KR 20rmula  Item Sampling: Parallel Forms Method: Parallel forms Kuder and Richardson advanced reliability assessment reliability compares 2 equivalent forms of a test that by developing methods for evaluating reliability within a measure the same attribute (same difficulty). Pearson single test administration. Their method is better than product moment correlation coefficient is used. The using the split-half for assessing internal reliability because it simultaneously considers all possible ways of method tries to make sure that the test is reliable such that the error variance is not attributable to the splitting (instead of arbitrarily splitting). 20 assess selection of one set of items. The tests can be
More Less

Related notes for PSYC37H3

Log In


Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.