Anyone who’s taken multiple intelligence tests knows dramatically the scores can vary. For example, we have commenters on this blog who claim as much as 2 standard deviation gaps (2 SD) between their SAT scores and their Wechsler scores. Imagine if two different stadiometers gave a 2 SD difference in height (that’s over 5 inches!) . Why is a level of imprecision that would never be tolerated in the hard sciences handwaved away in psychometrics, and what do we do to fix it?
At first glance IQ tests seem incredibly reliable as evidenced by the 0.98 reliability (standard error (SE) of 2 points) reported for the WAIS-IV. But how was this number arrived at? For most subtests, reliability was measured by randomly dividing the subtest in half (odd vs even items), taking the correlation between both halves, and then correcting the correlation for the full length of the subtest. Once they have the reliabilities for all the individual subtests, they then combine them into a composite reliability for the entire scale.
But if the subtest level reliability is calculated by randomly dividing the subtests items into odd or even numbered items, why not calculate the full-scale IQ reliability by dividing the subtests into odd or even numbered subtests? The WAIS-IV might be an extremely reliable measure of how smart you are on the abilities measured by the WAIS-IV, but are the abilities measured by the WAIS-IV a representative sample of all cognitive abilities?
Unlike the WAIS-IV, the original WAIS was arguably a pretty representative sample of human cognition. Although there was some selection bias for subtests that correlated well with other subtests, for the most part Wechsler just wanted a very diverse group of subtests that were easy to administer, fun to take, and provided clinical insights into how people think.
A psychotic mental defective obtained the following scores on the original WAIS (keep in mind that subtest scores have a mean of 10 and an SD of 3, unlike the verbal, performance and full-scale IQ’s that have a mean of 100 and an SD of 15)
|Full Scale IQ||67|
So using my favorite standard deviation calculator, we find this person has a mean subtest score of 4.64 with an SD of 2.54. Now because there are 11 subtests, we divide this SD by the square root of 11, which gives a standard error (SE) of 0.77. What that means is that assuming the 11 WAIS subtests are equivalent to a random sample of all cognitive abilities, then this person’s true average level of functioning has about a 2/3rd chance of falling anywhere from a scaled score of 3.87 and a scaled score of 5.41 (+/- 1 SE). For his age (17) on the original WAIS, this equates to a true IQ range of 61 to 72, implying an SE of 5.5! (more than twice as high as the SE claimed by the WAIS-IV based on a misleading definition of reliability)
We can arguably say with 95% certainty that if the WAIS included every cognitive ability possessed by the human brain, his full-scale IQ would be anywhere from 56 to 78 (+/- 1.96 SE). But that’s a bit like saying his height is anywhere from 5’2″ to 5’6″. Unless this person has an abnormal amount of subtest scatter, it may take an IQ test with over 60 subtests for IQ to have a meaningful reliability as high as height’s.