The Flynn effect, popularized by James Flynn, refers to the fact that IQ tests supposedly get easier with time. Although by definition the average IQ of American or British (white) people is always 100, the older the IQ test, the easier it is to score 100. Thus to keep the average at 100, tests like the Wechsler must be renormed every 10 years or so, otherwise the average IQ would increase by about 3 points per decade.
Although scholars continue to debate whether the Flynn effect reflects a genuine increase in intelligence (perhaps caused by prenatal nutrition or mental stimulation) or just greater test sophistication caused by modernity, there’s been remarkably little skepticism about the existence of the Flynn effect itself.
Malcolm Gladwell writes:
If an American born in the nineteen-thirties has an I.Q. of 100, the Flynn effect says that his children will have I.Q.s of 108, and his grandchildren I.Q.s of close to 120—more than a standard deviation higher. If we work in the opposite direction, the typical teen-ager of today, with an I.Q. of 100, would have had grandparents with average I.Q.s of 82—seemingly below the threshold necessary to graduate from high school. And, if we go back even farther, the Flynn effect puts the average I.Q.s of the schoolchildren of 1900 at around 70, which is to suggest, bizarrely, that a century ago the United States was populated largely by people who today would be considered mentally retarded.
While few people believe our grandparents were genuinely mentally retarded, it’s taken for granted that they would have scored in the mentally retarded range by today’s standards.
But is this true? I began having doubts over a decade ago when I examined the items on the first Wechsler intelligence scale ever made: the ancient WBI (Wechsler Bellevue intelligence scale). Meticulously normed on New Yorkers in the 1930s, this test remains far and away the most comprehensive look we have at early 20th century white North American intelligence, and while some of the subtests looked easy by today’s standards, others, especially vocabulary, looked harder.
The Kaufman effect
What also struck me was how little instruction, probing or coaching people got when taking the ancient WBI, compared to its modern descendant the WAIS-IV. This matters a lot because the way the Flynn effect is calculated on the Wechsler is by giving a new sample of people both the newest Wechsler and its immediate predecessor, in random order to cancel out practice effects, and then seeing which version they score higher on. If they average 3 points lower on the WAIS-IV normed in 2006 than on the WAIS-III normed in 1995, it’s assumed IQ increased by 3 points in 11 years.
The problem with this method (as Alan Kaufman may have discovered before me) is that the subset of the sample that took the newer version first has a huge advantage on the older version compared to the norming sample of the older test (over and above the practice effect which is controlled for), because the norming sample of the older test was never given coaching and probing.
Statistical artifact
A Promethean once said maybe the Flynn effect is just a statistical artifact of some kind. He never told me what he meant, but it got me thinking:
One problem with how the Flynn Effect is calculated on the Wechsler is that it’s assumed that gains over time can be added. For example it’s assumed that you can add the supposed 7.8 IQ gain from WAIS normings 1953.5 -1978 to the 4.2 IQ gain from normings 1978 – 1995 to the 3.7 IQ gain from normings 1995-2006, for a grand total of 15.7 IQ points from normings 1953.5 – 2006.
This would make sense if he were talking about an absolute scale like height, but is problematic when talking about a sliding scale like IQ. For example, suppose the raw number of questions correctly answered in 1953.5 was 20 with an SD of 2. By 1953.5 standards, 20 = IQ 100 and every 2 points = 15 IQ points above or below 100. Now suppose in 1978, people averaged 22 with an SD of 1. That’s a gain of 15 IQ points by 1953.5 standards. Now suppose in 1995 people average 23 with an SD of 2. That’s a gain of 15 IQ points by 1978 standards. Adding the two gains together implies a 30 point gain from 1953.5 to 1995, but by both 1953 and 1993 standards, the difference is only 23 points.
Changing content
Another problem with studying the Flynn effect is the content of tests like the Wechsler is constantly changing. This is especially problematic when studying long-term trends in general knowledge and vocabulary. If words that are obscure in the 1950s become popular in the 1970s, then people in the 1970s will score high on the 1950s vocabulary test. Meanwhile the 1970s vocabulary test may contain words that don’t become popular until the 1990s, Thus adding the vocabulary gains from the 1950s to the 1970s to the gains from the 1970s to the 1990s, might give the false impression that people in the 1990s will do especially well on a 1950s vocabulary test, when in reality, many words from the 1950s may have peaked in the 1970s and are even more obscure in the 1990s than they were in the 1950s.
An ambitious study
Given the Kaufman effect, the statistical artifact, and changing content, I realized the only way to truly understand the Flynn effect is to take the oldest quality IQ test I could find and replicate its original norming on a modern sample.
In 2008 I made it my mission to replicate Wechsler’s 1935-1938 norming of the very first Wechsler scale. Ideally I should have flown to New York where Wechsler had normed his original scale, but if Wechsler could use white New Yorkers as representative of all of white America (WWI IQ tests showed white New Yorkers matched the national white average), I could use white Ontarians as representative of all of white North America (indeed white Americans and white Canadians have virtually the same IQs). The target age group was 20-34 because this was the reference age group Wechsler had used to norm his subtests.
It took over a decade but I was gradually able to arrange for 15 randomly selected white young adults to take the one hour test. They were non-staff recruited from about half a dozen fast food locations in lower to upper middle class urban and suburban Ontario. The final sample was not perfectly representative of white North America (they were a bit less educated and much less female) and testing conditions were not optimum (environments were sometimes noisy, at least one person had a few beers before testing; another was literally falling asleep during the test) and 15 people is way to small a sample to draw statistically significant conclusions about 11 different subtests. One man with a conspicuously low score was removed from the sample because he had suffered a stroke.
Nonetheless, the below table shows how whites tested in 2008 to 2019 compared to Wechsler’s 1935-1938 sample, with the last column showing the expected scores of the 21st century sample, extrapolating gains James Flynn calculated from 1953.5 to 2006 (see page 240 of his book Are We Getting SMARTER?) to the current study: circa 1937 to circa 2013.5.
Note: the 11 subtests were scaled to have a mean of 10 and an SD of 3 in the original young adult norming sample, while the verbal, performance and full-scale IQs were scaled to have a mean of 100 and an SD of 15. Note also that vocabulary is alternate test, not used to calculate either verbal or full-scale IQ on the WBI. One third of my sample did not take Digit Symbol so for these, Performance and full-scale IQs were calculated via prorating.
Test: | Nationally representative sample of young white adults (NY, 1935 to 1938) | Randomish sample of young white adults (2008 to 2019, ON, Canada) | Expected WBI scores in 2008-2019 based on Flynn’s calculated rate of increase |
Information (general knowledge test) | 10 (SD 3) | 8.07 ( SD 2.6) | 12.3 |
Similarites (verbal abstract reasoning) | 10 (SD 3) | 12.93 (SD 2.94) | 15.54 |
Arithmetic (mental math) | 10 (SD 3) |
7.2 (SD 3.78) (this subtest contained a unit conversion item that seemed biased against Canadians) |
11.02 |
Vocabulary | 10 (SD 3) | 8.73 (SD 2.6) | 14.95 |
Comprehension (Common sense & social judgement) | 10 (SD 3) | 9.33 (SD 3.2) | 13.93 |
Digit Span (attention & rote memory) | 10 (SD 3) | 9.47 (SD 2.23) | 11.46 |
Picture Completion (visual alertness) | 10 (SD 3) | 10.47 (SD 3.16) | 14.52 |
Picture Arrangement (social interpretation) | 10 (SD 3) | 9.8 (SD 2.54) | 13.35 |
Block Design (spatial organization) | 10 (SD 3) | 12.53 (SD 3.07) | 12.91 |
Object Assembly (spatial integration) | 10 (SD 3) | 11.47 (SD 1.77) | 14.06 |
Digit Symbol (Rapid eye-hand coordination) | 10 (SD 3) |
10.8 (SD 2.82) (note: only 10 of the 15 subjects took this subtest) |
14.66 |
Verbal IQ |
100 (SD 15) | 99.8 (SD 14.46) | |
Performance IQ | 100 (SD 15) | 106.47 (SD 12.11) | |
Full-scale IQ | 100 (SD 15) | 103.4 (SD 13.63) | 122 |
Conclusion
The Flynn effect is dramatically smaller than we’ve been led to believe, at least on tests of specific information that may become obscure over generations. By contrast certain verbal skills (categorizing) and spatial analysis have indeed increased by amounts comparable with Flynn’s research. It’s unclear if these are nutritional gains caused by increasing brain size, neuroplastic gains caused by cultural stimulation, or mere teaching to the test caused by schooling, computers and brain games.
this is article reminds me of this:
no time to talk…
if you wanna play a dirty trick on a woman…
if she’s rich and single at ca 35…
just tell her you have sperm…
fantastic!
rupert brooke has had so many women and redacted by pp, aug 17, 2019] write biographies…
why?
he’s famous for one poem, for being good looking, for being high class, for being bi, and for dying young…
but not from dying in the great war…
he died from an infected mosquito bite…not bullet wounds.
The Flynn Effect occurs due to swelling of the middle class; so this means that there is greater familiarity with the cultural tools of the test designers. So since people are more exposed to the cultural tools used on the test (due to the growth of the middle class), and so self-confidence, self-esteem, etc (pretty much better cognitive/affective preparedness) increased as well which explains the rise in IQ test scores, proving that IQ scores are “middle-class knowledge scores. and IQ tests are tests of social class and one’s specific knowledge in their class..
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1002.4245&rep=rep1&type=pdf
Click to access 3398d781543cd0edcf51f181074f4c3ff35b.pdf
This also jives well with the psycho-logist Elaine Castles’ claim that “… intelligence is in fact a cultural construct, specific to a certain time and place.” (Elaine Castles, “Inventing Intelligence: How America Came to Worship IQ”)
Impressive work Pumpkin!
Recruiting in fast food places probably biases the sample downwards to some degree though.
Can you calculate the full-scale IQ without the sub-tests that sound like they depend on era-specific knowledge, like Information and Vocabulary?
I tried to balance things out by recruiting some people from more upper middle class places like starbucks but had less success.
Vocab is not used to calculate IQ in the WBI but I will try excluding some of the other subtests.
Pumpkin, with a low spatial IQ, can a high matrix reasoning score make up for it.
Fascinating. Thanks for this.
Any comment on reported gains on Raven’s Matrices, which has remained unchanged?
Best
James
Well the 2 subtests where I found the biggest Flynn effects were similarities (abstract reasoning) & block design (visual analysis). Perhaps the raven combines both abilities & thus shows a big Flynn effect.
Perhaps the increase in visual abilities is caused by nutrition & the increase in abstraction is a schooling effect
Has the black-white IQ gap in the United States narrowed? A
literature review
‘the [black-white IQ gap] is amenable to environmental intervention, regardless of the extent to which it is heritable or genetically determined.’
https://www.researchgate.net/publication/324074343_Has_the_Black-White_IQ_Gap_in_the_United_States_Narrowed_A_Literature_Review
On the wechsler it’s shrunk a bit for kids but not at all for adults
Source? What do you think of this paper that calls out three “HBDers” (charlatan Gottfredson, charlatan Murray and charlatan Rushton)?
See the very end of this article for sources:
https://pumpkinperson.com/2016/12/05/the-black-white-iq-gap/
Click to access AFD-130905-006.pdf
look at page 34. Depressing…
Jordan peterson made a similar argument about iq and the US military but not regarding blacks.
Why are all the examples given of white people, why not include the results of blacks, or other races?
Because non-whites were not included in the norming of IQ tests until circa the 1960s or 1970s so if I want to study long-term IQ trends, I have to look only at whites unfortunately.
This study about IQ is so flawed as to be laughable. First, [redacted by pp, aug 19, 2019]. Secondly, the sample size of 15 is ridiculous; Reinhard recommends a min size of 30 and even that has caveats. Thirdly, the “researcher” enlisted (only) 15 customers of fast food restaurants in lower to upper middle class urban and suburban Ontario. I wonder how that biased the “study” straight away? You can guarantee they were a bit less educated, eg turning up having had a few beers. And as for being “much less female” whose fault is that? This took a decade to arrange? Finally the comment about “This subtest contained a unit conversion item that seemed biased against Canadians” is neither explained nor credible as no real test specifically biases a test against a particular nationality. And as for it “seemed”, ie no investigation of the real reason. Very sloppy and laughable “work”. I can only imagine this came from The Onion or a student joke newspaper.
Some data is always better than no data. You realize medical research often reports on case studies of individual patients? Maybe someone who reads this will do a more conclusive study and then this has served it’s purpose.
The female-male gap is small anyway.
You realize not every country uses the same units to measure things? Presumably it’s a metric/imperial system thing.
Pumpkin, why is Working Memory and Processing speed tested on IQ tests? I found that training Working Memory does lead to an increase in storage functions, but storage functions don’t really improve processing power. But if that’s the case, what’s the point of WMI? I guess a higher WM allows for more efficient thinking.
“A Promethean once said”
Have you ever met a Grail Society member? A club so exclusive that 1 in 76 billion people would qualify, with 2000 people trying and not one even coming close. There are societies for more dumb-dumb people though, than the Grail. Mensa and Promethean are nothing compared to the other high IQ people that could be out there!