A respected blogger named Emil has responded to my recent post about Harvard students regressing precipitously to the mean when they take an official IQ test. Although some studies have found the SAT only correlates 0.4 with official IQ tests like the WAIS, Emil writes:
The lower values are due to restriction of range, e.g. Frey and Detterman (2004). When corrected, the value goes up to .7-.8 range. Also .54 using ICAR60 (Condon and Revelle, 2014) without correction for reliability or restriction.
While it’s certainly true that the SAT’s correlation with official IQ tests goes way up when you correct for range restriction, I’m not sure how appropriate the correction is here. The point of such corrections is that if a sample has a restricted range of general intelligence (g), but an unrestricted range of non-g variance, then almost by definition, variance in g will have less predictive power than non-g variance, since the latter variance exceeds the former.
However people who take the SAT, particularly at the same high school, are not just restricted in g, but are also restricted in academic background and test preparation which likely correlates with SAT scores independently of g, thus studies that correct for range restriction in g, while ignoring range restriction in non-g variance, may grossly overestimate the SAT’s correlation with IQ in a random sample of all American 17-year olds.
Emil also notes the average IQs of Harvard students in the study I cited might be deflated by an oversampling of social science students who are less intelligent than STEM students. I definitely agree that STEM students are more intelligent than social science students, however I’m not sure this would have a significant effect because most Harvard students are not in STEM, so the non-STEM students would probably be a lot more representative of the average Harvard undergrad than STEM students are. However this needs to be explored in more depth.
Emil then writes:
SAT has an approx. mean of ~500 per subtest, ceiling of 800 and SD of approx. 100. So a 1500 total score is 750+750=(500+250)+(500+250)=(500+2.5sd)+(500+2.5sd), or about 2.5 SD above the mean.
I realize Emil is just doing a rough estimate, but it’s important to note that verbal and math sections of the SAT are said to only correlate about 0.67, so someone who scored +2.5 SD on each subscale should be about +2.8 SD on the composite (relative to the SAT population, who are already above the U.S. population mean). At least in theory..
Official stats from the year 2000 (around when the Harvard students in the cited study were tested) showed that the national mean verbal SAT was 505 (SD = 111) and the mean math SAT was 514 (SD = 113) and the composite score had a mean of 1019 (SD = 208). Assuming Harvard students have a mean SAT of 1490, they would have scored 2.26 SD higher than the average SAT taker. Roughly the top one in 85 SAT takers, and probably the top one in 255 level for all American 17 year olds (+2.66 SD).
Emil then applies the 0.86 test-retest correlation to estimate how SAT takers will score on the WAIS, however this correlation might be way too high because it is based on people taking the same test twice and the SAT and WAIS are not the same test. One’s true score on the SAT will not correlate perfectly with one’s true score on the WAIS.
People who score +2.26 SD above the SAT population on the SAT will average 0.86(2.26 SD)= 1.94 SD when they take the SAT again, which is the top 2.6% with respect to SAT takers, and the top 0.88% of all 17 year olds, equivalent to an IQ of 136 (U.S. norms) or IQ 134 (U.S. white norms) or IQ 132 (U.S. normal white norms). By contrast on the WAIS Harvard students average IQ 122 (U.S. normal white norms; corrected for test abbreviation).
In short, the unreliability of the SAT does not seem to explain the severe regression to the mean Harvard students experience when tested on the WAIS.
In short, the unreliability of the SAT does not seem to explain the severe regression to the mean Harvard students experience when tested on the WAIS.
The fact that the study you used used a convenience sample does.
Only if you believe Harvard psychology students are way dumber than the average Harvard student
There’s no way to know what group the sample group represents generally, be it Harvard psych students or Harvard undergraduates in total. So no, that belief is unnecessary.
So only if you believe Harvard students who volunteer for Harvard IQ studies are significantly dumber than the average Harvard student.
No, only if we believe that the Harvard students who volunteered for THIS study weren’t representative. There’s no way to scientifically generalize the findings to Harvard undergrads.
It’s more like ‘sure, if you think it’s unlikely that a one-off non-representative group could be dumber than the average.’
Is there any rational reason why we should expect the recruitment method in this study to significantly oversample less bright students, other than the reasons mentioned in the post?
Please be precise & clear if you decide to respond.
You’re the one who claims that they are representative.
There are plenty of ways in which a convenience sample can be non-representative. The burden is on you to show that this sample is somehow representative because you are trying to generalize the findings.
Further, my explanation is parsimonious: the sample wasn’t representative, hence the discordance.
Well it’s a fact that a sample of Harvard students scored much lower on an official IQ test than Harvard students generally scored on the SAT. There are at least possible 2 explanations:
1) SATs are very imperfect predictor of official IQ scores
2) Students in the study are a very imperfect sample of Harvard undergrad intelligence
I can think of several plausible reasons why 1) might be true (the SAT is more influenced by SES, school quality, school courses than official IQ tests) but only one reason why 2) might be true (fewer STEM students partake in psychology studies)
Argument from lack of imagination is no way to meet a burden of proof.
The fliers could have been posted in places where dumber than average samples of the student populous congregated, the only people who had time to participate could have been dumber than average, the only people who would take the time to do so for the money may have been dumber than average, etc. etc. etc.
The list of possibilities is endless, which is why generalizing from convenience samples is a bad idea.
Unless you can produce affirmative evidence (beyond incredulity) that the sample was representative, of course.
Swank, you could come up with the same ad hoc excuses to dismiss almost any sample in virtually any study.
Nope. When studies follow random sampling methods, the results can be generalized.
When convenience samples are used, if the researcher can offer evidence that the results are representative (beyond suppositions and incredulity), then the convenience sample may be good info. But usually, convenience samples are meant as pilot studies for further research.
same ad hoc excuses
They aren’t ad hoc excuses. It’s a fundamental defect of convenience samples.
I’ve asked you a few times now to offer evidence that the convenience sample is representative. And you have yet to do so. Do you have any, or not?
Cite one IQ study that is a random sample of the population it measures.
Do you have any evidence or not?
I don’t have to produce anything, and pointing to other research that may share the same flaw doesn’t mean your method here isn’t flawed. Are you trying to say that all IQ research relies on convenience sampling?
Do you have any evidence or not?
The evidence is that the participants were all Harvard students recruited from signup sheets around campus as opposed to one particular part of campus. What evidence do you have that the students were unrepresentative with respect to Harvard cognitive ability?
I don’t have to produce anything,
No one here has to produce anything, but my claim was that the Harvard students underperformed on another IQ test because the SAT is an unrepresentative sample of the brain. Your claim was that they underperformed because the students were an unrepresentative sample of Harvard students. You’re entitled to your opinion, but I see no strong evidence in support of it, and constantly asking me to debunk your claims shows fiercely bad manners.
and pointing to other research that may share the same flaw doesn’t mean your method here isn’t flawed.
I don’t have a method Swank. I have a fact I’m trying to explain. Why did a sample of Harvard students score much lower on the WAIS than on the SAT? I gave my explanation. You’ve given yours.
Are you trying to say that all IQ research relies on convenience sampling?
Were you born yesterday or have you just been in a coma for the last 100 years? I’ve seen studies where the performance of a single classroom of children were used to estimate the average IQ of an entire country, and you’re complaining about a study that recruited students from ALL OVER CAMPUS to estimate the average IQ of a single university? Are you really that out of touch with what passes for scientific research? And it’s not just IQ research, it’s virtually all social science research and medical research for that matter.
Would it be ideal if every study put the names of every single member of a population into a hat, randomly selected a large sample of them, and forced each selected person to participate? Yes. Is that even close to reality? No. Could the sampling of this Harvard study be severely flawed? Yes. But as far as I know, it’s the only study we’ve got.
Swank,
This is an old post but if you ever check. I think your objections are valid. I’m not a statistics expert or student but i would expect convenience samples to be frequently unrepresentative. Pumpkin does have a point that it’s realistically impossible to have perfect data so you need to use what you have. But, that doesn’t negate the requirement to discount the results. Unless Pumpkin shows that people who usually volunteer or get paid to take these test are representative, you shouldn’t assume anything. It’s bad science.
In fact, I suspect (although I have no idea) that people who do take these tests are somewhat different. I can think of a dozen ways that they may be different.
In almost all research you have to make some basic assumptions. You say that’s bad science; okay, so virtually all science is bad then. Maybe most science is bad which is why old ideas are constantly being overturned, but bad science is better than no science.
Of course, you make some basic assumptions. Swank never stated you shouldn’t. He simply said that you need to discount the value of the results. You acknowledge and mention to people the possible nonrepresentative pool, especially when the assumptions have a heavy influence or likely to be wrong. In the case here, the convenience sampling might indicate a relationship and then you administer a more robust test.
Much of how any manager, CEO, or the average person decide what to do is based upon personal anecdotal evidence of course. There is no way can they administer a rigorous statistical analysis of every aspect of their daily lives. They see patterns, assume to be true, and act on them. But, they always remind themselves that given the small sample pool and the possible human bias, the perceived pattern might be wrong.
You realize your critique applies to virtually every IQ study ever done right?
Of course it applies to every IQ study but to varying degrees. And, you have to mention it if it’s particularly worrisome. For example, if you took the test of every single Harvard student, I’d find the data more reliable than the fact that you took data from only Harvard students who took a specific major which was done here. There definitely is the possiblity that the two groups are not the same.
Furthermore, I don’t know if you agree but every study is suspect given the abuse of researchers to fudge their data. I worked as a student researcher when I was young and it happened all the time. It was hard to get grant money and if you didn’t show strong data, you’re not going to get funded. That is why papers need to not only be peer-reviewed but the sponsors need to be revealed as well as parallel research to be done. I mention this to emphasize that I don’t believe data to be reliable and need to be careful when drawing conclusions.
One thing that bumpkin and most IQ fetish folks never talk about is the late development phase. Late bloomers are supposedly to be very intelligent and high IQ, and often multi-talented.
And this might be true, given the fact that blacks have an early start in elementary school as smarties, but they lose out to Whites and Asians significantly, as part of the evolution stage in later life.
model mis-specification isn’t considered by pp or Emil.
namely that the distribution of SAT and WAIS scores is not bivariate normal (the model). mis-specification shows up most in the extremes.
given pp’s figures, the SAT and full scale WAIS correlation could still be as high as .89.
Both if it’s not a bivariate normal distribution, wouldn’t the correlation be kind of meaningless?
all models are mis-specified pp.
all models are wrong, but some models are useful.
as long as reality is approximately like the model, it’s useful, but deriving the full population correlation from the extremes using the model is always going to give you shit.
the inter-correlations are parameters of the multivariate normal dist.
but the empirical dist…reality…is never parametric.
for example the log of stock returns is approximately normal, and thus one may use this model to price derivatives on stocks.
but it isn’t really normal. it has fat tails for one, and even if it were a parametric dist, its parameters would vary over time. there may be a better/more approximate parametric dist, but it will still be wrong.
it is of course possible via percentile matching to scale scores to fit any distribution you like in one dimension. but in two or more dimensions it may not be.
the percentile to percentile function needn’t even be monotonic.
for example, height percentile and income percentile may increase until a certain point and then decrease as freakishly tall people are discriminated against or have other problems.
Well as I’ve alluded to, many years ago a Promethean and I are were arguing about the average IQ of the (self-made) Forbes 400 and how we could go about predicting it from regression since economic success is anything but normal. He argued that you simply assign economic success a normalized Z score, so the average Forbes 400 member would have a median money Z score of 5, and multiplying this by the IQ-income correlation of 0.4 would give an IQ Z score of 2 (around IQ 130)
I was very unhappy about this at the time because I felt the average billionaire should be much much smarter, but I slowly came to accept that IQ 130 was high enough, but I still felt that forcing extreme incomes to fit a bell curve really underestimates the accomplishments of the super rich and IQs derived from normalizing income would underestimate their ability, but of course using the actual Z score of their financial success would give ludicrous results.
But of course the actual method used by social scientists when applying statistics to income/wealth data is to do a log transformation.
I’ve often wondered what would happen if we tried to predict the average weight of the world’s tallest people from their heights and the heights of the world’s tallest people from their weight. Height and especially weight are clearly non-Gaussian at the extremes but would any linear relationship between height and weight extend to the extremes, and if not, would normalizing the Z scores work, I wonder.
for example, height percentile and income percentile may increase until a certain point and then decrease as freakishly tall people are discriminated against or have other problems.
Another reason it may decrease after a certain point is freakishly rich people make money from owning a business rather than getting promoted within a corporation. The latter probably requires more height.
Converting SAT to IQ scores is just a very messy process. You must take into account that students at upper tier schools like Harvard take the SAT 2-3 times on average (I took it 3 times myself) and are more likely to prep, which probably significantly reduces the g-loading of the test. This also screws up any attempt to convert the reported SAT scores to an IQ score at top schools because these schools generally only report the average of the best single-sitting scores for test-takers who attend their schools.
The SAT g-loading you cited of around .65 in the previous post is probably about average among SAT test-takers, but is probably somewhat lower for test-takers at top schools and higher for those at lower -tier schools.
I would have assumed picky schools like Harvard would take your average SAT score if you took it multiple times, to avoid getting stuck with mediocre students who only scored high through multiple testing
But at the same time they may only report the best scores of their students to the media to maximize reputation.
This is an extremely important point you raise because they were only tested on the WAIS once so i might not be comparing apples to apples
The reason schools only report the highest single-sitting scores is because those are the only ones they consider in the applications process. This is called “superscoring”, and has been a staple of college applications for a good long while (however, taking the SAT more than three times looks desperate).