A reader stated provided a screenshot of his performance on humanbenchmark.com.

The reader states:


humanbenchmark.com, that website where you test your reaction speed, has a wide selection of other psychometric tests, I’d guess a composite score of all the tests would probably have a decently high g-loading. I just want some background info on these tests, if there is any.

As discussed in previous articles in this series, some of the tests (sequence memory, number memory) have their roots in conventional psychometric tests. Tests of reaction time date back to the 19th century work of Francis Galton who believed that basic neurological speed predicted intelligence. Unfortunately Galton’s research was derailed by a lack of reliability (he only used a one trial measure of reaction time) range restriction (his samples tended to be elite) and improper measures of intelligence with which to relate reaction time (he compared it with school grades since IQ tests had not yet been invented). As a result, he detected virtually no relationship between reaction time and intellect.

Nearly a century later Arthur Jensen would revisit Galton’s work, correcting for these problems. He found that when you aggregated many different kinds of reaction time (simple, complex, etc) measured both by speed and consistency (faster and less variable RTs imply higher intelligence) over many different trials, and compared with measures of IQ (not grades) and corrected for range-restriction, the results correlated a potent 0.7 with intelligence.

Unfortunately, the human benchmark test only uses simple reaction time (which is much less g loaded than complex RT), only one type of simple reaction time (an aggregate of several types is more g loaded) and only measures speed (variability is much more g loaded) and does not provide a composite score weighted to maximize g loading. As a result, on the whole the human benchmark tests seem inferior to the game THINKFAST which a bunch of us played circa 2000. So accurate was THINKFAST that the Prometheus society considered using it as an entrance requirement, with one internal study finding that one’s physiological limit on THINKFAST correlated a potent 0.7 with SAT scores in one small sample of academically homogenous people. Having people practice until hitting their physiological limit was a great way to neutralize practice effects because everyone must practice until their progress plateaus.

Sadly, this innovative research petered out when people worried that Thinkfast might give different results depending on the computer. People fantasized about Thinkfast being on a standardized handheld device so scores could be comparable, but in those days, few people imagined we’d one day all have iphones and ipads.

The reader continues:


I’ve also attached a screenshot of all my average scores, though I’ll note that some scores are inflated since I’ve done all the tests many times and I often don’t bother finishing the test if I do bad. The strange thing about these scores is that by more conventional measures both my verbal IQ and working memory are pretty average, yet I’m able to score above the 99.9 percentile on 2 of these tests. I think this points to the fact that memory is an ability that is much broader than most IQ models would suggest. Like the verbal memory test in particular, I seem to be using a very different part of my brain compared to more typical tests like digit span. I’d also wager that most of the variation in working memory can be explained by chunking/processing abilities rather than raw storage capacity.
Also, what does the strength of the practice effect really say about a test? None of these tests really have a pattern or trick to them, yet for some of them my score has improved a lot from the first time I did them.

This is an extremely important question. In complex cognitive tasks like chess or conventional IQ tests, practice improves performance because we learn strategies, but on elementary cognitive tasks like Human Benchmark and Thinkfast, fewer strategies are possible so one wonders if there’s an increase in raw brain power.

The analogy I make is heigt vs muscle. If I repeatedly had my height measured, I might score a bit higher with practice. Not because I was genuinely getting taller, but because I was learning subtle tricks like how to stand straighter. By contrast if I had my strength measured everyday, I’d show more increase, but this increase would not simply be because I acquired tricks to do better (how I position the barbells in my hands) but because a genuine increase in strength.

So is intelligence more analogous to height or physical strength (the latter being far more malleable)? Is the practice induced increase in Human Benchmark tests an acquired strategy (even a subconscious one) or a real improvement, and how do we even operationalize the difference?

If practicing elementary cognitive tasks really did improve intelligence we’d expect brain-training games to improve IQ, but apparently they do not. Jordan Peterson explains that the problem is that cognitive practice in one domain does not translate to other ones.

On the other hand, why should anyone expect brain training to transcend domains? When a weight lifter does bicep curls, he doesn’t expect it to make his legs any stronger, so why should someone practicing visual memory expect to see an increase in verbal memory, let alone overall IQ?

But how can we know if we’ve even improved a specific part of intelligence rather than just become more test savvy? We know that weight lifting has improved our strength, and not just our technique, because we can see our muscles getting bigger, so perhaps cognitive training games might make certain brain parts bigger.

The groundbreaking London Taxi Cab study, published in 2000, used MRI technology to compare the brains of experienced taxi cab drivers and bus drivers who drive the city streets of London every day. In contrast to bus drivers, whose driving routes are well-established and unchanging, London taxi drivers undergo extensive training to learn how to navigate to thousands of places within the city. This makes them an ideal group to use to study the effects of spatial experience on brain structure.

The study focused on the hippocampus, which plays a role in facilitating spatial memory in the form of navigation. The MRI revealed that the posterior hippocampi of the taxi drivers were much larger than that of the bus drivers (who served as the control subjects). Even more exciting was that the size of the hippocampus directly correlated with the length of time that someone was a taxi driver–the longer someone drove a taxi, the larger their hippocampus.

The London Taxi Cab Study provides a compelling example of the brain’s neuroplasticity, or ability to reorganize and transform itself as it is exposed to learning and new experiences. Having to constantly learn new routes in the city forced the taxi cab drivers’ brains to create new neural pathways “in response to the need to store an increasingly detailed spatial representation.” These pathways permanently changed the structure and size of the brain, an amazing example of the living brain at work.

Source

Assuming the brains of the taxi drivers actually changed (as opposed to the sample changing because less spatially gifted drivers left the job) it might be possible to increase specific parts of intelligence, but since there are so many different parts, it’s perhaps impossible to ever increase overall intelligence (or overall brain size) by more than a trivial degree. We can improve our overall muscle mass because our muscles are outside or skeleton; by contrast our brains our inside our cranium so its growth is constrained. It could be that improving the size of one part of the brain requires a corresponding decrease in other parts, to avoid the overall brain from getting too big for its skull.

My research assistant 150 IQ Ganzir also weighed in on the reader’s questions, writing:

The first aspect of this score profile I noticed is the absence of any huge dips, the 10 on Number Memory notwithstanding, since a tiny change in raw score on that test can dramatically alter your percentile ranking. Given that all of this subject’s scores on the more IQ-like tests are well above average compared even to other HumanBenchmark users, who themselves are undoubtedly self-selected for superior proficiency on these types of tasks, we wouldn’t expect their reaction time to be particularly fast, but it is. Our subject appears to be a jack-of-all-trades, if you will, at these tasks. Simple reaction time has only a weak correlation of about -0.2 to -0.4 with IQ, according to Arthur Jensen on page 229 of The g factor. Note that the correlation is negative because a faster reaction speed implies a lower reaction time.

The commenter mentions: “I’ve also attached a screenshot of all my average scores, though I’ll note that some scores are inflated since I’ve done all the tests many times and I often don’t bother finishing the test if I do bad.” If true, this would indeed cause a statistical upward bias, but I have no idea how to even begin calculating the size of that. However, if the tests are reliable in the statistical sense, meaning they give similar scores with each administration, then the average score increase couldn’t be too large. But, then again, if the commenter was reaching nearly the same score every time, why would they restart on a bad run? High intra-test score variability might indicate executive functioning problems.

The commenter notes that their verbal IQ and working memory are “pretty average” on other tests, but their score on verbal memory here is so high relative to other HumanBenchmark users that the system just gives it 100th percentile without discriminating further. (I know that it can’t literally be 100th percentile, as I and several other people I know have achieved higher scores.) A possible contributing factor is that HumanBenchmark users may tend to have less than long attention spans, inhibiting performance on this test, on which reaching one’s potential may take quite a while, especially for higher scorers.

Our correspondent also writes: “Like the verbal memory test in particular, I seem to be using a very different part of my brain compared to more typical tests like digit span. I’d also wager that most of the variation in working memory can be explained by chunking/processing abilities rather than raw storage capacity.” Of course, I don’t think it’s possible to determine by introspection which part(s) of the brain you’re using on a given task, but I think I understand the subjective experience described here. As for chunking/processing abilities versus raw storage capacity, I’m not sure what’s implied here. The human brain could be described as a massively parallel computer, and it naturally processes things in chunks. If “chunking” refers to purposely learnt mnemonics, such as the mnemonic major system, then Goodhart’s Law applies here because learnt skills lose their g-loading.

The commenter thus wonders about the continued meaning of their scores: “Also, what does the strength of the practice effect really say about a test? None of these tests really have a pattern or trick to them, yet for some of them my score has improved a lot from the first time I did them.” Unfortunately, without studies of these tests specifically, we can’t know the extent to which Goodhart’s Law applies. Even analyses of seemingly similar tests from mainstream psychometrics wouldn’t be insufficient, since the HumanBenchmark versions are subtly but crucially different. All I can say is that only someone of uncommonly high cognitive capacity could produce this score profile regardless of how much time they spent practicing, and that, with no indication of how rare your scores are compared to the general population, greater precision is currently almost meaningless.

Scores on the “Chimp Test,” or at least the version on HumanBenchmark, are also almost meaningless because unlimited time is allowed to review the digits’ locations before answering, making it less a test of visual working memory and more a test of how long the testee is willing to stare at boxes. Also, most people will probably on average score higher on the HumanBenchmark “Number Memory” test than on the clinical version of the Digit Span test, since the former presents the digits simultaneously and allows a few seconds to mentally review them, whereas, in the latter, each digit is read only once with no opportunity for review.

Finally, the subject’s strong performances on Typing and Aim Trainer make me suspect a background in competitive computer gaming.