Last week commenter Kiwi-Anon left me the following message about the KAMIKAZE which you can take here.
Alright, I’ve just finished my final draft. I added some floor extension items that I know a mathematically challenged kid I tutor can solve, so hopefully everybody should be able to get at least one correct. I think 6-7 correct should be about average among normal people. Looking over the test again, the ceiling is probably not as high as I originally thought, but should still be high enough for this blog – there will probably be a few stray Feynmans who breeze through it, but most won’t hit the ceiling. BTW, it’s probably a good idea to separate out both people who have a math/physics background and people with math competition experience with a demographic questionnaire. Let me know if any of the questions are ambiguous or if any of my answers seem wrong (I have checked them, but I am very sleep deprived right now so there’s a chance I made a mistake). Feel free to rearrange the questions into whatever you feel is a better order of difficulty (they are already loosely in such an order), and of course, you don’t have to keep the silly name for the test. You may want to try the test for yourself to see if my time limit seems right; I want fairly smart people to have at least 9 minutes for the last 3 questions.
Here’s the test, hopefully it meets your criteria:
The Kiwi-Anon Mathematical Intelligence, Knowledge, And Zeal Examination (K.A.M.I.K.A.Z.E.) (hey, acronyms aren’t my strong suit!)
Hey, that’s me!
I guess you didn’t have any issues with any preliminary testing you did? Ganzir told me he left some comments about the test with you, but he didn’t say what they were.
I am interested to see how well scores on this correlate with the WAIS and the SAT math section.
Anyway, hope everyone enjoys the test, and thanks for posting it PP. If anyone finds any problems with the questions on the test, reply to this comment about them.
I thought PP had your e-mail and would forward the comments to you, but he apparently doesn’t and didn’t. I’ll reply to you about it.
It’s a nice initiative, but I some of the questions don’t make any sense: [redacted by pp, 2022-04-19] The fact that this question slipped through suggests low-IQ and low conscientiousness on the part of the maker, both of which convinced me not to do the test.
Which question was it?
the traffic question. It’s too late to send you his comment but now that I just now have your email, I’ll send redacted feedback to you in future.
So, did he find an error? That is one of the ones that I am very confident shouldn’t be problematic since I David Wechsler’d it (to coin a euphemism) out of an obscure problem book, and I know it is solvable because I managed to solve it when I first encountered it. I thought it was a fairly fun one. [redacted by pp, 2022-04-19] Maybe the assumptions you need to make in order to solve the question are less obvious to non-mathematicians – do you think it needs to be re-written? Most of the other questions are original (or at least heavily re-written to avoid googleability), I was worried there was a problem with one of the original ones.
Perhaps it would be better to change the wording to [redacted by pp, 2022-04-19] (I thought this was a reasonable assumption anyway).
Too late to change it now anyway. Part of intelligence is the ability to adapt to ambiguity so don’t worry if every question isn’t clear. Mistakes add color.
I solved the question in about five to ten seconds of reflection. Assumptions have to be made, but c’est la vie.
Yes, I suppose someone who is unable to introduce reasonable simplifying assumptions when solving a math problem would probably unable to apply their any of mathematical knowledge very well in real life anyway. It’s almost like the mathematical equivalent of the common sense type questions on Wechsler Comprehension tests.
but I some of the questions don’t make any sense
This also doesn’t make any sense
Pumpkin where is your response to my question?
Great another test that can tell me not to quit my day job…
i got 3 right 😦
Got 9 but after clearing cookies and re-doing it (and feeding nonsense data in the optional section this time so it can be safely discarded) I got 12.
I made a silly arithmetic mistake on one and a slight conceptual error on another two that resulted in being 1 off the answer. I tend to mess up questions that deal with calculating the number of items within a set boundary. e.g. 1,2,3,4,5. There are 3 items between 1&5 but I’d just do 5 – 1 = 4 if I’m not thinking carefully enough.
For research purposes, I submitted the same answers I got right when I took it under the 35-minute time limit.
By the way, if you want to calculate difficulty levels for each question, then in flexiquiz you can go into “analyze”, then change report type to individual responses, and then export it into an excel document. This will show you each individual’s responses for each question (and how many points their response got) in a spreadsheet along with other data. If you calculate the average number of points for a certain question then that will give you a decimal number between 0 and 1 that corresponds to the proportion of people who got the question right. This also allows you to sort individuals by IQ/SAT data, so you can plot IQ with number of correct answers etc.
fascinating! I don’t even know if I have excel but I’m sure pill uses it to keep track of all the licence plates he sees.
I don’t have excel either, but you can do it in google sheets.
Ok so I haven’t showered for a month and a half. I must say I feel great.
you’re so socially retarded
i shower twice a day!
So is this what your blog is all about from now on? Posting other peoples stuff. Why not my stuff? I have great ideas for articles.
For my intensive business classes you have to use a lot of Excel. all finance is done on Excel!
Mug and Pill wouldnt know that despite being in the field because theyre dinosaurs. but the point im trying to make is Excel is a key component of the financial sector and i use it everyday at school.
they teach you how to use Excel extensively and you have to be able to use the application to your advantage as often as possible.
Excel is super important where I work to. It’s a miracle I am where I am without it.
i feel that i feel that.
15/17. Are there any norms available?
I’m looking for jobs in Singapore now. Since the Asians are allowed to be nationalist (they have no danish there), they offer jobs to Singaporeans first before foreignors so it will be very hard getting even interviews.
So I interviewed with Goldman Sachs for a compliance role and didn’t get it. If I was black they would have begged me to come.
If I was black I would at least be a millionaire. And there would be about 25% of women desperate to have sex with me. And the government and media would treat me like some sort of religious figure.
If I was the same height, same IQ and same social class, I would basically be unstoppable if I was black. Some danish producer might even hit me up to be a talk show host.
LOL! Only because Oprah paved the way.
and she was selected by obese white gentiles according to peepee.
they expressed their EGI by favoring a fellow fat person.
the above is also the basis of a question on my test the AUTISM which peepee is afraid to take. sad.
peepee says black men are ugly then she says it’s racism that mel gibson is chosen for his looks. this is why women shouldn’t be allowed to vote.
notice peepee hasn’t claimed that professional sports are “racist”.
Never said Gibson’s success was because racism, but it is in part because he’s white & the free market valued whiteness
In contrast if Mel Gibson was black, he would not only still have a Hollywood career but he would have to carry around all the oscars he would have won in a wheelbarrow.
Gibson would never have made it in Hollywood in the first place had he been black. Racism was much worse in his days & he built his whole career on his blue eyed white looks.
before mel was a star there was a whole genre of movies called blaxsploitation.
https://en.wikipedia.org/wiki/Shaft_(1971_film)
https://en.wikipedia.org/wiki/Blaxploitation
And?
depp sues for defamation.
depp is another example. he’s not especially good looking or tall but he has a good voice.
PP where are comments
the answer key needs to be updated for two of the probability questions. It seems to only accept percentages and not decimals even though it does not specify this requirement (i.e. excepts 50% but not .5). I tested to confirm. Took my 14 down to a 12
Yes this caused the super easy Die question to be failed by shocking number of people. It works if you include a zero which you’re supposed to do. Don’t know if it makes sense to fix after so many people already tested.
Worth noting that the difference between you getting a 14 or a 12 is only about 2 IQ points. But yes, I think I will correct it.
fixed
Great test! I enjoyed it!
Probably would be of interest to relax the time constraints and see what happened. Keeping track of how long it took people to answer individual items would likely add insight into scores and to which particular items people had trouble with. It took me about 2 hours to figure out the microbe question. As soon as you set it up in the right way, the answer follows almost immediately.
Glad you liked it!
I tried to choose or create questions that were easy to do once you have the right perspective, but which were hard to figure out the right perspective for quickly. I figured that would make it so that only exceptionally clear and efficient thinkers would be able to complete it in the time limit, without placing to much emphasis on computational speed. I wanted the test ceiling to be significantly above my own ability, hence I made the time limit quite harsh as I know that I would not be able to solve all these problems in 35 minutes. I don’t think I’ll do an extended time limit version of the test since I want to focus on getting data on this version to create an accurate norm. Unfortunately I don’t think it would be possible to see how long a person spent on each question since all questions are on the same page.
Hmm, I have reworked question 17 and while my new answer is very similar to my first answer it is now slightly different. [redacted by pp, 2022-04-24]
To prevent the test from being spoiled, I redacted your comment, but before doing so, forwarded it to Ganzir who will hopefully forward it to kiwi anon.
Do you not have my email address PP? I sent you some emails, not sure if they got through
I just found your email. It had been wrongly classified as junk LOL. I’ll respond within 48 hours.
To Psychometric:
I’m not exactly sure what your solution is, but I can guarantee you that there is an exact solution that is expressible within the constraints stipulated by the question. If you write a more detailed explanation of your issues with the problem in this comment section then PP can send them to me.
Thank you kiwianon for your offer! However, I can now see a remarkably easy approach to this question. I am interested whether there might be another solution approach to this problem? I have applied some programming in a spreadsheet environment with some success; perhaps a more formal programming environment would be even better.
Question 17 was the real highlight of the test. Many of the other questions were more or less familiar from other IQ tests.
I think perhaps the answer to question 6 (the ladder question) might be misleading. Apparently the correct answer does not consider the shape that is actually traced out to be correct but instead credits the response which encompasses the entire shape (even the part that is “imaginary”).
Would be of interest if a more thorough analysis of the results could be posted to the site. Possibly a correlation matrix, that showed how each item correlated with every other item (also perhaps the total score). The total scores that were tied might be ordered by which one answered the most difficult questions correctly. Possibly even, if some total scores were lower, yet the mix of questions answered correctly were more difficult then they could be ranked above those test scores that answered more of the easier questions.
So many online tests are fixated on the total score on IQ tests, yet clearly so many of the early items have so little in the way of discriminative power. Clearly, missing question 1 (from some oversight) while answering question 17 would have shown much much higher cognitive ability level, even when they both counted for exactly the same incremental score. An alternative scoring method could be to simply grade testees by the highest theta demonstrated during the test. There would then be an incentive to go directly to the last (presumably most difficult question) and focus on it; instead of working forward through a test testees might work backwards. Yet, the typical and intended order of answering the questions (i.e., from the start) reinforces interpreting total score as of central meaningfulness whereas the most important interpretation from a psychometric perspective is the g loading of the questions.
It is disappointing to me that empty psychometrics is the rule not the exception in academic environments: Tests then all largely become some great athletic sprint or marathon often won through sheer brute exertion and not primarily through mind power. During my studies, thinking profoundly about difficult problems was central interest, not reducing curricula down to bite size video game like tidbits. One way around such brutish testing could be to self-declare an expected competency level and then apply adaptive testing proximal to the self-stated region. Tests could then be shorter, less stressful and provide better insight into ability level.
Maybe there is another solution, I’m not sure. I don’t know how programming in a spreadsheet would help much; the approach I know about is simple enough to do on pen and paper in a few lines, or possibly even in one’s head.
I don’t have that data yet, but when I get it I’m going to experiment with these sorts of analyses. Ganzir wanted to do Rasch analysis on the responses to the test, which is I believe is similar to what you are talking about. But really, I can’t see much point in making serious use of super sophisticated scoring methods on a test like this.
A better way of creating a score for the test (instead of using total score) might be to multiply the probability of a correct answer for each correctly answered question of a testee.
Those testees who correctly answered the 16 questions that at least 1 person answered correctly (i.e., all the questions except question 17), would achieve a score of 1.25752 E-6 (as per the latest posted statistics from the test). One might then take the negative log of this and report the result as 5.9. Those testees that could only answer question 1 would achieve a score of 1 (probability of a correct response =1). – log(1) = 0.
This scoring method would help to overcome the problem that arises when higher ability testees incorrectly answer easy items. When total scores are used, an easy item has just as much weight as the most difficult question even when the probability of correctly answering the question approaches 100%. Such easy questions do not typically provide meaningful information about testees.
This scoring method would also overcome the problem that arises when 2 candidates receive the same total score, though have a different mix of correctly answered questions. With the product method, such candidates would not receive the same score. Would be interesting to see how the product scoring method’s correlation with other tests compared to that of the total score method’s correlation.
Why go to all the trouble? I’m sure there’s a very high correlation between hardest problem solved and number of problems solved. Now you likely eventually reach a point where total score continues to increase without a corresponding increase in hardest problem solved and perhaps that’s a good way of defining a test’s functional ceiling.
Plus I think your ability to quickly work through items you know how to solve without making any careless mistakes gets at another ability that also correlates with g. As does choosing efficient strategies that leave you enough time to solve the hardest problems.
The motivation is to move away from a classical testing perspective and towards an item response theory view. Classical testing applies the brute force approach of having everyone run a marathon with the winner being determined by who sweats the most (though not necessarily rewarding the most thoughtful); it’s not an elegant method to find those with the most brain power. In fact, those with more brain power could be too exhausted by the time they got to the last questions to give them their best effort. For those with the most cognitive ability, most of the earlier questions will have seemed largely superfluous. Yet, this testing paradigm is the go to standard in education as it acts as an effective tool to exert control over students while largely ignoring the psychometric weaknesses involved.Those who want to win the race are then co-opted to run the same race as everyone else, even when the information value of most of the questions is minimal.
Item response theory is not as interested in marathons: It is interested in finding the best estimate of theta (ability level) without all of the exertion. Using adaptive testing one might only need to ask a few questions to make a good estimate.
In an educational setting this might look like this- when a student were writing a course exam, the first question might be for the student to provide their best estimate of what they thought their exam score would be. Who better to ask than the student themselves? They probably have a pretty good idea! Might even ask them how many hours they studied and what their previous grades in the course had been to help them formulate a good estimate. Using adaptive testing with item response theory, one could then ask a question that reflected the student’s estimated exam grade.
If the student got this one question correct, the computer might ask if the student would be satisfied ending the test at that point and be awarded the grade that they suggested reflected their abilities. The student might refuse and ask for a more difficult question for a higher grade (or accept and end the testing). If the answer were wrong, the student could be given an easier question. The questions would continue to become easier until a correct answer were given. Such a testing approach could yield a very accurate estimate of testees ability.
This adaptive approach using IRT would give a more accurate assessment of ability without having to use the brute strength approach of a speeded test that often saps people of their high level cognitive abilities before they ever reach the maximal theta questions near the end of the test. With these typical tests the reward for answering the most difficult questions can often be quite marginal. All the questions are given the same weight on the final score, so from a cost benefit perspective, a rational test taker is compelled to start from the beginning and work forward no matter the level of their cognitive ability. With the IRT approach there would be no need to answer the easiest items first (or perhaps at all). It would be best for a testee to go directly to a question which was at their maximal ability level and answer it.
As shown below, an alternative approach to avoid empty ability testing would be to consider the marginal (-log10(pr(Q)); this would give one estimate of item difficulty. The far right column shows that with a 100% probability of being correct the -log10 (pr(Q1) = 0. If the cumulative sum of -log(pr(Qs)) for all correct answers were used as the score question 1 would then be given 0 weight (not 1/17th). Notice also that question 16 would be given 28 times the weight of question 2. Changing the weighting in this way would change the testing incentives and strategies of test takers. Starting from the first question and working forward might no longer make sense. The test would now highly reward testers who went to the questions that they expected would have the highest cost (in terms of time spent) benefit return.
These insights would probably have greatly strengthened the Kamikaze test. The roughly 5 ability ranges tested (~80%, 50%, 35%, 20% and ~1%) could then be probed more efficiently and deeply to find the true theta level.
Notice that now answering question 16 would have nearly the same test weight (0.854 bottom far right) as answering the first 7 questions (fifth column from left row 7 “dice”). In adaptive test, perhaps questions with easier thetas might even felt to be correctly “answered” when more difficult thetas were answered.
Moving to an IRT perspective means that merely aggregating questions with very different g loadings can be avoided in order to have a much better estimate of the testees ability while often using many fewer questions. In such a testing environment, artificial time limits would no longer have the same relevance. One could ask a difficult question and allow candidates time to think carefully about it. 35 minutes could offer a substantial amount of time to demonstrate true creative solving potential.
{Notably there would then no longer be a high correlation between total number of questions answered and hardest problem solved with IRT. In classical testing this high correlation emerges inevitably because for a given level of attained maximal theta (i.e., ability demonstrated on a test) all previous items (i.e., the easier items before maximal theta is reached) would have substantial probability of being answered correctly (pr(theta) =1/(1+e^-(theta-b) i.e., the probability that a testee with ability level theta correctly answering a question of difficultly b) and given the cost benefit balence these questions should be attempted (because total score is linearly related to number of items correctly answered and not their g loading). Classical testing incentivizes testees to do busy work and not brain work. There is just too much reward given to answer question 1 and the easy questions and not the harder questions. Changing the incentive structure would likely reveal more of the hidden high theta testees who would otherwise be swamped with low theta questions and would not have enough time to answer the most difficult questions. For lower theta testees, additional time probably would not be as helpful as they would not have the knowledge base to make the needed conceptual leaps. It is not so much that high ability testees cannot answer the more difficult questions, as there it is not enough reason (or time) to even try.
proportion Product from Cum. Sum of Product from Cum. Sum Marginal log
Correct Top log from top Bottom from Bottom
1 10 percent 1 1 0 0.0000013 5.8860566 0
2 paint 0.93 0.93 0.0315171 0.0000013 5.8860566 0.0315171
3 die 0.86 0.7998 0.0970186 0.0000014 5.853872 0.0655015
4 donuts 0.64 0.511872 0.2908386 0.0000016 5.79588 0.19382
5 sponge 0.57 0.291767 0.5349638 0.0000025 5.60206 0.2441251
6 fence 0.57 0.1663072 0.7790889 0.0000043 5.3665315 0.2441251
7 dice 0.5 0.0831536 1.0801189 0.0000076 5.1191864 0.30103
8 cards 0.5 0.0415768 1.3811489 0.0000151 4.8210231 0.30103
9 jellybeans 0.5 0.0207884 1.6821789 0.0000302 4.5199931 0.30103
10 male 0.36 0.0074838 2.1258778 0.0000605 4.2182446 0.4436975
11 book 0.36 0.0026942 2.5695702 0.000168 3.7746907 0.4436975
12 bicycle 0.36 0.0009699 3.013273 0.0004668 3.3308692 0.4436975
13 ladder 0.21 0.0002037 3.691009 0.0012965 2.8872275 0.6777807
14 29 men 0.21 0.0000428 4.3685562 0.006174 2.2094334 0.6777807
15 triangular 0.21 0.000009 5.0457575 0.0294 1.5316527 0.6777807
16 traffic 0.14 0.0000013 5.8860566 0.14 0.853872 0.853872
I had a lot of fun with this test. The last four questions in particular were very enjoyable. Do you know of other tests that are similar to this? I saw that the original PATMA link from here is dead. Is that being hosted anywhere now? Thanks!
I’m glad you liked it!