Solves maths Olympiad problems, doesn’t grasp place value: What kind of intelligence is GPT-4?
How large language models expose the problems of intelligence benchmarks
My social media feeds have been saturated with visuals that testify to AI’s growing capabilities. Among the most eye-popping is this chart that suggests GPT-4 has leapt past humans across most domains.
GPT-4 has passed the bar (placing in the 90th percentile) and, when combined with the Wolfram Alpha plugin, achieves 95% in A Level Maths, prompting Conrad Wolfram to declare ‘game over’ for the subject. The example that stopped me in my tracks is from the evocatively titled ‘Sparks of AGI’ paper, where GPT-4 successfully solves a modified maths Olympiad problem. For the curious, here it is:
This is a perfectly valid solution to a genuinely tough problem; the solution is creative, meticulously well-reasoned. The sparks are flying, the hype surrounding AI’s emergent thinking capabilities surely justified.
Mercurial problem solvers
But there’s more to it. Large language models are the most mercurial of problem solvers. The same models that deliver expert solutions like the one above get tangled in knots when confronted with the most basic mathematical concepts. I recently got into a protracted argument with Bing when it refused to accept that the 7 in 12.37 represents seven hundredths (it doubled down on its belief that it is, in fact, seven tenths, before bringing our exchange to an abrupt halt). This is a recurring trend: across several areas of maths, GPT-4’s predecessor ChatGPT was shown to make basic mistakes across every topic it was tested on, even though it solved more complex problems within each topic.
What gives? How can a system solve selected Olympiad problems while making mistakes a cogent seven year old would baulk at?
The answer may sound counterintuitive: LLMs know too damned much. Having been trained on huge amounts of data (‘Open’AI refuses to disclose how much but the ‘large’ in large language models should be taken literally), GPT-4 and its kind are able to map problems to similar ones they have seen before.
The authors of the ‘sparks’ paper reject the characterisation that GPT-4 is merely ‘memorising’ its training data (despite evidence to the contrary for both ChatGPT and GPT-4) but they do acknowledge that:
It [GPT-4] can also make very basic mistakes and occasionally produce incoherent output which may be interpreted as a lack of true understanding. Its mathematical knowledge and abilities can depend on the context in a seemingly arbitrary way.
GPT-4 has demonstrated what cognitive psychologists call near transfer; the ability to solve problems situated within a context they are already familiar with. Another admission from the ‘sparks’ paper is that:
changes in the wording of the question can alter the knowledge that the model displays
Subtle changes of wording should not make the difference between serving up a valid solution and spouting nonsense. As long as these behaviours persist, we should be sceptical of claims that GPT-4 demonstrates robust understanding of concepts, a key indicator of which is far transfer: the ability to generalise and adapt what you know to new situations. Plugins like Wolfram Alpha will help LLMs overcome their numerical blind spots, but this alone will not address their gaps in reasoning.
Uprooting the goalposts
This is not just a reflection of where current large language models fall short (time will tell what GPT-4’s successors prove capable of), but of how we choose to measure intelligence. Critics are themselves lambasted for regularly shifting the goalposts for AI, but if the goalposts are the exams of mainstream education, they don’t just need shifting, they need uprooting.
What does it say about educational assessment that AI can clear so many of its benchmarks with such brittle understanding of the underlying concepts? Could it also be true that we have been using the wrong benchmarks to assess human intelligence?
By the standards of mainstream education, GPT-4 is the dream student. It has studied extensively, and goes into its exams in full anticipation of the questions ahead. This is what so much of academic success boils down to.
At a stretch, one might argue that exams reward students for acquiring core knowledge - the so-called building blocks of learning. But exams evaluate little of students’ creative application of that knowledge to new settings (such as those they will face outside the confines of an exam hall). The false narrative we pedal to students of the usefulness of their education to the real world, and to their job prospects, is being emphatically exposed by the performance of LLMs. Even the most diligent student cannot hope to compete with language models on knowledge acquisition.
But assessment doesn’t have to be so limited. François Chollet’s ARC framework is a serious attempt to test LLMs on novel problems that do not feature in their training data and thus capture something of the robustness and flexibility of their thinking. In each task, the subject is shown a few examples of coloured grids (which computers process as a 2D grid of symbols) and are then asked to complete the missing example. The aim is to measure one’s ability to pick up new skills with limited prior experience. These tests, which humans have little trouble with, have so far proven thorny for LLMs.
Testing for far transfer is tough for large language models because of the sheer extent of their prior knowledge. It is actually a more straightforward task in the case of humans, who do not have the means to absorb all the information the internet has to offer. But the idea is the same - educational assessment ought to account for students’ prior knowledge and pose questions that are not merely a rehashing of things they have seen before. Maths challenge papers (of which the Olympiad is at the top end) may be fit for purpose because they are constantly innovating with new question types that students are unlikely to have encountered before - studying past papers and ‘teaching to the test’, while helpful, only goes so far.
One-time written exams may not even be the optimal format. A dialogue is more likely to expose gaps in their thinking (this is true of large language models too: the ‘sparks’ paper shows how GPT-4 goes off the rails, mathematically speaking, within a few prompts). When interviewing prospective Oxford maths undergraduates, I would often meet students whose responses, much like GPT-4, flitted between expert solutions and inexplicable lapses in reasoning, suggesting the former were predicated on little more than rote learning. In the space of a typical 25-minute interview, I could develop a reasonable sense of the students’ coherence of thought, their ability to plan and execute lines of attack, and respond to prompts in the face of uncertainty. Recruitment exercises, similarly, only serve their purpose if they test for novelty (unless one is hiring for roles rooted in routine, predictable tasks…the very roles that automation has in its sights).
Intelligence is easy to proclaim but hard to define, harder even to measure. Much of the commentary on LLMs has focused on how the benchmarks we use for human intelligence are not appropriate for AI. But it turns out that some of them are of little use to humans either.