Large Language Models’ Emergent Abilities Are a Mirage

Large Language Models’ Emergent Abilities Are A Mirage
Courtesy of Merrill Sherman/Quanta Magazine

Three-digit addition offers an example. In the 2022 BIG-bench study, researchers reported that with fewer parameters, both GPT-3 and another LLM named LAMDA failed to accurately complete addition problems. However, when GPT-3 trained using 13 billion parameters, its ability changed as if with the flip of a switch. Suddenly, it could add—and LAMDA could, too, at 68 billion parameters. This suggests that the ability to add emerges at a certain threshold.

But the Stanford researchers point out that the LLMs were judged only on accuracy: Either they could do it perfectly, or they couldn’t. So even if an LLM predicted most of the digits correctly, it failed. That didn’t seem right. If you’re calculating 100 plus 278, then 376 seems like a much more accurate answer than, say, −9.34.

So instead, Koyejo and his collaborators tested the same task using a metric that awards partial credit. “We can ask: How well does it predict the first digit? Then the second? Then the third?” he said.

Koyejo credits the idea for the new work to his graduate student Rylan Schaeffer, who he said noticed that an LLM’s performance seems to change with how its ability is measured. Together with Brando Miranda, another Stanford graduate student, they chose new metrics showing that as parameters increased, the LLMs predicted an increasingly correct sequence of digits in addition problems. This suggests that the ability to add isn’t emergent—meaning that it undergoes a sudden, unpredictable jump—but gradual and predictable. They find that with a different measuring stick, emergence vanishes.

Brando Miranda (left), Sanmi Koyejo, and Rylan Schaeffer (not pictured) have suggested that the “emergent” abilities of large language models are both predictable and gradual.

Courtesy of Kris Brewer; Ananya Navale