Language Models will be Scaffolds

Alex Zhang, Feb 25, 2026.

I have been somewhat convinced since before I started my PhD that the language models we interact with in the (near) future will be what we call scaffolds today. For the earlier half of this decade, it was generally believed that any work other than improving the base “neural” language model (i.e. an end-to-end neural network model like Opus 4.6 or Qwen3-8B) was a contrarian take on Sutton’s Bitter Lesson that would ultimately fall victim to scale. And this belief was genuinely a good bet since the invention of the Transformer in 2017: religiously following this bet is how companies like OpenAI and Anthropic have exploded in valuation since. The inability to follow this bet is also what has led to academia’s weakening presence in AI, and the growing pull of industry labs for ambitious, young researchers.

For the second half of this decade, my intuition tells me to bet differently. I’m not saying that scaling is dead; scaling is quite literally the key to everything in a data-driven strategy like deep learning. It’s more that language models are really good now: so good, that I theorize that existing neural language models are actually severely underutilized. I am implying that they are much better at general task solving than what we naively use them for. We’ve spent the better half of this decade exhausting every axis of scale we can find (e.g. data, compute, model capacity) in hopes that the neural language models we produce can edge out a few extra points on benchmarks built three years ago, but the obsession over “raw model capability” has ironically led to our evaluation metrics being completely off. How do you begin to evaluate between scaffolds like Claude Code, Codex, Cursor, and Antigravity with anything other than “vibes”? The lack of comparison is not because it doesn’t exist; it’s because we weren’t prepared for it.

Another consequence of the “language model purist” view is the conflation of the term “language model” to mean neural network. A language model, as we defined it pre-“Attention is All You Need”, is merely a probabilistic function from text to text. As an example, at the very end of 2025, I released a preprint called “Recursive Language Models”. A common point of confusion is in two-thirds of the title being “Language Model”, when the main proposal of the paper is about a task-agnostic scaffold. The argument presented in that paper is a formalized implementation of the theme of this essay, which is that a powerful class of language models with near-infinite input, output, and reasoning context are scaffolds around neural language models that can call themselves recursively inside of a REPL. To be blunt, what I am suggesting is that the line between a language model and a scaffold is blurring, and the field is once again open to novel ideas on what these scaffolds should look like.

As a researcher in AI, this should be very exciting. The field is generally resistant to “out-there” ideas, but the ability to produce novel, state-of-the-art systems without expensive training is at a peak. What’s even more exciting is that there isn’t “low-hanging fruit” per se (I strongly dislike this term because it implies you should pursue lazy incremental ideas), it’s more that we have once again hit a ripe period where innovative, clever ideas can make a huge impact on the direction of the field. Of course, I will continue to bet that training Recursive Language Models (RLMs) are the way to go to achieve near infinite-context LMs and produce a breakthrough in reasoning capabilities for models, but I also firmly believe that there are a plethora of other refinements or alternatives that may prove to be better. Only time will tell if GPT-9-super-high-genius-think ends up being a scaffold, but for now, I’m hopeful for the ideas to come.