After years of dominance by artificial intelligence known as Transformer, the search for new architectural structures has begun.
Transformers form the basis of OpenAI’s video-generating model Sora, and they also form the core of text-generating models like Anthropic’s Claude, Gemini, and Google’s GPT-4o. But they’re starting to run into technical hurdles—particularly computational hurdles.
Transformers are not particularly efficient at processing and analyzing huge amounts of data, at least when running on off-the-shelf hardware. This leads to high and possibly prohibitive costs. Not sustainable Demand for power is rising as companies build and expand infrastructure to accommodate transformer requirements.
This month’s promising architecture proposal is: Test Time Training (TTT)The model was developed over a year and a half by researchers at Stanford University, UC San Diego, UC Berkeley, and Meta. The research team claims that TTT models can not only process much more data than transformers, but they can do so without consuming nearly as much computing power.
Hidden state in transformers
One of the fundamental components of transformers is the “hidden state,” which is essentially a long list of data. When a transformer processes something, it adds entries to the hidden state to “remember” what it just processed. For example, if a model is working its way through a book, the hidden state values would be things like representations of words (or parts of words).
“If you think of a transformer as an intelligent entity, the lookup table — its hidden state — is the brain of the transformer,” Yu Sun, a postdoctoral researcher at Stanford University and a TTT research associate, told TechCrunch. “This specialized brain enables well-known transformer capabilities like contextual learning.”
Hidden state is part of what makes transformers so powerful. But it also holds them back. For a transformer to “say” a single word about a book it just read, the model has to scan the entire lookup table—a task that’s as computationally intensive as rereading the entire book.
So Sun and his team came up with the idea of replacing the hidden state with a machine learning model — like nested AI puppets, if you will, a model within a model.
It’s a bit technical, but the gist of it is that the internal machine learning model of a TTT model, unlike a lookup table in a transformer, doesn’t grow and grow as it processes additional data. Instead, it encodes the data it processes into representational variables called weights, which is what makes TTT models so high-performance. No matter how much data a TTT model processes, its internal model size will not change.
Sun believes that future TTT models could efficiently process billions of data points, from words to images to audio recordings to videos. That’s far beyond the capabilities of today’s models.
“Our system can say X words about a book without the computational complexity of rereading the book X times,” Sun said. “Large transformer-based video models, like Sora, can only process 10 seconds of video, because they only have the ‘brain’ of a lookup table. Our ultimate goal is to develop a system that can process long video that resembles the visual experience of human life.”
Doubts about TTT models
Will TTT models eventually outperform Transformers? It’s possible, but it’s too early to tell.
TTT models are not a direct replacement for transformers. Researchers have only developed two small models for study, making it difficult to compare TTT as a current method to some of the larger transformer applications that exist.
“I think this is a very interesting innovation, and if the data supports the claims that it provides efficiency gains, that’s great news, but I can’t tell you whether it’s better than existing architectures,” said Mike Cook, a senior lecturer in the Department of Informatics at King’s College London who was not involved in the TTT research. “One of my old professors used to tell me a joke when I was an undergraduate: How do you solve any computer science problem? Add another layer of abstraction. Adding a neural network inside a neural network definitely reminds me of that.”
Regardless, the accelerating pace of research into transformer alternatives indicates a growing recognition of the need for significant progress.
This week, AI startup Mistral launched a model, Codestral Mamba, that is based on another alternative to transformers called state space models (SSMs). State space models, like TTT models, appear to be more computationally efficient than transformers and can scale to larger amounts of data.
AI21 Labs is also exploring SSMs. Cartesiawhich pioneered the production of some of the first SSMs and the Codestral Mamba missiles of the same name, Mamba and Mamba-2.
If these efforts succeed, it could make generative AI much more accessible and widespread than it is now — for better or worse.