AGI Is Not Multimodal

"In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence." –Terry Winograd

The recent successes of generative AI models have convinced some that AGI is imminent. While these models appear to capture the essence of human intelligence, they defy even our most basic intuitions about it. They have emerged not because they are thoughtful solutions to the problem of intelligence, but because they scaled effectively on hardware we already had. Seduced by the fruits of scale, some have come to believe that it provides a clear pathway to AGI. The most emblematic case of this is the multimodal approach, in which massive modular networks are optimized for an array of modalities that, taken together, appear general. However, I argue that this strategy is sure to fail in the near term; it will not lead to human-level AGI that can, e.g., perform sensorimotor reasoning, motion planning, and social coordination. Instead of trying to glue modalities together into a patchwork AGI, we should pursue approaches to intelligence that treat embodiment and interaction with the environment as primary, and see modality-centered processing as emergent phenomena.

Preface: Disembodied definitions of Artificial General Intelligence — emphasis on general — exclude crucial problem spaces that we should expect AGI to be able to solve. A true AGI must be general across all domains. Any complete definition must at least include the ability to solve problems that originate in physical reality, e.g. repairing a car, untying a knot, preparing food, etc. As I will discuss in the next section, what is needed for these problems is a form of intelligence that is fundamentally situated in something like a physical world model. For more discussion on this, look out for Designing an Intelligence. Edited by George Konidaris, MIT Press, forthcoming.

Why We Need the World, and How LLMs Pretend to Understand It

TLDR: I first argue that true AGI needs a physical understanding of the world, as many problems cannot be converted into a problem of symbol manipulation. It has been suggested by some that LLMs are learning a model of the world through next token prediction, but it is more likely that LLMs are learning bags of heuristics to predict tokens. This leaves them with a superficial understanding of reality and contributes to false impressions of their intelligence.

The most shocking result of the predict-next-token objective is that it yields AI models that reflect a deeply human-like understanding of the world, despite having never observed it like we have. This result has led to confusion about what it means to understand language and even to understand the world — something we have long believed to be a prerequisite for language understanding. One explanation for the capabilities of LLMs comes from an emerging theory suggesting that they induce models of the world through next-token prediction. Proponents of this theory cite the prowess of SOTA LLMs on various benchmarks, the convergence of large models to similar internal representations, and their favorite rendition of the idea that “language mirrors the structure of reality,” a notion that has been espoused at least by Plato, Wittgenstein, Foucault, and Eco. While I’m generally in support of digging up esoteric texts for research inspiration, I’m worried that this metaphor has been taken too literally. Do LLMs really learn implicit models of the world? How could they otherwise be so proficient at language?

One source of evidence in favor of the LLM world modeling hypothesis is the Othello paper, wherein researchers were able to predict the board of an Othello game from the hidden states of a transformer model trained on sequences of legal moves. However, there are many issues with generalizing these results to models of natural language. For one, whereas Othello moves can provably be used to deduce the full state of an Othello board, we have no reason to believe that a complete picture of the physical world can be inferred by a linguistic description. What sets the game of Othello apart from many tasks in the physical world is that Othello fundamentally resides in the land of symbols, and is merely implemented using physical tokens to make it easier for humans to play. A full game of Othello can be played with just pen and paper, but one can’t, e.g., sweep a floor, do dishes, or drive a car with just pen and paper. To solve such tasks, you need some physical conception of the world beyond what humans can merely say about it. Whether that conception of the world is encoded in a formal world model or, e.g., a value function is up for debate, but it is clear that there are many problems in the physical world that cannot be fully represented by a system of symbols and solved with mere symbol manipulation.

Another issue stated in Melanie Mitchell’s recent piece and supported by this paper, is that there is evidence that generative models can score remarkably well on sequence prediction tasks while failing to learn models of the worlds that created such sequence data, e.g. by learning comprehensive sets of idiosyncratic heuristics. E.g., it was pointed out in this blog post that OthelloGPT learned sequence prediction rules that don’t actually hold for all possible Othello games, like “if the token for B4 does not appear before A4 in the input string, then B4 is empty.” While one can argue that it doesn’t matter how a world model predicts the next state of the world, it should raise suspicion when that prediction reflects a better understanding of the training data than the underlying world that led to such data. This, unfortunately, is the central fault of the predict-next-token objective, which seeks only to retain information relevant to the prediction of the next token. If it can be done with something easier to learn than a world model, it likely will be.

To claim without caveat that predicting the effects of earlier symbols on later symbols requires a model of the world like the ones humans generate from perception would be to abuse the “world model” notion. Unless we disagree on what the world is, it should be clear that a true world model can be used to predict the next state of the physical world given a history of states. Similar world models, which predict high fidelity observations of the physical world, are leveraged in many subfields of AI including model-based reinforcement learning, task and motion planning in robotics, causal world modeling, and areas of computer vision to solve problems instantiated in physical reality. LLMs are simply not running physics simulations in their latent next-token calculus when they ask you if your person, place, or thing is bigger than a breadbox. In fact, I conjecture that the behavior of LLMs is not thanks to a learned world model, but to brute force memorization of incomprehensibly abstract rules governing the behavior of symbols, i.e. a model of syntax.

Quick primer:

Syntax is a subfield of linguistics that studies how words of various grammatical categories (e.g. parts of speech) are arranged together into sentences, which can be parsed into syntax trees. Syntax studies the structure of sentences and the atomic parts of speech that compose them.
Semantics is another subfield concerned with the literal meaning of sentences, e.g., compiling “I am feeling chilly” into the idea that you are experiencing cold. Semantics boils language down to literal meaning, which is information about the world or human experience.
Pragmatics studies the interplay of physical and conversational context on speech interactions, like when someone knows to close an ajar window when you tell them “I am feeling chilly.” Pragmatics involves interpreting speech while reasoning about the environment and the intentions and hidden knowledge of other agents.

Without getting too technical, there is intuitive evidence that somewhat separate systems of cognition are responsible for each of these linguistic faculties. Look no further than the capability for humans to generate syntactically well-formed sentences that have no semantic meaning, e.g. Chomsky’s famous sentence “Colorless green ideas sleep furiously,” or sentences with well-formed semantics that make no pragmatic sense, e.g. responding merely with “Yes, I can” when asked, “Can you pass the salt?” Crucially, it is the fusion of the disparate cognitive abilities underpinning them that coalesce into human language understanding. For example, there isn’t anything syntactically wrong with the sentence, “The fridge is in the apple,” as a syntactic account of “the fridge” and “the apple” would categorize them as noun phrases that can be used to produce a sentence with the production rule, S → (NP “is in” NP). However, humans recognize an obvious semantic failure in the sentence that becomes apparent after attempting to reconcile its meaning with our understanding of reality: we know that fridges are larger than apples, and could not be fit into them.

But what if you have never perceived the real world, yet still were trying to figure out whether the sentence was ill-formed? One solution could be to embed semantic information at the level of syntax, e.g., by inventing new syntactic categories, NP_{the fridge} and NP_{the apple}, and a single new production rule that prevents semantic misuse: S → (NP_{the apple} “is in” NP_{the fridge}). While this strategy would no longer require grounded world knowledge about fridges and apples, e.g., it would require special grammar rules for every semantically well-formed construction… which is actually possible to learn given a massive corpus of natural language. Crucially, this would not be the same thing as grasping semantics, which in my view is fundamentally about understanding the nature of the world.

Finding that LLMs have reduced problems of semantics and pragmatics into syntax would have profound implications on how we should view their intelligence. People often treat language proficiency as a proxy for general intelligence by, e.g., strongly associating pragmatic and semantic understanding with the cognitive abilities that undergird them in humans. For example, someone who appears well-read and graceful in navigating social interactions is likely to score high in traits like sustained attention and theory of mind, which lie closer to measures of raw cognitive ability. In general, these proxies are reasonable for assessing a person’s general intelligence, but not an LLM’s, as the apparent linguistic skills of LLMs could come from entirely separate mechanisms of cognition.

The Bitter Lesson Revisited

TLDR: Sutton’s Bitter Lesson has sometimes been interpreted as meaning that making any assumptions about the structure of AI is a mistake. This is both unproductive and a misinterpretation; it is precisely when humans think deeply about the structure of intelligence that major advancements occur. Despite this, scale maximalists have implicitly suggested that multimodal models can be a structure-agnostic framework for AGI. Ironically, today’s multimodal models contradict Sutton’s Bitter Lesson by making implicit assumptions about the structure of individual modalities and how they should be sewn together. In order to build AGI, we must either think deeply about how to unite existing modalities, or dispense with them altogether in favor of an interactive and embodied cognitive process.

The paradigm that led to the success of LLMs is marked primarily by scale, not efficiency. We have effectively trained a pile of one trillion ants for one billion years to mimic the form and function of a Formula 1 race car; eventually it gets there, but wow was the process inefficient. This analogy nicely captures a debate between structuralists, who want to build things like "wheels" and "axles" into AI systems, and scale maximalists, who want more ants, years, and F1 races to train on. Despite many decades of structuralist study in linguistics, the unstructured approaches of scale maximalism have yielded far better ant-racecars in recent years. This was most notably articulated by Rich Sutton — a recent recipient of the Turing Award along with Andy Barto for their work in Reinforcement Learning — in his piece “The Bitter Lesson.”

[W]e should build in only the meta-methods that can find and capture this arbitrary complexity… Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. - Rich Sutton

Sutton’s argument is that methods that leverage computational resources will outpace methods that do not, and that any structure for problem-solving built as an inductive bias into AI will hinder it from learning better solutions. This is a compelling argument that I believe has been seriously misinterpreted by some as implying that making any assumptions about structure is a false step. It is, in fact, human intuition that was responsible for many significant advancements in the development of SOTA neural network architectures. For example, Convolutional Neural Networks made an assumption about translation invariance for pattern recognition in images and kickstarted the modern field of deep learning for computer vision; the attention mechanism of Transformers made an assumption about the long-distance relationships between symbols in a sentence that made ChatGPT possible and had nearly everyone drop their RNNs; and 3D Gaussian Splatting made an assumption about the solidity of physical objects that made it more performant than NeRFs. Potentially none of these methodological assumptions apply to the entire domain of possible scenes, images, or token streams, but they do for the specific ones that humans have curated and formed structural intuitions about. Let’s not forget that humans have co-evolved with the environments that these datasets are drawn from.

The real question is how we might heed Sutton’s Bitter Lesson in our development of AGI. The scale maximalist approach worked for LLMs and LVMs (large vision models) because we had natural deposits of text and image data, but an analogous application of scale maximalism to AGI would require forms of embodiment data that we simply don’t have. One solution to this data scarcity issue extends the generative modeling paradigm to multimodal modeling — encompassing language, vision, and action — with the hope that a general intelligence can be built by summing together general models of narrow modalities.

There are multiple issues with this approach. First, there are deep connections between modalities that are unnaturally severed in the multimodal setting, making the problem of concept synthesis ever more difficult. In practice, uniting modalities often involves pre-training dedicated neural modules for each modality and joining them together into a joint embedding space. In the early days, this was achieved by nudging the embeddings of, e.g. (language, vision, action) tuples to converge to similar latent vectors of meaning, a vast oversimplification of the kinds of relationships that may exist between modalities. One can imagine, e.g., captioning an image at various levels of abstraction, or implementing the same linguistic instruction with different sets of physical actions. Such one-to-many relationships suggest that a contrastive embedding objective is not suitable.

While modern approaches do not make such stringent assumptions about how modalities should be united, they still universally encode percepts from all modalities (e.g. text, images) into the same latent space. Intuitively, it would seem that such latent spaces could serve as common conceptual ground across modalities, analogous to a space of human concepts. However, these latent spaces do not cogently capture all information relevant to a concept, and instead rely on modality-specific decoders to flesh out important details. The “meaning” of a percept is not in the vector it is encoded as, but in the way relevant decoders process this vector into meaningful outputs. As long as various encoders and decoders are subject to modality-specific training objectives, “meaning” will be decentralized and potentially inconsistent across modalities, especially as a result of pre-training. This is not a recipe for the formation of coherent concepts.

Furthermore, it is not clear that today’s modalities are an appropriate partitioning of the observation and action spaces for an embodied agent. It is not obvious that, e.g., images and text should be represented as separate observation streams, nor text production and motion planning as separate action capabilities. The human capacities for reading, seeing, speaking, and moving are ultimately mediated by overlapping cognitive structures. Making structural assumptions about how modalities ought to be processed is likely to hinder the discovery of more fundamental cognition that is responsible for processing data in all modalities. One solution would be to consolidate unnaturally partitioned modalities into a unified data representation. This would encourage networks to learn intelligent processes that generalize across modalities. Intuitively, a model that can understand the visual world as well as humans can — including everything from human writing to traffic signs to visual art — should not make a serious architectural distinction between images and text. Part of the reason why VLMs can’t, e.g., count the number of letters in a word is because they can’t see what they are writing.

Finally, the learn-from-scale approach trains models to copy the conceptual structure of humans instead of learning the general capability to form novel concepts on their own. Humans have spent hundreds of thousands of years refining concepts and passing them memetically through culture and language. Today’s models are trained only on the end result of this process: the present-day conceptual structures that make it into the corpus. By optimizing for the ultimate products of our intelligence, we have ignored the question of how those products were invented and discovered. Humans have a unique ability to form durable concepts from few examples and ascribe names to them, reason about them analogically, etc. While the in-context capabilities of today’s models can be impressive, they grow increasingly limited as tasks become more complex and stray further from the training data. The flexibility to form new concepts from experience is a foundational attribute of general intelligence, we should think carefully about how it arises.

While structure-agnostic scale maximalism has succeeded in producing LLMs and LVMs that pass Turing tests, a multimodal scale maximalist approach to AGI will not bear similar fruit. Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally. For example, my recent paper on visual theory of mind saw abstract symbols naturally emerge from communication between image-classifying agents, blurring the lines between text and image processing. Eventually, we should hope to reintegrate as many features of intelligence as possible under the same umbrella. However, it is not clear whether there is genuine commercial viability in such an approach as long as scaling and fine-tuning narrow intelligence models solves commercial use-cases.

Conclusion

The overall promise of scale maximalism is that a Frankenstein AGI can be sewed together using general models of narrow domains. I argue that this is extremely unlikely to yield an AGI that feels complete in its intelligence. If we intend to continue reaping the streamlined efficiency of modality-specific processing, we must be intentional in how modalities are united — ideally drawing from human intuition and classical fields of study, e.g. this work from MIT. Alternatively, we can re-formulate learning as an embodied and interactive process where disparate modalities naturally fuse together. We could do this by, e.g., processing images, text, and video using the same perception system and producing actions for generating text, manipulating objects, and navigating environments using the same action system. What we will lose in efficiency we will gain in flexible cognitive ability.

In a sense, the most challenging mathematical piece of the AGI puzzle has already been solved: the discovery of universal function approximators. What’s left is to inventory the functions we need and determine how they ought to be arranged into a coherent whole. This is a conceptual problem, not a mathematical one.

Acknowledgements

I would like to thank Lucas Gelfond, Daniel Bashir, George Konidaris, and my father, Joseph Spiegel, for their thoughtful and thorough feedback on this work. Thanks to Alina Pringle for the wonderful illustration made for this piece.

Author Bio

Benjamin is a PhD candidate in Computer Science at Brown University. He is interested in models of language understanding that ground meaning to elements of structured decision-making. For more info see his personal website.

Citation

For attribution in academic contexts or books, please cite this work as

Benjamin A. Spiegel, "AGI Is Not Multimodal", The Gradient, 2025.

@article{spiegel2025agi,
    author = {Benjamin A. Spiegel},
    title = {AGI Is Not Multimodal},
    journal = {The Gradient},
    year = {2025},
    howpublished = {\url{https://thegradient.pub/agi-is-not-multimodal},
}

References

Andreas, Jacob. “Language Models, World Models, and Human Model-Building.” Mit.edu, 2024, lingo.csail.mit.edu/blog/world_models/.

Belkin, Mikhail, et al. "Reconciling modern machine-learning practice and the classical bias–variance trade-off." Proceedings of the National Academy of Sciences 116.32 (2019): 15849-15854.

Bernhard Kerbl, et al. “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” ACM Transactions on Graphics, vol. 42, no. 4, 26 July 2023, pp. 1–14, https://doi.org/10.1145/3592433.

Chomsky, Noam. 1965. Aspects of the theory of syntax. Cambridge, Massachusetts: MIT Press.

Designing an Intelligence. Edited by George Konidaris, MIT Press, 2026.

Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online. Association for Computational Linguistics.

Eye on AI. “The Mastermind behind GPT-4 and the Future of AI | Ilya Sutskever.” YouTube, 15 Mar. 2023, www.youtube.com/watch?v=SjhIlw3Iffs&list=PLpdlTIkm0-jJ4gJyeLvH1PJCEHp3NAYf4&index=64. Accessed 18 May 2025.

Frank, Michael C. “Bridging the data gap between children and large language models.” Trends in cognitive sciences vol. 27,11 (2023): 990-992. doi:10.1016/j.tics.2023.08.007

Garrett, Caelan Reed, et al. "Integrated task and motion planning." Annual review of control, robotics, and autonomous systems 4.1 (2021): 265-293.APA

Goodhart, C.A.E. (1984). Problems of Monetary Management: The UK Experience. In: Monetary Theory and Practice. Palgrave, London. https://doi.org/10.1007/978-1-349-17295-5_4

Hooker, Sara. The hardware lottery. Commun. ACM 64, 12 (December 2021), 58–65. https://doi.org/10.1145/3467017

Huh, Minyoung, et al. "The Platonic Representation Hypothesis." Forty-first International Conference on Machine Learning. 2024.

Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).

Lake, Brenden M. et al. “Building Machines That Learn and Think like People.” Behavioral and Brain Sciences 40 (2017): e253. Web.

Li, Kenneth, et al. "Emergent world representations: Exploring a sequence model trained on a synthetic task." ICLR (2023).

Luiten, Jonathon, Georgios, Kopanas, Bastian, Leibe, Deva, Ramanan. "Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis." 3DV. 2024.

Mao, Jiayuan, Chuang, Gan, Pushmeet, Kohli, Joshua B., Tenenbaum, Jiajun, Wu. "The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision." International Conference on Learning Representations. 2019.

Mitchell, Melanie. “LLMs and World Models, Part 1.” Substack.com, AI: A Guide for Thinking Humans, 13 Feb. 2025, aiguide.substack.com/p/llms-and-world-models-part-1. Accessed 18 May 2025.

Mu, Norman. “Norman Mu | the Myth of Data Inefficiency in Large Language Models.” Normanmu.com, 14 Feb. 2025, www.normanmu.com/2025/02/14/data-inefficiency-llms.html. Accessed 18 May 2025.

Newell, Allen, and Herbert A. Simon. “Computer Science as Empirical Inquiry: Symbols and Search.” Communications of the ACM, vol. 19, no. 3, 1 Mar. 1976, pp. 113–126, https://doi.org/10.1145/360018.360022.

Peng, Hao, et al. “When Does In-Context Learning Fall Short and Why? A Study on Specification-Heavy Tasks.” ArXiv.org, 2023, arxiv.org/abs/2311.08993.

Spiegel, Benjamin, et al. “Visual Theory of Mind Enables the Invention of Early Writing Systems.” CogSci, 2025, arxiv.org/abs/2502.01568.

Sutton, Richard S. Introduction to Reinforcement Learning. Cambridge, Mass, Mit Press, 04-98, 1998.

Vafa, Keyon, et al. "Evaluating the world model implicit in a generative model." Advances in Neural Information Processing Systems 37 (2024): 26941-26975.

Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (December 2017). "Attention is All you Need". In I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett (ed.). 31st Conference on Neural Information Processing Systems (NIPS). Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc. arXiv:1706.03762.

Winograd, Terry. “Thinking Machines: Can There Be? Are We?” The Boundaries of Humanity: Humans, Animals, Machines, edited by James Sheehan and Morton Sosna, Berkeley: University of California Press, 1991, pp. 198–223.

Wu, Shangda, et al. "Beyond language models: Byte models are digital world simulators." arXiv preprint arXiv:2402.19155 (2024).

Perspectives Trends

AGI Is Not Multimodal

Why We Need the World, and How LLMs Pretend to Understand It

The Bitter Lesson Revisited

Conclusion

Acknowledgements

Author Bio

Citation

References

Benjamin A. Spiegel

Recent Posts

Shape, Symmetries, and Structure: The Changing Role of Mathematics in Machine Learning Research

What's Missing From LLM Chatbots: A Sense of Purpose

We Need Positive Visions for AI Grounded in Wellbeing

Financial Market Applications of LLMs

A Brief Overview of Gender Bias in AI

Tags

AGI Is Not Multimodal

Why We Need the World, and How LLMs Pretend to Understand It

The Bitter Lesson Revisited

Conclusion

Acknowledgements

Author Bio

Citation

References

Benjamin A. Spiegel

Recent Posts

Shape, Symmetries, and Structure: The Changing Role of Mathematics in Machine Learning Research

What's Missing From LLM Chatbots: A Sense of Purpose

We Need Positive Visions for AI Grounded in Wellbeing

Financial Market Applications of LLMs

A Brief Overview of Gender Bias in AI

Tags

You Might Be Interested In

Why We Released Grover

OpenAI: Please Open Source Your Language Model

Why transformative artificial intelligence is really, really hard to achieve