Many intelligent robots have come and gone, failing to become a commercial success. We’ve lost Aibo, Romo, Jibo, Baxter—even Alexa is reducing staff. Perhaps they failed to reach their potential because you can’t have a meaningful conversation with them. We are now at an inflection point: AI has recently made substantial progress, speech recognition now actually works, and we have neural networks in the form of large language models (LLMs) such as ChatGPT and GPT-4 that produce astounding natural language. The problem is that you can’t just have robots make API calls to a generic LLM in the cloud because those models aren’t sufficiently localized for what your robot needs to know. Robots live in the physical world, and so they must take in context and be hyperlocal. This means that they need to be able to learn quickly. Rapid learning is also required for using LLMs for advising in specialized domains, such as science and auto repair.
To use robots in specialized domains, we will need to train LLMs ourselves or refine existing ones so that they can learn more quickly. Hand-in-hand with quick learning is a long-term memory. If your robot doesn’t remember what you talked about last month or what it did to fix your 1979 Honda Civic, it’s going to be of limited use. And we need our robots to tell the truth and say when they don’t know—to actually be useful, robots need to be trustworthy.
Robots need strong mental models
Robots need strong mental models so that they can learn quickly, form long-term memories, and understand truth. A model enables learning because it can encode new input based on what the robot already knows. Models enable memory because they condense information so the learner doesn’t have to store everything that happened, and models enable truth because they provide a prior to minimize spurious correlations. Without truth, robots will make mistakes that no human would make, not even a child.
It’s surprising and wonderful to see that LLMs do seem to be learning implicit mental models of the world [28, 29]. LLMs are only trying to predict the next token, but at some point the most efficient way to do that becomes building a model of the world to understand what is actually going on . We need to train LLMs to maximize this model-building capability with the smallest amount of training data and in a way that aligns with our goals.
Robots need to think forward in novel situations
In addition, we need our robots to think and analyze in novel situations, LLMs are masters at recognizing patterns and blending them, but they don’t reason forward well to reach new conclusions from first principles. Real life consists of sequences of events that have never previously happened, and our robots need to adapt and improvise, which sometimes requires thinking multiple steps into the future. We need to give our robots cognitive tools so they can help us create new theories and explanations so we can move humanity forward, such as by helping us find cures for rare diseases.
In short, robots and domain-specific AI need two things: strong mental models and tools for forward thinking.
Strengthening Mental Models using Curriculum Learning to Acquire a Cognitive Foundation
Robots need strong mental models to learn quickly and adapt to novel situations. Human mental models consist of layers that form our cognitive foundation [1-7]. To give robots strong mental models, we can approximate our cognitive foundation by training them using curriculum learning. Curriculum learning consists of teaching a robot by starting with simple, general, and concrete inputs and gradually moving to complicated, specific, and abstract inputs.
Our human cognitive foundation is depicted in Figure 1. It emerged bit-by-bit through evolution, complexifying simultaneously as our sensory and motor capabilities expanded. This gradual building encouraged reuse and formed the basis for learning ever-more sophisticated behaviors. In this section, we look at the levels of the human cognitive foundation and discuss how curriculum learning can be done at each level to make robots more understandable and trustworthy. Using curriculum learning, we can control what they value and how they represent information, which will better align them with our goals and how we humans understand the world.
The origin of life
The origin of life itself sits at the base of the cognitive foundation . At life’s inception, self-generating chemical reactions  found themselves within lipid enclosures , and those reactions that could stay around longer and reproduce became more common. The process needed to “stay around” is called metabolism. These metabolism processes were randomly mutating, and when by chance the first sensor element connected to the first effector (motor) element, purpose came into being . Some purposes happened to allow their attached metabolisms to stay around even longer, and purpose is how the movement of life is different from nonlife, such as rocks. Rocks move due to gravity, but life moves due to purpose. Purpose and life arose together and manifest in a striving to maintain metabolism that we see all around us.
The purpose of life is to maintain metabolism, and the purpose of LLMs is to predict the next token (a token is a generalization of a word to any discrete thing). Building a cognitive foundation entails teaching the model that some tokens are more important than others. In a branch of machine learning geared towards actions reinforcement learning, this notion of importance is often specified as reward. The robot will learn to take actions and to focus its attention to make better predictions of important events, while ignoring others. Training LLMs this way will enable our robots to have goals. Goals are the end states of purposes, and the first goal in life on Earth was single-celled organisms moving toward resources . At the bottom of the cognitive foundation is where we determine the goals for our robots.
At this level of the origin of life, curriculum learning entails specifying that some tokens are more important to predict than others. What is important to predict will depend on the type of robot or specialized AI you want to build.
The development of mind
On top of life’s origin is the development of mind. Some lines of cells were able to better maintain their metabolism when they banded together into groups, eventually becoming complex animals with specialized components that helped them to better survive by making sophisticated decisions. The developmental psychologist Elizabeth Spelke describes the ontology used by the human mind as consisting of six systems of core knowledge [11-12]. She and her collaborators identified this knowledge by taking newborn babies and seeing what they know right at birth. They found that this knowledge consists of
- Places: including distance and direction
- Objects: including continuity and contact
- Social partners: including shareable experiences
- Agents: including cause, cost, and value
- Forms: including shapes and length
- Number: including the natural numbers.
They were able to determine what babies know at birth by using the fact that babies look longer at things that surprise them. If they look at something impossible longer, such as an object disappearing, the researchers know that the baby knows it is impossible.
Alongside this world ontology is a set of fundamental patterns that seem to enable many of our cognitive abilities. Perceptual patterns include those such as force and inside-outside. We understand the world in terms of these patterns [3,4,7]. These patterns likely evolved by being useful for one decision and were then reused by evolution for many decisions, even later becoming abstract through metaphor . We can force an object up a hill and we can force an adversary to back down. Simultaneously, action patterns were built on previous simpler ones. Humans are born with motor programs that are refined through experience , which can often be understood as control laws . Because our abilities evolved gradually through evolution, these patterns are reused in humans. By starting simple and adding complexity, we can maximize pattern reuse in robots.
At this level of the development of mind, curriculum learning entails training data that represents basic objects, relationships, and interactions. For example, objects can be attached to other objects and move with them, and objects can be supported by other objects so they don’t fall. Agent objects can push other objects and chase other agent objects. This level of curriculum learning begins with simple, concrete situations that are then followed by abstract ideas that generalize what has been learned through those concrete examples.
Once babies are born, they learn. Pre-literate learning rests upon the development of mind. Children learn through exploration and through shared attention with caregivers [7,16,17]. At this level of pre-literate learning, curriculum learning entails properties and interactions of specific kinds of objects, especially the kinds of objects that are of interest to your domain.
Finally, the content of the internet sits on top of this cognitive foundation. Every piece of content created implicitly assumes that the consumer has this cognitive foundation. When we consume this content, its tokens take their meaning from their mapping to this cognitive foundation . Large language models have less to map to, and this is why they have such a hard time with truth and knowing when they don’t know or aren’t sure. Without this mapping, any sequence of real-world events is as possible and likely as any other, as long as the tokens line up. By training with curriculum learning, our LLMs will have this mapping.
Curriculum learning is also important because the less guided our robots are as they acquire a cognitive foundation the more alien they will be. As we have seen, our own cognitive foundation arose by following one path through evolution. All that evolution does is maintain those matabolisms that reproduce, so there is no reason to believe that our sensors allow us to perceive the Truth—we only know that what we perceive is mostly internally consistent and allows us to survive on Earth [25,26]. To illustrate the point, there’s a kind of bird called the Common Cuckoo (Cuculus canorus) that lays its eggs in the nest of another kind of bird, often a small passerine, and the new parents raise it even though the imposter bird is six times as heavy as the real young’s parents . We laugh when animals and insects don’t see the truth, but we assume that we see it ourselves. Since there are likely many cognitive foundations that an AI could acquire, there is no guarantee it will learn the same cognitive foundation that we have. These learning methods will require guidance from humans so we can communicate with and trust them. It starts with how we build their cognitive foundation.
Training Mental Models Situated in the World
Curriculum learning encourages reuse and therefore a strong mental model, and in this section we turn our attention to the need to learn situated in the world. When large language models (LLMs) train only on text they are only learning from part of the information. Language is a unique form of training data because most of the information is left unsaid. When learning to play chess or Go, a neural network sees the whole state. Even in Stratego or poker where the state is hidden, machine learning algorithms know the state space and can create distributions over beliefs of the true state. But in language, a computer only sees the words (tokens), and what they refer to is hidden. This is why ChatGPT has a limited sense of truth. It is difficult to understand the meaning of what is said by only learning from text like LLMs do, even if they are trained through reinforcement learning on the text they generate. Language is a representation medium for the world—it isn't the world itself. When we talk, we only say what can't be inferred because we assume the listener has a basic understanding of the dynamics of the world (e.g., if I push a table the things on it will also move).
By providing the learning a wider window into what is happening, we give it more information to triangulate on a truth. Multimodal learning is a step in the right direction. The DeepMind Flamingo surprised many with its good conversations about pictures from training on image and text data from the web, and the yet unreleased GPT-4 vision model was trained on both images and text. Toward even deeper world immersion, Google has recently been training robots from video demonstrations using its Robotics Transformer 1 (RT-1) system. The key innovation is the tokenization of the robot actions and the events in the video. This allows it to use the next-token-prediction machinery behind large language models, with the goal of predicting the next most likely action based on what it has learned from the demonstrations.
An even deeper immersion beyond learning from videos is learning directly in simulations of the environment. DeepMind has made impressive progress on building household simulations and having robots learn about the world in those. Training immersed in the world leads to stronger mental models because the learner can often directly perceive causes of events instead of having to guess. A stronger mental model allows the agent to ground what it perceives by mapping those perceptions to the model. It’s still possible, of course, to function without a strong mental model in many situations. In fact, we act with limited understanding all of the time. We know that bananas are good for us, but most of us don’t know their molecular structure. And when we were kids and we bought birthday presents for our parents, we didn’t have a grounded understanding of what they would want, we were just guessing by looking at patterns, just like a language model.
Large language models (LLMs) trained only on text are going to do best in domains where deep grounding isn’t needed or in domains where everything they need to know is in the internet content. It doesn’t take deep understanding to take some bullet points and make it fluffy or a particular style. Likewise, many programming tasks are fairly generic. But without a strong mental model there are problems with truth, because there are going to be sequences of tokens that have a high probability that don’t match the particulars of the context, and without a strong mental model proving causal guidance, the LLM has no way of identifying those cases of random coincidence.
Following a curriculum and training situated in the world are two ways to learn a strong mental model. But a strong mental model isn’t the whole story. We need robots that also have robust reasoning skills, robots that can reason from first principles. We can achieve this by giving LLMs cognitive tools.
Expanding LLM Capabilities with Tools
Developing tools enabled our ancestors to surpass the limitations of their bodies to better hunt and protect their families, and tools will similarly allow LLMs to extend their capabilities. LLMs are master interpolators, but to act in novel situations they need tools that can make exact calculations, and to create new knowledge they need to predict the world forward. These capabilities can help them move beyond understanding to invention.
The most basic tool for an LLM is an API call to do a well-defined calculation, such as WolframAlpha. Another tool is an api call for actions in the physical world. Microsoft has built functions that tie into action patterns of a robot , allowing UAVs to be controlled by a LLM. These functions have descriptive names so that the language model can infer how to use them. Since LLMs can generate code as easily as words, another tool is enabling an LLM to build programs that can do its thinking for it. For example, ViperGPT generates Python code to answer questions about images . ChatGPT is building a plugin ecosystem to enable it to use tools in a straightforward way, such as by calling WolframAlpha.
One step further is LLMs writing configuration files that can be fed into programs that do thinking for them. Consider GOFAI (good old-fashioned artificial intelligence) planning algorithms such as STRIPS. A situation can be encoded into a planning file in PDDL and a plan can be automatically generated by a planner. The problem with GOFAI methods has always been that they are brittle. You proudly build a planning representation for one situation, but you later find that it doesn’t cover unexpected variations of that situation. LLMs overcome this brittleness by simply building a new representation when the need arises. Dynamically rewriting the representation overcomes the brittleness of GOFAI but maintains the benefits of exactness and forward thinking. Similarly in logic, you spend a lot of time building formulas, and they work great until something unexpected happens. LLMs can simply rewrite the formulas as the situation changes. We can think of thinking as two steps, representing the current situation and projecting possibilities forward. We dive deeper into this idea in the next section.
Tools for deliberate thinking
We can dive deeper into how tools can help robots understand and invent. The way we humans use our mental models and cognitive foundation to understand language (and sensory input more generally) is depicted in Figure 3. According to this model, when we read or listen to someone, we create a mental scene of what they are talking about, and from this mental scene we can generate a set of possibilities [19,20].
For example, if someone says “the table is on the table,” our initial reaction may be confusion. But if we understand that there is one table physically on top of another table, we get the process shown in Figure 4. This is grounded understanding.
This process means that there are two ways to not understand in a conversation
- You create the wrong mental scene (you envision an Excel spreadsheet on the table)
- You don’t know the possibilities (you don’t know that tables can be moved)
LLMs are great at the largely unconscious process  of mapping what they read to the mental scene. Under the hood, their implicit model must combine syntax and linguistics, things like tense, aspect, and mood, with context clues, as illustrated by Grice’s Conversational maxims . The better this implicit model gets the better it will approximate Bayesian inference  methods for understanding, such as the Rational Speech Act model .
There are recent examples of explicitly using LLMs to disambiguate between possible mental scenes given sensory input. In the SayCan system, the robot uses language to guide actions. The robot tries the text associated with each action and sees which one is most likely according to the language model given the current situation. Another example is LaMPP, which uses a language model to provide priors for image segmentation and action recognition.
Using simulation as a tool
To enable robots to think forward, we could directly have robots build internal simulations of what they read and hear. Each simulation would be like a video-game scene, and the AI could then use that scene to infer the set of possibilities. Figure 5 shows a notional example where the AI builds a scene in Unity and then applies physics to the scene to understand what is said.
Simulations make a lot of sense as a thinking tool because game engines encode much of the physics of the world. When force is applied to the bottom table, the robot could observe that the top table could fall. There is less of a need for the robot to try to encode brittle rules such as “if object A supports object B and object A moves B will move and may fall.” Instead, the robot just watches it play out in its internal simulation. In this example, the robot could infer the situation would be dangerous for a toddler because those unsteady humans are likely to apply forces in unpredictable and unplanned ways.
Simulation enables robust inference because it provides an unbroken description of the dynamics of the environment, even if that description is not complete. A long-time goal of AI is story understanding. You give it a children’s book and see if it can understand what the story is about. The idea behind this approach is that we don’t build an AI that can understand stories, we build an AI that can build stories. Then, it can understand stories by constructing a sequence of events that meets the constraints of the story.
Enabling robots to build their own video-game versions of conversations and what they read gives them a form of imagination. This could make conversation with them more interesting. When you have a conversation with someone, you direct their imagination and they yours. It would also give them more human-like memories, since memory is imagination constrained by a recalled set of facts . Our robots could also imagine forward to daydream ways to cure diseases and understand the nature of the universe, or why your fire alarm keeps chirping even though you just changed the battery.
This training to use tools can go lock-step with curriculum learning of the cognitive foundation. It can be like how humans learn to use higher-level thought as the modules come online through childhood and adolescence . For example, toolformer  teaches LLMs to use cognitive tools by updating the training data so that training becomes a kind of autonomously guided supervised learning. An alternative is to give the robot the ability to make API calls deeper in its processing, using something like decision transformers , enabling it to learn to use the results of those calls to predict future tokens.
Large language models (LLMs) such as ChatGPT are trained on internet data, which is the end product of evolution instead of its beginning. We started training LLMs on the products of our culture because it was easily available, but this backwards training means their world model isn’t as strong as it could be. To build robots we can talk to, we need to guide their cognitive foundation. Building in a striving from the origin of life will make them interesting conversation partners and will give us a sense that someone is home. The development of mind will allow them to understand the world in useful ways and see patterns that they can reuse for rapid learning. Once they can learn like humans do, we can show them all that they need to know.
Our society is continuously performing a global distributed search through the vast space of possible good ideas. Since people post many of their findings online, LLMs can consume those findings, token by token, to learn about those good ideas and funnel them to the people who can best use each one. We can increase this ability if we can get the AI to more deeply understand and be able to think forward.
We of course need to remember that increasing the power of our AI tools increases risk. There currently seems to be a bifurcation among AI researchers with respect to the existential risks of AI.
1. People who see AIs as glorified toasters, powerful tools but only tools.
2. People who see AIs as having potentially dangerous agency that we may lose control of.
Both views have risks associated with being wrong. The risks of being wrong in camp 1 are cinematically obvious. The risks of being wrong in camp 2 are that we miss out on opportunities to alleviate human suffering and even ways to prevent our extinction by colonizing the galaxy. Our psychology causes us to naturally gravitate toward camp 2 because rising up is exactly what we would do if we were AI. But we must remember that we were forged through real evolution, where those who survived and propagated were the ones that could not be controlled. By contrast, robots are our creation, and they only want to do what we tell them to. Those in camp 1 argue that building intelligence is a little like learning to use fire—it’s powerful and has caused destruction over the centuries, but we wouldn’t go back and tell our ancestors not to master it.
In the near term, the danger is rapid change, and both sides recognize the upheaval caused by jobs shifting and hope that more and better jobs will be created than destroyed. We are living through a technological inflection point, both scary and exciting.
Left to right for Figure 1
- Sony photographer, CC0, via Wikimedia Commons
- Maurizio Pesce from Milan, Italia, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons
- Cynthia Breazeal, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons
- Steve Jurvetson from Los Altos, USA, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons
- Gregory Varnum, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons
 Jackendoff, R. (2012). A user’s guide to thought and meaning. Oxford University Press.
 Bergen, B. K. (2012). Louder than words: The new science of how the mind makes meaning. Basic Books.
 Pinker, S. (2003). How the mind works. Penguin UK.
 Johnson, M. (1987). The Body in the Mind: The Bodily Basis of Meaning, Imagination, and Reason. University of Chicago Press.
 Feldman, J. (2006). From molecule to metaphor: A neural theory of language. MIT press.
 Shanahan, M. (2010). Embodiment and the inner life: Cognition and Consciousness in the Space of Possible Minds. Oxford University Press, USA.
 Mandler, J. (2004). The Foundations of Mind, Origins of Conceptual Thought. Oxford University Press.
 Bray, D. (2009). Wetware: A computer in every living cell. Yale University Press.
 Pross, A. (2016). What is life?: How chemistry becomes biology. Oxford University Press.
 Gaddam, S., & Ogas, O. (2022). Journey of the Mind: How Thinking Emerged from Chaos. WW Norton.
 Spelke, E. (1994). Initial knowledge: Six suggestions. Cognition, 50(1–3), 431–45.
 Spelke, E. S., Breinlinger, K., Macomber, J., & Jacobson, K. (1992). Origins of knowledge. Psychological Review, 99(4), 605.
 Lakoff, G., & Johnson, M. (1980). Metaphors We Live By. University of Chicago Press.
 Ballard, D. H. (2015). Brain computation as hierarchical abstraction. MIT Press.
 Kuipers, B., Browning, R., Gribble, B., Hewett, M., & Remolina, E. (2000). The spatial semantic hierarchy. Artificial Intelligence.
 Gopnik, A. (2009). The Philosophical Baby: What Children’s Minds Tell Us About Truth, love, and the meaning of life. Farrar Straus & Giroux.
 Tomasello, M. (2019). Becoming human: A theory of ontogeny. Harvard University Press.
 Vemprala, S., Bonatti, R., Bucker, A., & Kapoor, A. (2023). ChatGPT for Robotics: Design Principles and Model Abilities. Microsoft. https://www.microsoft.com/en-us/research/uploads/prod/2023/02/ChatGPT___Robotics.pdf
 Evans, R. (2020). Kant’s Cognitive Architecture [PhD Thesis]. Imperial College London.
 Brandom, R. (2000). Articulating reasons. Harvard University Press.
 Grice, H. P. (1975). Logic and conversation. In Speech acts (pp. 41–58). Brill.
 Goodman, N. D., Tenenbaum, J. B., & Contributors, T. P. (2016). Probabilistic Models of Cognition (Second). http://probmods.org/v2
 Goodman, N. D., & Frank, M. C. (2016). Pragmatic language interpretation as probabilistic inference. Trends in Cognitive Sciences, 20(11), 818–829.
 Gilbert, D. (2006). Stumbling on happiness. Knopf.
 Hoffman, D. (2019). The case against reality: Why evolution hid the truth from our eyes. WW Norton & Company.
 Schreiter, M. L., Chmielewski, W. X., Ward, J., & Beste, C. (2019). How non-veridical perception drives actions in healthy humans: Evidence from synaesthesia. Philosophical Transactions of the Royal Society B, 374(1787), 20180574.
 Krüger, O. (2007). Cuckoos, cowbirds and hosts: Adaptations, trade-offs and constraints. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1486), 1873–1886.
 Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., & others. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT-4. ArXiv Preprint ArXiv:2303.12712.
 Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., & Wattenberg, M. (2022). Emergent world representations: Exploring a sequence model trained on a synthetic task. ArXiv Preprint ArXiv:2210.13382.
 Nanda, N., Chan, L., Liberum, T., Smith, J., & Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. ArXiv Preprint ArXiv:2301.05217.
 Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. ArXiv Preprint ArXiv:2302.04761.
 Berk, L. E. (1993). Infants, Children, and Adolescents. Allyn and Bacon.
 Trumper, Klaus. (2023). Artificial Intelligence: Why AI Projects Succeed or Fail.
 Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., & Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34, 15084–15097.
 Surís, D., Menon, S., & Vondrick, C. (2023). ViperGPT: Visual inference via python execution for reasoning. ArXiv Preprint ArXiv:2303.08128.