Reflections on Foundation Models

This piece was originally published on the Stanford CRFM and Stanford HAI blogs.

R

ecently, we released our report on foundation models, launched the Stanford Center for Research on Foundation Models (CRFM) as part of the Stanford Institute for Human-Centered AI (HAI), and hosted a workshop to foster community-wide dialogue. Our work received an array of responses from a broad range of perspectives; some folks graciously shared their commentaries with us. We see open discourse as necessary for forging the right norms, best practices, and broader ecosystem around foundation models. In this blog post, we talk through why we believe these models are so important and clarify several points in relation to the community response. In addition, we support and encourage further community discussion of these complex issues; feel free to reach out at [email protected].

Foundation models

We define foundation models as models trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks. These models, which are based on standard ideas in transfer learning and recent advances in deep learning and computer systems applied at a very large scale, demonstrate surprising emergent capabilities and substantially improve performance on a wide range of downstream tasks. Given this potential, we see foundation models as the subject of a growing paradigm shift, where many AI systems across domains will directly build upon or heavily integrate foundation models. Foundation models incentivize homogenization: the same few models are repeatedly reused as the basis for many applications. Such consolidation is a double-edged sword: centralization allows us to concentrate and amortize our efforts (e.g., to improve robustness, to reduce bias) on a small collection of models that can be repeatedly applied across applications to reap these benefits (akin to societal infrastructure), but centralization also pinpoints these models as singular points of failure that can radiate harms (e.g., security risks, inequities) to countless downstream applications.

In our appraisal of the status quo, we see clear evidence of such homogenization: within the research community, large language models like BERT are pervasively adopted in the NLP community and foundation models are similarly being researched in other areas (e.g., computer vision, reinforcement learning, protein folding, music, speech, organic molecules). Simultaneously, in industry, several startups are heavily prioritizing these models (e.g., Hugging Face, Anthropic, AI21 Labs, Cohere, Aleph Alpha) and big tech companies like Google, Facebook, and Microsoft are increasingly investing in developing and using these models in products that impact billions of people.

With this increasing scale of application in mind, we emphasize that foundation models present clear and significant societal risks, both in their current implementation and their fundamental premise. In addition, the resource requirements to train these models have lowered standards for accessibility and excludes the majority of the community, which lacks the resources to train these models, from directly shaping their development.

This context informs and motivates our efforts at CRFM. In our report, we center the responsible development of foundation models with specific focuses on inequity, misuse, environmental impact, legal frameworks, ethics of scale, and economic consequences. These societal considerations further inform our discussion of the technical underpinnings of these models (data, architectures, objectives, systems, evaluation, theory), their implications for areas of AI (robotics, vision, reasoning), and their applications for various disciplines (law, healthcare, education, biomedicine). We also outline how existing practices should change: protocols for data management, respect for privacy, standard evaluation paradigms, mechanisms for intervention and recourse to combat injustice, and the broader norms that govern foundation models. In general, we believe concerted action during this formative period will shape how foundation models are developed, who controls this development, and how foundation models will affect the broader ecosystem and impact society.

Below, we discuss specific points that have been the subject of discussion.

The current trajectory for foundation models is not inevitable. Our efforts are galvanized by our belief that significant change is necessary in both model development (e.g., adoption of data practices that respect the rights and dignity of data subjects as opposed to indiscriminate scraping) and the broader ecosystem (e.g., increasing access to these models, decreasing centralization of power surrounding these models in big tech). We especially highlight academia as a key entity in shaping this space, noting “the furious pace of technological progress and the entrenchment [of foundation models] due to centralization raise powerful concerns that demand the attention of humanists and social scientists in addition to technologists” (pg. 9-10).

Foundation models can (and increasingly should) be grounded. “Perception, interaction, acting in a physical 4D world, acquiring models of commonsense physics, theories of mind, and acquiring language grounded in this world are important components of AI” (Malik, 2021) that all involve or require grounding. Given the importance of grounding, we draw attention to existing efforts that successfully build foundation models with various forms of grounding such as CLIP, which is trained on image-text pairs on the Internet, and MERLOT, which is trained on YouTube videos with transcribed speech. Further, grounding can also be achieved via integrating foundation models with other approaches in AI: PIGLeT is a recent example that grounds linguistic form encoded using a pretrained language model with a neural model of physical dynamics. Foundation models are not just large language models; foundation models can also be trained using images, video, and other sensory and knowledge base data—and some already are. In our report, we underscore grounding as critical to how foundation models will evolve in computer vision (Section 2.2) and robotics (Section 2.3).

Emphasizing the role of people. At present, the development of foundation models generally fails to center people and development itself is fairly closed off to a small collection of high-resourced actors. In our report and broader approach, we aim to identify and uplift the role of people throughout the ecosystem that situates foundation models. People create the data that underpins foundation models, develop foundation models, adapt foundation models for specific applications, and interact with the resulting applications. More extensively, we highlight data subjects, creators, curators, and managers (Section 4.6), foundation model providers (Section 4.1), downstream application developers (Section 2.5, 4.3), hardware and software developers involved in co-design (Section 4.5), malicious actors (Section 5.2), marginalized populations (Section 5.1), and domain experts, patients, litigants, and students (Section 3), among others. At the workshop, we especially emphasized how diversity (e.g., representational diversity, institutional diversity, and disciplinary diversity) is necessary (but not sufficient) for centering people to a far greater extent than they are currently in the development of foundation models. While our study is about foundation models, it is thus not our intention to center them in a way that is at odds with human-centric approaches.

Supporting diverse research. Diversity in many senses, including methods and approach, is quintessential for a healthy research community. The community at the CRFM reflects this: we come to the topic of foundation models with different disciplinary backgrounds and all of us also pursue research that is orthogonal, complementary, or contradictory to foundation models. Indeed many of us in the past, or even at present, remain skeptical about various aspects of these models. Consequently, we believe that drawing attention to foundation models need not displace or deprioritize other types of research. In fact, we believe that foundation models are not only compatible with many other research topics, but also that foundation models may enable new breakthroughs in these areas.

For example, while foundation models are very much “bottom-up” in that structure emerges from the data, methods such as causal networks, probabilistic programs, and formal systems are “top-down”, imposing strong structure. While these two paradigms might seem incompatible, we think that these approaches are rather synergistic. Inference in top-down approaches is generally computationally difficult due to having to solve an inverse problem, but foundation models could provide a fast proposal that aids in inference (Section 2.4). By analogy to Daniel Kahneman’s System 1 and 2, foundation models may provide a (very good) implementation of fast, automatic, surface-level reasoning (System 1) that can be integrated with other approaches for slow, analytic, deliberative reasoning (System 2).

Summary: Our objective, in both the report and the broader research agenda at CRFM, is to provide a measured perspective on foundation models that recognizes their strengths and weaknesses. By drawing attention to these models, we seek to highlight the dramatic success and rapid adoption of these models and simultaneously their existing deficiencies, enduring limitations, and reasons for societal concern. Our aim is to help shape a better future for how these models are developed, deployed, and come to impact the broader ecosystem they are situated in.

Naming

The name “foundation model” has also drawn significant attention; given this attention, we want to provide context on how the name came about. We began by surveying existing terms (e.g., “(large) language model”, “self-supervised model”, “pretrained model”). Of these terms, we found several did not identify the correct class or characteristics of models: “(large) language model” was too narrow given our focus was not only language; “self-supervised model” was too specific to the training objective; and “pretrained model” suggested that the noteworthy action all happened after “pretraining”. In general, we found most terms did not convey the care with which we felt these models should be built.

Generally unsatisfied with existing terms, we considered a vast array of new terms that emphasize different dimensions of these models (e.g., base model, broadbase model, inframodel, platform model, task-agnostic model, polymodel, pluripotent model, generalist model, universal model, multi-purpose model). After several weeks of debate, including an explicit tournament to contrast different name candidates that included everyone in the center weighing in, we settled on “foundation model”. In choosing this term, we take “foundation” to designate the function of these models: a foundation is built first and it alone is fundamentally unfinished, requiring (possibly substantial) subsequent building to be useful. “Foundation” also conveys the gravity of building durable, robust, and reliable bedrock through deliberate and judicious action. This aligns with our belief that it is critical for the community to be able to audit, evaluate, and critique these foundations rather than permitting them to be built unchecked and uninspected. That is, foundations are neither good by default nor should be assumed to be good; existing foundation models provide shaky foundations in significant ways. No name is perfect; even within the center, “foundation model” is not everyone’s favorite, but we do feel “foundation model” appropriately communicates the nature of these models.

Foundation models are neither “foundational” nor the foundations of AI. We deliberately chose “foundation” rather than “foundational”, because we found that “foundational” implied that these models provide fundamental principles in a way that “foundation” does not. For example, we can readily say “shaky foundations”, whereas “shaky and foundational” is unidiomatic. While many people currently find “foundation model” more natural to say, “foundation model” is grammatically a well-formed noun compound (incidentally parallel to “language model”). Further, “foundation” describes the (role of) model and not AI; we neither claim nor believe that foundation models alone are the foundation of AI, but instead note they are “only one component (though an increasingly important component) of an AI system” (see Section 1.2; pg. 7-9).

Foundations models in relation to large language models (LLMs). Foundation models are a strict superset of LLMs, though the most salient foundation models currently are LLMs (e.g., GPT-3). The terms highlight distinct properties: “foundation model” emphasizes the function of these models as foundations for downstream applications, whereas “large language model” emphasizes the manner in which these artifacts are produced (i.e., large textual corpora, large model size, “language modeling” objectives). Akin to how deep learning was popularized in computer vision (e.g., ImageNet, AlexNet, ResNet) but now extends beyond, foundation models emerged in NLP with LLMs but foundation models (that are not LLMs) exist for many other modalities, including images, code, proteins, speech, molecules as well as multimodal models (see pg. 5).

By using the term “foundation model”, there is potential risk that pertinent history and critique related LLMs may be lessened or even erased. We do not believe this justifies using LLM where it is not appropriate, but instead requires that this context be highlighted in relation to foundation models. With that said, we see the name “foundation model” as intensifying the severity of existing critiques. By invoking the term “foundation”, we pinpoint that many critiques are of outsized significance due to the fact that these models currently function as foundations. (We emphasize this in our discussion of inequity and ethics of scale in Sections 5.1 and 5.6.) Consequently, we feel the term “foundation model” makes explicit the importance of this critique in a way that “large language model” largely leaves opaque. Importantly, foundation model is not intended to replace LLM, pretrained model, or self-supervised models in all contexts. We envision “foundation model” being used to designate the broader class of models and their function, whereas LM may be used for describing models related to language, LLM to further emphasize scale, pretrained model to center adaptation (e.g., fine-tuning) to downstream tasks, and self-supervised model to emphasize the training objective/process.

Summary: The term “foundation model” emphasizes the function these models (are intended to) serve. It is the term we will use to describe this broad class of models and we hope others will use this term in referencing these models and our work but, ultimately, we hope folks will engage with the substance of our work more deeply beyond the name.

Conclusion

Foundation models are a charged yet increasingly important topic. At CRFM, our philosophy is to embrace the open discussions and debate of these models; we believe this discourse is critical to identifying and creating a path forward. We thank the community for their response and critique of our work and would like to invite anyone with a perspective not represented here to reach out and start a dialogue.

Acknowledgements

We would like to thank Su Lin Blodgett, Ernest Davis, Michael Madaio, Jitendra Malik, Gary Marcus, Girish Sastry, and Jacob Steinhardt for writing extended commentaries on our work. In addition, we would like to thank Maneesh Agrawala, Shyamal Buch, Dallas Card, Katie Creel, Chelsea Finn, Irena Gao, Sidd Karamcheti, Pang Wei Koh, Mina Lee, Fei-Fei Li, Shana Lynch, Christopher Manning, Peter Norvig, Laurel Orr, Shibani Santurkar, and Alex Tamkin for their comments on this post as well as Hugh Zhang for graciously offering to share this post with the Gradient community.