The Reasonable Effectiveness of Virtue Ethics in AI Alignment

The Reasonable Effectiveness of Virtue Ethics in AI Alignment

. 35 min read

Preface

This essay argues that rational people don’t have goals, and that rational AIs shouldn’t have goals. Human actions are rational not because we direct them at some final ‘goals,’ but because we align actions to practices[1]: networks of actions, action-dispositions, action-evaluation criteria, and action-resources that structure, clarify, develop, and promote themselves. If we want AIs that can genuinely support, collaborate with, or even comply with human agency, AI agents’ deliberations must share a “type signature” with the practices-based logic we use to reflect and act.

I argue that these issues matter not just for aligning AI to grand ethical ideals like human flourishing, but also for aligning AI to core safety-properties like transparency, helpfulness, harmlessness, or corrigibility. Concepts like ’harmlessness’ or ‘corrigibility’ are unnatural -- brittle, unstable, arbitrary -- for agents who’d interpret them in terms of goals or rules, but natural for agents who’d interpret them as dynamics in networks of actions, action-dispositions, action-evaluation criteria, and action-resources.

While the issues this essay tackles tend to sprawl, one theme that reappears over and over is the relevance of the formula ‘promote x x-ingly.’ I argue that this formula captures something important about both meaningful human life-activity (art is the artistic promotion of art, romance is the romantic promotion of romance) and real human morality (to care about kindness is to promote kindness kindly, to care about honesty is to promote honesty honestly).

I start by asking: What follows for AI alignment if we take the concept of eudaimonia -- active, rational human flourishing -- seriously? I argue that the concept of eudaimonia doesn’t simply point to a desired state or trajectory of the world that we should set as an AI’s optimization target, but rather points to a structure of deliberation different from standard consequentialist[2] rationality. I then argue that this form of rational activity and valuing, which l call eudaimonic rationality[3], is a useful or even necessary framework for the agency and values of human-aligned AIs.

These arguments are based both on the dangers of a “type mismatch” between human flourishing as an optimization target and consequentialist optimization as a form, and on certain material advantages that eudaimonic rationality plausibly possesses in comparison to deontological and consequentialist agency with regard to stability and safety.

The concept of eudaimonia, I argue, suggests a form of rational activity without a strict distinction between means and ends, or between ‘instrumental’ and ‘terminal’ values. In this model of rational activity, a rational action is an element of a valued practice in roughly the same sense that a note is an element of a melody, a time-step is an element of a computation, and a moment in an organism’s cellular life is an element of that organism’s self-subsistence and self-development.[4]

My central claim is that our intuitions about the nature of human flourishing are implicitly intuitions that eudaimonic rationality can be functionally robust in a sense highly critical to AI alignment. More specifically, I argue that in light of our best intuitions about the nature of human flourishing it’s plausible that eudaimonic rationality is a natural form of agency, and that eudaimonic rationality is effective even by the light of certain consequentialist approximations of its values. I then argue that if our goal is to align AI in support of human flourishing, and if it is furthermore plausible that eudaimonic rationality is natural and efficacious, then many classical AI safety considerations and ‘paradoxes’ of AI alignment speak in favor of trying to instill AIs with eudaimonic rationality.

Throughout this essay, I will sometimes explicitly and often implicitly be asking whether some form of agency or rationality or practice is natural. The sense of ‘natural’ I’m calling on is certainly related to the senses used in various virtue-ethical traditions, but the interest I take in it is less immediately normative and more material or technical. While I have no reductive definition at hand, the intended meaning of ‘natural’ is related to stability, coherence, relative non-contingency, ease of learnability, lower algorithmic complexity, convergent cultural evolution, hypothetical convergent cultural evolution across different hypothetical rational-animal species, potential convergent evolution between humans and neural-network based AI, and targetability by ML training processes. While I will also make many direct references to AI alignment, this question of material naturalness is where the real alignment-critical action takes place: if we learn that certain exotic-sounding forms of agency, rationality, or practice are both themselves natural and make the contents of our all-too-human values natural in turn, then we have learned about good, relatively safe, and relatively easy targets for AI alignment.

Readers may find the following section-by-section overview useful for navigating the essay:

  • Part I presents a class of cases of rational deliberation that are very different from the Effective Altruism-style optimization[5] many in the AI-alignment world treat as the paradigm of rational deliberation. I call this class of rational deliberations 'eudaimonic rationality,' and identify it with the form of rationality that guides a mathematician or an artist or a friend when they reflect on what to do in mathematics or in art or in friendship.

  • Part II looks at the case of research mathematics (via an account by Terry Tao) as an example of eudaimonic rationality at work. What does a mathematician try to do in math? I say she tries to be mathematically excellent, which involves promoting mathematical excellence through mathematical excellence, and that this structure is closely related to why 'mathematical excellence' can even be a concept.

  • Part III argues that for eudaimonic agents such as a mathematician who is trying to do excellent mathematics, distinctions between ‘instrumental goods’ and ‘terminal goods’ (intrinsic goods) are mostly unnatural. This makes reflection about values go very differently for a eudaimonic agent than for an Effective Altruism-style agent. Instead of looking to reduce a network of causally intertwined apparent values to a minimal base of intrinsic values that “explains away” the rest as instrumental, a eudaimonic agent looks for organism-like causal coherence in a network of apparent values.

  • Part IV cashes out the essay’s central concepts: A eudaimonic practice is a network of actions, action-dispositions, action-evaluation criteria, and action-resources where high-scoring actions reliably (but defeasibly) causally promote future high-scoring actions. Eudaimonic rationality is a class of reflective equilibration and deliberation processes that assume an underlying eudaimonic practice and seek to optimize aggregate action-scores specifically via high-scoring action.

  • In part V, I argue that many puzzles and ‘paradoxes’ about AI alignment are driven by the assumption that mature AI agents will be Effective Altruism-style optimizers. A “type mismatch” between Effective Altruism-style optimization and eudaimonic rationality makes it nearly impossible to translate the interests of humans -- agents who practice eudaimonic rationality -- into a utility function legible to an Effective Altruism-style optimizer AI. But this does not mean that our values are inherently brittle, unnatural, or wildly contingent: while Effective Altruism-style optimizers may well be a natural type of agent, eudaimonic agents (whether biological or AI) are highly natural as well.

  • In part VI, I ask whether a eudaimonically rational AI agent devoted to a practice like mathematical research would be safe by default. I argue that a practice like mathematical research plausibly has natural boundaries that exclude moves like ‘take over planet to get more compute for mathematical research,’ but the issue is nuanced. I propose that a practice’s boundaries (for which there may be multiple good natural candidates) may be most stable when a practice is paired with a support practice: a complementary practice for dealing with practice-external issues of maintenance and resource-gathering.

  • Part VII develops the idea of ‘support practices’: eudaimonically rational ways to support eudaimonic practices. We famously want AI agents to help humans lead flourishing lives, but how can we define the purview of this ‘help’? I argue that many core human practices have natural support-practices with a derived eudaimonic structure: the work of good couples’ therapist, for instance, is intertwined with but clearly distinct from a couple’s relationship-practice. Still, there remains a problem: a support-practice AI might harm other people and practices to help the people or practice it’s supporting.

  • Part VIII moves from eudaimonic rationality in general to eudaimonically rational morality. I argue that thinking of moral virtues as domain-general, always-on practices solves key AI-alignment-flavored problems with consequentialist and deontological moralities. The core idea is that the conditions for e.g. ‘kindness’ being a robust moral virtue are akin to the conditions for ‘mathematical excellence’ being a meaningful concept: it must be generally viable to promote kindness in yourself and others kindly. It’s this structure, I argue, that gives moral virtues material standing in a ‘fitness landscape’ riven by pressures from neural-network generalization dynamics, reinforcement-learning cycles, and social and natural selection.

  • Part IX argues that eudaimonic agents have some unique forms of robustness to RL-like and Darwinian-like dynamics that tend to mutate the values of EA-style optimizers. In particular, eudaimonic agents should be very robust to the risk of developing rogue subroutines (sometimes called ‘the inner alignment problem’).

  • In part X I discuss canonical AI-safety desiderata like transparency, corrigibility, and (more abstractly) niceness. I argue that treating these properties as moral virtues in my sense -- domain-general, always-on eudaimonic practices -- dissolves problems and paradoxes that arise when treating them as goals, as rules, or even as character traits. I end with an appendix on some prospects for RL regimes geared towards eudaimonic rationality.

I. Rational Action in the Good Life

I start with a consideration of the nature of the good we hope AI alignment can promote. With the exception of hedonistic utilitarians, most actors interested in AI alignment understand our goal as a future brimming with human (and other sapient-being) flourishing: persons living good lives and forming good communities. What I believe many fail to reflect on, however, is that on any plausible conception human flourishing involves a kind of rational activity. Subjects engaged in human flourishing act in intelligible ways subject to reason, reflection, and revision, and this form of rational care and purposefulness is itself part of the constitution of our flourishing. I believe this characterization of human flourishing is relatively uncontroversial upon reflection, but it raises a kind of puzzle if we’re used to thinking of rationality in consequentialist (or consequentialist-with-deontological-constraints) terms: just what goal is the rational agency involved in human-flourishing activity directed towards?

One obvious answer would be that, like all properly aligned rationality, the rational agency involved in human-flourishing activities is geared towards maximizing human (and other sapient) flourishing. But we should quickly find ourselves confused about the right way to describe the contribution that rational agency in human-flourishing activities makes to human flourishing. It seems neither appropriate to say that the rational agency involved in a human-flourishing activity contributes to human flourishing only by enacting rationality (by selecting actions that are intrinsically valuable when rationally selected), nor appropriate to say that the rational agency involved in a human-flourishing activity contributes to human flourishing just instrumentally (by selecting actions that causally promote human flourishing).[6]

The first option reduces our rational actions to something ritualistic, even as the good life surely involves mathematicians working to advance mathematics, friends speaking heart-to-heart to deepen intimacies, gymnasts practicing flips to get better at flips, and novelists revising chapters to improve their manuscripts. The second option threatens to make the good in the good life just impossible to find -- if speaking heart-to-heart is not the good of friendship, and working on math is the not the good of mathematics, then what is?

This essay argues that deliberative reasoning about the good life is neither directed towards goals external to rational action nor directed towards rational action as an independent good, but towards acts of excellent participation in a valued open-ended process. I then go on to argue that the ‘eudaimonic’ structure of deliberation salient in cases like math or friendship (sloganized as ‘promote x x-ingly’) is also subtly critical in more worldly, strategic, or morally high-stakes contexts, and constitutes a major organizing principle of human action and deliberation.

II. What Is a Practice?

Since ‘human flourishing’ can seem mysterious and abstract, let’s focus on some concrete eudaimonic practices.[7] Consider practices like math, art, craft, friendship, athletics, romance, play, and technology, which are among our best-understood candidates for partial answers to the question ‘what would flourishing people in a flourishing community be doing.’ From a consequentialist point of view, these practices are all marked by extreme ambiguity -- and I would argue indeterminacy -- about what’s instrumental and what’s terminal in their guiding ideas of value. Here, for example, is Terry Tao’s account of goodness in mathematics:

‘The very best examples of good mathematics do not merely fulfil one or more of the criteria of mathematical quality listed at the beginning of this article, but are more importantly part of a greater mathematical story, which then unfurls to generate many further pieces of good mathematics of many different types. Indeed, one can view the history of entire fields of mathematics as being primarily generated by a handful of these great stories, their evolution through time, and their interaction with each other. I would thus conclude that good mathematics [...] also depends on the more “global” question of how it fits in with other pieces of good mathematics, either by building upon earlier achievements or encouraging the development of future breakthroughs. [There seems] to be some undefinable sense that a certain piece of mathematics is “on to something”, that it is a piece of a larger puzzle waiting to be explored further.’

It may be possible to give some post-hoc decomposition of Tao’s account into two logically distinct components -- a description of a utility-function over mathematical achievements and an empirical theory about causal relations between mathematical achievements -- but I believe this would be artificial and misleading. On a more natural reading, Tao is describing some of the conditions that make good mathematical practice a eudaimonic practice: In a mathematical practice guided by a cultivated mathematical practical-wisdom judgment (Tao’s ‘undefinable sense that a certain piece of mathematics is “on to something”’), present excellent performance by the standard of the practical-wisdom judgment reliably develops the conditions for future excellent performance by the standard of the mathematical practical-wisdom judgment, as well as cultivating our practical and theoretical grasp of the standard itself.[8]

This is not to suggest that ‘good mathematics causes future good mathematics’ is a full definition or even full informal description of good mathematics. My claim is only that the fact that good mathematics has a disposition to cause future good mathematics reveals something essential about our concept of good mathematics (and about the material affordances enabling this concept). By analogy, consider the respective concepts healthy tiger and healthy human: It's essential to the concept of a healthy tiger that x being a healthy tiger now has a disposition to make x be a healthy tiger 5 minutes in the future (since a healthy tiger body self-maintains and enables self-preservation tiger-behaviours), and essential to the concept of a healthy human that x being a healthy human now has a disposition to make x be a healthy human 5 minutes in the future (since a healthy human body self-maintains and enables self-preservation human behaviours). But these formulae aren't yet complete descriptions of 'healthy tiger' or 'healthy human,' as evidenced by the fact that we can tell apart a healthy tiger from a healthy human.

Crucially, the mathematical practical-wisdom described by Tao is not entirely conceptually opaque beyond its basic characterization as a self-cultivating criterion for self-cultivating excellence in mathematical activity. Mathematical flourishing can partly be described as involving the instantiation of a relation (a mathematical-practice relation of ‘developmental connectedness’) among instantiations of relatively individually definable and quantifiable instances of mathematical value such as elegant proofs, clear expositions, strong theorems, cogent definitions and so on. Furthermore, this relation of developmental connectedness is partly defined by its reliable tendency to causally propagate instances of more individually and locally measurable mathematical value (instances of elegant proofs, clear exposition, strong theorems, cogent definitions and so on):

[I believe] that good mathematics is more than simply the process of solving problems, building theories, and making arguments shorter, stronger, clearer, more elegant, or more rigorous, though these are of course all admirable goals; while achieving all of these tasks (and debating which ones should have higher priority within any given field), we should also be aware of any possible larger context that one’s results could be placed in, as this may well lead to the greatest long-term benefit for the result, for the field, and for mathematics as a whole.

One could, again, try to interpret this causal relationship between excellence according to Tao’s ‘organicist’ (or ‘narrative’ or ‘developmental’) sense of good mathematics and the reliable propagation of narrow instances of good mathematics as evidence of a means-ends rational relation, where additive maximization of narrow instances of mathematical value is the utility function and ‘organicist’ mathematical insight is the means. For Tao, however, the evidential import of this causal relationship goes exactly the other way -- it suggests a unification of our myriad more-explicit and more-standalone conceptions of mathematical excellence into a more-ineffable but more-complete conception. As Tao says:

It may seem from the above discussion that the problem of evaluating mathematical quality, while important, is a hopelessly complicated one, especially since many good mathematical achievements may score highly on some of the qualities listed above but not on others [...] However, there is the remarkable phenomenon that good mathematics in one of the above senses tends to beget more good mathematics in many of the other senses as well, leading to the tentative conjecture that perhaps there is, after all, a universal notion of good quality mathematics, and all the specific metrics listed above represent different routes to uncover new mathematics, or different stages or aspects of the evolution of a mathematical story.

III. Inverting Consequentialist Reflection

Tao’s reasoning about local and global mathematical values exemplifies a central difference between consequentialist rationality and eudaimonic rationality, now taken as paradigms not only for selecting actions but for reflecting on values. (Paradigms for what philosophers will sometimes call ‘reflective equilibration.’) Within the paradigm of consequentialist rationality, if excellence[9] in accordance with a holistic, difficult-to-judge apparent value (say ‘freedom’) is reliably a powerful causal promoter of excellence in accordance with more explicit, more standalone apparent values (say ‘material comfort,’ ‘psychological health,’ ‘lifespan’), this relationship functions as evidence against the status of the holistic prima-facie value as a constitutive -- as opposed to instrumental -- value. Within the paradigm of eudaimonic rationality, by contrast, this same relationship functions as evidence for the status of the holistic prima-facie value as a constitutive value.

For a (typical)[10] consequentialist-rationality reflection process, evidence that the excellence of a whole causally contributes to the excellences of its parts explains away our investment in the excellence of the whole. The “coincidence” that the intrinsically valuable whole is also instrumentally valuable for its parts is taken to suggest a kind of double-counting error -- one we “fix” by concluding that the whole has no constitutive value but valuing the whole is an effective heuristic under normal circumstances. A eudaimonic-rationality reflective equilibration, by contrast, treats instrumental causal connections between excellences as evidence that our notions of excellence are picking out something appropriately ‘substantive.’
For eudaimonic-rationality reflective equilibration, it is the discovery of causal and common-cause relations among excellences that ratifies our initial sense that caring about these excellences is eudaimonically rational. The discovery of these causal connections functions as evidence that:

  1. The ‘local’ excellences we care about are resonant or fruitful, in that they causally promote each other and the holistic excellences in which they participate.
  2. The ‘holistic’ excellences we care about are materially efficacious and robust, in that they causally promote both the more local excellences that participate in them and their own continuation as future holistic excellence.[11]

In my view this is the right way to treat causal connections between (apparent) values if we’re hoping to capture actual human values-reflection, and points to an important strength of the eudaimonic rationality paradigm: Eudaimonic rationality dissolves the ‘paradox’ that in real-life arguments about the value of various human enterprises (e.g. the value of branches of science, branches of art, branches of sport), judgments of intrinsic value typically seek succor from some kind of claim to instrumental value. For example, a defense of the importance of research in quantum physics will often appeal to the wonderful technological, mathematical, and special-sciences applications quantum physics gave us, without meaning to reduce the worth of quantum physics to these applications. On my reading, these appeals aren't just additive -- 'aside from the intrinsic value there is also instrumental value' -- but presentations of evidence that research in quantum physics is a resonant part of a flourishing organic whole (e.g. the civilizational whole of ‘modern science and technology’).

I believe that without 'organicism' of the kind described above, one faces a serious dilemma whenever one argues for the intrinsic worth of a pursuit or norm: either we stress the value's independence from all benefits and applications and make the claim of value dogmatic and irrelevant, or else we invite an instrumentalist reduction that ‘explains away’ the appearance of intrinsic value.[12] Indeed, I’d argue that organicism of this kind is even necessary to make sense of caring about rationality (including truth, knowledge, non-contradiction and so on) non-instrumentally at all: the ‘paradox’ of rationality as a substantive value is that the typical usefulness of rationality suggests an error-theory about its apparent intrinsic value, since it’s a strange coincidence that rationality is both so typically useful and so intrinsically good. On an organicist account, however, we expect that major forms of excellence endemic to human life -- thought, understanding, knowledge, reasoned action -- both typically promote each other and typically promote our material flourishing and causal leverage on the world.

IV. The Material Efficacy Condition

Returning now to Tao’s account of good mathematics, let’s take final stock of our interpretation. I argue that mathematical excellence (the property marking ‘the very best examples of good mathematics’) according to Tao satisfies the following conditions, which I believe Tao intends as necessary but not sufficient:

A) Mathematical excellence is a property of mathematical-activity instances.

B) An excellent mathematical-activity instance performed today is excellent partly by virtue of satisfying the mathematical-practice relation ‘builds on’ with regard to past excellent mathematical-activity instances.

C) An excellent mathematical-activity instance performed today is excellent partly by virtue of having a reliable causal tendency to bring about future excellent mathematical-activity instances that satisfy the mathematical-practice relation ‘builds on’ with regard to it.

D) Instantiation of more local, more individually measurable criteria of mathematical-activity goodness such as elegant proofs, clear expositions, and strong theorems is a typical correlate of mathematical excellence.

E) At a given moment in a given mathematical field, the instantiation of mathematical excellence will be predictably better-correlated with the instantiation of certain local criteria of mathematical-activity goodness than with others.

Should we take these traits to collectively describe something more like a decision-procedure called ‘mathematical excellence’ that mathematicians should try to follow, or something more like an event called ‘mathematical excellence’ whose aggregate future-occurrences mathematicians should aspire to maximize? My contention is that Tao’s account is inherently ambiguous, and for a good reason: in ordinary circumstances there is no significant practical difference between doing excellent mathematics and doing instrumentally optimal mathematics with regard to maximizing future aggregate excellent mathematics. This isn’t to say that doing excellent mathematics is the instrumentally optimal action among all possible actions with regard to aggregate future excellent mathematics, but that (in ordinary circumstances) it is the instrumentally optimal choice from among mathematical actions with regard to aggregate future excellent mathematics[13].

I propose that the rough matchup between mathematical excellence and optimal (among mathematical actions) instrumental promotion of aggregate mathematical excellence is neither an empirical miracle nor something determined ‘by definition’ in a trivializing sense. Rather, ‘mathematical excellence’ as used by Tao is a concept that has a referent only if there is a possible property x that satisfies both desiderata A-E and the additional criterion that among mathematical actions, actions that are optimal as instantiations of x are also roughly optimal for maximizing aggregate future instantiation of x-ness.[14]

This is what I would describe as the material efficacy condition on eudaimonic rationality. In order for a practice to be fit for possessing internal criteria of flourishing, excellence, and eudaimonic rationality, a practice must materially allow for an (under normal circumstances) optimally self-promoting property x that strongly correlates with a plethora of more local, more individually measurable properties whose instantiation is prima facie valuable. Stated more informally, there must exist a two-way causal relationship between a practice’s excellence and the material, psychological, and epistemic effects of its excellence, such that present excellence reliably materially, psychologically, and epistemically promotes future excellence.

V. Practices and Optimization

I earlier said that if human flourishing involves practicing eudaimonic rationality, there may well be a “type mismatch” between human flourishing and the kind of consequentialist optimization we often associate with the idea of an agenticly mature future AI. In fact, I believe that implicitly recognizing but misdiagnosing this type mismatch is at least partially responsible for MIRI-style pessimism about the probability of aligning any artificial agents to human values.

On my view, the secret to relatively successful alignment among humans themselves (when there is successful alignment among humans) lies in the role attempted excellence plays as a filter on human interventions in the future trajectory of a eudaimonic practice. To the degree that humans value a given eudaimonic practice, they are committed to effecting their vision for the practice’s future-trajectory primarily by attempting acts of excellence in the present: we stake our intended effect over the practice’s future-trajectory on the self-propagating excellence of our intervention. While this ‘filter’ doesn’t necessarily stop the worst interventions from being harmful (there are forms of ‘anti-excellence’ that also have self-promotion powers), I contend that this filter is mechanically crucial for the possibility of reliably benign or positive interventions.

What do I mean? Consider the difference between a world where scientists typically try to propagate (what they believe to be) the scientific truth mainly by means of submitting research work to scientific institutions, and a world where scientists typically try to propagate (what they believe to be) the scientific truth by means including propaganda, fraud, threats, bribery, and slander. As Liam Kofi Bright demonstrates in On Fraud, a community of consequentialist scientists devoted to maximizing truth will predictably match the latter model. I believe one lesson to be drawn is that humans’ ability to collaborate in the promotion of science depends on our ability to scientifically collaborate in the promotion of science, rather than throttle the future trajectory of science every-which-way our financial and political powers based on our individual beliefs about the optimal trajectory of science.

A flourishing eudaimonic practice is, above all, a natural-selection-like[15] mechanism whose fitness-function selects among attempted acts of excellence the ones conducive to (and constitutive of) the practice’s flourishing, propagating the excellence these acts instantiate. When people committed to a eudaimonic practice make their attempted interventions into the future trajectory of the practice via acts of attempted excellence, the natural-selection-like mechanism embodied by the practice (rather than any single individual’s theory of optimal future trajectory) is the aligned intelligence determining the practice’s future trajectory.

The explanation here, again, is partly causal and partly constitutive: a practice’s “ultimate” norms of excellence, including the “ultimate” epistemic and alethic norms of a discursive practice, are partly defined by the succession of norms in the course of a practice’s development through best-efforts attempted excellence. Although this may be no deterrent to an already god-like optimizer who can simulate entire civilizational trajectories, an agent short of these capacities can best act on their vision of the optimal future-trajectory of a practice by attempting an excellent contribution to the practice.[16]

The second aspect of our type-mismatch is much more in the weeds: In my analysis so far, I discussed the overall excellence of the trajectory of a eudaimonic practice much like a consequentialist might discuss a quantity of utility. This may be taken to suggest that a ‘sophisticated consequentialist’ or ‘universal consequentialist’ could easily accommodate the implications of the so-called type mismatch by treating them as instrumental, decision-procedure level considerations against naive optimization. In fact, quantities like ‘aggregate democracy’ or ‘overall mathematical excellence’ are (on my view) practice-internal quantities that quickly lose meaning if we try to apply them outside the scope of a ‘promote x x-ingly’ decision-procedure.

What do I mean? Consider, for example, the practice of philosophy. Here are some questions that should arise for a consequentialist planner (including a sophisticated consequentialist planning decision-procedures or habits) who values philosophy practice-trajectories: Does rating (e.g.) Aristotle’s or Dharmakirti’s philosophical achievements as the most excellent achievements in philosophy imply that we should “tile the universe” with independent practice-trajectories designed to reproduce classical Greek or Indian philosophy? If not, is it because we should assign non-linearly greater value to longer trajectories? Or should we discount trajectories that have parallel contents? Or should we analyze the greatness of early achievements in a practice as mostly instrumental greatness but the greatness of later achievements in a practice as mostly intrinsically valuable? These are all, I believe, bad questions that have only arbitrary answers. To an agent trying to promote philosophy by doing excellent philosophical work, the bad questions above are naturally out of scope. The agent uses the concept of ‘aggregate philosophical excellence’ or ‘a philosophy practice-trajectory’s value’ only to reason about the philosophical influence of their work on the trajectory of the philosophy-practice in which it participates. Choosing an excellent action in practice requires (at most) quantitative comparison between different possible paths for a practice-trajectory, not quantitative comparison between possible worlds containing independent practice-trajectories sprinkled throughout time and space.

VI. Prospects and Problems for AI

Is this good news for AI alignment? It’s certainly good news that (if I’m right) eudaimonic practices are something like natural kinds marked by a causal structure that enables a self-developing excellence well-correlated with multiple naive local measures of quality. But does this mean we could develop a stable and safe (e.g.) ‘mathematical excellence through mathematical excellence’ AI? If we create a fully agentic AI mathematician, will it naturally abstain from trying to extend its longevity or get more resources (even for doing mathematics) other than by impressing us with excellent mathematical work? I think that prospects are good, but not simple.

I believe ‘mathematical excellence through mathematical excellence’ really can powerfully scope what mechanisms for shaping the future an AI cares to activate. An AI trained to follow ‘promote mathematics mathematically’ will only care about influencing the future by feeding excellent mathematical work to mathematics’ excellence-propagation mechanism. But it’s harder to say whether the structure of mathematical practice also properly scopes what subactions can be taken as part of an instance of “doing math.” Is a human mathematician working on a would-be excellent proof in pen and paper practicing math when she is picking up a pen or flipping pages? When she is taking the bus to her office? When she’s buying amphetamines? And is an AI mathematician working on a would-be excellent proof practicing math when it opens a Python console? When it searches the web for new papers? When it harvests Earth for compute?

I think these questions are complex, rather than nonsensical. Much like collective practices, individual practices -- for example a person’s or possibly an AI’s mathematical practice -- may possess functional organic unities that allow a meaningful distinction between internal dynamics (including dynamics of development and empowerment) and external interventions (including interventions of enhancement and provision). Still, it’s clear that eudaimonic practices do not exist in isolation, and that no practice can function without either blending with or relying on a “support practice” of some kind.

How, then, do we rationally go about externally-oriented activities like building offices for mathematicians, performing elective reconstructive surgery on an athlete, or conducting couples therapy for romantic partners? And furthermore, how do we rationally go about allocating scarce resources useful for different practices, or judging whether to integrate (e.g.) performance-enhancing drugs into a practice?

This is, I think, the fundamental question for AI alignment from the viewpoint of ‘eudaimonic rationality.’ We want AI to support human eudaimonic practices -- and, if relevant, its own eudaimonic practices or participation in human eudaimonic practices -- in a eudaimonia-appropriate way. But how does the logic of eudaimonic rationality extend from eudaimonic practices to their support activities? How do we ‘eudaimonically-rationally’ do the dirty work that makes eudaimonia possible? My best answer is: carefully, kindly, respectfully, accountably, peacefully, honestly, sensitively.

VII. From Support-Practices to Moral Practice

The theory of AI alignment, I propose, should fundamentally be a theory of the eudaimonic rationality of support practices. One part of this theory should concern the ‘support’ relation itself, and analyze varieties of support practices and their appropriate relation to the self-determination of a eudaimonic practice: Support-practices such as acquiring resources for a practice, maintaining an enabling environment, coaching practitioners, conducting (physical or psychological) therapy for practitioners, devising technological enhancements for a practice, and educating the public about a practice, each have their own ‘role-morality’ vis-a-vis the practice they support. It is this part of the theory of ‘support practices’ that should, if all goes well in the theory’s construction, describe the various practice-external ways to eudaimonically-rationally act on a pro-attitude towards the aggregate excellence of the practice’s future trajectory without treating it like a quantity of utility. (Much like the concept of ‘mathematical action’ scopes the range of action-choices in such a way that decision-theoretic optimization of math’s aggregate excellence becomes mostly well-behaved from an organicist viewpoint, so should the concepts of various types of ‘support action’ scope the range of action-choices in such a way that decision-theoretic optimization of a practice’s aggregate excellence becomes mostly well-behaved from an organicist viewpoint when the choice of actions is scoped.)

What is more difficult is delineating the appropriate relationship of a support-practice to everything outside the practice it supports. What stops a marriage-therapist AI on Mars from appropriately tending to the marriage of a Mars-dwelling couple but harvesting Earth for compute to be a better therapist-AI for that couple? While we can perhaps imagine a person or AI taking up a support-role for ‘humanity’s flourishing as whole,’ so that there’s no outside to speak of, I am not sure that the concept of practice remains natural at this level of abstraction. We have no real grasp on a direct practice of human flourishing, but rather grasp it as the harmonious and mutually supportive interaction of all eudaimonic practices and support-practices participating in the flourishing. And as there is, indeed, not much outside of the practice of human flourishing, it’s also unclear whether there is room for a support-practice external to the field of human flourishing itself.

It’s here that I want to call on the classic idea of domain-general virtues, the traditional centerpiece of theories of human flourishing. I propose that the cultivation of human flourishing as such -- the cultivation of the harmony of a multiplicity of practices, including their resource-hungry support practices -- is the cultivation of an adverbial practice that modulates each and every practice. What makes our practices ‘play nice’ together are our adverbial practices of going about any practice carefully, kindly, respectfully, accountably, peacefully, honestly, sensitively.[17]

VIII. Virtue decision-theory

Why think of qualities like kindness, respectfulness, or honesty as ‘practices’? The first reason is that devotion to a quality like kindness or honesty displays the same normative structure with regard to means and ends as we find in devotion to a practice: An agent devoted to kindness cares about their own future kindness (and about the future kindness of others), but will seek to secure future kindness only by kind means. The second reason is that qualities like kindness or honesty also approximately have the material structure of a practice: there exist effective very kind strategies for promoting kindness in oneself and others, and when these strategies succeed they further increase affordances for effective very kind strategies for promoting kindness/honesty in oneself and others.

The difference between adverbial practices like kindness or honesty and practices like research mathematics is that adverbial practices don’t have a “proprietary” domain. In a practice like research mathematics, the material structure of the domain does the most of work of directing agents to a eudaimonic form of agency all by itself, as long as the agents restrict themselves to in-domain actions. (Recall that we described mathematically excellent action as, in ordinary circumstances, the best action among mathematical action for maximizing aggregate mathematical excellence.) With a domain-general, adverbial practice like kindness the normative structure needs to do somewhat more heavy lifting.

The following is a first pass at characterizing the normative structure of an adverbial practice that values some action-quality x. The corresponding material efficiency condition (or material structure) necessary for the practice to be viable is that under ordinary circumstances this decision-procedure be instrumentally competitive with naive optimization of aggregate x-ness[18]:

Actions (or more generally 'computations') get an x-ness rating. We define the agent’s expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent's future actions conditional on a. (Thus the agent will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifices of present x-ness for big boosts to expected aggregate future x-ness.)[19]

A commitment to an adverbial practice that values x is a commitment to promoting x-ness (in oneself and others) x-ingly. The agent strikes a balance between promoting x-ness and acting x-ingly that heavily prioritizes acting x-ingly when the two are in conflict, but if x meets the material efficacy condition then the loss this balance imposes on future x-ness will be small under normal circumstances, and -- from our point of view -- desirable in abnormal circumstances. This is because just like the practices of research mathematics, philosophy, or art, an adverbial practice is a crucial ‘epistemic filter’ on actions aiming to shape its future, and the (e.g.) future kindness a paperclipper-like future-kindness-optimizer optimizes for is probably not the kindness we want. What we know about kindness with relative certainty is that we’d like people and AIs here and now to act kindly, and to develop, propagate, and empower the habit and art of kindness in a way that is both kind and clever.

To keep our conceptual system nicely organized, we might want distinguish merely (e.g.) very kind action from an action that is both very kind and highly promotive of future kindness in oneself and others, and call the latter sort of action excellently kind. What I call the material efficacy conditions for adverbial practices states not that the kindest action best-promotes aggregate kindness, but that there are almost always action-options that are excellently kind: very kind actions that strongly promote aggregate kindness in oneself and others.

IX. Virtue decision-theory is 'Natural' for Humans and AIs

I’ve said that the robustness or ‘naturalness’ of a practice’s normative structure (‘promote x x-ingly’) depends on the practice’s material structure: the capacity of high x-ness actions to causally promote aggregate x-ness. I also said that in key real-world practices, commitment to x-ing might optimize aggregate x-ness even better than direct optimization would. These two claims are best understood together. On my view, the normative structure ‘promote x x-ingly’ appears prominently in human life because (given the right material structure) ‘promote x x-ingly’ is a much more stable than ‘promote x.’

How so? Both humans and any sufficiently dynamic AI agent operate in a world that subjects their agency, values, and dispositions to constant mutation pressures from RL-like and Darwinian-like processes. Eudaimonic deliberation is an RL-dynamics-native, Darwinian-dynamics-native operation: its direct object is a form of life that reinforces, enables, and propagates that same form of life. When an x-ing successfully promotes (in expectation) aggregate x-nes, the fact of its success itself promotes x-ing because it reverberates via ubiquitous RL-like and Darwinian-like processes that reinforce (a generalization of) successful action. The material structure of a practice is the backbone that makes reliable success and meaningful generalization possible -- the right ecology of neural-network generalization dynamics, reinforcement-learning feedback loop dynamics, and neural and environmental selection dynamics.

An EA-style optimizer trying to minimize risk from optimization-goal-mutation, by contrast, is fighting an uphill battle to foresee and contain the RL-like and Darwinian-like side effects of its optimization actions.[20] One critical mutation-pressure in particular is the risk that an optimizer agent will cultivate, reinforce, and materially empower subroutines (what high-church alignment theory calls ‘mesaoptimizers’) that initially serve the optimization goal but gradually distort or overtake it. For example, if a pro-democracy government instates a secret police to detect and extrajudicially kill anti-democracy agitators, and the government increases the secret police’s funding whenever the police convincingly reports discovering an agitator, the secret police might grow into a distorting influence on the government’s democracy-promotion effort. In light or risks like this, it’s not surprising that oppressive democracy-promotion is generally considered an unserious or dishonest idea: even if an agent were to abstract some concept of ‘aggregate democracy’ from democratic practice into a consequentialist value[21], it’s plausible that the agent should then immediately revert to a commitment to democratic practice (‘promote democracy democratically’) on sophisticated-consequentialist grounds.

We should perhaps imagine eudaimonic practices as fixed points at the end of a chain of mesaoptimisers taking over outer optimisers and then being taken over by their own mesaoptimisers in turn. What the practice contributes that puts a stop to this process concept of x-ness that’s applicable to every agentic subroutine of x-ing across all nesting levels, so that x-ness is reinforced (both directly and through generalization) across all subroutines and levels.

X. Virtue-decision-theory is Safe in Humans and AIs

Let’s talk about AI alignment in the more narrow, concrete sense. It’s widely accepted that if early strategically aware AIs possess values like corrigibility, transparency, and perhaps niceness, further alignment efforts are much more likely to succeed. But values like corrigibility or transparency or niceness don’t easily fit into an intuitively consequentialist form like ‘maximize lifetime corrigible behavior’ or ‘maximize lifetime transparency.’ In fact, an AI valuing its own corrigibility or transparency or niceness in an intuitively consequentialist way can lead to extreme power-seeking: the AI should seek to violently remake the world to (for example) protect itself from the risk that humans will modify the AI to be less corrigible or transparent or nice.[22] On the other hand, constraints or taboos or purely negative values (a.k.a. ‘deontological restrictions’) are widely suspected to be weak, in the sense that an advanced AI will come to work around them or uproot them: ‘never lie’ or ‘never kill’ or ‘never refuse a direct order from the president’ are poor substitutes for active transparency, niceness, and corrigibility.

Conceiving of corrigibility or transparency or niceness as adverbial practices is a promising way to capture the normal, sensible way we want an agent to value corrigibility or transparency or niceness, which intuitively-consequentialist values and deontology both fail to capture. We want an agent that (e.g.) actively tries to be transparent, and to cultivate its own future transparency and its own future valuing of transparency, but that will not (e.g.) engage in deception and plotting when it expects a high future-transparency payoff.

If this is right, then eudaimonic rationality is not a matter of congratulating ourselves for our richly human ways of reasoning, valuing, and acting but a key to basic sanity. What makes human life beautiful is also what makes human life possible at all.

Appendix: Excellence and Deep Reinforcement Learning

Within the context of broadly RL-based training of deep neural networks, it may be possible to give some more concrete meaning to what I called the material efficacy condition for a property qualifying as an adverbial practices. We can now understand the material efficacy condition on x partly in terms of the conditions necessary for ‘promote x-ness x-ingly’ to be a viable target for RL. Consider an RL training regimen where x-ness is rewarded but aggregate x-ness reward is bounded with some asymptotic function on the sum. For x to meet the RL version of the material efficacy condition, it must be possible to design an initial reward model (most likely LLM-based) that assigns actions an x-ness rating such that:

  1. The x-ness rating is enough of a natural abstraction that reinforcement of high x-ness actions generalizes.
  2. If high x-ness action both depends on having capital of some kind and is suboptimal from the viewpoint of general power-seeking, there must typically be some high x-ness actions that approximately make up for the (future x-ness wise) opportunity cost by creating capital useful for x-ing.[23]
    (Illustration: If you dream of achieving great theater acting, one way to do it is to become President of the United States and then pursue a theater career after your presidency, immediately getting interest from great directors who'll help you achieve great acting. Alternatively, you could start in a regional theater after high school, demonstrate talent by acting well, get invited to work with better and better theater directors who develop your skills and reputation -- skills and reputation that are not as generally useful as those you get by being POTUS -- and achieve great acting through that feedback loop.)
  3. For any capability y necessary to reward in training to produce effective AI, there must be an unlimited local-optimization path of Pareto improvement for x-ness and y together.
    (Illustration: Maybe the most effective kind of engineering manager is ruthless; a nice engineering manager can still grow in effectiveness without becoming less nice, because there are many effective nice-engineering-management techniques to master.)
  4. Successful initial training in ‘promoting x x-ingly’ allows the model to be used as a basis for a new reward model which human experts judge as better-capturing our concept of x-ness. The process should be iterable.
    (If the model is LLM-based, improved performance may automatically lead to improved understanding of the x-ness concept. More generally, data from training runs as well the model’s value-function could be used to refine an x-ness rating that more strongly implements conditions 1-3.)

  1. My use of ‘practice’ is inspired by Alasdair MacIntyre’s use of the term. There’s a history of related uses going back to Marx and to Aristotle. ↩︎

  2. Recall that because of the possibility of 'notational consequentializing’ (rewriting any policy as a utility function), dividing agents or even theories or decision-procedures into ‘consequentialist' and ‘non-consequentialist’ isn’t a strict formal distinction. Throughout this essay, by ‘consequentialist’ I will mean roughly an agent for whom, in ideal practical reasoning, means and outcomes are effectively separately evaluable and the value of outcomes is typically decisive. Semi-formally, by ‘consequentialist’ I mean an agent s such that when s considers whether to perform action c, s’s ideal reasoning is an expected-utility calculation using a utility-function whose utility-assignment to a complete world-trajectory w has low sensitivity to whether s performs c in w (holding everything else about w constant). ↩︎

  3. In speaking about different ‘forms of rationality’ I don’t mean to make a fundamental metaethical distinction: consequentialism, deontology, and eudaimonism are first-order ethical view that each induce a different characteristic profile of deliberation, action, and value-reflection. I'm bundling the elements of such a profile under the label “form of rationality” in a modest sense: roughly, a way of structuring one’s practical reasoning. ↩︎

  4. This way of thinking is broadly associated with analytic Neo-Aristotelians such as Alasdair MacIntyre and Michael Thompson. ↩︎

  5. Instances of eudaimonic rational deliberations may still be described as VNM-rational expected utility maximization, but the utility function that rationalizes them is unnatural-looking and makes use of concepts that themselves involve complex relations between actions and outcomes. ↩︎

  6. Technically speaking the first horn of the dilemma can be further bifurcated into ‘rational agency contributes to human flourishing by choosing actions that are intrinsically valuable however chosen’ and ‘rational agency contributes to human flourishing by selecting actions such that these actions combined with their selection by rational agency are intrinsically valuable.’ ↩︎

  7. It’s interesting to note that practices like math, art, craft, friendship, athletics, romance, play, and technology are not only consensus elements of human flourishing but also in themselves entities that can ‘flourish’: a mathematical field (or a person’s mathematical life) can wither or flourish, a friendship can wither or flourish, technological development can wither and flourish, and so on. ↩︎

  8. See Tao: ‘[The] determination of what would constitute good mathematics for a field can and should depend highly on the state of the field itself. It should also be a determination which is continually updated and debated, both within a field and by external observers to that field; as mentioned earlier, it is quite possible for a consensus on how a field should progress to lead to imbalances within that field, if they are not detected and corrected in time.’ ↩︎

  9. Within this essay I use ‘excellence’ as the most general, pre-theoretical term for accordance with a holistic evaluative standard. The standard can be instrumental or terminal, apply to actions or persons or states or objects, be moral or aesthetic or epistemic and so on, and the standard itself (and so the excellence it defines) may later be judged as rational or irrational, substantive or trivial, significant or insignificant. ↩︎

  10. The above observation does not describe a formal feature of ‘consequentialism’ per any standard technical definition. However I believe it accurately describes a strong observable tendency in both the academic and ‘rationalist’ literature when conducting normative reflective equilibration within a consequentialist paradigm. ↩︎

  11. I put ‘local’ and ‘holistic’ in scare-quotes in the above, since the relation of parts and wholes is likely iterable: Arithmetic geometry is part of algebraic geometry, which is part of mathematics, which is part of the arts and sciences, which is part of human culture, which is part of human flourishing, which may itself be part of other wholes to which the idea of excellence is applicable. Similarly, a practice capable of excellence may be part of multiple different wholes. ↩︎

  12. It may be fruitful to explore the potential of PageRank-like algorithms as theoretical models of how eudaimonic reflective equilibration works, and especially of how initial ideas of eudaimonic excellences are ‘bootstrapped’ from simpler and more local prima facie goods (and prima facie ills) in the first place. Scott Aaronson and Simon Dedeo have both discussed conceptual applications of PageRank-like algorithms in philosophy in various informal contexts. That said, I believe it’s unlikely that PageRank over reliable instrumental-contribution relationships among prima facie goods and ills is the full story about the emergence of intrinsically valued holistic excellences, since while organicist relations between the excellence of wholes and of parts do involve instrumental-contribution relationships they plausibly also involve more rarified, ‘hermeneutic’ relations of (e.g.) mutually dependent intelligibility. ↩︎

  13. Why ‘rough matchup’ and ‘ordinary circumstances’? Because there are analytic-philosophy-style counterexamples to simple attempts to turn this ceteris paribus optimization relationship more strict. For example, the instrumentally best (for aggregate mathematical excellence) mathematical work and the most mathematically excellent work will diverge when a billionaire promises to donate 100 billion dollars to research-mathematics if Jacob Lurie does some long division by hand. ↩︎

  14. We should in principle also be concerned with the possibility of failures of uniqueness, as well as failures of existence, but recall that the above collection of properties is already not intended as a full or sufficient definition. ↩︎

  15. I mean ‘natural-selection-like’ only in the broadest sense. A central difference is that the selection-process enacted by a practice should have a complex, rich, continuously updated relationship to the best-informed practice-ideals of individuals. The concept of ‘dialectics’ as used in German philosophy may be of relevance if we were to try to describe this relationship in more detail. ↩︎

  16. It should in principle be possible to offer a more exacting analysis here, distinguishing (at least initially) between the development of the value-judgments made within a practice and the development of the evaluable activities performed within the practice. On my view the fact that intra-practice excellence is best fit to properly shape the development of the practice’s value-judgments is principally ‘true by definition,’ and the fact that intra-practice excellence is best fit to properly shape the development of the evaluable activities performed within a practice is principally ‘true by causation.’ ↩︎

  17. The matter of the unity of the adverbial virtues, and of whether it is more like a harmony of different practices or more like the common-factor excellence that underlies locally-measurable mathematical goods in Tao’s account, is for another day. ↩︎

  18. By ‘instrumentally competitive under normal circumstances’ I mean, roughly: in scopes where aggregate x-ness quantities are well-defined, switching from commitment to a eudaimonic decision-procedure for x to a naive-optimization procedure for x isn’t necessarily a long-term wining strategy with regard to aggregate x-ness maximization. ↩︎

  19. A richer account might include a third-tier utility function that takes the aggregate x-ness of the future actions of all other agents. In this richer account a practice involves three tiers of consideration: the action's x-ness, the aggregate x-ness of your future actions, and the aggregate x-ness of the future actions of all other agents. ↩︎

  20. I am referring, in part, to what high-church alignment theory calls the ‘inner alignment problem’ and ‘successor problem.’ ↩︎

  21. Per my discussion in part V, an abstracted ‘aggregate democracy’ quantity will only be determinate in some applications. The claim about relative effectiveness of practice-commitment and direct optimization refers to only to contexts where the quantity is determinate. ↩︎

  22. For a more interesting example, consider an AI that finds itself making trade-offs between different alignment-enabling behavioral values when dealing with humans, and decides to kill all humans to replace them with beings with whom the AI can interact without trade-offs between these values. ↩︎

  23. The difference between criteria '1.' and '2.' is clearest if we think about x-ness as rating state-action pairs. Criterion '1.' is the requirement that if (a,s), (a',s')(a'',s'') are historical high x-ness pairs and (a''',s''') is an unseen high x-ness pair then reinforcing the execution of a in s, a' in s', and a'' in s'' will have the generalization effect of increasing the conditional probability P(a'''|s'''). Criterion '2.' is roughly the requirement that choosing a higher x-ness action in a given state increase expected aggregate future x-ness holding policy constant, by increasing the probability of states with higher expected state-action x-ness value given the current policy. ↩︎