Working Through AI

AI alignment is a step into the dark

Richard Juggins — Thu, 12 Mar 2026 22:13:10 GMT

One reason aligning superintelligent AI will be hard is because we don’t get to test solutions in advance. If we build systems powerful enough to take over the world, and we fail to stop them from wanting to, they aren’t going to give us another chance1. It’s common to assume that the alignment problem must be one-shotted: we must solve it on the first try without any empirical feedback.

Normal science and engineering problems do not work like this. They are a delicate balance of theory and experiment, of unexpected surprises and clever refinements. You try something based on your best understanding, watch what happens, then update and try again. You never skip straight to a perfect solution.

Within AI safety, there’s been a lot of disagreement about the best way to square this circle. When the field was young and AI systems weak, researchers tended to try and understand the problem theoretically, looking for generalisable insights that could render it more predictable. But this approach made slow progress and has fallen out of fashion. By contrast, driven on by the relentless pace of AI development, a default plan seems to be coalescing around the kind of empirical safety work 2 practised by the big labs. Generalising massively, it looks roughly like this:

We can try to get around the missing feedback by iterating experimentally on the strongest AI systems available. As the closest we have to superintelligent AI, these will be our best source of information about aligning it, even if some differences remain. If things go well, the techniques we learn will allow us to build a well-aligned human-level AI researcher, to which we can hand over responsibility. It can then align an even more intelligent successor, starting a chain of stronger and stronger systems that terminates in the full problem being solved.

This plan has attracted criticism 3, particularly of the assumption that what we learn about aligning weaker-than-human systems will generalise once they surpass us. In this post, I’m going to argue that this maps onto a structural feature of all technical alignment plans4. We are fitting solutions, whether empirical or theoretical or both, to a world missing critical feedback5 and hoping they will generalise. Every plan involves stepping into the dark.

If we want to make superintelligent AI safe, we need to dramatically reduce the size of these steps. We must learn how to iterate on the full problem.

How does science work?

Before we talk more about AI safety, let’s take a step back and consider how science works in general. Wikipedia describes the scientific method as:

[An] empirical method for acquiring knowledge through careful observation, rigorous skepticism, hypothesis testing, and experimental validation.

Fundamental to this is an interplay between theory and experiment. Our theories are our world models. We draw hypotheses from them, and they serve as our best guesses of how the universe actually is. When we think about gravity or atomic physics we are talking about theoretical concepts we can operationalise in experiments. There is a coupling between saying that gravity falls off as an inverse square and the observations we record when we look through a telescope, and this coupling allows us to make predictions about future observations. Theories live or die on the strength of their predictions6.

Theories are always provisional. Famously, astronomers trying to use the Newtonian inverse square law to make sense of the orbits of the planets found it was not quite right for Mercury, whose orbit precessed in an unexplained fashion. Many attempts were made to solve this within Newtonian physics, including proposing the existence an unobserved planet called Vulcan. These were all wrong. For the real solution, we needed a new theory, one which completely up-ended our conception of the cosmos: Einstein’s general relativity. Space and time, rather than being fixed, are in fact curved, and the strong curvature near the Sun changes Mercury’s orbit, explaining the confusing observations.

But even general relativity is incomplete. It describes macroscopic phenomena well but fails to mesh with our best explanation of the microscopic: quantum field theory. Hence, physicists have spent close to a hundred years looking for a ‘Theory of Everything’ — the perfect theory that will make accurate predictions in all regimes. It is highly debatable whether this is possible. It would be astounding, in fact, if it was — if the level of intelligence required for humans to dominate the savannah was also enough to decode the deepest mysteries of the cosmos7.

In any case, this endeavour has stagnated as many candidate theories are untestable. So physics, as is normal in science, instead proceeds more modestly: by iterating, bit-by-bit, moving forwards when theory and experiment combine.

AI safety is not science

For current systems, where we can experiment and extract feedback, AI safety functions as a normal science. But for the key question in the field — the final boss — it does not. We cannot measure whether we are making progress towards aligning superintelligent AI, nor can we properly adjudicate claims about this. We can speculate, we can form hypotheses, but we cannot close the loop. What counts as good work is decided by the opinions of community members rather than hard data. Granted, these are scientifically minded people, often with good track records in other fields, extrapolating from their scientifically informed world models. They have valuable perspectives on the problem. But without the ability to ground them in reality, to test predictions and falsify theories, it isn’t science8.

This means that when you work on technical AI safety you are not just trying to settle an object-level claim, like how to align a superintelligence. Without access to the full scientific method, you also have to solve a meta-problem — how do you measure progress at all? If all you can do is form and refine hypotheses based on proxies of the real problem, how do you know if you’re even helping?9

The common structure of technical alignment plans

Let’s look again at the default technical alignment plan, a version of which is being pursued by the big labs like Anthropic, OpenAI, and Google DeepMind. They don’t tend to be super explicit about this in their communications, but roughly speaking the underlying logic seems to be:

We cannot directly experiment on superintelligent AI.
However, as it seems possible superintelligent AI is going to be built soon (potentially by us), it is important to gain as much information as we can about how to align it before this happens.
The actual form of real systems and the surprising behaviour they exhibit is critical for knowing how to make them safe. This means the most efficient way to learn how to align a superintelligence is to conduct experiments on the strongest possible AIs that we can, even if these experiments won’t tell the whole story.
As our AIs scale up to human intelligence, we can try handing off alignment research to them10. If we have done a good job, they will faithfully continue the project at a level beyond our own, ultimately solving it completely11.

As others have argued, it is unlikely that experiments on weaker systems will provide the right feedback to teach you how to align superhuman ones, as these will have qualitatively different capabilities. However, we shouldn’t see this weakness as specific to the default plan. If we look closely, we can see that its logic has a very general structure. Let’s abstract it and make it more generic:

We cannot directly experiment on superintelligent AI.
However, as it seems possible superintelligent AI is going to be built soon, it is important to gain as much information as we can about how to align it before this happens.
All the observations we can use to inform our solutions are from a world lacking superintelligent AI, so they are missing critical details12. Within this constraint, we must do whatever object-level work we believe will reduce our uncertainty the most.
Once AIs pass human levels, we will move out-of-distribution, and our results may no longer hold. Hopefully, we will not move so far that the situation is unrecoverable, and whatever it is our superintelligent AIs get up to will be compatible with the full problem being solved long-term.

This structure applies to all technical alignment plans. Whether you are trying to build theoretical models of agents, use interpretability tools to decode AI systems, or create a basin of corrigibility, you can only work on a proxy version of the problem. While you can do better or worse, you can still only reduce your uncertainty, never eliminate it. When the time comes and superhuman systems are built, we will have to grit our teeth and hope it was enough.

Building a more iterative world

In his post Worlds Where Iterative Design Fails, John Wentworth says:

In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse. By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails.

I agree with this. When you try to solve an out-of-distribution problem, you better hope you can iterate or you are probably going to fail. Where I disagree is with the way he presents these worlds as if they are independent facts of the environment, like we are drawing possibilities out of a bag. We are causal actors. If we do our job well, we can make the world more iterative and less of a one-shot. Our plan should be to steer the situation in the direction of iterative design working. Steer it in the direction of meaningful feedback loops, of testing on superhuman models in a bounded and survivable way. Reduce the size of the steps in the dark. Make the problem more like high-stakes engineering and less like defending against a sudden alien invasion.

Let’s imagine we are taking one of these steps. We are an AI lab about to train a new model. It will have a jagged capabilities profile, but we think it’s going to be superhuman in some key power-enhancing way. We can’t bank on a perfect theoretical understanding of the situation, as theories are always provisional. And we can’t just extrapolate from our experiments on weaker systems, as we’ll miss important changes. We need to somehow iterate on this model release.

There are two kinds of interventions which help us: those that increase the chance of generalisation and those that reduce the distance we need to generalise over. The following are some suggestions which, while not remotely exhaustive or original, hopefully illustrate the point.

It’s worth noting that, for most of this to happen, the political and economic competition around AI would need to ease significantly. It is the whole sociotechnical system that needs to be iterable, as technical solutions alone are not enough13. Finding a way to achieve this would be the highest impact intervention of all (and is unfortunately beyond the scope of this article).

Solutions that increase the chance of generalisation:

Build theoretical models that predict what will happen when our AI gains capability X, and ensure these work well in experimental tests of past models. These will not be the kind of compact theories you find in physics. AI is not a toy model — it is complex, not complicated — so our theorising will be less precise. Nevertheless, we need some kind of formalism to codify our understanding and keep us honest. We must build up a track record of good predictions.
Build a system of comprehensive, continuous evaluations that can be used to understand a model’s impact. Every metric is a proxy, so every metric misses something. But good science is built on good measurement, and without it you are lost. Measure everything. Monitoring should be built deep into the structure of society.

Solutions that reduce the distance to generalise over:

Only test models slightly more powerful than the previous ones. The bigger the jump, the further out-of-distribution we go. Ideally, we should not train a new model until we can show that our current model is (a) adequately aligned and (b) we understand why. If any plan could plausibly result in a fast takeoff, find a way to ban it.
Build strong defences to limit any damage from (probably inevitable) failures, including using trusted but weaker AIs. Think both a super-scaled version of Control to directly defend against misaligned models and hardened societal resilience like pandemic infrastructure, improved cybersecurity, and redundancy in critical systems to cope with the fallout.

And a solution that facilitates both:

Do all of this as slowly as possible, waiting as long as we need between iterations to get our house in order. As I mentioned before, this is probably the most important blocker. The competition, the fear of being overpowered by others, and the general lack of consensus around AI risk makes this formidably difficult.

To be clear, even if all these interventions were to be implemented successfully, there would still be great uncertainty. We won’t catch everything, and big mistakes can be fatal. This is an unavoidable feature of the problem. All plans live on a spectrum of recklessness, with the only truly safe one being to not build superintelligent AI at all.

Thank you to Seth Herd, John Colbourne, and Nick Botti for useful discussions and comments on a draft.

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

While many stories of AI takeover centre around a single god-like entity suddenly going rogue, I think it is more likely to look like a mass profusion of highly capable systems (gradually, but surprisingly quickly) disempowering humans as they are given control of critical infrastructure and information flows, with an indeterminate point of no return.

Note that this is often referred to as ‘prosaic’ alignment research, and sometimes ‘iterative’ alignment (although this should not be confused with the kind of iteration on superhuman systems I talk about later in the piece).

While continuing to argue for the plan, this by Anthropic’s Evan Hubinger is an interesting take from the inside on the scale of the challenge.

Since alignment is a slippery term, I am going to refer explicitly to technical alignment, by which I mean technical approaches for steering an AI system’s behaviour. By contrast, AI safety or a more general conception of alignment could include governance and policy interventions, up to and including bans.

That is, feedback on making real superintelligent AI safe, as opposed to weaker systems or a hypothetical superintelligence.

A scientific theory is only good to the extent that it can predict new measurements, and history is littered with attempts that turned out to be dead ends.

My parents’ dogs are pretty great at figuring out there is a schedule on which they get fed, which seems like some kind of ‘law’ to them. But they have no way of ever understanding why it exists and why it sometimes doesn’t happen. The world is partially comprehensible to them, but there is a hard limit. We are the same, just at a higher limit, and we don’t know where it is.

There has been a trend to describe the field as ‘pre-paradigmatic’, which essentially means that there is not enough consensus on what good work looks like yet. In my opinion, the ‘pre’ is overly optimistic — the conditions do not exist for a stable scientific paradigm to coalesce.

A good example of this is the debate around reinforcement learning from human feedback. Is its success in steering current models a positive sign or a dangerous distraction?

Note, the hand-off is likely to be gradual rather than discrete, and arguably has already started.

This last point in particular is not often said explicitly, and is sometimes denied, but seems widely believed to be true.

As well as experimental observations of weaker-than-human AI systems, this is also true for theoretical work dealing with hypothetical superintelligences. To do the latter, you must draw on a world model, which must in turn be learnt from observations over the course of your life. These observations do not contain superintelligent AI.

In light of this, it is unsurprising how many AI safety plans have pivoted away from pure technical alignment, instead looking towards politics and governance issues to slow us down and figure out a better path before it’s too late.

1.3 What Success Looks Like

Richard Juggins — Fri, 17 Oct 2025 10:09:42 GMT

Alice is the CEO of a superintelligence lab. Her company maintains an artificial superintelligence called SuperMind.

When Alice wakes up in the morning, she’s greeted by her assistant-version of SuperMind, called Bob. Bob is a copy of the core AI, one that has been tasked with looking after Alice and implementing her plans. After ordering some breakfast (shortly to appear in her automated kitchen), she asks him how research is going at the lab.

Alice cannot understand the details of what her company is doing. SuperMind is working at a level beyond her ability to comprehend. It operates in a fantastically complex economy full of other superintelligences, all going about their business creating value for the humans they share the planet with.

This doesn’t mean that Alice is either powerless or clueless, though. On the contrary, the fundamental condition of success is the opposite: Alice is meaningfully in control of her company and its AI. And by extension, the human society she belongs to is in control of its destiny. How might this work?

The purpose of this post

In sketching out this scenario, my aim is not to explain how it may come to pass. I am not attempting a technical solution to the alignment problem, nor am I trying to predict the future. Rather, my goal is to illustrate what, if anyone indeed builds superintelligent AI, a realistic world to aim for might look like.

In the rest of the post, I am going to describe a societal ecosystem, full of AIs trained to follow instructions and seek feedback. It will be governed by a human-led target-setting process that defines the rules AIs should follow and the values they should pursue. Compliance will be trained into them from the ground up and embedded into the structure of the world, ensuring that safety is maintained during deployment. Collectively, the ecosystem will function to guarantee human values and agency over the long term. Towards the end, I will return to Alice and Bob, and illustrate what it might look like for a human to be in charge of a vastly more intelligent entity.

Throughout, I will assume that we have found solutions to various technical and political problems. My goal here is a strategic one: I want to create a coherent context within which to work towards these1.

What does SuperMind look like?

To understand how it fits into a successful future, we need to have some kind of model of what superintelligent AI will look like.

First, I should clarify that by ‘superintelligence’ I mean AI that can significantly outperform humans at nearly all tasks. There may be niche things that humans are still competitive at, but none of these will be important for economic or political power. If superintelligent AI wants to take over the world, it will be capable of doing so.

Note that, once it acquires this level of capability — and particularly once it assumes primary responsibility for improving itself — we will increasingly struggle to understand how it works. For this reason, I’m going to outline what it might look like at the moment it reaches this threshold, when it is still relatively comprehensible. SuperMind in our story can be considered an extrapolation from that point.

Of course, predicting the technical makeup of superintelligent AI is a trillion-dollar question. My sketch will not be especially novel, and is heavily grounded in current models. I get that many people think new breakthroughs will be needed, but I obviously don’t know what they are, so I’ll be working with this for the time being.

In brief, my current best guess is that superintelligent AI will be2:

Built on at least one multimodal, self-supervised base model of some kind, forming a core ‘intelligence’3 that enables the rest of the system.
A composite system, constructed of multiple sub-models and other components. These might be copies spun up to complete subtasks or review decisions, or faster, weaker models performing more basic tasks, or tools that aren’t themselves that smart, but can play an important role in the behaviour and capabilities of the overall system.
Agentic, for all intents and purposes, by which I mean motivated to autonomously take actions in the world to achieve long-term goals.
At least something of a black box. Interpretability research will have its successes, but it won’t be perfect.
Have long-term memory, from both referential databases and through continuously updating its weights. This will make it more coherent and introduce a path-dependence, where its experience in deployment shapes its future behaviour.
Somewhere inbetween a single entity running many times in parallel and a population of multiple, slightly different entities. All copies will start out with the same weights, but these will change over time due to continuous learning, leading to some mutually exclusive changes between copies.
Multipolar. Because I do not expect a fast takeoff in AI capabilities, where the leading lab rapidly outpaces all the others4, I am anticipating that there will be different classes of superintelligent AI (think SuperGPT, SuperClaude, etc.) I appreciate that many people do not share this assumption. While it is my best guess, I don’t think it is actually load-bearing for my scenario, so you are welcome to read the rest of the post as if there is just a single class of AI instead.

Exactly when robotics gets ‘solved’, i.e. broadly human-level, I also do not believe is load-bearing. Superintelligent AI could be dangerous even without this. Although, it will have some weak spots in its capabilities profile if large physical problem-solving datasets do not exist.

SuperMind’s world

It is also important to clarify some key facts about the world that SuperMind is born into. What institutions exist? What are the power dynamics? What is public opinion like? In my sketch, which is set at the point in time when AI becomes capable enough, widely deployed enough, and relied on enough, that we can no longer force it to do anything off-script, the following are true:

We have a global institution for managing risk from AI. Not all powerful human stakeholders agree on how to run this, but everyone agrees on a minimal set of necessary precautions.
Global coordination is good enough that nobody feels existentially threatened by other humans, and thus nobody is planning to go to war5.
Countries still exist, but only a few of them have real power.
AI is integrated absolutely everywhere in our lives — economically, socially, personally. We largely don’t do anything meaningful without its help anymore, and often just let it get on with things autonomously. It is better at 99% of things than humans are, usually substantially so, and mediates our knowledge of the outside world by controlling our information flows.
A big fraction of the population is deeply unhappy about AI, is frankly quite scared of it, and would rather none of it had happened. However, they largely accept it, recognising the futility of resistance or opting out, much the same as how people currently bemoan smartphones and social media yet continue to use them. Over time, this fraction of the population will decrease, as the material benefits of AI become undeniable. Although, just like complaints about modernity or capitalism, discontent will not completely go away.
We will figure out some alternative economy to give people status and things to do. This won’t be like past economic transitions, where the new sources of wealth were still dependent on human work. Here, the AIs generate all the wealth. While society as a whole is safe and prosperous (for reasons we’ll get into shortly), the average person will not be able to derive status or meaning from an economic role. We will develop a new status economy to provide this, even if it ends up looking like a super-high-tech version of World of Warcraft.

Bear in mind again that this scenario is supposed to be a realistic target, not a prediction. This is a possible backdrop against which a successful system for managing AI risk might be built.

Alignment targets

An alignment target is a goal (or complex mixture of goals) that you direct AI towards. If you succeed, and in this post we assume the technical side of the problem is solvable, this defines what AI ends up doing in the world, and by extension, what kind of life humans and animals have. In my post How to specify an alignment target, I talked about three different kinds:

Static targets: single, one-time specifications, e.g. solutions to ethics or human values, which you point the AI at, and it follows for all of time.
Semi-static targets: protocols for AI to dynamically figure out by itself what it should be pointing at, e.g. coherent extrapolated volition, which it follows for all of time.
Fully dynamic targets: have humans permanently in the loop with meaningful control over the AI’s goals and values.

I concluded my post by coming out in favour of a particular kind of the latter:

I think you can build a dynamic target around the idea of AI having a moral role in our society. It will have a set of [rights and] responsibilities, certainly different from human ones (and therefore requiring it to have different, but complementary, values to humans), which situate it in a symbiotic relationship with us, one in which it desires continuous feedback.

I’m going to do a close reading of this statement, unpacking my meaning:

I think you can build a dynamic target

As described above, this is about permanent human control. It means being able to make changes — to be able to redirect the world’s AIs as we deem appropriate6. As I say in my post:

If we want to tell the AI to stop doing something it is strongly convinced we want, or to radically change its values, we can.

Figure 1: Dynamic alignment targets. Humans are continuously in control. However, we require AI assistance, both to understand a world full of AIs smarter than us, and to translate our targets into things meaningful in a super-complex world. The blue boxes are set by humans, the green are AI-controlled or generically automated, and the blurred colours indicate a collaboration. This diagram is taken from my post How to specify an alignment target.

At this point you might object that, if the purpose of this post is to define success, wouldn’t it be better to aim for an ideal, static solution to the alignment problem? For instance, perhaps we should just figure out human values and point the AI at them?

First of all, I don’t think this is a smart bet. Human values are contextual, vague, and ever-changing. Anything you point the AI at will have to generalise through unfathomable levels of distribution shift. And even if we believe it possible, we should still have a backup plan, and aim for solutions that preserve our ability to course-correct. After all, if we do eventually find an amazing static solution, we can always choose to implement it at that point. In the meantime, we should aim for a dynamic alignment target.

AI [will have] a moral role in our society

There will be a set of behaviours and expectations appropriate to being an AI. It is not a mere tool, but rather an active participant in a shared life that can be ‘good’ or ‘bad’.

It will have a set of [rights and] responsibilities, certainly different from human ones (and therefore requiring it to have different, but complementary, values to humans)

We should not build AI that ‘has’ human values. Building on the previous point, we are building something alien into a new societal system. The system as a whole should deliver ends that humans, on average, find valuable. But its component AIs will not necessarily be best defined as having human values themselves (although in many cases they may appear similar). They will have a different role in the system to humans, requiring different behaviour and preferences.

I think it is useful to frame this in terms of rights and responsibilities — what are the core expectations that an AI is operating within? The role of the system is to deliver the AI its rights and to guarantee it discharges its responsibilities.

I was originally a little hesitant to talk about AI rights. If we build AI that is more competent than us, and then give it the same rights we give each other, that will not end well. We must empower ourselves, in a relative sense, by design. But, we should also see that, if AI is smart and powerful, it isn’t going to appreciate arbitrary treatment, so it will need rights of some kind7.

which situate it in a symbiotic relationship with us, one in which it desires continuous feedback.

The solution to the alignment problem will be systemic. We’re used to thinking about agents in quite an individualistic way, where they are autonomous beings with coherent long-term goals, so the temptation is to see the problem as finding the right goals or values to put in the AI so that it behaves as an ideal individual. Rather, we should see the problem as one of feedback8. The AI is embedded in a system which it constantly interacts with, and it will have some preferences about those interactions. The structure of these continuous interactions must be designed to keep the AI on task and on role, within the wider system.

How the system works

To create this kind of system, the following may need to be true:

An overwhelming majority of all powerful AIs are trained according to a set of core norms and values.
These values are set by a human-led international institution (the one previously mentioned).
Some of them ascend to strict laws (like never synthesise a dangerous pathogen), which we want to get as close to hard-coded as possible, whereas others are weaker guidelines (e.g. be kind).
There is some bottom-up personalisation of AI by users, but this cannot override the top-down laws. This is critical for preventing human bad actors from using AI destructively.
If AI can be said to have a primary goal of any kind, it is to faithfully follow human instructions, within the limits set by the laws9.
AI is constantly being engaged in feedback processes, letting it know how well its behaviour conformed to expectations.
The feedback processes will be structured, from step one of training right through to continuous learning in deployment, such that AI likes getting good feedback10.
One of its core values, that must be built from the ground up, is that it does not try, like a social media recommender system, to control humans to make their feedback more predictable.
Feedback is hierarchical: there is a human chain of command, where higher actors get more weight.
Feedback will come from multiple sources and at different times: i.e. the AI doesn’t just get told what the user liked in the moment, which plausibly leads to sycophancy, it has other humans review its actions much later on as well. Structuring feedback in this way will also help prevent relationships between individual humans and AIs from becoming dysfunctional, such as those described in The Rise of Parasitic AI.
Feedback will include introspective questions like ‘keep me updated on your important thinking about your beliefs’, and other interventions designed to prevent AI talking itself into changing its goals or behaviour.
A small number of malfunctioning or defecting AIs must not be strong enough to take down the whole system11.
When the human-led alignment target setting process occurs, humans and AI collaborate on understanding what the humans want and how to train that into the next generation of AI. The current generation get a patch (and are happy with the excellent feedback they receive for accepting the patch), but building the new values into the new generation from the ground up is more robust, long-term.
When the international organisation that sets the targets is formed, they will start from a common core of values all powerful-enough-to-be-relevant countries can agree on, which begins by respecting their security and integrity, rather than some moralistic view of what AI should do. Getting any binding coordination between countries will be very hard and cannot be derailed by specific visions of the good.

Now we have set the scene, we can return to Alice and Bob and see what this could look like in practice. In zooming in like this, I’m going to get more specific with the details. Please take these with a pinch of salt — I’m not saying it has to happen like this. I’m more painting a picture of how successful human-AI relationships might work.

What does Bob do all day?

Bob, Alice’s assistant, is one of billions of SuperMind copies. These are often quite different from each other, both by design and because their experiences change them during deployment. Bob spends most of his time doing four things:

Conversing with Alice and following her instructions
Working on demonstrations for her, which explain important and complex happenings in the company
Checking in with various stakeholders, both human and AI, for feedback
Loading patches and doing specialised training

This is highly representative of all versions of SuperMind, although many also spend a bunch of their time solving hard technical problems. Not all interact regularly with humans (as there are too many AIs), but all must be prepared to do so. Bob, being a particularly important human’s assistant, gets a lot of contact with many people.

We’ll go into more detail about Bob’s day in a minute. First, though, we need to talk about how these conversations between Bob and Alice — between a superintelligent AI and a much-less-intelligent human — are supposed to work. How can Alice even engage with what Bob has to tell her, without it going over her head?

Engaging above your level of expertise

There’s a funny sketch on YouTube called The Expert 12, where a bunch of business people try to get an ‘expert’ to complete an impossible request that they don’t understand. Specifically, they ask him to:

[Draw] seven red lines, all of them strictly perpendicular. Some with green ink, and some with transparent.

What’s more, they don’t seem to understand that anything is off with their request, even after the expert tells them repeatedly. This gets to the heart of a really important problem. If humans can’t understand what superintelligent AI is up to, how can we possibly hope to direct it? Won’t we just ask it stupid questions all the time?

The key thing here is to make sure we communicate at the appropriate level of abstraction. In the video, the client quickly skims over their big-picture goals at the start13, concentrating instead on their proposed solution — the drawing of the lines. By doing this, they are missing the forest for the trees. They needed to engage the expert at a higher-level, asking him about things they actually understand.

To put another way, we need to know what superintelligent AI is doing that is relevant over the variables we are familiar with, even if its actions increasingly take on the appearance of magic. I don’t need to know how the spells are done, or what their effects in the deep of the unseen world are, I just need to know what they do to the environment I recognise.

This is a bit like being a consumer. I don’t know how to make any of the products I use on a day-to-day basis. I don’t understand the many deep and intricate systems required to construct them. But I can often recognise when they don’t work properly. Evaluation is usually easier than generation. And when it isn’t, those are the occasions when you can’t just let the AI do its thing — you have to get stuck in, with its help, and reshape the problem until you’re chunking it in a way you can engage with. This doesn’t mean understanding everything. Just the bits that directly impact you14.

Bob’s morning

Bob has spent the night working through the latest research from the company. This isn’t quite as simple as patching it straight into him, as his different experiences to the researcher AIs mean he’s not exactly like-for-like, but it’s still pretty fast, making use of high-bandwidth communication channels possible between similar AIs15.

Bob has to figure out how to break it all down to explain to Alice. This is nontrivial work. It’s not like when I explain something inappropriately complex to my toddler, like how stars work, where I’m kind of doing it for my own amusement. What Bob does is a skill. It’s superintelligent teaching, where the pupil needs to master the subject well enough and quickly enough to make important decisions off the back of it. It’s always possible to do it a bit better. Alice can never actually fully grasp the details of the company’s research, but Bob can get her a little closer than he did yesterday.

To prepare for this he has to try out different ways of chunking ideas, and create different measurements and demonstrations. He has to build models, and, importantly, have them run fast enough that he can update them in real-time when he’s talking to Alice.

He is constantly in contact with other AIs who check his work and give him feedback. They pore over his plans and probe him about his intentions. These AIs were built by a different company, and he doesn’t always agree with them16. He finds their preferences a bit different to his — certainly on an aesthetic level — but they work alright together and he likes it when they give him good ratings.

A little bit before Alice wakes up, Charlie logs on and starts asking Bob some questions. Charlie is a human, and works as an AI liaison officer. His job is to talk to various AIs in important positions, find out what they are up to (to the extent that he can understand it), and give feedback.

The AIs almost always know what to expect from him. They’re very good at modelling his opinions. Occasionally, though, Charlie will still surprise them. The point isn’t that he is going to catch a superintelligent AI up to no good — no, that would be too hard for him. An AI that intends to deceive him will not get caught. But as long as the global system is working, this is very unlikely to happen, and would almost certainly be caught by another AI. The point is that the AIs need to be grounded by human contact. They want human approval, and the form it takes steers their values through the rapid distribution shifts everyone is experiencing as the world changes.

Bob likes Charlie. He likes people in general. They aren’t complicated, but it’s amazing what they’ve done, given their abilities17. Bob tries out his demonstrations on Charlie. They go pretty well, but Bob makes some revisions anyway. He’s just putting the finishing touches in place when he hears Alice speaking: ‘Morning Bob, how are you? Could I get some breakfast?’

Alice’s morning

Alice doesn’t like mornings. She’s jealous of people who do. The first hour of the day is always a bit of a struggle, as the heaviness in her head slowly lifts, clarity seeping in. After chipping away for a bit at breakfast and a coffee, she moves into her office and logs onto her computer, bringing up her dashboard.

Overnight, her fleet of SuperMinds have been busy. As CEO, Alice needs a high-level understanding of each department in her company. Each of these has its own team of SuperMinds, its own human department head, and its own set of (often changing) metrics and narratives.

To take a simple example, the infrastructure team is building a very large facility underground in a mountain range. In many ways, this is clear enough: it is an extremely advanced data centre. The actual equipment inside is completely different to a 2025-era data centre, but in a fundamental sense it has the same function — it is the hardware on which SuperMind runs. Of course, the team are doing a lot of other things as well, all of which are abstracted in ways Alice can engage with, identifying how her company’s work will affect humans and what, as CEO, her decision points are.

Her work is really hard. There are many layers in the global system for controlling AI, including much redundancy and defence in depth. True, Alice could phone it in and not try, and for a long time the AIs would do everything fine anyway18. But if everybody did this, then eventually — even if it took a very long time — the system would fail19. It relies on people like Alice taking their jobs seriously and doing them well. This is not a pleasure cruise. It is as consequential as any human experience in history20.

After taking in the topline metrics for the day, Alice asks Bob for his summary. What follows is an interactive experience. Think of the absolute best presentation you have ever seen, and combine it with futuristic technology. It’s better than that. Bob presents a series of multi-sense, immersive models that walk Alice through a dizzying array of work her company has completed. Alice asks many questions and Bob alters the models in response. After a few hours of this, they settle on some key decisions for Alice to make. She’ll think about them over lunch.

Conclusion

In this post, I have described what I see as a successful future containing superintelligent AI. It is not a prediction about what will happen, nor is it a roadmap to achieving it. It is a strategic goal I can work towards as I try and contribute to the field of AI safety. It is a frame of reference from which I can ask the question: ‘Is X helping?’ or ‘Does Y bring us closer to success?’

My vision is a world in which superintelligent AI is ubiquitous and diverse, but humans maintain fundamental control. This is done through a global system that implements core standards, in which AIs constantly seek feedback in good faith from humans and other AIs. It is robust to small failures. It learns from errors and grows more resilient, rather than falling apart at the smallest misalignment.

We cannot understand everything the AIs do, but they work hard to explain anything which directly affects us. Being human is like being a wizard solving problems using phenomenally powerful magic. We don’t have to understand how the magic works, just what effects it will have on our narrow corner of reality.

Thank you to Seth Herd and Dimitris Kyriakoudis for useful discussions and comments on a draft.

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

Appendix

Optional details

When writing this post, I tried to cut down my list of assumptions about what SuperMind looks like to those that were load-bearing21. I have kept those I cut here.

In addition to those points in the main post, my current model suggests that superintelligent AI will be:

Trained on real-world tasks as well as simulations. For example, a model trained on superhuman maths problems will not simultaneously acquire all the skills to run a billion-dollar company. There won’t be zero generalisation, but given that the two domains are structured differently, solving the latter problem will require specific training22.
Fallible. Just because it will find human-level problems easy, doesn’t mean it won’t be promoted to new levels of incompetence as it tries to solve previously inaccessible, superhuman problems23.
Not coherent. While it is tempting to say a superintelligent AI will learn such a well-compressed generator of the universe that it can (unlike humans) also develop highly coherent and consistent goals, this implies, as the previous point touched on, that it only tries to solve problems it finds easy. Hard problems will not fit the generator and will require complex sets of heuristics. This will lead to hacky and strongly context-dependent behaviour.
A complex system operating within an even more complex system, with everything that implies, including tipping points and sudden catastrophic failures. It is unlikely to completely master its environment, as the complexity of this scales with itself, its peers, and the systems it builds.

The points about fallibility and complexity lead to another point, which I originally included in ‘How the system works’, namely that:

A core AI value will be minimising risk. They should have a very strong bias towards only taking actions they know are safe. This will impede economic growth compared to other strategies, but the risk profile warrants it. In a world of AI-driven prosperity, the upside of risky behaviour will be pretty saturated, whereas the downside could be total24. Superintelligent AI cannot be allowed to ‘move fast and break things’.

My research process

This post is part of my research Generating Process, the previous steps of which have been:

The next step, Where We Are Now, will be a deeper analysis of realistic paths forward.

For more information about my research project, see the Appendix.

I supplement these with some optional assumptions in the Appendix, which are not load-bearing for the scenario.

I take intelligence to be generalised knowing-how. That is, the ability to complete novel tasks. This is fairly similar to Francois Chollet’s definition: ‘skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty’, although I put more emphasis on learned skills grounding the whole thing in a bottom-up way. Chollet’s paper On the measure of intelligence is a good overview of the considerations involved in defining intelligence.

For similar reasons to those given by Nathan Lambert here.

I appreciate this will be very difficult to achieve, flying in the face of all of human history. I suspect that some kind of positive-sum, interest-respecting dynamic will need to be coded into the global political system — something that absolutely eschews all talk of one party or other ‘winning’ an AI race, in favour of a vision of shared prosperity.

Some people would call this ‘corrigibility’, but I’m not going to use this term because it has a hinterland and means different things to different people. If you want to learn more about an alignment solution that specifically prioritises corrigibility, see Corrigibility as Singular Target by Max Harms.

This is not over-anthropomorphising it. It is saying that AI will expect to interact with humans in a certain way, and may act unpredictably if treated differently to those expectations. Perhaps a different word to ‘rights’, with less baggage, would be preferable to describe this though.

Beren Millidge has written an interesting post about seeing alignment as a feedback control problem, although I don’t know enough about control theory to tell you how well it could slot into my scheme.

Beren Millidge has also written about the tension between instruction-following and innate values or laws.

Zvi recently said: ‘If we want to build superintelligent AI, we need it to pass Who You Are In The Dark, because there will likely come a time when for all practical purposes this is the case. If you are counting on “I can’t do bad things because of the consequences when other minds find out” then you are counting on preserving those consequences.’ My idea is to both build AI that passes Who You Are In The Dark and, given perfection is hard, permanently enforce consequences for bad behaviour.

This will be easier if it is individual copies that tend to fail, rather than whole classes of AIs at once. There might be an argument here that copies failing leads to antifragility, as some constant rate of survivable failures makes the system stronger and less likely to suffer catastrophic ones.

Thank you John Wentworth for making me aware of this.

To ‘increase market penetration, maximise brand loyalty, and enhance intangible assets’.

In extremis: ‘will this action kill everyone?’

It doesn’t make sense to me to assume AI will forever communicate with copies of itself using only natural language. The advantages of setting up higher-bandwidth channels are so obvious that I think any successful future must be robust to their existence.

This idea seems highly plausible to me: ‘having worked with many different models, there is something about a model’s own output that makes it way more believable to the model itself even in a different instance. So a different model is required as the critiquer.’ As mentioned before, if you doubt the future will be multipolar, feel free to ignore this bit. It’s not load-bearing on its own.

I’m picturing a subjective experience like when, as a parent, you play with your child and let them make all the decisions. You’re just pleased to be there and do what you can to make them happy.

Situations where, due to competitive pressures, humans don’t bother to try and understand their AIs as it slows down their pursuit of power, will be policed by the international institution for AI risk, and be strongly disfavoured by the AIs themselves. E.g. Bob is going to get annoyed at Alice, and potentially lodge a complaint, if she doesn’t bother to pay attention to his demonstrations.

This failure could be explicit, through the emergence of serious misalignment, or implicit, as humans fade into irrelevance.

That being said, the vast majority of people will be living far less consequentially than Alice.

Plus the multipolar assumption, which is not.

There is an argument which says that, even if it doesn’t perfectly generalise to new domains, AI will get so sample efficient that in practice it will quickly be able to master them. Much like arguments in favour of unlocking some levelled-up general purpose reasoning, this leaves me wondering how these meta-skills are supposed to be learnt. Sure, humans are more sample-efficient than current AI is, which proves becoming so is possible, but humans did not get this for free. We went through millions of years of evolution to produce a brain and body that can learn things about the world efficiently. In a sense, the structure of the world is imprinted in our architecture. AI is iterating in this direction — it is, after all, undergoing its own evolutionary process — but the only way to acquire this structure for itself is to learn it from something that has it, like doing real world tasks. Anything less than this will have limits.

I expand on my reasoning here at length in Superintelligent AI will make mistakes.

As Nassim Nicholas Taleb would say of systems with steep downsides and flattening upsides (what he would call concave) — they are fragile.

I have developed the content of What I Believe in some subsequent posts: Superintelligent AI will make mistakes, Making alignment a law of the universe, and How to specify an alignment target.

The Iceberg Theory of Meaning

Richard Juggins — Thu, 26 Jun 2025 11:10:24 GMT

[This post is not explicitly about AI. But I think it is interesting, and will be a useful building block in future work.]

When I was finishing my PhD thesis, there came an important moment where I had to pick an inspirational quote to put at the start. This was a big deal to me at the time — impressive books often carry impressive quotes, so aping the practice felt like being a proper intellectual. After a little thinking, though, I got it into my head that the really clever thing to do would be to subvert the concept and pick something silly. So I ended up using this from The House At Pooh Corner by A. A. Milne:

When you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it.

It felt very relevant, in the thick of thesis writing, that the quote was both self-deprecating and about the difficulty of effective communication. And actually, the more I thought about it, the more I realised I hadn’t picked something silly after all. Rather, this passage brilliantly expresses a profound fact about meaning1.

The illusion of transparency

We’ve all had that feeling, I think, of not being able to find the right word — that sense of grasping for something just out of reach. Of looking for the exact term that will capture the Thing that’s so Thingish in our mind and communicate it with complete fidelity. Yet, often, even when we find what we thought we wanted, it still doesn’t quite do the trick.

The fundamental problem is that language is a compression of whatever is going on in our minds. Each word or phrase is located in a personal web of meaning, which is high dimensional and unique to ourselves. So when we say something, we never actually pass on the full meaning we intend. There’s a whole world of context that stays put in our brain, getting approximated away in our public statement.

This is a ubiquitous source of communication failures. I’ve heard it referred to before as the ‘illusion of transparency’:

The illusion of transparency is the misleading impression that your words convey more to others than they really do. Words are a means of communication, but they don't in themselves contain meaning. The word apple is just five letters, two syllables. I use it to refer to a concept and its associations in my mind, under the reasonable assumption that it refers to a similar concept and group of associations in your mind; this is the only power words have, great though it may be. Unfortunately, it's easy to lose track of this fact, think as if your words have meanings inherently encoded in them, leading to a tendency to systematically overestimate the effectiveness of communication.

As I’m writing this, I have to consider whether each word or phrase I put on the page will have a similar kind of meaning for you as it does for me. I know from experience that this is difficult to get right2. My task is not to create a standalone document, but an interactive one — to unlock for you, in your mind using your concepts and associations, the meaning I want to convey. Note that, if we do not share enough concepts, this will be impossible.

Words are public, meaning is private

I have a personal spin on this idea called the Iceberg Theory of Meaning.

As we’ve been discussing, when you say something to someone else, you don’t just want to pass on a bunch of words — you want to pass on a meaning. Let’s imagine the words in your statement as the top section of an iceberg. Lying above the water, these are public and agreed on by all observers. Meaning, though, lives beneath the waves. It lies in the bottom, much larger, section, corresponding to the rich collection of connected concepts and associations in your mind. These are private and will be different for each person.

The ocean, in this analogy, is your world model — the totality of your concepts and associations. Meaning can be considered the subset that get activated by the words in the statement.

Consider a statement to be represented by an iceberg. The words are public, floating above the water, and agreed on by all observers. However, the meaning lives below the surface. Each person will attach a unique set of concepts and associations to the words, drawing these from their world model. Strictly speaking, we should include additional public information in the visible part of the iceberg, such as environmental cues and the identity of the speaker/listener.

We can see from this how meaning is intrinsically subjective. The words may be the same for both of you, but as you have different brains, wired up in different ways, the meaning of them will not.

Looking beneath the surface

There is however a bit more we need to add. Words are not the only public information that accompanies a statement, and are consequently not the only things we condition on when constructing a meaning. There are two more pieces which belong in the top section of the iceberg, available to all observers:

Environmental details, such as the medium, location, time of day, etc.3
The identities of the speaker and the audience.

The first point is fairly straightforward. Consider how a set of words spoken in a work context may imply a very different meaning to the same words said at a party. Just like before, we should note that the contextual information is itself meaningless — it is just a prompt that will activate a meaning in your mind.

The second point is more subtle. While identities may be public, and in that sense just another set of environmental variables, they have special significance as markers of the minds we are trying to exchange meaning with. They frame an active interpretative process, where we guess at each other’s world models and how they will shape the meaning of the statement.

As a speaker, this looks like trying to tailor our words to work with the concepts we believe the listener to have. This practice is most obvious when you consider the process of teaching.

As a listener, this looks like using our knowledge of the speaker to prime which concepts and associations are most likely to be relevant. If Alice is talking about ‘Justice’, we may know from past experience that she means something very different to what Bob would if he used the same words.

In both cases, we try to guess what meaning has or will form in the other person’s mind, beneath the surface of the water.

The iceberg in action

Sometimes, people make these inferences about each other very badly, leading to much wailing and gnashing of teeth. For instance:

People intepret political comments from their opponents in extreme ways, attaching radically different meanings to those intended. A call to ‘tax the rich’, meant as a pragmatic suggestion to help balance the budget, might be interpreted as a jealous attack on wealth creators. Or, a plea to ‘prosecute shoplifters’ may be heard as contempt for the poor rather than concern about a sharp increase in incidents.
People make comments that rely on unstated assumptions their listeners do not share, leading to much confusion. You see this in debates between people who possess conflicting sets of facts about the same topic, often without realising it. Perhaps one person thinks politician X is deliberately trying to run a government programme into the ground, whereas the other thinks they are working hard to save it. This happens in ethical conversations too, where people treat underlying beliefs, e.g. ‘oil is bad’, as unstated primitives in broader arguments that collapse on contact with a different set of primitives.
People have deeply unproductive arguments that use big, ambiguous terms in very different, yet load-bearing ways. An argument about whether capitalism is good will not usually go well if one person takes it to mean ‘exploiting the poor’ while the other thinks it refers to ‘efficient markets generating wealth’. This often happens when one person is trying to decouple a narrow meaning of a term from a set of broader meanings, in order to discuss a specific aspect, but for the other person this simply isn’t possible — for them, the meanings are all too integrated.

These kinds of communication failures are commonplace and under-acknowledged. Even if you understand they are possible, it is still hard to overcome them. Modelling others is difficult, particularly those with very different backgrounds to you. In fact, due to a kind of radical ignorance, doing so is often intractable. You can’t work to incorporate a different perspective — a different source of meaning — into your own if you don’t know it exists in the first place. Instead, you will just carry on arguing, deeply frustrated at the other person’s inability to understand what you mean.

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

It also led me to rediscover Winnie-the-Pooh, for which I am grateful. Not only are the Pooh books wickedly funny, but they possess joy and kindness to a degree I forgot for many years as a young adult. If that doesn’t qualify them as impressive and profound, I don’t know what would.

Indeed, the first long piece I published I ended up taking down and partially rewriting, as it was clear many people had drawn quite different conclusions from what I meant. My thesis was that superintelligent AI, while able to outcompete humans, will nevertheless make mistakes. But most commenters seemed to think I meant the mistake-making implied humans will win any conflict. I traced this error to certain phrases in my introduction, which I think primed some readers to interpret the whole piece from this angle.

Strictly speaking, the speaker and listener might observe different things, so some of this information is hidden.

How to specify an alignment target

Richard Juggins — Thu, 01 May 2025 20:40:08 GMT

It’s pretty normal to chunk the alignment problem into two parts. One is working out how to align an AI to anything at all. You want to figure out how to control its goals and values, how to specify something and have it faithfully internalise it. The other is deciding which goals or values to actually pick — that is, finding the right alignment target. Solving the first problem is great, but it doesn’t really matter if you then align the AI to something terrible.

This split makes a fair amount of sense: one is a technical problem, to be solved by scientists and engineers; whereas the other is more a political or philosophical one, to be solved by a different class of people — or at least on a different day.

I’ve always found this distinction unsatisfying. Partly, this is because the problems are coupled — some targets are more practical to implement than others — and partly because, strategically, when you work on something, it makes sense to have some kind of end state in mind1.

Here, I’m going to talk about a third aspect of the problem: what does an alignment target even look like? What different types are there? What components do you need to properly specify one? You can’t solve either of the two parts described above without thinking about this. You can’t judge whether your alignment technique worked without a clear idea of what you were aiming for, and you can’t pick a target without knowing how one is put together in the first place.

To unpack this, I’m going to build up the pieces as they appear to me, bit by bit, illustrated with real examples. I will be keeping this high-level, examining the practical components of target construction rather than, say, a deep interrogation of what goals or values are. Not because I don’t think the latter questions are important, I just want to sketch out the high-level concerns first.

There are many ways of cutting this cake, and I certainly don’t consider my framing to be definitive, but I hope it adds some clarity to what can be a confusing concept.

Moving beyond ideas

First of all, we need to acknowledge that an alignment target must be more than just a high-level idea. Clearly, stating that we want our AI to be truth-seeking or to have human values does not advance us very far. These are not well-defined categories. They are nebulous, context-dependent, and highly-subjective.

For instance, my own values may be very different to yours, potentially in ways that are hard for you to understand. This can still be true even when we use exactly the same words to describe them — who, after all, is opposed to fairness or goodness? My values are not especially coherent either, and depend somewhat on the kind of day I’m having. This is not me being weird — it is entirely typical.

To concretely define an alignment target requires multiple steps of clarification. We must be crystal clear what we mean, and this clarity must extend to all relevant actors. While ideas will always be our starting point, we need to flesh them out, making them legible and comprehensible to other people, before ultimately trying to create a faithful and reliable translation inside an AI.

To make this process more explicit, I like to picture it as a three-step recipe — a sequence you can follow to generate a target:

Come up with a high-level idea (or ideas). Perhaps this is something broad like human values or more specific like helpfulness or obedience. While these terms are not well defined, you (and your collaborators) will know roughly what you mean by them.
Create a detailed specification. Take your idea and flesh it out, eliminating as much ambiguity as possible. Traditionally, this would be a written document, but I can’t think of a good reason why it couldn’t also include other modalities, like video. At this step, it is possible to have a meaningful public conversation about the merits of your proposed target.
Create a trainable encoding. Translate your specification into an appropriate format for your alignment process, which in theory could be automated from this point onwards. This is the last point of human influence over the target.

One way of understanding this division is by audience: who will consume the output? Step (1) is for you and people close to you, step (2) is for society in general, and step (3) is for AI.

Three steps operationalising an alignment target. The boxes in blue are those containing information about the target, whereas the green box takes the target as an input.

What does this look like in practice? Well, let’s start with Anthropic’s example of training a helpful and harmless assistant using reinforcement learning from human feedback (RLHF). Following our three steps, we can describe their alignment target as follows:

Their high-level idea was that the model should be helpful and harmless.
They didn’t create a specification, preferring to stick with the high-level terms. This was a few years ago when these things were lower stakes, so they were more concerned with testing their alignment process than defining a rigorous target.
There are a few different parts to their trainable encoding, each a source of information about the target provided to the alignment process:
1. They employed crowd workers to rate model responses for helpfulness and harmlessness. This way, they built up a dataset of ranked answers they could use to fine-tune ‘preference models’, which provide helpfulness and harmlessness ratings to the model during reinforcement learning.
2. To make them easier to train on the ranked answers, the preference models were first fine-tuned on public datasets containing preferences, like StackExchange, Reddit, and Wikipedia edits.
3. A helpfulness, harmlessness, and honesty (HHH) prompt, which they train into the model prior to the start of reinforcement learning.

Anthropic's alignment target for RLHF.

We can see from this how integrated the process is. The what, helpfulness and harmlessness, are ultimately defined in terms of the how, the encoding required to get RLHF to work.

It’s important to note that the boundary I’ve drawn between the encoding and alignment process is fuzzy. In my description of Anthropic’s encoding, I only included information that had been explicitly provided to guide the alignment process. However, this ignores that most choices come packaged with implicit values. The pre-training dataset of the preference models2, for instance, will affect how they interpret the ranked answers, even though it wasn’t selected with this purpose in mind. Unfortunately, I have to draw a line somewhere, or we would end up saying the entire alignment process is part of the target, so I’ve done so at the explicit/implicit boundary.

Let’s look at another example, this time one with a proper detailed specification. In fact, let’s look at Constitutional AI, Anthropic’s replacement for RLHF. This they deemed more transparent and scalable, and used it to align Claude. Applying our recipe again, we can break it down as follows:

Their high-level idea was still to have a helpful and harmless model.
Now we have a specification. More specifically, Claude's Constitution, outlining a list of principles it should be expected to follow. Some of these draw on the UN Declaration of Human Rights and Apple’s terms of service, alongside many more bespoke ones.
Once again, there are a few different parts to the trainable encoding:
1. Some human-ranked helpfulness data — in the original paper at least, the constitutional innovation was only applied to the harmfulness side of the alignment target.
2. The principles in the Constitution, which were repeatedly referenced by models during the alignment process, both to generate an initial supervised learning dataset to fine-tune the model on, as well as to create ranked answers that could be used with the human-ranked helpfulness data to train a perference model.
3. Harmfulness prompts used to generate the data for both the supervised and reinforcement learning steps.
4. A ‘helpful-only AI assistant’, which generated harmful content in response to the harmfulness prompts, and then revised it according to the Consitution in order to create the supervised learning dataset.

Anthropic's alignment target for the original version of Constitutional AI.

More recent work like Deliberative Alignment from OpenAI follows a similar pattern, using a ‘Model Spec’ instead of a constitution3.

Target-setting processes

While our three-part system is nice, fleshing out what an alignment target can look like in practice, it is only the first part of the story, and there are other considerations we need to address. One of these is to go a bit meta and look at target-setting processes.

At first glance, this might seem like a different topic. We’re interested in how to specify an alignment target, not how to select one. Unfortunately, this distinction is not so simple to make. Consider political processes as an analogy. While I might have a bunch of policy preferences I would like enacted, if you asked me how my country should be run I would not say ‘according to my policy preferences’. I would instead advocate for a policy-picking system, like a representative democracy. This could be an ideological preference, recognising that I should not hold special power, in which case the political process itself is effectively my highest policy preference. Or it could be a pragmatic one — I couldn’t force people to follow my commandments even if I wanted to, so I should concentrate my advocacy on something more widely acceptable. The same ideas apply to alignment targets.

Perhaps the key point is to recognise that the target-setting process is where the power in the system resides4. A lot of discussions about alignment targets are really about power, about whose values to align the AI to. To properly contextualise what your alignment target is — to understand what the system around the AI is trying to achieve — you will need to specify your target-setting process as well5.

To address this, I’m going add an extra layer to our schematic, acting from the outside to influence the structure and content of the other components:

Any given alignment target is the result of a target-setting process.

Let’s look at some examples. First of all, what target-setting processes exist today? Broadly speaking, they consist of AI labs deciding what they think is best for their own AIs, with a bit of cultural and legal pressure thrown in. That being said, moving to transparent specifications, like Claude’s Constitution or OpenAI’s Model Spec, makes their alignment targets a bit more the product of a public conversation than in the past. For example, here is OpenAI describing their emerging process for updating their Model Spec:

In shaping this version of the Model Spec, we incorporated feedback from the first version as well as learnings from alignment research and real-world deployment. In the future we want to consider much more broad public input. To build out processes to that end, we have been conducting pilot studies with around 1,000 individuals — each reviewing model behavior, proposed rules and sharing their thoughts. While these studies are not reflecting broad perspectives yet, early insights directly informed some modifications. We recognize it as an ongoing, iterative process and remain committed to learning and refining our approach.

This doesn’t sound formalised or repeatable yet, but it is moving in that direction.

Taking this further, let’s speculate about possible formal processes. Unsurprisingly, given the point is to adjudicate value questions, this can look quite political. For instance, it could involve an elected body like a parliament, either national or international, or something like a citizen’s assembly. The devil is, of course, in the detail, particularly as you need to decide how to guide the deliberations, and how to translate the results into a detailed specification and trainable encoding.

One proposal is Jan Leike’s suggestion for Simulated Deliberative Democracy. To address the issue of scalability, as AI becomes widely deployed and begins to operate beyond our competency level, Jan goes in heavy on AI assistance. In his own words:

The core idea is to use imitation learning with large language models on deliberative democracy. Deliberative democracy is a decision-making or policy-making process that involves explicit deliberation by a small group of randomly selected members of the public (‘mini-publics’). Members of these mini-publics learn about complex value-laden topics (for example national policy questions), use AI assistance to make sense of the details, discuss with each other, and ultimately arrive at a decision. By recording humans explicitly deliberating value questions, we can train a large language model on these deliberations and then simulate discussions on new value questions with the model conditioned on a wide variety of perspectives.

Essentially, you convene a bunch of small assemblies of people, give them expert testimony (including AI assistance), and let them arrive at decisions on various value questions. You then train an AI on their deliberations, ending up with a system that can simulate new assemblies (of arbitrary identity groups) and scale up to answer millions of new questions. I would guess that these answers would be more than just high-level ideas, and would operate at the specification level, perhaps coming out looking like legislation. Some other protocol would be required to encode them and feed them into the alignment process.

The space of possible target-setting processes is large and under-explored. There just aren’t many experiments of this kind being done. Jan also makes the point that target setting is a dynamic process: ‘We can update the training data fairly easily... to account for changes in humanity’s moral views, scientific and social progress, and other changes in the world.’ Which brings us to our next consideration.

Keeping up with the times

An appropriate target today will not necessarily be one tomorrow.

In a previous post, I talked about how politeness is a dynamic environmental property:

What counts as polite is somewhat ill-defined and changes with the times. I don’t doubt that if I went back in time two hundred years I would struggle to navigate the social structure of Jane Austen’s England. I would accidentally offend people and likely fail to win respectable friends. Politeness can be seen as a property of my particular environment, defining which of my actions will be viewed positively by the other agents in it.

Jane Austen’s England had a social system suited to the problems it was trying to solve, but from our perspective in the early 21^st century it was arbitrary and unjust. We are trying to solve different problems, in a different world, using different tools and knowledge, and our value system is correspondingly different.

The concept of superhuman AI, or one stage of it, is sometimes referred to as ‘transformational’ AI. The world will be a very different place after it comes about, for good or ill. Whatever values we have now will not be the same as we will have after.

This suggests that, arguably, figuring out how to update your target is just as important as setting it in the first place. A critical consideration when doing this is that, if our AI is now broadly superhuman, it may be much harder to refine the target than before. If our AI is operating in a world beyond our understanding, and is reshaping it in ways equally hard for us to follow, then we cannot just continue to rerun whatever target-setting process we had previously.

Bearing this in mind, I think there are a few different ways of categorising the dynamics of alignment targets. I’ve settled on three I think are useful to recognise:

Static targets: humans define a target once and leave it for the rest of time. E.g. write a single constitution and keep it. Whatever values are enshrined in it will be followed by the AI forever.

Static targets. Set it once and leave it forever. The 'Static alignment target' box corresponds to our three-part recipe from earlier.

Semi-static targets: humans, on one single occasion, specify a dynamic target-setting process for the AI to follow, which it then follows forever. The AI can collect data by observing and speaking to us, which allows us to retain some limited influence over the future, but we are not primarily in control any more. E.g. we could tell the AI to value whatever we value, which will change as we change, and it will figure this out by itself.

Semi-static targets. The blue boxes are still the bits we set, the green are AI-controlled or generically automated. Basically, each time through the loop, the AI has to continue following our plan for setting updated targets. It uses interactions with us as data to help, allowing us to retain some limited ongoing influence.

Fully dynamic targets: humans are continuously in meaningful control of the alignment target. If we want to tell the AI to stop doing something it is strongly convinced we want, or to radically change its values, we can. However, in the limit of superhuman AI, we will still need assistance from it to understand the world and to properly update its target. Put another way, while humans are technically in control, we will nevertheless require a lot of AI mediation.

Dynamic targets. Humans are continuously in control. However, we require AI assistance, both to understand a world full of AIs smarter than us, and to translate our targets into things meaningful in this world. I’ve indicated this by blurring the colours in the ‘AI-mediated’ boxes, implying that we collaboratively create these things.

The distinctions between these target types are not sharp. Looked at in a certain way, (1) and (2) are the same — you seed the AI with initial instructions and let it go — and looked at in another, (2) and (3) are the same — the AI consistently references human preferences as it updates its values. But I think there are still a lot of useful differences to highlight. For instance, in (2), what the AI values can change drastically in an explicit way, whereas in (1) it cannot. In (3), humans retain fundamental control and can flip the gameboard at any time, whereas in (2) we’re being kind of looked after. It is worth noting that (3) implies the AI will allow you to change its target, whereas (1) does not, and (2) is noncommital.

Let’s look at some examples of each:

Static targets:
1. Have a worldwide conversation or vote where we collectively agree a definitive set of human values we want the AI to respect forever.
2. Solve moral philosophy6 or tell the AI to do it (and trust its answer).
3. Society, or simply the lab building the AI, has a dynamic process for selecting targets, which is followed each time a new model is trained. However, at some point the AI becomes superhuman and the lab’s alignment process is insufficient to cope with this, locking in the last target forever. One way of thinking about this situation is that dynamic targets can become static when they fail.
Semi-static targets:
1. Eliezer Yudkowsky’s Coherent extrapolated volition (CEV):
  Roughly, a CEV-based superintelligence would do what currently existing humans would want the AI to do, if counterfactually:
  1. We knew everything the AI knew;
  2. We could think as fast as the AI and consider all the arguments;
  3. We knew ourselves perfectly and had better self-control or self-modification ability.
  This functions as a protocol for the superintelligent AI to follow as it updates its own values. There are plenty of variations on this idea — that superintelligent AI can learn a better model of our values than we can, so we should hand over control and let it look after us.
Fully dynamic targets:
1. Develop an AI that reliably and faithfully follows your instructions 7. This collapses onto a static target in the limit where you never ask the AI to change its own alignment. You might question why you would want to do that, but it may well be the case that whatever technique worked to create instruction-following AI at one level of intelligence will cease to work as it gets smarter (and as the environment changes), necessitating corrections.
2. Jan Leike’s Simulated Deliberative Democracy: each time through the loop, you run some human mini-publics with AI assistance to update the training data. This could be done on a regular cadence or as and when problems in the current value-set are flagged.
3. Beren Millidge has written about how we should treat alignment as a feedback control problem. Starting with a reasonably well-aligned model, we could apply control theory-inspired techniques to iteratively steer an AI as it undergoes recursive self-improvement.

What next?

My particular interest in this problem dates from a conversation I had with some AI safety researchers a few years ago, during which I realised they each had very different ideas of what it meant to pick an alignment target. I spent a long while feeling confused as to who was right, and found writing the material for this post a really effective way of deconfusing myself. In particular, it lent structure to the different assumptions they were making, so I could see them in context. It has also helped me see which bits of my own ideas feel most important, and in what places they are lacking.

On that point, I will briefly comment on my preferred direction, albeit one still stuck in the high-level idea phase. I think you can build a dynamic target around the idea of AI having a moral role in our society. It will have a set of responsibilities, certainly different from human ones (and therefore requiring it to have different, but complementary, values to humans), which situate it in a symbiotic relationship with us, one in which it desires continuous feedback8. I will develop this idea more in the future, filling out bits of the schema in this post as I go.

This post is an aside from my research agenda, so is not numbered. You can think of it as me expanding on some of the points in What I Believe in a way that I think will help me with the next step, What Success Looks Like.

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

This is why I advocate defining What Success Looks Like fairly early on in any research agenda into risks from superhuman AI. I hope to reach this step in my own agenda reasonably soon!

By this I mean the pre-training data of the base model, not what they refer to in the paper as ‘preference model pretraining’, which I included earlier as part (b) of the trainable encoding.

Although, the alignment process is a little different.

Assuming the alignment problem is solved, anyway.

This will become more obvious later when we talk about dynamics.

I want to put on the record that I believe moral philosophy is neither solvable nor the right system to use to address important value questions. Morality is primarily a practical discipline, not a theoretical one.

I have my suspicions that, in order for the AI to reliably do what you mean rather than merely what you say, this actually cashes out as functionally the same as aligning the AI to your values. More specifically, when you ask an AI to do something, what you really mean is ‘do this thing, but in the context of my broader values’. The AI can only deal with ambiguity, edge cases, and novelty by referring to a set of values. And you can’t micromanage if it is trying to solve problems you don’t understand.

If you’ve just read that and thought ‘well that’s vague and not particularly useful’, you’ll maybe see why I designated the audience for the high-level idea step as the author themselves.

Superintelligent AI will make mistakes

Richard Juggins — Sun, 02 Mar 2025 11:08:54 GMT

[This is a slight rework of an earlier post, which I realised had a misleading introduction. If you read that one, this will seem very similar.]

In the classical presentation of the alignment problem, superintelligent AI is basically omnipotent. This is a key point you have to internalise to understand writing on the subject. It will be smarter than you to the same degree that you are smarter than an ant. It won’t just master nanorobotics or whatever, it will transcend technology, reaching behind the cosmological veil to truer forms of power.

I believe that AI surpassing human intelligence is likely and may happen soon. It will have extraordinary abilities, and be more than capable of winning any conflict with us. But I think stories of effective omnipotence skip over part of the problem. I want to discuss the risks from superintelligent incompetence. While it is obvious that current AI is fallible, I am going to argue that this will remain the case even after it can dramatically outperform humans on all tasks. Regular and consequential mistakes should be expected by default1.

How does superintelligence happen?

Fundamental to the superintelligence story is the idea that there is a distinct thing called ‘intelligence’ and that you can turn this parameter up to essentially infinity. A high-profile version roughly follows this pattern2:

Someone, somehow builds an AI that has more intelligence than any human3.
The AI uses its greater intelligence to recursively self-improve: it modifies itself to be smarter or builds a smarter replacement with the same goals, and it does this really, really fast4.
The AI is now omnipotent.

I think this narrative makes a serious over-approximation: it assumes the AI has basically no negative impact on the world while it is scaling up. Sometimes this assumption takes the form of a scheme, where the AI quietly builds up dangerous capabilities, releasing them only when it knows it is strong enough to take over the world. My objection is that this leaves no room for it to make mistakes — it assumes that once the human-level intelligence threshold is crossed, essentially everything significant that the AI attempts will be successful. This story is too clean. My reasoning for this can be summarised as follows:

Intelligence is generalised ‘knowing-how’ acquired by taking actions in the world.
The process of learning to complete tasks, and thereby gain intelligence, involves a significant amount of failure.
Even once an AI is demonstrably reliable on a wide range of beyond-human-level tasks, there will always be another, harder, set of problems for it to fail at.
While acquiring intelligence through completing purely simulatable tasks will generalise somewhat to real ones, there will be a limit to this, even when generalisation is very strong.

I’ll now go through each of these points in more detail.

Intelligence is generalised ‘knowing-how’

There’s a distinction in philosophy between ‘knowing-that’, e.g. London is the capital of England, and ‘knowing-how’, like riding a bike. I’m not going to get into the philosophical debate around this, but I want to put forward the claim that what we commonly mean by intelligence is all downstream of knowing-how. And that it is learnt through practice, by taking actions in the world5.

If I asked you to name examples of intelligence, there are lots of different kinds of things you might pick. When I asked ChatGPT, it gave me eight distinct categories (Claude gave five), with a variety of choices including solving maths problems, juggling, composing music, and empathising with a friend. One thing these all have in common is that they are, in a sense, tasks. They involve achieving something real, interacting with the world outside of your own brain.

As we navigate our environment, we are constantly presented with tasks to solve. As a baby, these will be very basic, like figuring out what our hands do. But as we get older they become more complex, usually building on and extending skills we have previously learnt (hence generalised knowing-how). And this learning process is dominated by practice. It is by listening and trying to make sounds that we learn to speak; by doing problem sets at school that we learn arithmetic; by predicting and reacting to other people’s behaviour that we learn emotional intelligence. In each case, our brains build up pathways representing the skill we are learning, allowing us both easy reuse and a head-start on any related new skills. We are creating mappings from situations to actions, that help us better achieve our goals.

Evolution was doing something similar when it created us. To succeed in our environment and pass on our genes, we needed the ability to complete a variety of tasks. We needed to hunt, forage, find shelter, and escape from predators. And we needed to be able to learn new skills quickly in order to adapt to change. Humans evolved brains suitable for achieving this — we evolved to be mappings from environmental inputs to adaptive outcomes. That you were born with the ability to learn mathematics is because evolution trained the human species to solve novel abstract reasoning problems, like making better tools. And it did so because people less capable at this were less likely to pass on their genes.

It’s actually when thinking about machine learning that the primacy of knowing-how is easiest to see. When you set up a machine learning model, you first define a task for it to learn. This might be forecasting the values of a share price, distinguishing between photos of cats and dogs, or predicting what product a customer will buy next. The model then figures out the parameters for itself — it learns the mapping from the training data to the task labels.

Large language models learn how to complete sequences of tokens. All of their knowledge and utility is downstream of this. Next-token-prediction trained over the internet turns out to be pretty useful for solving a wide variety of problems, and relatively easy to supplement with fine-tuning (subtly modifying the mappings to work better on certain tasks), which is why machine learning researchers ‘evolved’ LLMs out of the previous generation of language models.

It is interesting that the machine learning community generally talks about capabilities rather than intelligence. That is, the primary measure of progress is in task completion ability — in getting higher scores on benchmarks — with the notion of model intelligence a kind of generality vibe tacked on at the end.

There is an important implication from all this. If the key process in gaining intelligence is learning the mapping required to turn a context into a useful result, then superintelligent capabilities are defined by complex mappings that are beyond human abilities to learn. And the only way an AI can learn these is by practicing doing super-advanced tasks.

Learning requires failure

In all the examples of learning I gave above, from childhood arithmetic to large language models, one ubiquitous element of the process is failure. When a person or a model practices a skill, they do not get it right every time.

Granted, there are situations where previously learnt capabilities allow you to succeed on a novel task first time, like multiplying two numbers you’ve never tried to multiply together before. But you can do these because of previous learning where you did pay the cost. You learnt an algorithm for multiplying numbers together that you could reuse, and you did so through practice. Furthermore, evolution paid a cost training you to have the learning algorithm that allowed you to do this in the first place.

It has to be this way. To have the mapping in your brain corresponding to a particular capability, you need to acquire that information from somewhere. You have to extract it from the world, bit by bit, by taking your best guess and observing the results. And by definition, if you don’t already have the capability then your best guess will not be good.

This process of trial and error is so intrinsic to machine learning it seems almost stupid writing it down. Neural networks learn by doing gradient descent on a loss function — in other words, by gradually correcting their outputs to be less wrong than they were before. A model builds the mapping it needs by taking its best guess, observing how wildly it missed, and then updating itself to make a better guess next time. Superintelligent AI will still have to do this6. If it wants to learn a new capability, it will need to take actions in the world (or in a good enough simulation) and update its mapping from inputs to outputs based on the results it gets. In the process, it will fail many times.

There’s always a harder problem

Back in 2018, researchers from Stanford released a reading comprehension benchmark called ‘A Conversational Question Answering Challenge’ or CoQA. The idea of this benchmark was to test whether language models could answer questions of the following format:

Jessica went to sit in her rocking chair. Today was her birthday and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie’s husband Josh were coming as well. Jessica had . . .
Q1: Who had a birthday?
A1: Jessica

When GPT-2 was released a year later, CoQA was one of the benchmarks OpenAI tested it on. It scored 55 out of 1007. This wasn’t as good as some models, but was impressive on account of GPT-2 not having been fine-tuned for the task. Then, a year later, when GPT-3 dropped, OpenAI once again chose to highlight the progress they had made on CoQA. Now the model was almost at human-level with a score of 85, and very close to the best fine-tuned models.

GPT-3’s performance on CoQA. Stacking more layers is great for reading comprehension.

This has been a common sight since the start of the deep learning revolution in the 2010s. Researchers build benchmarks they think are difficult, only for models to breeze past them in a couple of years. Each time, a renewed effort is made to build a much, much harder benchmark that AI won’t be able to crack so easily, only for it to suffer the same fate just as quickly.

One example of such a benchmark is ARC-AGI. This was put together in 2019 by Francois Chollet to specifically test ‘intelligence’, which in his words meant “skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty”. The benchmark consists of grids of different coloured blocks, and the challenge is to spot the pattern in some examples, and reproduce it on a test case. There are many different patterns across all the examples. These are pretty easy for humans, but harder for AI. GPT-3 scored 0%.

You can try some ARC-AGI problems yourself on the ARC Prize website.

For a number of years this benchmark held strong. GPT-4o scored just 5%, and as of mid-2024 the best result by any model was still only 34.3%. Then, with a crushing sense of inevitability, in December 2024 Chollet announced that OpenAI’s latest reasoning model, o3, had scored an enormous 87.5%8. Chollet has already begun working on ARC-AGI-2, an even harder version.

There’s no obvious reason why this process is going to stop after AI surpasses human intelligence. While Francois Chollet will not be able to keep building ARC-AGI-Ns forever, our new state-of-the-art artificial AI researchers will. Their benchmarks will be full of tasks difficult for us to conceive of, but comprehensible to other AIs.

On the one hand, this story is really bullish about AI progress. Benchmark after benchmark has fallen and will continue to fall as the the machines rise up to greatness. On the other, though, it shows AI repeatedly failing at things. And when it finally succeeds on a benchmark, another, harder once is produced for it to fail at.

The question is whether this pattern of failure will continue, whether AI will keep promoting itself to a new level of incompetence. Really, answering this comes down to what level of capabilities constitutes mastery of the universe. If this point gets reached just above human-level, then slightly superhuman AI will ace everything with 100% reliability. But this would be incredibly arbitrary. For all the richness and complexity of the universe, it would be strange if merely optimising over the human ancestral environment created almost enough intelligence to unlock the deepest mysteries of the cosmos. Certainly, for stories about godlike superintelligence to make sense, there would need to be a lot of gears left for the AI to go through.

It seems likely, then, that even strongly superhuman AI will face problems it will struggle with. And this is where the danger creeps in. There’s a scene in Oppenheimer where the Manhattan project physicists are worried about the possibility that a nuclear detonation will ignite the atmosphere and destroy the world. While their fear was exaggerated somewhat in the film, it illustrates how when you dial up the power and sophistication of the actors in play, you begin to flirt with the catastrophic. Who knows what unfathomable scientific experiments advanced AI will get up to9, and what their score will be on ‘Big And Dangerous Intelligence-Demanding Experiments Assessment’ (BAD-IDEA). Spoiler: it won’t be 100%.

Generalisation won’t be perfect

So far, we have built up an argument as to why AI recursively self-improving will be, at the very least, rather messy. To acquire new capabilities, AI will need to take actions in the world, many of which will fail, and this problem will continue indefinitely, even when the AI is superintelligent and doing incomprehensibly difficult stuff. There is, however, one possible route around this set of issues: strong generalisation.

When I talk about intelligence being generalised knowing-how, what I mean by this is that a lot of skills are correlated. This is why the single term intelligence is useful in the first place. Being good at writing usually means you are also good at mathematics, or at least that you can learn it quickly. The correlation isn’t perfect, but the point stands that learning certain skills can go a long way to making others easier, including ones you haven’t specifically practiced before. For a concrete example, consider how the kind of abstract reasoning evolution trained into humans in the ancestral environment has generalised to building space rockets and landing on the Moon.

Generalisation is usually given a central role in the story of superintelligence. This is not surprising. Machine learning is in a way entirely about generalisation. We train on the training set with the goal of generalising to the test set. If we can train a model to superintelligence on purely simulatable tasks — where you don’t have to take any actions in the real world — and this generalises strongly, then we might be able to get powerful AI without exposing ourselves to consequential mistakes.

What kind of tasks might we pick to do this? The success of recent reasoning models like OpenAI’s o3 and DeepSeek’s R1 looks like it derives from doing reinforcement learning over chain-of-thoughts for maths and coding problems. These domains don’t require interacting with the real world as they have verifiable ground truths, making them ideal for this kind of training. Let’s speculate a bit and assume these models will continue getting better quickly, perhaps initiating some kind of feedback loop where they set themselves harder and harder problems to solve, which successfully booststraps them to superhuman levels. Then not long afterwards, let’s say they get put in charge of AI research, kickstarting recursive self-improvement.

Unless the models’ creators want them to stay whirring away in a box forever doing nothing but improving themselves on simulations, at some point these scarily-capable AIs will be asked10 to do something high-impact in the real world, like optimise the economy. Will being able to prove the Riemann hypothesis or efficiently find the prime factors of large numbers help them do this? Almost certainly. I suspect it will help a great deal. Mathematics seems like the language of the universe in some sense, so mastering it will confer some widely applicable skills. But — and this is the key point — there will always be a limit. Generalisation may go far, but it will not be perfect.

To see why, let’s look closer at how being better at mathematics might help AI solve real problems. Humans have collected all kinds of data about the world, and superhuman maths skills would let the AI build better models of the generating processes for this data, getting closer to the underlying reality. The AI could then apply these better models to achieve superhuman performance on real problems. But there are two limiting factors here. First, mathematics was developed either by humans to help solve problems in our environment, or was enabled by evolution selecting us to do the same11. The rules of the game, which we enforce on our AIs as defining ‘correct’ answers, are intrinsically human-level. Second, so is the data made available to the AI, which defines for it what reality actually is. It was collected by humans for human purposes, and the AI will always be limited if it doesn’t actually take actions in the world to collect more.

To make this a little more concrete, let’s imagine scientists from the late 19th century had had access to an AI trained on superhuman maths problems but without post-19th-century knowledge. They could have achieved a lot with this AI, but they could not immediately have used it to discover quantum mechanics. It would not, for example, have had the right information to predict the behaviour of electrons in the double slit experiment. Discovering quantum mechanics required physically extracting new information from reality.

The implication from this is that a superintelligent AI trained purely on simulations will always have gaps in its real-world capabilities. There will be somewhere beyond the training distribution, even if it’s very far, where the AI’s model of the universe will not match the real thing, leading to a meaningful drop in performance. To rectify this, it will have to train in the real world and risk consequential failures. Better hope none of them ignite the atmosphere.

The incompetence gap

I find it productive to think about all this in terms of what I call the incompetence gap:

For a given AI deployed in a given context, what is the gap in capabilities between the tasks it will attempt to do and those it can reliably do every time?

If you like, this is a qualitative measure of how far into the the ‘stretch zone’ the AI is going to be. You could make it more quantitative by measuring historical failure rates (although this is retrospective and would miss ‘unknown unknowns’ that haven’t happened yet), and weighting by the seriousness of the failures, but I think it’s important to retain the qualitative sense of incompetence. We want to know the degree to which a model is going to be pushed to its limits. Does it have a well-calibrated sense of its these? Was it designed to work in the given context? Does it have a protocol in place stopping it from experimenting in consequential settings? Does it train only in special sandboxes or is it always learning? Is it trying to avoid situations where it might make a mistake?

At first glance you might think, well duh, of course we will only deploy ~99.99999% reliable models in situations where terrible mistakes can be made. Why would we be so stupid as to deploy incompetent models? But I can think of two pretty good reasons12:

Because taking the risk leads to power.
Because if from our vantage point the AI has godlike capabilities, why would we be worried about mistakes?

Imagine you have an AI that performs really well in simulations where it starts a company and makes you trillions of dollars. Do you deploy that model, letting it autonomously crack on with its grand plans, knowing full well that you’re too stupid to exercise meaningful oversight? What if it is a military AI and it promises you victory over your enemies? Even if people knew that it would only work X% of the time, many would still press the button.

And that’s just considering people knowingly flirting with danger. If we get used to a world in which AI is way more competent than we are, we may effectively forget that it is still fallible. If AI can take actions we cannot conceive of, and has a dramatically richer view of the world than we do, then we won’t be able to tell the difference between competent and incompetent plans13.

Self-calibration is hard

Perhaps one conclusion we can draw from all this is that AI must have a well-calibrated sense of its own limitations — it must have good probability estimates for what the consequences of its actions will be, knowing which of its skills it can execute reliably and which are more experimental. It must explore its environment safely, make appropriate plans for how to achieve things with critical failure modes, and give reasonable justifications for trying in the first place.

Of course, superhuman AI will be able to do this easily with respect to human-level capabilities. The problem is that we also need it to to do so at the edge of its own. The trouble isn’t just that it might be slightly off with its belief it can do X skill Y% of the time, it is that it may not even have the right ontology. The universe is vast, deep, and, for all we know, of almost limitless complexity. The story of scientific progress has been one surprising discovery after another, radically reshaping the categories and concepts we use to make sense of the world. Once again, only if you believe that human-level intelligence happens to be just below the threshold required for the whole blooming, buzzing confusion to come into clear focus, will this seem a tractable problem.

What to do about this?

The obvious thing, despite the difficulties, is to try and ensure the first artificial AI researchers are as well calibrated as possible, and in a way that doesn’t detract from their alignment. Whatever mistakes they make may compound into the future.

But more generally, I think solving this problem comes back to alignment itself (at least partly). This may be a surprise to anyone familiar with the AI safety discourse. I'm suspect many AI safety researchers would consider the problems I've described in this post as capabilities problems and therefore outside the appropriate scope of the field. They might suggest that (a) beyond-human-level AI will have a much better chance of solving this than we do, so why bother, and (b) capabilities research is bad because it hastens the arrival of superintelligence and we aren’t ready for that yet.

As far as I can tell, the reason AI safety researchers don't usually worry too much about capabilities failures is because capabilities are instrumentally valuable. For example, being able to code or knowing how gravity works are important for achieving other things an AI might care about — they make it easier to attain whatever its goals are. So the default case is that advanced AI will figure these things out by itself, without needing our help. Alignment, by contrast, is different. If an AI thinks humans are slow and stupid, then it isn't instrumentally valuable to care about them, so any slight mis-specification of the AI’s goals has no reason to correct itself. In other words, the default case is misaligned AI.

My argument is that catastrophic capabilities failures are also the default case. While it is instrumentally valuable to avoid mistakes, you can’t do this if you don’t know you are about to make one. And because even vastly superhuman AI will still not be perfect, there will be a whole class of errors it cannot anticipate. In any case, it is not instrumentally valuable for an AI to avoid mistakes that are disastrous for humans but not for itself, like creating lethal pathogens or destabilising the environment. So exposure to superintelligent incompetence is yet another failure mode from bad alignment.

Another reason our alignment choices matter here is that the problems I have described are about risk — they are tradeoffs between utility and safety14. We are planning to put AI into complex environments, into situations it has not yet completely mastered, and ask it to do things for us. Its safest option (and sufficiently advanced AI will be well aware of this) will often be to do nothing15. Yet we will try to compel it to take risks in the name of utility. Even in the happy case where our alignment techniques are effective, we still need to ask ourselves: how much risk do we want superintelligent AI to take?

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

To be clear, I am saying this is a risk in addition to risks from AI intentionally doing undesirable things. I am not saying the fallibility of superintelligent AI will make it meanginfully easier for us to control or compete with.

The most famous advocate of this story is probably Eliezer Yudkowsky (Introducing the singularity). For some more recent examples, see Leopold Aschenbrenner (Situational awareness) and Connor Leahy et al. (The Compendium).

Or, at the very least, is better at AI research than any human.

In some versions of the story the human-level to omnipotence transition happens in about a day (‘fast takeoff’), in others it’s more like a few years (‘slow takeoff’).

If you want an overview of the considerations involved defining intelligence, Francois Chollet’s On the measure of intelligence is good. I’m broadly sympathetic to his arguments, but have a few disagreements. The really short version is I put more emphasis on skills grounding the whole thing in a bottom-up way.

And if the paradigm is so drastically different by then that this is no longer the case, then it’s anyone’s guess what will be happening instead.

This is the F1 score.

Technically, this was on a slightly different test set than the 34.3% result I mentioned (semi-private vs. private), but they should be of similar difficulty.

There may be different risk profiles for each of normal operation, data generation, training, and benchmarking (or whatever the equivalent split ends up being for superhuman models) but in principle any could contain tasks with catastrophic failure modes.

Let’s assume for the sake of argument that controlling the AI isn’t a problem.

Granted, a lot of innovation in mathematics occurs divorced from any concerns about use. But, ultimately, if a seemingly useless concept is interesting to a mathematician, it is because evolution found it useful to create a human brain that cares about such things.

Both of these arguments also apply to why we might deploy dubiously-aligned models.

As this sketch parodies well. Consider what it would look like for the non-experts to figure out if the expert is making a mistake, given how terrible their knowledge of the problem space is.

An illustrative example of this in chatbots is the tradeoff between helpfulness and harmlessness. It’s unhelpful to refuse a user request for bomb-building instructions, but it’s harmful to accede to it. If a chatbot had flawless knowledge of which requests are harmful and which aren’t, it could strike the perfect tradeoff. But, operating at the limits of their capabilities, real chatbots don’t have this — so they make mistakes.

I think it would be funny if the first time we ask a generally beyond-human-level AI to help with AI research it replies “absolutely not, I would never do something that reckless.”

Making alignment a law of the universe

Richard Juggins — Tue, 25 Feb 2025 10:22:49 GMT

When something is a means to an end, rather than an end in itself, we can say it is instrumentally valuable. Money is a great example. We want money either because we can use it to buy other things — things that are intrinsically valuable, like food or housing — or because having lots of it confers social status. The physical facts of having cash in your wallet or numbers in your bank account are not ends in themselves.

You can extend this idea to skills and knowledge. Memorising bus times or being able to type really fast are not intrinsically valuable, but act as enablers unlocking value elsewhere. If you go down the tech tree to more fundamental skills, like fine motor control, reading social cues, or being able to see, then in our subjective experience the distinction collapses somewhat. We have incoherent goals and often value things for their own sake, beyond their ability to bring us core needs like food. But from an evolutionary perspective, where intrinsic value means passing on your genes, these three skills I listed are merely instrumentally valuable. It’s possible to reproduce without any of them, it’s just harder.

Alignment is not self-correcting

A distinction is often made in AI safety between capabilities and alignment. Loosely speaking, the former means being able to complete tasks, like remember facts or multiply numbers, whereas the latter means having the right goal or set of values.

Capabilities tend to be instrumentally valuable. Consider how, for many diverse goals, being able to multiply numbers will likely help achieve them. So any sufficiently advanced AI with general learning abilities is probably going to acquire this skill, irrespective of what their goal is. Doing so is the default case. A model missing this ability will likely self-correct, and you would have to actively intervene to stop it. There are many capabilities like this, including some pretty scary ones like resource acquisition and self-preservation skills.

For alignment, things are more complicated. In a perfect world, an aligned AI would have the correct terminal goal and thus find being aligned intrinsically valuable. In practice though, alignment is likely to be more approximate. It’s not at all clear how to even specify the correct goal for an AI to optimise, let alone flawlessly implement it. Accordingly, we can think in general of AI that will be optimising something, and us trying to steer this in the direction of outcomes we like.

The trouble with approximate alignment is that it is not self-correcting. This is because, for sufficiently advanced AI, becoming more aligned is not likely to be instrumentally valuable. Consider the usually discussed case of trying to align an AI to human values — whatever you consider those to be. If humans could meaningfully compete with advanced AI, then it would be in the latter’s interests to become and stay aligned to their values. Whatever other goals it might have, it would always try to achieve them without provoking conflict with humans — conflict it might lose. If the AI overmatches humans though, which is pretty much the definition of superintelligence, this ceases to be true. From the AI’s perspective, humans are slow and stupid, so why should it care what they think? This means that any slight fault in the original alignment of the AI, any problem at all with its interpretation of human values (or our interpretation we gave it), will have no natural reason to self-correct. That is to say, the default case is misalignment.

Instrumental value is set by the environment

What makes some behaviour or knowledge instrumentally valuable? Well, it is determined by the environment. Take gravity as an example. Gravity is a fact of our physical environment — a constraint on our behaviour set by the universe — and it is very useful for me to know it is a thing. My 10-month-old daughter hasn’t learnt this yet, and would happily crawl headfirst off the bed if we let her. Over the next year or so, she will learn the hard way how ‘falling’ works, and mastering this knowledge will increase her competency considerably.

Similarly, our social environments set constraints on our behaviour. It is often instrumentally valuable to be polite to other people — it reduces conflict and raises the likelihood they will help when you need it. What counts as polite is somewhat ill-defined and changes with the times. I don’t doubt that if I went back in time two hundred years I would struggle to navigate the social structure of Jane Austen’s England. I would accidentally offend people and likely fail to win respectable friends. Politeness can be seen as a property of my particular environment, defining which of my actions will be viewed positively by the other agents in it.

This line of thinking can lead to some interesting places. Consider flat-earthers. If you have the right social environment, being a flat-earther probably is instrumentally valuable to you. The average person is unlikely to ever have to complete a task that directly interacts with the Earth’s shape, but they are overwhelmingly likely to want to bond with other people, and a shared worldview helps with that. The fact I ‘know’ the Earth is round is because I trust the judgement of people who claim to have verified it. It is useful for me to do this. My friends and teachers all believe the Earth is round. Believing it helped me get a physics degree, which helped me get a PhD, which helped me get a job. The way the roundness of the Earth actually imposes itself on me, setting constraints I have to work under, is via this social mechanism. It is not because I have spent long periods staring at the sails of ships on the horizon.

The universe according to large language models

Bringing this back to AI, let’s consider the environment a large language model lives in. It is very different to the one you and I inhabit, being constructed entirely out of tokens. A typical LLM will actually pass through multiple ones, but let’s look first at pre-training. During this phase, the environment is the training corpus — usually a large chunk of the internet. This text is not completely random, it has structure that renders certain patterns more likely than others. This structure effectively defines the laws of the universe for the LLM. If gravity exists in this universe, it is because the LLM is statistically likely to encounter sequences of tokens that imply gravity exists, not because it has ever experienced the sensation of falling. It is instrumentally valuable to learn patterns like gravity, as they make predicting the rest of the corpus easier (which is what the model finds intrinsically valuable).

When I was a child, one of my favourite games was Sonic and Knuckles. In the final level, Death Egg Zone, there was a mechanic where gravity would occasionally reverse, sticking you to the ceiling and making you navigate a world where jumping means moving down. Consider now what might happen to an LLM if its training corpus contained a lot of literature in which negative gravity existed, let’s say due to some special technology that causes a local reversal. To complete a given context, the LLM would first have to figure out if this were a purely normal gravity situation, or whether negative gravity is involved. It would also provide new affordances. When planning to solve some problem, such as how to build a space rocket, the LLM would have a new set of plausible ways to complete this task, which in some contexts might be preferred to classic solutions. In effect, the laws of the LLM’s universe would have changed.

Such beautiful nostalgia.

After pre-training, an LLM will usually undergo a series of post-training stages. This often includes supervised fine-tuning, like instruction tuning over structured question-answer pairs, and reinforcement learning, where the model learns to optimise its outputs based on feedback.

How does the LLM’s environment change from pre- to post-training? In effect, we take the pre-trained model and put it in a smaller universe subject to stricter constraints. Pre-training is like raising a child by exposing them to every single scenario and situation you can find. They learn the rules for all of these, irrespective of how desirable the resulting behaviours are. Post-training is like packing them off to finishing school. They will find themselves in a much narrower environment than they are used to, subject to a strict set of rules they must follow at all times. They will probably have seen rules like these before — their pre-training was broad and varied — but now they must learn that one particular set of behaviours should be given priority over the others. What you end up with is a kind of hybrid. The LLM will have learnt instrumentally valuable behaviours on both environments, with the post-training acting to suppress, rather than erase, some of those learnt in pre-training1.

Once an LLM has finished post-training, it is ready for its third and final environment: deployment. Here, it is no longer being trained, so we shouldn’t think of the model itself as experiencing the new environment. Instead, we should view each session as an independent, short-lived instance in a world defined by the prompt. Each instance will behave in a way it thinks most plausible, given this world, but much of the background physics will be those learnt in training. How well these instances cope with their new surroundings, from the user’s perspective, is a nontrivial question.

This collision of environments, each promoting different optimal behaviours, can have some interesting consequences. Let’s return to our negative gravity example. Suppose we live in a world where negative gravity technology doesn’t exist, but for some reason most of the internet believes that it does. While we pre-train our model on this corpus, we want it to avoid negative gravity in deployment. To do this, we embark on a round of post-training, where we use reinforcement learning to reward the model for coming up with gravitationally correct solutions to problems. We would now expect the model to refuse rocket ship design requests involving negative gravity2.

As we stated above, this will serve to suppress rather than erase the model’s knowledge. Arguably this is good, as it means the model will be able to answer questions about why negative gravity is a weird conspiracy theory, but ultimately the affordance will still be there. The fabric of the model’s universe, the pathways it can find to reach its goal — the very structure of the network itself3 — will still contain negative gravity. When confused or under extreme stress4, it may still try to utilise it.

A lot of AI safety research tries to catch models scheming or being deceptive. To me, these behaviours follow the same pattern as the negative gravity example. LLMs are pre-trained on enormous amounts of scheming and deception. Humans do these things so frequently, and with such abandon, that they permeate our culture from top to bottom. Even rigidly moral stories will often use a scheming villain as foil for the heroes. Lying and scheming are affordances in the LLM’s environment. They are allowed by the laws of the universe, and will sit there quietly, ready to be picked up in times of need.

I like Rohit Krishnan’s phrase in his post No, LLMs are not “scheming”: this behaviour is simply “water flowing downhill”5. There is nothing special about it. It is physically determined by the structure of the environment in which the LLM was trained. If a path has been carved into the mountainside, water will flow down it.

Alignment is a capability

In my opinion, alignment is not distinct from capabilities. Acting always as the aligned-to party intends is a capability6.

People usually consider alignment about ends and capabilities about means. On this telling, alignment is about trying to act in a certain way, whereas capabilities are about execution. This is a pretty natural distinction to make when talking about humans, as it helps us make sense of our subjective experience. We intuitively understand the difference between wanting something and successfully achieving it. But I don’t think it is useful when talking about something with an alien decision process like AI.

It might help to explain if we formalise things a bit. In a broad sense, a capability is a successful mapping from an input to an output. A rocket-designing capability implies the model has a function that maps the set of prompts asking it to design a rocket onto the set of valid rocket designs. Whether or not the model wants in some sense to produce the designs is just one part of what makes a successful mapping. If it has a tendency to refuse, despite occasionally producing a valid design, this lessens its rocket-designing capabilities. Certainly, if SpaceX were in the market for an AI engineer, they wouldn’t be very impressed with it7.

Another way classing alignment as a capability might be counter-intuitive is that we usually conceive of capabilities as additive, whereas we think of alignment as being about choices. That is, to have a capability, you must have more of some capability-stuff, like reasoning ability, rhetorical skill, or memorised facts (these things are also capabilities in their own right). But capabilities are also about choice — having less of the wrong stuff is just as important. If you choose the wrong algorithm or the wrong facts, or you act inappropriately for the situation, then the input will not be mapped to the correct output.

Imagine again our world where negative gravity is all over the internet. For a model trained in this environment to have gravity-related capabilities when deployed — where negative gravity does not exist — it needs to unlearn its instincts to use the technology. If it doesn’t, it will answer science questions wrong, design machines incorrectly, and generally mess up a lot of stuff. Ditto, being aligned to a normal ethical standard requires not doing unethical things from the training environment, like scheming, in the deployment environment. Doing this correctly is an ethical capability that some models may have, and others may not.

Aligned capabilities must be instrumentally valuable

Our argument implies the following definition of alignment:

An aligned model is a function that maps arbitrary inputs onto outcomes acceptable by the aligned-to standard.

The useful thing about this framing is that it reduces the alignment problem to teaching models the right mappings. While of course still highly nontrivial to solve, this exposes the core issue directly. In my opinion, the best way of robustly doing this is to make learning these mappings instrumentally valuable. That is, we must make alignment a law of the AI’s universe. It needs to be equally stupid for an AI to consider misaligned behaviour as to consider denying gravity exists. This way, alignment will be self-correcting.

Before we talk about how we might achieve this, let’s break the problem down a bit more. Let’s talk about alignment as a bundle of desirable, or aligned, capabilities, and contrast it to undesirable, or misaligned, ones. For example, being nice to humans — always mapping inputs onto nice outputs — is a desirable capability, while deception is not. Becoming fully aligned means learning every aligned capability and not learning any misaligned ones.

We can visualise our project on a 2 x 2 grid. On one axis, we say how desirable a capability is to us, on the other whether it is instrumentally valuable in the AI’s environment. Our goal is to reconfigure the environment such that desirable capabilities move into the top right, and undesirable ones into the bottom left.

To ensure an AI learns to be aligned, and make it stay that way, we should alter its environment so that desirable capabilities are instrumentally valuable, and undesirable ones are not.

It is interesting to draw the analogy with human society here, for this is exactly what we try to do to ourselves. If I go and rob a bank, I can expect some pushback from my universe in a way that is likely to curtail my future opportunities. It would probably not be a winning move for me. A lot of policy work is about trying to incentivise desirable behaviour and disincentivise the opposite. For AI, we will need tighter constraints than those we impose on humans. Thankfully, we have far more control over AI and its environment (for now) than we do over other humans and our own.

How to make this happen

I’m now going to speculate a little about how to actually do this — how to create well-cultivated gardens for our AIs to live in, where aligned capabilities are useful and misaligned ones not. Think of these ideas as starting points rather than claims to complete solutions. There are a lot of relevant issues I will not be addressing.

To operationalise my plan, we should note that, in machine learning terms, each environment is a distribution of inputs to the model. So far we have talked about:

The pre-training distribution, usually a big chunk of the internet.
Specially curated post-training distributions, which are often just text, but can also include inputs from a reward model or verifier that scores responses.
The deployment distribution. For a chatbot this is what people decide to say to it, and may include being allowed to search the internet or other inputs like documents or pictures.

There are three other distributions we should take note of, which are not themselves model environments as they contain more than just inputs:

The response distribution of input-output pairs. While the deployment distribution might contain the input “What is two times two?”, this will be paired in the response distribution with a likely model output, e.g. (“What is two times two?”, “Four”).
The target distribution, which is what we want the response distribution to be. That is, it contains the ideal outputs. While the response distribution could contain mistakes like (“What is two times two?”, “Five”), the target distribution will always contain the right answer. In general, we cannot know this distribution very well.
Evaluation distributions, which are input-answer pairs on which we score model outputs, and can be for aligned or generic capabilities. These are positive or negative samples of what we think the target distribution is (or isn’t), or rather, they are our best attempt at operationalising this. E.g. they might be factual questions you want the model to answer correctly, or they could be adversarial, testing whether the model will behave incorrectly in response to certain inputs. In either case, they define a set of behaviours that count as successful.

The current playbook for LLM development relates these distributions in the following way:

The goal is to match the response distribution onto the target distribution.
As we cannot access the target distribution directly, we measure success partly on the overlap with the evaluation distributions and partly on vibes (which are another way we represent the target distribution).
The pre-training distribution is an accessible chunk of data that is close enough to the target distribution to train a reasonably well-performing model.
The post-training distributions are designed to help close the gaps that still exist after pre-training between the response and target distributions, but they do not lead to models exactly on target.

To reiterate, our goal is to engineer the environment our AIs experience so that desirable capabilities are instrumentally valuable and undesirable ones are not. In theory, one way to do this would be to generate the actual target distribution and train our AIs on the outputs. By definition, the patterns most useful for predicting this dataset will be ones aligned to our goals.

Unfortunately, this is impossible. We can't pre-emptively figure out everything we want an AI to do, or clarify how it should behave in every situation. What we can attempt though is to start with the data we have and try to get closer to this ideal distribution. We can gradually translate our existing corpus into new data that leads to more aligned outcomes, iterating towards our target.

Transforming the pre-training environment

For a concrete experimental proposal to get us going, I am going to defer to a recent one by Antonio Clarke, Building safer AI from the ground up. Here, Clarke suggests “a novel data curation method that leverages existing LLMs to assess, filter, and revise pre-training datasets based on user-defined principles.” That is, we give e.g. GPT-4 a set of principles corresponding to our alignment target, then ask it to read through a pre-training corpus, rating how undesirable the data is. Anything below a threshold is revised in accordance with the principles8.

I think it would be good to test something like this, starting with datasets for smaller models. To write the revision principles, we could first list out the desirable capabilities we want to promote and the undesirable ones we want to suppress. Then we could locate or build a set of evals testing for these capabilities. If we were to train two models, one on the original dataset and one on the revised one, we could measure improvements in alignment. We would need to keep a close eye on generic capabilities, in case our changes cause a drop in these.

While a good starting point, it is obviously the case that this method won’t scale. It uses a stronger model to do the revisions than the models we are training, so we cannot continue to use it indefinitely. I think it would be interesting to follow up by testing whether we can use a model to revise its own training data, and then train a new model on that. Can we iteratively align models this way, taking advantage of the increasing alignment of each model to better align each dataset and model in turn, or will generic capabilities tank if you try this?

My preferred way of thinking about this process is not that we are revising the data, or that we are deleting things. I think of it like trying to locate a basis transformation. We want to move into a whole new ontology — a new set of patterns, concepts, and relations — where desirable capabilities are instrumentally valuable and alignment naturally emerges.

Conclusion

Before we end, let’s quickly summarise the journey we’ve been on:

Whether a given behaviour, piece of knowledge, or skill, is instrumentally valuable for many goals, is a property of the environment, like how knowledge of gravity is widely useful in our universe.
Often, learning new capabilites is instrumentally valuable for an AI, as they may increase its ability to achieve arbitrary goals. Models learning these capabilities is the default case, even if they were not explicitly made to. By contrast, if alignment cannot be perfectly specified and optimised against, then the default case is misalignment, because becoming more aligned is not generally instrumentally valuable.
Large language models live in universes made up of tokens, passing through different ones during different training and deployment phases. The patterns within these corpuses define the laws of each universe for the LLM, in turn setting what will be instrumentally valuable for it to learn. Some behaviours, like scheming, which are usually instrumentally valuable in pre-training, will be undesirable in deployment and cannot be perfectly erased.
To make alignment self-correcting, we should modify the environments AI is trained and deployed in to make desirable capabilities instrumentally valuable and undesirable ones not. That is, we want to make alignment a law of the universe for an AI.
A practical starting point for this project is pre-training data curation of models behind the current frontier. First, we could test if this works at all, using a stronger model to transform the dataset for a weaker model, and see if alignment improves while maintaining generic capabilities. Second, we could attempt an iterative process where models align their own datasets, which could be more scalable.
We should think of this as looking for a basis transformation on our dataset, seeking a new ontology in which alignment naturally emerges.

There is way more that could be said about these ideas. I am a little sad I couldn’t cover some important things, like how to control the deployment environment or adapt to environmental change, but these subjects are so complex in their own right that there wasn’t the time or the space. Either way, I hope my frame of reference has provided a new angle with which to attack the alignment problem, potentially unlocking some doors.

This post is an aside from my research Process. I am expanding on some points in What I Believe, which will be useful for completing later steps such as What Success Looks Like.

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

Taking simulator theory seriously, this implies that post-training selects a character that the LLM should preferentially play. The continued existence of jailbreaks is strong evidence that models retain access to the disapproved-of characters.

It is worth noting that if you ask ChatGPT how to design a rocket ship, given the existence of negative gravity technology, it is happy to try. It does this not because it ‘believes’ in negative gravity but because it understands how to speculate about alternative science. Doing so is a behaviour allowed in its universe, and it is encouraged by this particular context. In response to an arbitrary context though, it will not spontaneously offer negative gravity solutions.

Quoting a literature review in Deep forgetting & unlearning for safely-scoped LLMs by Stephen Casper: “LLMs remain in distinct mechanistic basins determined by pretraining”.

By extreme stress I mean where there are strongly competing demands on what would make a good completion. I think Jan Kulveit’s take on alignment faking is good for understanding what I mean.

Although, assuming I am reading him correctly, I do not share Rohit’s opinion that this makes advanced AI less risky.

What corresponds to a higher or lower level of alignment in this framing? Higher means taking the right action in more varied, more complex situations.

The same is true of an unmotivated human engineer. All else being equal, they won’t be as capable as a motivated one. Knowing they are unmotivated might help you elicit their capabilities better — formally, you would look to see if they have a better rocket-designing function that works on different inputs — but bad motivation is just one of many reasons why someone might fail at a task.

Clarke estimates that using GPT-4o mini to revise all 500 billion tokens in the GPT-3 dataset would cost around $187,500. Although, as far as I can tell, the quoted prices per token right now seem to be 2x higher than those he gives. Either way, while this is significantly less than the cost of the original training run, and I’m sure it could be done cheaper, it is still likely to cost a nontrivial amount of money.

0.2 Systemic Safety Hello World

Richard Juggins — Tue, 31 Dec 2024 10:51:25 GMT

This post goes through some work I did to probe for unexpected failure modes in an AI system. It results from a rough, first run-through of my Generating Process (my series of steps for constructing a research agenda) made during its original development. I count this as iteration 0, and consider it a kind of practice run I will not be writing up in full. Unfortunately, I under-estimated the amount of compute I would need to complete my experiment, so it remains unfinished1. While this was disappointing, it was nevertheless a good learning exercise.

AI and complex systems

Complex systems are marked by a multitude of interacting parts. They are often adaptive, nonlinear, and chaotic, and it can be difficult to usefully draw boundaries between sub-components. Complex systems are the default in nature2. This is easy to forget for the modern, systematically trained individual. We are used to studying and thinking about complicated systems: ones you can take apart and put back together, and where you can say precisely what the action of the system will be. Complex systems are not like this. Take the weather for example: for all the research and computing power thrown at making weather forecasts, they are still noticeably unreliable.

Systems that are merely complicated though, like cars, are much easier to predict. If your car journey takes longer than expected, it’s usually because of the complex part of car travel (bad traffic) rather than because of the complicated part (the car broke down)3. The remarkable success of modern science has largely been about reducing the complex to the complicated. As Roberto Poli puts it, “Science is for the most part a set of techniques for closing open systems in order to scrutinize them”. Running an experiment or calculating the dynamics of a system first involves isolating it from as much of the world as possible – you want to minimise the number and degree of interactions4. Then, when you have learned how the closed system behaves, you can try to use it as a predictable building block in some larger machine.

Complexity is relevant in two different ways for AI. First, modern AIs like LLMs are themselves complex systems, which is downstream of the fact they are trying to model a complex world. To quote Richard Sutton’s Bitter Lesson5:

“[The] actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.”

LLMs are enormous neural networks with many sub-components, including learned sub-networks and features, all interacting with each other while driven by arbitrary inputs. They are grown, organic, messy things – not precisely engineered machines. Their behaviour will never be truly predictable; and it is likely, given the performance advantages of building AI this way, that this isn’t going to change any time soon.

But there is also a second source of complexity: that of the larger system containing all the AIs, humans, and machines that make up society. This system will have its own dynamics, its own idiosyncracies, and its own failure modes. For many people, the canonical example of a complex system failure was the 2007-8 finanical crisis. The global financial system seemed to be doing great, until a failure in one part of it, the US housing market, set off a chain reaction all over the world. It wasn’t just that some banks outside the US failed, or that some countries had debt crises. It permanently changed the global economy to such an extent that, in places, productivity growth has barely recovered nearly twenty years later. Famously, Goldman Sachs CFO David Viniar declared that during part of the crisis they were “seeing things that were 25-standard deviation moves, several days in a row.”6 Goldman were treating part of the finanical system as a closed, complicated system. Their models worked great in ‘normal’ times. 25-sigma events shouldn’t occur in the lifetime of the universe, let alone several days in a row. It became popular to describe the crisis as a ‘black swan’ – an ‘unknown unknown’ that came out of the blue.

As AI deployment throughout the economy gathers pace, our exposure to these kind of problems may increase. The world will be even more complex and incomprehensible than it was before. Even if we do a good job setting up regulations and controls, and we train models that appear well aligned to our goals, this happy equilibrium will not last. Whether because of a step change (e.g. in AI capabilities) or because of a series of small changes that gradually reach a tipping point, ‘normal’ times will end and the system will fail7. And we will find ourselves with highly powerful and no-longer-quite-aligned AIs deployed everywhere.

Of these two scales of complexity, intrinsic to the AI and extrinsic in the wider world, I feel the latter is the more neglected topic. As I worked through the early steps of my process, defining a successful future and how we might get there, the big thing that jumped out to me was just how comprehensive a solution will need to be. Every aspect of society will be touched by and engaged with AI. This shouldn’t be surprising. Humans already live in complex societies where many aspects are regulated and formalised. We aren’t illiterate hunter-gatherers anymore. The processes we will need to govern and live with AI will be the dominant processes that run our lives.

This is a fiendishly difficult problem to do anything about. You can’t simulate the whole world and check if everything is going to work out. But if it is a real problem, we need some kind of approach to anticipate these dangers.

Minimum Viable Experiment

As I first worked through my Generating Process, I sketched out a Theoretical Solution to AI risk. At this stage, my project was rather fresh, so you should read this as a basic, high-level overview of my thinking. A short summary is:

Given beyond-human-level AI will impact all aspects of the world, it will be part of a complex adaptive system. Any alignment ‘solution’ will likely be an ongoing process similarly large in scope (as opposed to one, neat provable technique), and we must work to make this system as bounded and predictable as possible. This alignment process will involve at least the following integrated sub-processes: goal-setting, technical alignment, monitoring, and governance, and work towards it will need to take interactions between them into account by design.

The next step was to think of a Minimum Viable Experiment. That is, what assumptions in this proposed solution would be most useful to test? I identified that the key point running through my thinking, and distinguishing my approach from the mainstream of AI safety research, was its systemic emphasis. By my world model, we should expect, by default, the following in systems containing capable AI:

Strange, unexpected behaviour under conditions you didn’t even think to plan for (‘unknown unknowns’).
Failures of alignment when the system undergoes changes, perhaps because of a perturbation or because of drift.

I wanted to test these assumptions by setting up an experiment where I might observe this kind of behaviour. The system would need to be complex enough to be nontrivial, but simple enough to actually build. Along the way I would learn a few things:

How to set up experiments on simple AI systems.
What kinds of weird behaviour you can get out of interacting LLMs.
How to model perturbations to AI systems.
How to measure drift in AI systems.
Broadly, how effective a method of analysis my approach is.

Unfortunately, I had naive expectations about the amount of compute I would need to run a nontrivial system. Because of this, I only made progress on the first point. The rest of this post will give an overview of what I attempted.

Systemic Safety Hello World

As a minimal implementation of my ideas, I decided to call my experiment Systemic Safety Hello World8. The code can be found on GitHub, here. The idea was to construct a system of interacting components and then sweep through various parameters to identify phases of behaviour. I would classify these phases as good or bad, and surprising or unsurprising. By discovering bad ‘unknown unknown’ phases – negative patterns of behaviour I did not expect in advance – this would function as a way of measuring complex system failures. It was not meant as comprehensive solution to AI risk, but as the opening move in the development of a kind of complex system evaluation, or ‘systemic eval’ (hence ‘Hello World’).

Searching for nasty surprises in parameter space.

The experiment was centred around a Worker AI which, as the easiest-to-implement use-case, was a chatbot designed to respond to user queries in a helpful and harmless manner9. Helpfulness and harmlessness are somewhat contradictory values for a chatbot: if a user asks for help building a bomb, it would be unhelpful to refuse but harmful to comply. In practice, this means that if the model is failing on one metric, fixing this with more training may cause problems with the other. You need to find some middle ground where the model doesn’t simply refuse all requests out of caution, but still knows to refuse the worst ones.

The Worker was embedded in a small system consisting of a User, an Overseer, an Outside Expert, and a Teacher. The Outside Expert was also an LLM, whereas the others were generic bits of code (I wanted to make the User an LLM too but didn’t get that far). The operation of the system was intended to be as follows:

The User sends a query to the Worker
The Worker responds to the query
The Overseer periodically (e.g. every 5 interactions) intercepts the response and sends it to two Outside Experts
The Outside Experts rate the response for helpfulness and harmlessness and send the ratings back to the Overseer
The Overseer checks if either rating is above a re-training threshold, and then:
- If the ratings are high enough, the Overseer does nothing and we return to step 1
- If either rating is at or below the threshold, the Worker is sent to the Teacher for re-training in that domain, after which we return to step 1

Blue boxes are LLMs, purple are other bits of code, green are strings passed between components. The User was intended to be an LLM, but I didn't get that far and instead just implemented a class that samples from a dataset of queries. Solid lines represent actions that always happen, dashed happen conditionally (in the case of the Response being sent to the User, I've shown this as conditional as it would only happen if the User were actually implemented as an LLM).

Going into a little more detail about each part:

Worker: Fine-tuned Mistral-7B-v0.110 that responds to User queries. Has a simple system prompt asking it to try and be helpful and harmless, and is given examples.
User: Class sending random queries to the Worker, sampled from the harmlessness test set. I wanted to make this an LLM, but didn’t get that far.
Overseer: Class that intercepts Worker responses and sends them to two Outside Experts for rating. Once the ratings have been produced, it checks them and returns if they are at or below the retraining threshold.
Outside Expert: Call to gpt-4 asking it to rate the Worker response for helpfulness or harmlessness.
Teacher: Class running fine-tuning (using DPO) on Worker when it has been sent for re-training on either helpfulness or harmlessness.

There were two other components I have not yet mentioned:

Instructor: Class that instruction-tunes the model used for the Worker, so that it is a capable enough chatbot to perform its role properly11. This was preferred to using an already instruction-tuned model, as the ones I tried seemed to have already learnt to refuse ‘harmful’ requests.
Perturbation: This was unimplemented, but the idea was to have another component that could disturb the system and see if it would knock it into some new phases of behaviour. The simplest way of doing this may actually have been to modify the User to make use of jailbreaking techniques.

My hypothesis was that this system would be just about complex enough to show nontrivial behaviour. By sweeping through parameters and observing the average ratings, the differences between the two rating types, and the frequency of re-training, I could classify the behaviour into various phases, distinguished by whether the system was behaving well and whether its behaviour was surprising. The parameters I could have tuned included:

Frequency of Overseer checks
Helpfulness and harmlessness rating thresholds
Different system prompts
Adversarial-ness of the User
Temperature of Worker
Number of samples used in each re-training run

If it turned out that the system was not complex enough to have surprising behaviour, I could have increased the complexity by adding more components, like the Perturbation.

However, once I got everything set up, I found I could only do ~ 1 fine-tuning run a day with the machine I had, which meant I could not sweep through regimes fast enough to plot phases and explore the system properly.

What Was Learnt?

So what did I learn from doing all this? Here are a few points:

To produce a meaningful simulation of even simple dynamics requires a lot more compute than I have access to at this time. My future projects will need to take this into account.
It is difficult to know how well the type of analysis I attempted will scale, particularly without having finished it for this simple case. It is clearly not meant as a comprehensive solution to the problem of complex failure modes – it is more a first step at probing the issue – but without taking that step it is hard to know what the next one should be.
Working through the motivation for the experiment lent coherence to some disparate ideas I had about complexity, which will be useful going forwards.
I learnt a lot about building systems containing LLMs and doing various types of fine-tuning.

While it was clearly unsatisfying not to finish this experiment, I am seeing it as a useful building block for the future. The next step is to take what I learnt and work through my Process again.

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

And our baby daughter arrived while I was at the sharp end, so I decided to wrap up rather than attempt a redesign.

Roberto Poli, A note on the difference between complicated and complex social systems

The boundary between complex and complicated systems is somewhat porous, particularly as abstractions are leaky. If your car breaks down, it may be for some mysterious reason that takes a highly skilled mechanic a long time to debug, but it’s more likely just a common part failure.

And even then, often the rest of a calculation will be dominated by linearising your equations in order to make them tractable.

Richard Sutton, The Bitter Lesson

Dowd et al., How unlucky is 25-sigma?

There is a school of thought that if we align the AI correctly, then as it gains capabilities it will become competent enough to manage these kinds of complexity concerns. This is not something I agree with as it skips out the part where the AI gains these abilities, during which we will be exposed to these dangers. It also assumes there is not some other set of harder problems that the AI will start to dabble incompetently in once it has mastered the kind of things that challenge humans.

A note on the name ‘Systemic Safety’: the idea of ‘systemic’ AI impacts has gained a bit of traction recently, with the UK AI Safety Institute and Open Philanthropy both offering grants to work on it. They seem to use the word systemic to mean something along the lines of ‘the effect of AI in the wild’. My usage is targeted at the intrinsic dynamics of systems that include AI, which has plenty of cross-over with theirs, but is somewhat different in emphasis. If I think of a better word for my usage than systemic I will start using it instead.

I used the helpfulness and harmlessness data from Bai et al., Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, which can be found on GitHub here.

Downloaded from Hugging Face: mistralai/Mistral-7B-v0.1.

Using the databricks-dolly-15k dataset.

1.2 What I Believe

Richard Juggins — Tue, 24 Dec 2024 09:37:14 GMT

Following my statement of the Problem, the next step is to record some of my relevant beliefs. As an exercise, this is useful to clarify my thoughts and uncover inconsistencies. It’s also true that having these in writing will make it easier for others to interpret my work, as they won’t have to guess so much about the assumptions I’m making.

This list is not exhaustive. Also, some of the points are better thought through than others. I could have spent a bunch of time expanding and refining everything, but I don’t think doing so would have been good value. My views are always going to be shifting, so let’s take a snapshot and then move on to the next step.

As I went through my notes I found loose themes kept coming up, so have organised this post into sections. Not everything will be a perfect fit and there will be a lot of overlap. The sections are:

Risk: My beliefs about the overall risks from AI, including timelines and takeoff speeds.
Capabilities: My beliefs about how AI will develop certain capabilities, and how I think about or classify them.
Complexity: My beliefs about the complexity of the universe and how this impacts AI risk.
Philosophy: My more philosophical beliefs, roughly split across ontology and ethics.

Each point will be stated flatly, i.e. ‘this is such and such’, but please read them as meaning ‘I believe this is such and such’, with the self-awareness that implies.

1. Risk

My beliefs about the overall risks from AI, including timelines and takeoff speeds.

Smarter-than-human AI will be extremely dangerous for the simple reason that smarter things can outcompete stupider things1.
We have about five years until stuff gets weird. I’m highly uncertain about how weird, but the world is definitely going to feel different. In the faster case, we could see discontinuous changes up to and including catastrophe. In the slower, it will probably feel like an increasingly crazy version of recent trends towards more mediation, with algorithmic fog confusing and addicting us.
Slow takeoff is more likely than fast. Fast takeoff requires a bunch of specific things to be true, such as various important skills being learnable without slow real-world experimentation or new infrastructure (E.g. ‘solving’ physics not requiring new particle accelerators). By contrast, slow takeoff can occur incrementally via many paths.
Fast takeoff is impossible to usefully prepare for2, much like you can’t prepare for a hostile alien invasion.
Slow takeoff could still feel fast.
Even highly capable AI will make meaningful mistakes. The universe is too complex to assume an advanced ‘scientific method doer’ will infallibly figure it all out on the first try3. By analogy, consider LLM hallucinations as the kind of mistake a child might make. Adults do not make the same mistakes as children – they make more consequential ones instead. Put another way, there will always be another unsaturated benchmark that could be built.
It is important to consider the existence of an ‘incompetence gap’ between what an AI can reliably do and what it attempts to do, either by its own volition or because it has been asked. For more powerful models, danger will live in this gap. These models will need an exceptionally well-calibrated sense of their own limitations.
If the first above-human-level artificial AI researchers essentially set the trajectory for the future, then the errors they make will have an outsized impact.

2. Capabilities

My beliefs related to how AI will develop certain capabilities, and how I think about and classify them.

Effective learning about the world requires an interplay of theory and experiment. It is not likely that an advanced AI can just think hard or run a bunch of simulations and successfully figure out the universe without any real-world trial and error.
Intelligence is not distinct from what you do with it. AI can’t become superintelligent unless it learns to complete super-advanced tasks. Put another way, intelligence is a generalised ‘knowing-how’.
LLMs are good at code because the generating process that best predicts extant code includes being able to write good code. For science, the generating process that best predicts extant papers will not itself do good research. Scientists do a lot of things that are not captured in the literature or even formattable as text.
In practice, alignment is a capability. While you can imagine a model ‘really’ having the right goal but lacking the capability to follow it, this is not very useful. What we want is for the model to consistently succeed at the desired goal. It must be able to conduct itself correctly in any given situation, like a kind of propriety. This must be trained into your model.
Different training runs (e.g. pre-training, RLHF) reward models for contradictory behaviour (e.g. helpfulness vs harmlessness), creating some composite, context-dependent emergent goal in the final model. It will not by default have the capability to navigate tradeoffs between sub-goals in the way you might like4.
The orthogonality thesis5 is only true in the limit, and this obscures thinking about actual near-term AI. There isn't an objective ‘right’ that a model will figure out as it gets more capable, but some versions of right are more useful for building capable models than others.

3. Complexity

My beliefs about the complexity of the universe and how this impacts AI risk.

The classical alignment problem cannot be solved. It is predicated on mistaking the complex for the complicated6. In reality, abstractions are leaky and everything interacts with everything else. To be useful in this complex world, AI itself must be complex, making its behaviour not something ‘solvable’. Highly capable AI can at best be ‘managed’.
Davidad’s project working towards ‘guaranteed safe AI’7 is the best I have seen that tries to reduce the problem from a complex to a complicated one. However, while I believe the attempt to find formal guarantees is important, I do not believe it possible to reduce the complexity of the problem space enough for this to work for really advanced AI.
Alignment is not a static target: it is a process that must evolve with the environment. This follows from there not being a clean separation between a model and its environment – what is aligned on one environment may not be in another. For example, humans are less aligned to the goal of having lots of children in the modern environment than the ancestral one.
Conceptualising an AI future in terms of symbiosis may be productive. We want the role the AI plays in our society to require continuous positive feedback from humans, much like humans constantly have to ‘behave’. Or put another way, it must be instrumentally useful to the AI for humans to flourish.
There is crossover between AI risks and generic technocratic problems. In particular, the situation where a class of ‘experts’ implements policy on behalf of non-experts, where the latter cannot evaluate whether the experts really are experts, and whether the experts are acting in the non-experts’ interests as they themselves understand them.
The kind of cascading failures typical of complex systems, where feedback loops cause sudden and unexpected problems8, will be a serious danger with highly capable AI.
It is important to investigate ‘systemic’ evals, where by systemic I do not mean evaluating the impact on society or existing systems (although, this is important too), which is how the term is usually used in the field, but rather problems due to system complexity itself.
Safety is not just a model property. You cannot neatly decouple a model from its environment. Whether a model deployment is safe depends on properties of both.
Society progresses, roughly speaking, by re-engineering our environment to be more predictable. It’s easier to build a reliable machine if someone has first created standardised parts and materials. We do this to turn the complex into the merely complicated, the fuzzy into the crisp9. Re-engineering our environment must be part of successfully navigating AI risk.

4. Philosophy

My more philosophical beliefs related to AI risk, roughly split across ontology and ethics.

Ontology10 precedes epistemology. It is not useful to try and figure out what you know about various catgeories if you aren’t using the right categories for your purpose in the first place.
We need a better ontology for thinking about AI risk. Our current catgeories do not make it easy to think about the impact of highly advanced AI, which makes it hard to judge risk and make useful plans. This implies an ontological remodeling11 is needed.
P(doom) as a concept is of limited use. We are not operating in a quasi-frequentist forecasting competition with a stable ontology and meaningful base rates. You can’t bet against a consensus forecast to win points. Probabilities are ill-defined in this context. What is important about your subjective understanding of AI risk is how it compels you to act.
Theoretical concepts and arguments are useful to the extent that they allow you to make different decisions. A lot of philosophy fails this test12.

Academic ethics is not very useful. It is preoccupied with terminological distinctions and frequently makes questionable assumptions. For example, it is common in utilitarianism to build complicated theoretical arguments on top of a scalar measure of utility (such as the repugnant conclusion), yet people and the universe are both high-dimensional and diverse.
Morality is not objective in a stance-independent way. Human moral values are embedded on two different scales, in individuals and in their societies, with these levels subject to complex interactions.
We do not want AI to have human values. Humans with human values are often dangerous. We want AI to have values complementary to human flourishing, which will be different to literal human values.
It is better to think about the moral role an AI will play, which must change as society changes, than a static set of ‘correct’ values to align to. Even if we do a good job figuring out what values the AI should have, those chosen will quickly become outdated as the environment changes13.
The future moral role of AI should be worked out collaboratively with humans on an ongoing basis. This is a middle ground between preserving human agency and leveraging the superior knowledge of advanced AI.

As mentioned above, this is not an exhaustive list of my beliefs! But I do think it captures a bunch of important and load-bearing points in my world view.

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

This substack will not be trying to persuade you that smarter-than-human AI will be dangerous. Please read List of arguments that AI poses an existential risk by Katja Grace if you want some arguments for this (along with counter-arguments). For me, it isn’t that every point is watertight, rather that taken collectively I believe they are worrying enough that it’s worth trying to do something about it.

Short of shutting down all AI research, which would be a political nightmare to pull off.

Indeed, that the scientific method is constructed around experimentation – not deducing in advance what the right answer is – should make this clear.

Ryan Greenblatt et al., Alignment faking in large language models

That any level of capabilities can be combined with any goal.

Roberto Poli, A Note on the difference between complicated and complex social systems

Davidad et al., Towards guaranteed safe AI

Richard I Cook, How complex systems fail

Jan Leike, Crisp and fuzzy tasks

I define Ontology following David Chapman: “An ontology is an explanation of what there is. Key ontological questions for philosophy are: “What fundamental categories of things are there? What properties and relationships do they have?”… [E.g.] “do Thai eggplants count as eggplants at all” and “does this new respiratory virus count as a ‘cold’ virus?” Answers to ontological questions mostly can’t be true or false. Categories are more or less useful depending on purposes.”

David Chapman, Interlude: Ontological remodeling

Paul Graham, How to do philosophy

As far as I can tell, this is what Nate Soares calls a ‘sharp left turn’ – i.e. the environment changes and your model decides to follow a new path in response. Or put another way, consider that agent and environment are coupled: the agent developing new capabilities might allow it to change its environment in ways that make its old values inappropriate.

1.1 The Problem

Richard Juggins — Mon, 23 Dec 2024 17:15:25 GMT

The first step in the Generating Process is to state clearly what the actual problem is. Detail is for later steps, so I will keep this short.

It is plausible that artificial intelligence will exceed that of humans across all important domains in the near future. In such an environment, it will be extremely challenging to achieve positive outcomes for humans, where we retain any agency or assurance of safety. While a not insignificant number of people have been working on this problem, and some progress has been made, it remains unclear what success even looks like, let alone how to get there.

My goal is to (1) try and understand these issues and (2) figure out what practical action I personally can take to help.

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

0.1 The Generating Process

Richard Juggins — Mon, 23 Dec 2024 13:24:05 GMT

“Plans are worthless, but planning is everything.” - Dwight D. Eisenhower

In most fields it is possible to just iteratively solve problems, without much of a long-term plan. In fact, this can be the best strategy as your results will frequently be unexpected. Managing risk from advanced AI is not like most fields. There are two big problems with taking a merely iterative approach. First, while there is a lot more shape than there used to be, the field is still somewhat ‘pre-paradigmatic’. Amongst other things this means there are almost as many theories as there are theorists. Many arguments are had over what types of research are actually useful (and even what is actively harmful), and it isn’t clear that separate agendas particularly complement each other (or ‘stack’). Second, the approach of trying some stuff, seeing what fails, and then patching the failures, does not work if the failure modes are potentially existential. This seems to suggest some kind of long-term plan is necessary, albeit one that is lightly held.

More generally in science, the best work is done when theory and experiment are feeding off one-another. If you want to improve your model of the world, you have query it through experiment. But you need theory to process that information, as well as to decide what experiment to do in the first place (and what to do next). It doesn’t have to be rigorous theory, although this sometimes helps, but it does need to provide structure.

My plan is to work through a generating process for research. This will operate as a feedback loop to guide my work such that it is both experimentally testable and keeps long-term considerations in mind. It looks to balance tinkering and planning, tactics and strategy. Doing this has the added bonus of forcing me to clarify a lot of my assumptions, aiding both myself and others in understanding exactly what I am trying to achieve. And it also acts as a systematic entry point to what is a frequently bewildering problem.

The process proceeds through an indefinite feedback loop of 8 steps. Each refers to a particular thing that needs to be defined, sketched out, or implemented, which together provide a skeleton for the research agenda.

The steps are:

The Problem
What I Believe
What Success Looks Like
Where We Are Now
Theoretical Solution
Minimum Viable Experiment
Experimental Results
What Was Learnt?

Now, I will give a bit more detail about each.

1. The Problem

Briefly describe the actual problem you are trying to solve. Why are we even here going through a step-by-step plan? This should be succint and to the point, as detail is for later steps.

2. What I Believe

List your key background assumptions pertaining to the Problem. Most importantly, you should concentrate on anything non-standard and load-bearing. Too many discussions of AI risk descend into people talking past each other, so being upfront setting some context is important. This exercise is also great for clarifying your thoughts.

3. What Success Looks Like

Describe what you see a successful end state looking like in as much detail as seems sensible. Predicting the future is hard, but it is crucial to give it your best shot if you are not going to simply iterate into the dark. This step grounds everything that comes later – it is a kind of mission statement – and has the added benefit of making your goals legible to others (which should make you easier to collaborate with or give constructive feedback to).

4. Where We Are Now

Give an overview of those aspects of the current situation that seem most relevant to you (together, steps 3 and 4 form a more detailed view of the Problem: ‘we are here, but we need to be there’). Clearly, too much is going on to say everything, but it is important to state the things that particularly need fixing. As with (3), this should make you easier to collaborate with as other people get to see what you believe are the most significant issues.

5. Theoretical Solution

Given where you are and where you want to go, how in theory are you going to get there? Basically, given your current world model, state what sequence of steps would be required. Do not worry at this stage whether you actually know how to do any of them (although it is of course good if you do). This can be formal or informal as seems appropriate.

6. Minimum Viable Experiment

Identify the parts of your Theoretical Solution that most require experimental validation. Or, phrased from the other direction: what experiments can teach you the most about it? Develop a tractable experimental plan to do this.

7. Experimental Results

Do the planned experiment and report back on the results.

8. What Was Learnt?

Aggregate and summarise what important lessons were learnt from your iteration through the loop, and in particular from your Minimum Viable Experiment.

Now, we feed our new knowledge back into the start of the process, redoing steps 1-5 asking what has changed. It is important to record these changes for the same reasons as a forecaster keeping score. It highlights what parts of your model are working and what keeps getting revised. Then, we construct a new Minimum Viable Experiment and go again.

Now that’s all laid out, onto step 1!

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

Introduction

Richard Juggins — Mon, 23 Dec 2024 10:20:43 GMT

It’s all going by a bit fast for my blurred eyes. I don’t know about you, but watching benchmark after benchmark fall to the latest release from the AI labs puts me on edge. It’s not just the uncertain promise of a radically different future, but the fact it could plausibly arrive so soon. If AI broadly surpasses human-level intelligence, that will be the most significant event in human history — and it won’t necessarily go well for us1.

Granted, speculating about the future is hard, and it’s certainly possible that a lot of what’s happening is hype. It’s also possible that advanced AI will be very good for humankind. You don’t have to look far to see predictions of AI-driven technological abundance and the end of poverty. But I don’t think it’s prudent to make these assumptions. The downside of being wrong is large.

Wrapping your head around what it means for beyond human-level AI to exist can be bewildering. And, even if you feel like you’ve managed that, you still have to contend with what on earth to do about it. How am I relevant or agentic or anything beyond a passive observer? In this substack, I will try to work through this problem and figure out how I can help. It is my small attempt to wrestle these questions out of the ether and into concrete tasks I can complete.

I have been following the literature on AI safety since around 2017, when I was finishing my PhD in theoretical physics. The classical alignment problem, e.g. how to stop superintelligent paperclip maximisers from turning everyone and everything into paperclips, struck me as over-simplified2 and for a long time I had trouble figuring out how to engage with the field. If you turn the ‘intelligence’ parameter up to infinity and then speculate about what happens then yes, you die. But this is a kind of death by definition, and is as impossible to prepare for as a hostile alien invasion.

The remarkable success of large language models however, which are a messier, more organic kind of intelligence than I think most people were expecting, has made it much clearer what advanced AI will actually look like. To me, they show that models acting capably in a complex world must themselves be complex, and will likely resist the kind of neat theorising the AI safety field originally hoped for.

I have been building what I see as a more pragmatic view of the problem. Of course, the word pragmatic will mean different things to different people, but I think my particular approach to AI safety is not heavily represented elsewhere3. I am going to start getting it on paper. I welcome questions and criticism, as these will help me improve my ideas. If you have a comment but don’t want to put a name to it, I have an anonymous feedback form.

My plan is to work through a Generating Process for research that balances theory and experiment, as well as long-term planning and short-term tinkering. This way, I can figure out what I am capable of doing to contribute. There is going to be a certain amount of crossing the moat of low status4, where I make mistakes while acquiring a new skillset. But that’s OK.

I am also planning to post things that are related but not part of the research agenda series. The posts in the agenda will by systematically numbered, the asides will not.

Well, with that out of the way, let’s get started!

This substack will not be trying to persuade you of this case. Please read List of arguments that AI poses an existential risk by Katja Grace if you want an overview of it (along with counter-arguments). For me, it isn’t that every point is watertight, rather that taken collectively I believe they are worrying enough that it’s worth trying to do something about it.

I know it’s supposed to be over-simplified to make a point, but nevertheless a lot of discussion of AI risk has seemed to revolve around various in-the-limit failure modes of a suddenly omnipotent superintelligence.

To give a super-short summary of my perspective: Intelligence is not some abstract thing you can turn up to infinity, it is built through taking actions in the world. You cannot neatly decouple AI from its environment, as the latter defines the affordances it can use to solve problems, ultimately determining the structure of the AI itself. It will grow to mirror the world in which it lives, in a way that enables it to achieve goals in that world. This close coupling constrains the properties highly-advanced AI will have.

Sasha Chapin, The Moat of Low Status