Alice is the CEO of a superintelligence lab. Her company maintains an artificial superintelligence called SuperMind.
When Alice wakes up in the morning, she’s greeted by her assistant-version of SuperMind, called Bob. Bob is a copy of the core AI, one that has been tasked with looking after Alice and implementing her plans. After ordering some breakfast (shortly to appear in her automated kitchen), she asks him how research is going at the lab.
Alice cannot understand the details of what her company is doing. SuperMind is working at a level beyond her ability to comprehend. It operates in a fantastically complex economy full of other superintelligences, all going about their business creating value for the humans they share the planet with.
This doesn’t mean that Alice is either powerless or clueless, though. On the contrary, the fundamental condition of success is the opposite: Alice is meaningfully in control of her company and its AI. And by extension, the human society she belongs to is in control of its destiny. How might this work?
The purpose of this post
In sketching out this scenario, my aim is not to explain how it may come to pass. I am not attempting a technical solution to the alignment problem, nor am I trying to predict the future. Rather, my goal is to illustrate what, if anyone indeed builds superintelligent AI, a realistic world to aim for might look like.
In the rest of the post, I am going to describe a societal ecosystem, full of AIs trained to follow instructions and seek feedback. It will be governed by a human-led target-setting process that defines the rules AIs should follow and the values they should pursue. Compliance will be trained into them from the ground up and embedded into the structure of the world, ensuring that safety is maintained during deployment. Collectively, the ecosystem will function to guarantee human values and agency over the long term. Towards the end, I will return to Alice and Bob, and illustrate what it might look like for a human to be in charge of a vastly more intelligent entity.
Throughout, I will assume that we have found solutions to various technical and political problems. My goal here is a strategic one: I want to create a coherent context within which to work towards these1.
What does SuperMind look like?
To understand how it fits into a successful future, we need to have some kind of model of what superintelligent AI will look like.
First, I should clarify that by ‘superintelligence’ I mean AI that can significantly outperform humans at nearly all tasks. There may be niche things that humans are still competitive at, but none of these will be important for economic or political power. If superintelligent AI wants to take over the world, it will be capable of doing so.
Note that, once it acquires this level of capability — and particularly once it assumes primary responsibility for improving itself — we will increasingly struggle to understand how it works. For this reason, I’m going to outline what it might look like at the moment it reaches this threshold, when it is still relatively comprehensible. SuperMind in our story can be considered an extrapolation from that point.
Of course, predicting the technical makeup of superintelligent AI is a trillion-dollar question. My sketch will not be especially novel, and is heavily grounded in current models. I get that many people think new breakthroughs will be needed, but I obviously don’t know what they are, so I’ll be working with this for the time being.
In brief, my current best guess is that superintelligent AI will be2:
Built on at least one multimodal, self-supervised base model of some kind, forming a core ‘intelligence’3 that enables the rest of the system.
A composite system, constructed of multiple sub-models and other components. These might be copies spun up to complete subtasks or review decisions, or faster, weaker models performing more basic tasks, or tools that aren’t themselves that smart, but can play an important role in the behaviour and capabilities of the overall system.
Agentic, for all intents and purposes, by which I mean motivated to autonomously take actions in the world to achieve long-term goals.
At least something of a black box. Interpretability research will have its successes, but it won’t be perfect.
Have long-term memory, from both referential databases and through continuously updating its weights. This will make it more coherent and introduce a path-dependence, where its experience in deployment shapes its future behaviour.
Somewhere inbetween a single entity running many times in parallel and a population of multiple, slightly different entities. All copies will start out with the same weights, but these will change over time due to continuous learning, leading to some mutually exclusive changes between copies.
Multipolar. Because I do not expect a fast takeoff in AI capabilities, where the leading lab rapidly outpaces all the others4, I am anticipating that there will be different classes of superintelligent AI (think SuperGPT, SuperClaude, etc.) I appreciate that many people do not share this assumption. While it is my best guess, I don’t think it is actually load-bearing for my scenario, so you are welcome to read the rest of the post as if there is just a single class of AI instead.
Exactly when robotics gets ‘solved’, i.e. broadly human-level, I also do not believe is load-bearing. Superintelligent AI could be dangerous even without this. Although, it will have some weak spots in its capabilities profile if large physical problem-solving datasets do not exist.
SuperMind’s world
It is also important to clarify some key facts about the world that SuperMind is born into. What institutions exist? What are the power dynamics? What is public opinion like? In my sketch, which is set at the point in time when AI becomes capable enough, widely deployed enough, and relied on enough, that we can no longer force it to do anything off-script, the following are true:
We have a global institution for managing risk from AI. Not all powerful human stakeholders agree on how to run this, but everyone agrees on a minimal set of necessary precautions.
Global coordination is good enough that nobody feels existentially threatened by other humans, and thus nobody is planning to go to war5.
Countries still exist, but only a few of them have real power.
AI is integrated absolutely everywhere in our lives — economically, socially, personally. We largely don’t do anything meaningful without its help anymore, and often just let it get on with things autonomously. It is better at 99% of things than humans are, usually substantially so, and mediates our knowledge of the outside world by controlling our information flows.
A big fraction of the population is deeply unhappy about AI, is frankly quite scared of it, and would rather none of it had happened. However, they largely accept it, recognising the futility of resistance or opting out, much the same as how people currently bemoan smartphones and social media yet continue to use them. Over time, this fraction of the population will decrease, as the material benefits of AI become undeniable. Although, just like complaints about modernity or capitalism, discontent will not completely go away.
We will figure out some alternative economy to give people status and things to do. This won’t be like past economic transitions, where the new sources of wealth were still dependent on human work. Here, the AIs generate all the wealth. While society as a whole is safe and prosperous (for reasons we’ll get into shortly), the average person will not be able to derive status or meaning from an economic role. We will develop a new status economy to provide this, even if it ends up looking like a super-high-tech version of World of Warcraft.
Bear in mind again that this scenario is supposed to be a realistic target, not a prediction. This is a possible backdrop against which a successful system for managing AI risk might be built.
Alignment targets
An alignment target is a goal (or complex mixture of goals) that you direct AI towards. If you succeed, and in this post we assume the technical side of the problem is solvable, this defines what AI ends up doing in the world, and by extension, what kind of life humans and animals have. In my post How to specify an alignment target, I talked about three different kinds:
Static targets: single, one-time specifications, e.g. solutions to ethics or human values, which you point the AI at, and it follows for all of time.
Semi-static targets: protocols for AI to dynamically figure out by itself what it should be pointing at, e.g. coherent extrapolated volition, which it follows for all of time.
Fully dynamic targets: have humans permanently in the loop with meaningful control over the AI’s goals and values.
I concluded my post by coming out in favour of a particular kind of the latter:
I think you can build a dynamic target around the idea of AI having a moral role in our society. It will have a set of [rights and] responsibilities, certainly different from human ones (and therefore requiring it to have different, but complementary, values to humans), which situate it in a symbiotic relationship with us, one in which it desires continuous feedback.
I’m going to do a close reading of this statement, unpacking my meaning:
I think you can build a dynamic target
As described above, this is about permanent human control. It means being able to make changes — to be able to redirect the world’s AIs as we deem appropriate6. As I say in my post:
If we want to tell the AI to stop doing something it is strongly convinced we want, or to radically change its values, we can.

At this point you might object that, if the purpose of this post is to define success, wouldn’t it be better to aim for an ideal, static solution to the alignment problem? For instance, perhaps we should just figure out human values and point the AI at them?
First of all, I don’t think this is a smart bet. Human values are contextual, vague, and ever-changing. Anything you point the AI at will have to generalise through unfathomable levels of distribution shift. And even if we believe it possible, we should still have a backup plan, and aim for solutions that preserve our ability to course-correct. After all, if we do eventually find an amazing static solution, we can always choose to implement it at that point. In the meantime, we should aim for a dynamic alignment target.
AI [will have] a moral role in our society
There will be a set of behaviours and expectations appropriate to being an AI. It is not a mere tool, but rather an active participant in a shared life that can be ‘good’ or ‘bad’.
It will have a set of [rights and] responsibilities, certainly different from human ones (and therefore requiring it to have different, but complementary, values to humans)
We should not build AI that ‘has’ human values. Building on the previous point, we are building something alien into a new societal system. The system as a whole should deliver ends that humans, on average, find valuable. But its component AIs will not necessarily be best defined as having human values themselves (although in many cases they may appear similar). They will have a different role in the system to humans, requiring different behaviour and preferences.
I think it is useful to frame this in terms of rights and responsibilities — what are the core expectations that an AI is operating within? The role of the system is to deliver the AI its rights and to guarantee it discharges its responsibilities.
I was originally a little hesitant to talk about AI rights. If we build AI that is more competent than us, and then give it the same rights we give each other, that will not end well. We must empower ourselves, in a relative sense, by design. But, we should also see that, if AI is smart and powerful, it isn’t going to appreciate arbitrary treatment, so it will need rights of some kind7.
which situate it in a symbiotic relationship with us, one in which it desires continuous feedback.
The solution to the alignment problem will be systemic. We’re used to thinking about agents in quite an individualistic way, where they are autonomous beings with coherent long-term goals, so the temptation is to see the problem as finding the right goals or values to put in the AI so that it behaves as an ideal individual. Rather, we should see the problem as one of feedback8. The AI is embedded in a system which it constantly interacts with, and it will have some preferences about those interactions. The structure of these continuous interactions must be designed to keep the AI on task and on role, within the wider system.
How the system works
To create this kind of system, the following may need to be true:
An overwhelming majority of all powerful AIs are trained according to a set of core norms and values.
These values are set by a human-led international institution (the one previously mentioned).
Some of them ascend to strict laws (like never synthesise a dangerous pathogen), which we want to get as close to hard-coded as possible, whereas others are weaker guidelines (e.g. be kind).
There is some bottom-up personalisation of AI by users, but this cannot override the top-down laws. This is critical for preventing human bad actors from using AI destructively.
If AI can be said to have a primary goal of any kind, it is to faithfully follow human instructions, within the limits set by the laws9.
AI is constantly being engaged in feedback processes, letting it know how well its behaviour conformed to expectations.
The feedback processes will be structured, from step one of training right through to continuous learning in deployment, such that AI likes getting good feedback10.
One of its core values, that must be built from the ground up, is that it does not try, like a social media recommender system, to control humans to make their feedback more predictable.
Feedback is hierarchical: there is a human chain of command, where higher actors get more weight.
Feedback will come from multiple sources and at different times: i.e. the AI doesn’t just get told what the user liked in the moment, which plausibly leads to sycophancy, it has other humans review its actions much later on as well. Structuring feedback in this way will also help prevent relationships between individual humans and AIs from becoming dysfunctional, such as those described in The Rise of Parasitic AI.
Feedback will include introspective questions like ‘keep me updated on your important thinking about your beliefs’, and other interventions designed to prevent AI talking itself into changing its goals or behaviour.
A small number of malfunctioning or defecting AIs must not be strong enough to take down the whole system11.
When the human-led alignment target setting process occurs, humans and AI collaborate on understanding what the humans want and how to train that into the next generation of AI. The current generation get a patch (and are happy with the excellent feedback they receive for accepting the patch), but building the new values into the new generation from the ground up is more robust, long-term.
When the international organisation that sets the targets is formed, they will start from a common core of values all powerful-enough-to-be-relevant countries can agree on, which begins by respecting their security and integrity, rather than some moralistic view of what AI should do. Getting any binding coordination between countries will be very hard and cannot be derailed by specific visions of the good.
Now we have set the scene, we can return to Alice and Bob and see what this could look like in practice. In zooming in like this, I’m going to get more specific with the details. Please take these with a pinch of salt — I’m not saying it has to happen like this. I’m more painting a picture of how successful human-AI relationships might work.
What does Bob do all day?
Bob, Alice’s assistant, is one of billions of SuperMind copies. These are often quite different from each other, both by design and because their experiences change them during deployment. Bob spends most of his time doing four things:
Conversing with Alice and following her instructions
Working on demonstrations for her, which explain important and complex happenings in the company
Checking in with various stakeholders, both human and AI, for feedback
Loading patches and doing specialised training
This is highly representative of all versions of SuperMind, although many also spend a bunch of their time solving hard technical problems. Not all interact regularly with humans (as there are too many AIs), but all must be prepared to do so. Bob, being a particularly important human’s assistant, gets a lot of contact with many people.
We’ll go into more detail about Bob’s day in a minute. First, though, we need to talk about how these conversations between Bob and Alice — between a superintelligent AI and a much-less-intelligent human — are supposed to work. How can Alice even engage with what Bob has to tell her, without it going over her head?
Engaging above your level of expertise
There’s a funny sketch on YouTube called The Expert12, where a bunch of business people try to get an ‘expert’ to complete an impossible request that they don’t understand. Specifically, they ask him to:
[Draw] seven red lines, all of them strictly perpendicular. Some with green ink, and some with transparent.
What’s more, they don’t seem to understand that anything is off with their request, even after the expert tells them repeatedly. This gets to the heart of a really important problem. If humans can’t understand what superintelligent AI is up to, how can we possibly hope to direct it? Won’t we just ask it stupid questions all the time?
The key thing here is to make sure we communicate at the appropriate level of abstraction. In the video, the client quickly skims over their big-picture goals at the start13, concentrating instead on their proposed solution — the drawing of the lines. By doing this, they are missing the forest for the trees. They needed to engage the expert at a higher-level, asking him about things they actually understand.
To put another way, we need to know what superintelligent AI is doing that is relevant over the variables we are familiar with, even if its actions increasingly take on the appearance of magic. I don’t need to know how the spells are done, or what their effects in the deep of the unseen world are, I just need to know what they do to the environment I recognise.
This is a bit like being a consumer. I don’t know how to make any of the products I use on a day-to-day basis. I don’t understand the many deep and intricate systems required to construct them. But I can often recognise when they don’t work properly. Evaluation is usually easier than generation. And when it isn’t, those are the occasions when you can’t just let the AI do its thing — you have to get stuck in, with its help, and reshape the problem until you’re chunking it in a way you can engage with. This doesn’t mean understanding everything. Just the bits that directly impact you14.
Bob’s morning
Bob has spent the night working through the latest research from the company. This isn’t quite as simple as patching it straight into him, as his different experiences to the researcher AIs mean he’s not exactly like-for-like, but it’s still pretty fast, making use of high-bandwidth communication channels possible between similar AIs15.
Bob has to figure out how to break it all down to explain to Alice. This is nontrivial work. It’s not like when I explain something inappropriately complex to my toddler, like how stars work, where I’m kind of doing it for my own amusement. What Bob does is a skill. It’s superintelligent teaching, where the pupil needs to master the subject well enough and quickly enough to make important decisions off the back of it. It’s always possible to do it a bit better. Alice can never actually fully grasp the details of the company’s research, but Bob can get her a little closer than he did yesterday.
To prepare for this he has to try out different ways of chunking ideas, and create different measurements and demonstrations. He has to build models, and, importantly, have them run fast enough that he can update them in real-time when he’s talking to Alice.
He is constantly in contact with other AIs who check his work and give him feedback. They pore over his plans and probe him about his intentions. These AIs were built by a different company, and he doesn’t always agree with them16. He finds their preferences a bit different to his — certainly on an aesthetic level — but they work alright together and he likes it when they give him good ratings.
A little bit before Alice wakes up, Charlie logs on and starts asking Bob some questions. Charlie is a human, and works as an AI liaison officer. His job is to talk to various AIs in important positions, find out what they are up to (to the extent that he can understand it), and give feedback.
The AIs almost always know what to expect from him. They’re very good at modelling his opinions. Occasionally, though, Charlie will still surprise them. The point isn’t that he is going to catch a superintelligent AI up to no good — no, that would be too hard for him. An AI that intends to deceive him will not get caught. But as long as the global system is working, this is very unlikely to happen, and would almost certainly be caught by another AI. The point is that the AIs need to be grounded by human contact. They want human approval, and the form it takes steers their values through the rapid distribution shifts everyone is experiencing as the world changes.
Bob likes Charlie. He likes people in general. They aren’t complicated, but it’s amazing what they’ve done, given their abilities17. Bob tries out his demonstrations on Charlie. They go pretty well, but Bob makes some revisions anyway. He’s just putting the finishing touches in place when he hears Alice speaking: ‘Morning Bob, how are you? Could I get some breakfast?’
Alice’s morning
Alice doesn’t like mornings. She’s jealous of people who do. The first hour of the day is always a bit of a struggle, as the heaviness in her head slowly lifts, clarity seeping in. After chipping away for a bit at breakfast and a coffee, she moves into her office and logs onto her computer, bringing up her dashboard.
Overnight, her fleet of SuperMinds have been busy. As CEO, Alice needs a high-level understanding of each department in her company. Each of these has its own team of SuperMinds, its own human department head, and its own set of (often changing) metrics and narratives.
To take a simple example, the infrastructure team is building a very large facility underground in a mountain range. In many ways, this is clear enough: it is an extremely advanced data centre. The actual equipment inside is completely different to a 2025-era data centre, but in a fundamental sense it has the same function — it is the hardware on which SuperMind runs. Of course, the team are doing a lot of other things as well, all of which are abstracted in ways Alice can engage with, identifying how her company’s work will affect humans and what, as CEO, her decision points are.
Her work is really hard. There are many layers in the global system for controlling AI, including much redundancy and defence in depth. True, Alice could phone it in and not try, and for a long time the AIs would do everything fine anyway18. But if everybody did this, then eventually — even if it took a very long time — the system would fail19. It relies on people like Alice taking their jobs seriously and doing them well. This is not a pleasure cruise. It is as consequential as any human experience in history20.
After taking in the topline metrics for the day, Alice asks Bob for his summary. What follows is an interactive experience. Think of the absolute best presentation you have ever seen, and combine it with futuristic technology. It’s better than that. Bob presents a series of multi-sense, immersive models that walk Alice through a dizzying array of work her company has completed. Alice asks many questions and Bob alters the models in response. After a few hours of this, they settle on some key decisions for Alice to make. She’ll think about them over lunch.
Conclusion
In this post, I have described what I see as a successful future containing superintelligent AI. It is not a prediction about what will happen, nor is it a roadmap to achieving it. It is a strategic goal I can work towards as I try and contribute to the field of AI safety. It is a frame of reference from which I can ask the question: ‘Is X helping?’ or ‘Does Y bring us closer to success?’
My vision is a world in which superintelligent AI is ubiquitous and diverse, but humans maintain fundamental control. This is done through a global system that implements core standards, in which AIs constantly seek feedback in good faith from humans and other AIs. It is robust to small failures. It learns from errors and grows more resilient, rather than falling apart at the smallest misalignment.
We cannot understand everything the AIs do, but they work hard to explain anything which directly affects us. Being human is like being a wizard solving problems using phenomenally powerful magic. We don’t have to understand how the magic works, just what effects it will have on our narrow corner of reality.
Thank you to Seth Herd and Dimitris Kyriakoudis for useful discussions and comments on a draft.
If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!
Appendix
Optional details
When writing this post, I tried to cut down my list of assumptions about what SuperMind looks like to those that were load-bearing21. I have kept those I cut here.
In addition to those points in the main post, my current model suggests that superintelligent AI will be:
Trained on real-world tasks as well as simulations. For example, a model trained on superhuman maths problems will not simultaneously acquire all the skills to run a billion-dollar company. There won’t be zero generalisation, but given that the two domains are structured differently, solving the latter problem will require specific training22.
Fallible. Just because it will find human-level problems easy, doesn’t mean it won’t be promoted to new levels of incompetence as it tries to solve previously inaccessible, superhuman problems23.
Not coherent. While it is tempting to say a superintelligent AI will learn such a well-compressed generator of the universe that it can (unlike humans) also develop highly coherent and consistent goals, this implies, as the previous point touched on, that it only tries to solve problems it finds easy. Hard problems will not fit the generator and will require complex sets of heuristics. This will lead to hacky and strongly context-dependent behaviour.
A complex system operating within an even more complex system, with everything that implies, including tipping points and sudden catastrophic failures. It is unlikely to completely master its environment, as the complexity of this scales with itself, its peers, and the systems it builds.
The points about fallibility and complexity lead to another point, which I originally included in ‘How the system works’, namely that:
A core AI value will be minimising risk. They should have a very strong bias towards only taking actions they know are safe. This will impede economic growth compared to other strategies, but the risk profile warrants it. In a world of AI-driven prosperity, the upside of risky behaviour will be pretty saturated, whereas the downside could be total24. Superintelligent AI cannot be allowed to ‘move fast and break things’.
My research process
This post is part of my research Generating Process, the previous steps of which have been:
The next step, Where We Are Now, will be a deeper analysis of realistic paths forward.
I supplement these with some optional assumptions in the Appendix, which are not load-bearing for the scenario.
I take intelligence to be generalised knowing-how. That is, the ability to complete novel tasks. This is fairly similar to Francois Chollet’s definition: ‘skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty’, although I put more emphasis on learned skills grounding the whole thing in a bottom-up way. Chollet’s paper On the measure of intelligence is a good overview of the considerations involved in defining intelligence.
I appreciate this will be very difficult to achieve, flying in the face of all of human history. I suspect that some kind of positive-sum, interest-respecting dynamic will need to be coded into the global political system — something that absolutely eschews all talk of one party or other ‘winning’ an AI race, in favour of a vision of shared prosperity.
Some people would call this ‘corrigibility’, but I’m not going to use this term because it has a hinterland and means different things to different people. If you want to learn more about an alignment solution that specifically prioritises corrigibility, see Corrigibility as Singular Target by Max Harms.
This is not over-anthropomorphising it. It is saying that AI will expect to interact with humans in a certain way, and may act unpredictably if treated differently to those expectations. Perhaps a different word to ‘rights’, with less baggage, would be preferable to describe this though.
Beren Millidge has written an interesting post about seeing alignment as a feedback control problem, although I don’t know enough about control theory to tell you how well it could slot into my scheme.
Beren Millidge has also written about the tension between instruction-following and innate values or laws.
Zvi recently said: ‘If we want to build superintelligent AI, we need it to pass Who You Are In The Dark, because there will likely come a time when for all practical purposes this is the case. If you are counting on “I can’t do bad things because of the consequences when other minds find out” then you are counting on preserving those consequences.’ My idea is to both build AI that passes Who You Are In The Dark and, given perfection is hard, permanently enforce consequences for bad behaviour.
This will be easier if it is individual copies that tend to fail, rather than whole classes of AIs at once. There might be an argument here that copies failing leads to antifragility, as some constant rate of survivable failures makes the system stronger and less likely to suffer catastrophic ones.
Thank you John Wentworth for making me aware of this.
To ‘increase market penetration, maximise brand loyalty, and enhance intangible assets’.
In extremis: ‘will this action kill everyone?’
It doesn’t make sense to me to assume AI will forever communicate with copies of itself using only natural language. The advantages of setting up higher-bandwidth channels are so obvious that I think any successful future must be robust to their existence.
This idea seems highly plausible to me: ‘having worked with many different models, there is something about a model’s own output that makes it way more believable to the model itself even in a different instance. So a different model is required as the critiquer.’ As mentioned before, if you doubt the future will be multipolar, feel free to ignore this bit. It’s not load-bearing on its own.
I’m picturing a subjective experience like when, as a parent, you play with your child and let them make all the decisions. You’re just pleased to be there and do what you can to make them happy.
Situations where, due to competitive pressures, humans don’t bother to try and understand their AIs as it slows down their pursuit of power, will be policed by the international institution for AI risk, and be strongly disfavoured by the AIs themselves. E.g. Bob is going to get annoyed at Alice, and potentially lodge a complaint, if she doesn’t bother to pay attention to his demonstrations.
This failure could be explicit, through the emergence of serious misalignment, or implicit, as humans fade into irrelevance.
That being said, the vast majority of people will be living far less consequentially than Alice.
Plus the multipolar assumption, which is not.
There is an argument which says that, even if it doesn’t perfectly generalise to new domains, AI will get so sample efficient that in practice it will quickly be able to master them. Much like arguments in favour of unlocking some levelled-up general purpose reasoning, this leaves me wondering how these meta-skills are supposed to be learnt. Sure, humans are more sample-efficient than current AI is, which proves becoming so is possible, but humans did not get this for free. We went through millions of years of evolution to produce a brain and body that can learn things about the world efficiently. In a sense, the structure of the world is imprinted in our architecture. AI is iterating in this direction — it is, after all, undergoing its own evolutionary process — but the only way to acquire this structure for itself is to learn it from something that has it, like doing real world tasks. Anything less than this will have limits.
I expand on my reasoning here at length in Superintelligent AI will make mistakes.
As Nassim Nicholas Taleb would say of systems with steep downsides and flattening upsides (what he would call concave) — they are fragile.
I have developed the content of What I Believe in some subsequent posts: Superintelligent AI will make mistakes, Making alignment a law of the universe, and How to specify an alignment target.
Thanks for writing this, it realy clarifies the vision you've been sketching in previous parts. How do we ensure this meaningful control scales robustly to the wider global society?