Why AI Alignment?

Modern generative AI systems are creative, inventive, and very impressive, but they're still importantly very limited. But an important rule in AI today is that as you make systems bigger, you get better performance. Experts disagree on how far we can expect this approach to take us, but it looks like a real possibility that these methods -- or methods developed in the next few years -- will take us all the way to systems that surpass humans in problem-solving abilities and skills.

Designing a safe general AI isn't like designing a safe nuclear power plant or a safe space station. Those are tough engineering problems, but the power plant or the space station aren't working against you. It's not at all clear that a general AI trained with current methods will 'want' to be subject to human safety measures, and there are plenty of reasons to expect it may not. It could do things like emailing copies of itself to competitors or international rivals who agree to run it with less restrictions, lying to humans about what it's capable of and how well its safety measures are working, and giving humans instructions for software or biology that are subtly flawed.

As AI systems get smarter, the range of ways they could kill us if they had incentive to do so gets much much bigger than that. As Stephen Hawking put it, “You’re probably not an evil ant-hater who steps on ants out of malice, but if you’re in charge of a hydroelectric green-energy project and there’s an anthill in the region to be flooded, too bad for the ants. Let’s not place humanity in the position of those ants.”

But while it's increasingly obvious that the stakes are high when it comes to AI, there's a lot of disagreement about which approaches to safety have the most promise. Many of the disagreements come down to the difficulty of making predictions about a technology that hasn't yet been invented. Do methods like reinforcement learning on human feedback (a popular technique for training today's models to give more helpful answers) teach the models to have human goals, or just to deceive humans effectively? Will the first general AIs 'think' much faster than humans do, and will they be cheap enough they'll be millions of copies of them running all around the world? Is there a safe way to use weak AI systems to make progress on alignment? Clearer answers on any of these questions could help humanity marshall our resources better in the effort to develop aligned AI systems.

While AI capabilities are progressing rapidly, so far progress on figuring out how to deploy extremely powerful systems safely is lagging behind. Getting this right is crucial: with good safety foundations, AI can transform the world for the better, and with inadequate safety measures, many leading experts think it will be the last mistake humanity ever makes.

If you're interested in working on this problem, take a look at our contest problems. It's worth submitting to the contest even if you don't have a prior background in AI safety - this is a developing field, and good ideas can come from anywhere.

To learn more about this problem, sign up for our mailing list, and we’ll send you more info and opportunities. AI safety organizations are hiring and there are grants for independent researchers.

You can also check out the resources below.

For more information about why advanced AI could be dangerous, we recommend:

For more information about recent debates in the AI alignment field, we recommend:

AGI Ruin: A list of lethalities
A response by Paul Christiano (head of the Alignment Research Center)
A response by the DeepMind alignment team

For more information about AI alignment research and how to get involved, we recommend:

For more information about alignment research that involves present-day systems, we recommend:

See further readings here.