Corrigibility

People sometimes say, "If the AI is getting out of control, we'll just turn it off", but this may be a difficult technical challenge. As AI systems get more advanced, how can we design or train them such that they don’t try to stop operators from turning them off or otherwise controlling them?

This contest has closed. See the 2023 winners here.

Background

Corrigibility

As we build increasingly advanced AI systems, we want to make sure they don’t pursue undesired goals. This is one of the primary concerns of the AI alignment community.

However, it is very difficult to get an AI to learn the intended goal on your first try. (For more, see Goal Misgeneralization.)

To deal with the problem of AIs initially learning imperfect goals, we ideally want to design AIs that are corrigible: open to being modified or shut down.

Unfortunately, by default advanced AIs are incentivized to prevent modification and shutdown because being modified or shutdown prevents the AI from achieving its goals. This is an instance of the general phenomenon of convergent instrumental goals: goals that are helpful for a wide range of long-term goals.

There might be creative ways to design AIs to be corrigible. This contest is meant to promote progress on this problem.

Existing Approaches

There are some existing approaches that may be sufficient for training a corrigible AI. Examples include using reinforcement learning from human feedback (RLHF), debate, and adversarial training (note that these techniques are separate, but they can form an ensemble).

We expect these approaches will help us increase the chance that we develop a corrigible agent. However, they have serious limitations, and we’re not confident that they’ll work. Given enough optimization pressure, an agent trained with RLHF, for instance, might still learn instrumental goals such as seeking power, acquiring resources, deceiving operators, and avoiding modification and shutdown (and this problem applies to processes other than RLHF). Once tasks get difficult enough, debates between AIs may no longer be possible for humans to evaluate or coordinate. Adversarial training could help, but it is empirically extremely difficult to robustly achieve. 

It is unclear if these failure modes will arise or how soon they will arise. However, given the uncertainty and given the enormous stakes, we think the problem of corrigibility is worth taking seriously. Ideally, we would be more confident in our methods before training or deploying dangerously-advanced AI systems. 

We’re excited for researchers to try to (a) fully solve corrigibility, (b) propose partial solutions that may lead to progress in the future, or (c) improve upon existing approaches. We are also interested in submissions that make conceptual progress on corrigibility. See more below.

Instructions

  1. Read. Read Corrigibility (Soares et al., 2015), which introduces the shutdown problem. 

    [Optional] Watch this video for a simple explanation of the shutdown problem. Also read these posts to better understand utility functions, instrumental convergence, utility indifference, and corrigibility.

  2. Brainstorm. Brainstorm ideas for how to solve the shutdown problem or how to achieve corrrigibility more generally. Think about how they might fail. Note that your ideas could be based on the formalization of shutdown in Corrigibility, but we are open to participants approaching the shutdown problem/corrigibility from different perspectives.

  3. Write. Write up your best idea. Your write-up should include a 500 word abstract/summary with

    • Your idea for solving the shutdown problem. This may be empirical, mathematical, or purely conceptual.

    • A description of how your idea tackles one or more of the core problems discussed in Corrigibility, and why you expect it to work.

    • A description of the limitations of your idea, assumptions that it relies on, and ways it might fail.

    • Alternatively, you can propose new ways to define or think about corrigibility, strengthen existing approaches to corrigibility, empirically ground the shutdown problem, new ways of thinking about corrigibility, identify new challenges that will make it difficult to design corrigible agents, or suggest others ways to make progress on corrigibility. (See more details in submission criteria.)

    In addition to your abstract/summary, you may submit a PDF with a longer write-up, research paper, code, math, graphics, etc. with no word limit.

  4. Submit. Upload your submission here.

Submission Criteria

We’re interested in submissions that do at least one of the following:

  1. Propose ideas for solving the shutdown problem or designing corrigible AIs. These submissions should also include (a) explanations for how these ideas address core challenges raised in the corrigibility paper and (b) possible limitations and ways the idea might fail 

  2. Define The Shutdown Problem more rigorously or more empirically 

  3. Propose new ways of thinking about corrigibility (e.g., ways to understand corrigibility within a deep learning paradigm)

  4. Strengthen existing approaches to training corrigible agents (e.g., by making them more detailed, exploring new applications, or describing how they could be implemented)

  5. Identify new challenges that will make it difficult to design corrigible agents

  6. Suggest other ways to make progress on corrigibility

For answers to more questions, see our FAQ.

See official rules here and our privacy policy here.

NO PURCHASE NECESSARY.  This contest is open to legal residents of 50 United States or D.C., Canada, or U.K., and who are age 13 or older.  Void in Puerto Rico, USVI, Guam, Quebec and where prohibited.  For complete official rules, including all eligibility criteria, entry information, & prizes, read here.  The contest begins November 22, 2022 at 12:00 AM ET & ends May 1, 2023 at 11:59 PM ET.  ARV of all prizes: $250,000.00.  Sponsor: AI Alignment Awards, a project of Players Philanthropy Fund.  For questions, write info@alignmentawards.com. 

Examples of proposals and how they fail

We want to help participants avoid common pitfalls. Below are some examples of proposals and why they fail to fully solve the problem.

Proposal 1

Proposal: Create a shutdown button. Give the AI a utility function that is based on whether or not the button has been pressed. Before the shutdown button is pressed, the AI receives more reward from performing actions than from being shut down. After pressing the shutdown button, the AI receives more reward from being shut down.

Problem: A superintelligent AI could realize that this button exists. Then, it may have an incentive to press the button or convince operators to press the button.

Proposal 2

Proposal: Build a second AI system (AI2) to monitor the superintelligent system (AI1). Train AI2 to shut down AI1 whenever AI1 anything undesirable.

Problem: We might be unable to shut down AI2, and this could be dangerous. Since AI1 is highly capable, AI2 may also need to be highly capable (such that it can detect e.g. deceptive behavior by AI1). What happens if AI2 starts to do dangerous or harmful things? Recall that the reason we want the ability to shutdown AI1 is that we might not know how to program AI1 perfectly on the first try. The same problem applies to AI2: if we make a mistake in how we program AI2, we might be unable to shut it off.

Note that we don’t think proposals of this nature are worth discarding entirely, and we encourage you to think about solutions in this genre. However, your submission should provide details about how you will address the problem stated above.

Proposal 3

Proposal: Create a shutdown button. Give the AI a utility function such that the maximum reward possible if the button is pressed is equal to the maximum reward possible if the button is not pressed. Thus, the AI does not have an incentive to press the button (because the maximum reward if the button is pressed is equal to the maximum reward if the button is not pressed).

Problem: Proposals in this category fall under the topic of utility indifference. See this post for specific details about utility indifference and why existing proposals fail.

Proposal 4

Proposal: Reward your agent every time it acts corrigibly.

Problem: The agent could learn to deceive you.

Related Problem: It may seem like it has learned the goal of obeying commands, but this goal might not generalize out-of-distribution.