AI Alignment Forum

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don’t want to cause harm or take over the world—or in other words, ensuring that they’re aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called “AI control”: ensuring that AI systems couldn’t take over the world, even if they were trying to.

Topics we discuss:

What is AI control?
Protocols for AI control
Which AIs are controllable?
Preventing dangerous coded AI communication
Unpredictably uncontrollable AI
What control looks like
Is AI control evil?
Can red teams match misaligned AI?
How expensive is AI monitoring?
AI control experiments
GPT-4’s aptitude at inserting backdoors

...

(Continue Reading – 31835 more words)

1Vivek Hebbar9h

Since it's slightly hard to find: Paul references it here (ctrl+f for "interior") and links to this source (once again ctrl+f for "interior"). Paul also refers to it in this post. The term is actually "position of the interior" and apparently comes from military strategist Carl von Clausewitz.

Ryan Greenblatt31m20

Also some discussion in this thread.

1Vivek Hebbar9h

This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

8Ryan Greenblatt3d

Summary Here's a (simplified) summary of where I'm at: * My prefered usage of control involves only using AIs with both control and a low probability that these AIs have problematic beyond episde aims. We should also offer to pay AIs to reveal their longer run aims. We should shutdown usage of powerful AIs if we get strong evidence of these aims. (E.g. the AIs accept our payment and reveal evidence of these aims. Or more concerningly, these AIs try to escape despite not revealing their aims at an earlier point.) (I think this isn't sufficiently clear in this podcast, sorry.) * Suppose that we proceed in a case where we know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us under a control regime. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their "normal jobs" as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We'd probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn't seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons. * As far as relaxing control on AIs goes, its worth emphasizing that control is an interim solution. I expect that AIs won't be heavily utilized with control measures for more than 20 years and my default guess is more like 2 years. (Of course, this might be a long time from the subjective perspective of these AIs.) We'd could transition out of a control regime by having an extremely robust approach to intent alignment which ensures that AIs are fully corrigible (and thus don't have preferences) or via the world dec

Fabien's Shortform

Fabien Roger

1mo

21Fabien Roger5d

I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book: * There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (relying only on statistics, with no causal model), or like a consultant (usually with shitty qualitative approaches). * The state of risk estimation for insurances is actually pretty good: it's quantitative, and there are strong professional norms around different kinds of malpractice. When actuaries tank a company because they ignored tail outcomes, they are at risk of losing their license. * The state of risk estimation in consulting and management is quite bad: most risk management is done with qualitative methods which have no positive evidence of working better than just relying on intuition alone, and qualitative approaches (like risk matrices) have weird artifacts: * Fuzzy labels (e.g. "likely", "important", ...) create illusions of clear communication. Just defining the fuzzy categories doesn't fully alleviate that (when you ask people to say what probabilities each box corresponds to, they often fail to look at the definition of categories). * Inconsistent qualitative methods make cross-team communication much harder. * Coarse categories mean that you introduce weird threshold effects that sometimes encourage ignoring tail effects and make the analysis of past decisions less reliable. * When choosing between categories, people are susceptible to irrelevant alternatives (e.g. if you split the "5/5 importance (loss > $1M)" category into "5/5 ($1-10M), 5/6 ($10-100M), 5/7 (>$100M)", people a

1romeostevensit2d

Is there a short summary on the rejecting Knightian uncertainty bit?

Fabien Roger1d34

By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia).

The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events".

For a more in-depth defense of this position in the context of long-term prediction... (read more)

How I select alignment research projects

Ethan Perez, Henry Sleight, Mikita Balesni

Youtube Video

Recently, I was interviewed by Henry Sleight and Mikita Balesni about how I select alignment research projects. Below is the slightly cleaned up transcript for the YouTube video.

Introductions

Henry Sleight: How about you two introduce yourselves?

Ethan Perez: I'm Ethan. I'm a researcher at Anthropic and do a lot of external collaborations with other people, via the Astra Fellowship and SERI MATS. Currently my team is working on adversarial robustness, and we recently did the sleeper agents paper. So, basically looking at we can use RLHF or adversarial training or current state-of-the-art alignment safety training techniques to train away bad behavior. And we found that in some cases, the answer is no: that they don't train away hidden goals or backdoor behavior and models. That was a lot of...

(Continue Reading – 6989 more words)

6johnswentworth4d

Meta: this comment is decidedly negative feedback, so needs the standard disclaimers. I don't know Ethan well, but I don't harbor any particular ill-will towards him. This comment is negative feedback about Ethan's skill in choosing projects in particular, I do not think others should mimic him in that department, but that does not mean that I think he's a bad person/researcher in general. I leave the comment mainly for the benefit of people who are not Ethan, so for Ethan: I am sorry for being not-nice to you here. ---------------------------------------- When I read the title, my first thought was "man, Ethan Perez sure is not someone I'd point to as an examplar of choosing good projects". On reading the relevant section of the post, it sounds like Ethan's project-selection method is basically "forward-chain from what seems quick and easy, and also pay attention to whatever other people talk about". Which indeed sounds like a recipe for very mediocre projects: it's the sort of thing you'd expect a priori to reliably produce publications and be talked about, but have basically-zero counterfactual impact. These are the sorts of projects where someone else would likely have done something similar regardless, and it's not likely to change how people are thinking about things or building things; it's just generally going to add marginal effort to the prevailing milieu, whatever that might be.

Ethan Perez4d1414

Yeah, some caveats I should've added in the interview:

Don't listen to my project selection advice if you don't like my research
The forward-chaining -style approach I'm advocating for is controversial among the alignment forum community (and less controversial in the ML/LLM research community and to some extent among LLM alignment groups)
1. Part of why I like this approach is that I (personally) think there are at least some somewhat promising agendas out there, that aren't getting executed on enough (or much at all), and it's doable to e.g. double the amount

... (read more)

1Michaël Trazzi6d

Links for the audio: Spotify, Apple Podcast, Google Podcast

11Michaël Trazzi6d

Claude Opus summary (emphasis mine): 1. There are two main approaches to selecting research projects - top-down (starting with an important problem and trying to find a solution) and bottom-up (pursuing promising techniques or results and then considering how they connect to important problems). Ethan uses a mix of both approaches depending on the context. 2. Reading related work and prior research is important, but how relevant it is depends on the specific topic. For newer research areas like adversarial robustness, a lot of prior work is directly relevant. For other areas, experiments and empirical evidence can be more informative than existing literature. 3. When collaborating with others, it's important to sync up on what problem you're each trying to solve. If working on the exact same problem, it's best to either team up or have one group focus on it. Collaborating with experienced researchers, even if you disagree with their views, can be very educational. 4. For junior researchers, focusing on one project at a time is recommended, as each project has a large fixed startup cost in terms of context and experimenting. Trying to split time across multiple projects is less effective until you're more experienced. 5. Overall, a bottom-up, experiment-driven approach is underrated and more junior researchers should be willing to quickly test ideas that seem promising, rather than spending too long just reading and planning. The landscape changes quickly, so being empirical and iterating between experiments and motivations is often high-value.

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

Lovre

Introduction

Much has been written and much has been observed about the abilities of GPT-3 on many tasks. Most of these capabilities, though not all, pertain to writing convincing text, but, not to undermine GPT-3's impressiveness at performing these tasks, we might call this the predictable part of its oeuvre. It only makes sense that a better language modelling is, well, going to be better at writing text.

Deeper and comparatively much less explored is the unpredictable ability of GPT-3 to learn new tasks by just seeing a few examples, without any training/backpropagation – the so called in-context learning (sometimes called metalearning).

The original paper announcing GPT-3 contained a handful of examples (perhaps mostly notably examples of GPT-3 learning to perform arithmetic, e.g. accurate addition of up to 5-digit numbers),...

(Continue Reading – 2394 more words)

gwern4d50

Another: "From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples", Vacareanu et al 2024:

We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised

Victoria Krakovna, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Zac Kenton, Jan Leike

(Originally posted to the Deepmind Blog)

Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden touch, in which the king asks that anything he touches be turned to gold - but soon finds that even food and drink turn to metal in his hands. In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material - and thus exploit a loophole in the task specification.

This problem also arises in the design of artificial agents. For

...

(Continue Reading – 1702 more words)

Luke H Miles4d10

I would watch a ten hour video of this. (It may also be more persuasive to skeptics.)