johnswentworth - AI Alignment Forum

How I select alignment research projects

Meta: this comment is decidedly negative feedback, so needs the standard disclaimers. I don't know Ethan well, but I don't harbor any particular ill-will towards him. This comment is negative feedback about Ethan's skill in choosing projects in particular, I do not think others should mimic him in that department, but that does not mean that I think he's a bad person/researcher in general. I leave the comment mainly for the benefit of people who are not Ethan, so for Ethan: I am sorry for being not-nice to you here.

When I read the title, my first thought was "man, Ethan Perez sure is not someone I'd point to as an examplar of choosing good projects".

On reading the relevant section of the post, it sounds like Ethan's project-selection method is basically "forward-chain from what seems quick and easy, and also pay attention to whatever other people talk about". Which indeed sounds like a recipe for very mediocre projects: it's the sort of thing you'd expect a priori to reliably produce publications and be talked about, but have basically-zero counterfactual impact. These are the sorts of projects where someone else would likely have done something similar regardless, and it's not likely to change how people are thinking about things or building things; it's just generally going to add marginal effort to the prevailing milieu, whatever that might be.

Natural Latents: The Math

johnswentworth2mo20

Yeah, that's right.

The secret handshake is to start with " is independent of $Y$ given $Z$ " and " $X$ is independent of $Z$ given $Y$ ", expressed in this particular form:

$P [X, Y, Z] = P [X | Z] P [Y, Z] = P [X | Y] P [Y, Z]$

... then we immediately see that $P [X | Z] = P [X | Y]$ for all $X, Y, Z$ such that $P [Y, Z] > 0$ .

So if there are no zero probabilities, then $P [X | Z] = P [X | Y]$ for all $X, Y, Z$ .

That, in turn, implies that $P [X | Z]$ takes on the same value for all Z, which in turn means that it's equal to $P [X]$ . Thus $X$ and $Z$ are independent. Likewise for $X$ and $Y$ . Finally, we leverage independence of $Y$ and $Z$ given $X$ :

$P [X, Y, Z] = P [Y | X] P [Z | X] P [X]$

$= P [Y] P [Z] P [X]$

(A similar argument is in the middle of this post, along with a helpful-to-me visual.)

Natural Latents: The Math

johnswentworth2mo20

Roughly speaking, all variables completely independent is the only way to satisfy all the preconditions without zero-ish probabilities.

This is easiest to see if we use a "strong invariance" condition, in which each of the must mediate between $X_{¯ i}$ and $Λ$ . Mental picture: equilibrium gas in a box, in which we can measure roughly the same temperature and pressure ( $Λ$ ) from any little spatially-localized chunk of the gas ( $X_{i}$ ). If I estimate a temperature of 10°C from one little chunk of the gas, then the probability of estimating 20°C from another little chunk must be approximately-zero. The only case where that doesn't imply near-zero probabilities is when all values of both chunks of gas always imply the same temperature, i.e. $Λ$ only ever takes on one value (and is therefore informationally empty). And in that case, the only way the conditions are satisfied is if the chunks of gas are unconditionally independent.

Many arguments for AI x-risk are wrong

johnswentworth2mo60

I agree with this point as stated, but think the probability is more like 5% than 0.1%

Same.

I do think our chances look not-great overall, but most of my doom-probability is on things which don't look like LLMs scheming.

Also, are you making sure to condition on "scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"

That's not particularly cruxy for me either way.

Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF".

Fair. Insofar as "scaling up networks, running pretraining + RL" does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.

Many arguments for AI x-risk are wrong

johnswentworth2mo254

Solid post!

I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer), and I think this is the best write-up of it I've seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.

Counting arguments provide no evidence for AI doom

johnswentworth3mo335

This isn't a proper response to the post, but since I've occasionally used counting-style arguments in the past I think I should at least lay out some basic agree/disagree points. So:

This post basically-correctly refutes a kinda-mediocre (though relatively-commonly-presented) version of the counting argument.
There does exist a version of the counting argument which basically works.
The version which works routes through compression and/or singular learning theory.
In particular, that version would talk about "goal-slots" (i.e. general-purpose search) showing up for exactly the same reasons that neural networks are able to generalize in the overparameterized regime more generally. In other words, if you take the "counting argument for overfitting" from the post, walk through the standard singular-learning-theory-style response to that story, and then translate that response over to general-purpose search as a specific instance of compression, then you basically get the good version of the counting argument.
- Just remembered I walked through basically the good version of the counting argument in this section of What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?
The "Against Goal Realism" section is a wild mix of basically-correct points and thorough philosophical confusion. I would say the overall point it's making is probably mostly-true of LLMs, false of humans, and most of the arguments are confused enough that they don't provide much direct evidence relevant to either of those.

Pretty decent post overall.

A Shutdown Problem Proposal

johnswentworth4mo20

There is no "AI gets control of button" option, from the perspective of either subagent. Both agents would look at option c, stick it into their do()-op on button state, and then act-as-though option C would not give any control at all over the button state.

I think you are attempting to do the math as though the do()-op were instead just a standard conditional (i.e. EDT-style rather than CDT-style)?

A Shutdown Problem Proposal

johnswentworth4mo32

Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.

I'm not seeing how it is possible for the agents to imagine that? Both of them expect that they have no influence whatsoever over whether the button is pressed, because there's nothing in their decision-driving models which is causally upstream of the button.

A Shutdown Problem Proposal

johnswentworth4mo44

That's not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we'd like to avoid if possible, and I don't have any argument that that particular sort of inefficiency is necessary for corrigible behavior.

The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent's internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent's epistemic probabilities sane.

... but then the question is whether that subagent induces button-influencing-behavior. I don't yet have a good argument in either direction on that question.

AI doing philosophy = AI generating hands?

johnswentworth4mo44

By the time a human artist can create landscape images which look nearly as good as those examples to humans, yeah, I'd expect they at least get the number of fingers on a hand consistently right (which is also a "how good it looks to humans" thing). But that's still reifying "how good it looks to humans" as the metric.

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wiki Contributions

Comments