Charbel-Raphaël argues that interpretability research has poor theories of impact. It's not good for predicting future AI systems, can't actually audit for deception, lacks a clear end goal, and may be more harmful than helpful. He suggests other technical agendas that could be more impactful for reducing AI risk. 

37Charbel-Raphael Segerie
Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have. First, I believe the post's general motivation of red-teaming a big, established research agenda remains crucial. It's too easy to say, "This research agenda will help," without critically assessing how. I appreciate the post's general energy in asserting that if we're in trouble or not making progress, we need to discuss it. I still want everyone working on interpretability to read it and engage with its arguments. Acknowledgments: Thanks to Epiphanie Gédéon, Fabien Roger, and Clément Dumas for helpful discussions. Updates on my views Legend: * On the left of the arrow, a citation from the OP → ❓ on the right, my review which generally begins with emojis * ✅ - yes, I think I was correct (>90%) * ❓✅ - I would lean towards yes (70%-90%) * ❓ - unsure (between 30%-70%) * ❓❌ - I would lean towards no (10%-30%) * ❌ - no, I think I was basically wrong (<10%) * ⭐ important, you can skip the other sections Here's my review section by section: ⭐ The Overall Theory of Impact is Quite Poor? * "Whenever you want to do something with interpretability, it is probably better to do it without it" → ❓ I still think this is basically right, even if I'm not confident this will still be the case in the future; But as of today, I can't name a single mech-interpretability technique that does a better job at some non-intrinsic interpretability goal than the other more classical techniques, on a non-toy model task. * "Interpretability is Not a Good Predictor of Future Systems" →

Popular Comments

Copying over from X an exchange related to this post: Tom McGrath: > I’m a bit confused by this - perhaps due to differences of opinion in what ‘fundamental SAE research’ is and what interpretability is for. This is why I prefer to talk about interpreter models rather than SAEs - we’re attached to the end goal, not the details of methodology. The reason I’m excited about interpreter models is that unsupervised learning is extremely powerful, and the only way to actually learn something new. > > [thread continues] Neel Nanda: > A subtle point in our work worth clarifying: Initial hopes for SAEs were very ambitious: finding unknown unknowns but also representing them crisply and ideally a complete decomposition. Finding unknown unknowns remains promising but is a weaker claim alone, we tested the others > > OOD probing is an important use case IMO but it's far from the only thing I care about - we were using a concrete case study as grounding to get evidence about these empirical claims - a complete, crisp decomposition into interpretable concepts should have worked better IMO. > > [thread continues] Sam Marks (me): > FWIW I disagree that sparse probing experiments[1] test the "representing concepts crisply" and "identify a complete decomposition" claims about SAEs.  > > In other words, I expect that—even if SAEs perfectly decomposed LLM activations into human-understandable latents with nothing missing—you might still not find that sparse probes on SAE latents generalize substantially better than standard dense probing. > > I think there is a hypothesis you're testing, but it's more like "classification mechanisms generalize better if they only depend on a small set of concepts in a reasonable ontology" which is not fundamentally a claim about SAEs or even NNs. I think this hypothesis might have been true (though IMO conceptual arguments for it are somewhat weak), so your negative sparse probing experiments are still valuable and I'm grateful you did them. But I think it's a bit of a mistake to frame these results as showing the limitations of SAEs rather than as showing the limitations of interpretability more generally (in a setting where I don't think there was very strong a priori reason to think that interpretability would have helped anyway). > > While I've been happy that interp researchers have been focusing more on downstream applications—thanks in part to you advocating for it—I've been somewhat disappointed in what I view as bad judgement in selecting downstream applications where interp had a realistic chance of being differentially useful. Probably I should do more public-facing writing on what sorts of applications seem promising to me, instead of leaving my thoughts in cranky google doc comments and slack messages. Neel Nanda: > To be clear, I did *not* make such a drastic update solely off of our OOD probing work. [...] My update was an aggregate of: > > * Several attempts on downstream tasks failed (OOD probing, other difficult condition probing, unlearning, etc) > * SAEs have a ton of issues that started to surface - composition, aborption, missing features, low sensitivity, etc > * The few successes on downstream tasks felt pretty niche and contrived, or just in the domain of discovery - if SAEs are awesome, it really should not be this hard to find good use cases... > > It's kinda awkward to simultaneously convey my aggregate update, along with the research that was just one factor in my update, lol (and a more emotionally salient one, obviously) > > There's disagreement on my team about how big an update OOD probing specifically should be, but IMO if SAEs are to be justified on pragmatic grounds they should be useful for tasks we care about, and harmful intent is one such task - if linear probes work and SAEs don't, that is still a knock against SAEs. Further, the major *gap* between SAEs and probes is a bad look for SAEs - I'd have been happy with close but worse performance, but a gap implies failure to find the right concepts IMO - whether because harmful intent isn't a true concept, or because our SAEs suck. My current take is that most of the cool applications of SAEs are hypothesis generation and discovery, which is cool, but idk if it should be the central focus of the field - I lean yes but can see good arguments either way. > > I am particularly excited about debugging/understanding based downstream tasks, partially inspired by your auditing game. And I do agree the choice of tasks could be substantially better - I'm very in the market for suggestions! Sam Marks: > Thanks, I think that many of these sources of evidence are reasonable, though I think some of them should result in broader updates about the value of interpretability as a whole, rather than specifically about SAEs. > > In more detail: > > SAEs have a bunch of limitations on their own terms, e.g. reconstructing activations poorly or not having crisp features. Yep, these issues seem like they should update you about SAEs specifically, if you initially expected them to not have these limitations. > > Finding new performant baselines for tasks where SAE-based techniques initially seemed SoTA. I've also made this update recently, due to results like: > > (A) Semantic search proving to be a good baseline in our auditing game (section 5.4 of https://arxiv.org/abs/2503.10965 ) > > (B) Linear probes also identifying spurious correlations (section 4.3.2 of https://arxiv.org/pdf/2502.16681 and other similar results) > > (C) Gendered token deletion doing well for the Bias in Bios SHIFT task (https://lesswrong.com/posts/QdxwGz9AeDu5du4Rk/shift-relies-on-token-level-features-to-de-bias-bias-in-bios… ) > > I think the update from these sorts of "good baselines" results is twofold: > > 1. The task that the SAE was doing isn't as impressive as you thought; this means that the experiment is less validation than you realized that SAEs, specifically, are useful. > > 2. Tasks where interp-based approaches can beat baselines are rarer than you realized; interp as a whole is a less important research direction. > > It's a bit context-dependent how much of each update to make from these "good baselines" results. E.g. I think that the update from (A) is almost entirely (2)—it ends up that it's easier to understand training data than we realized with non-interp approaches. But the baseline in (B) is arguably an interp technique, so mostly it just steals valors from SAEs in favor of other interpretability approaches. > > Obvious non-interp baselines outperformed SAEs on [task]. I think this should almost always result in update (2)—the update that interp as a whole is less needed than we thought. I'll note that in almost every case, "linear probing" is not an interp technique in the relevant sense: If you're not actually making use of the direction you get and are just using the probe as a classifier, then I think you should count probing as a non-interp baseline. Arthur Conmy: > I agree with most of this post. Fwiw, 1) I personally have more broadly updated down on interp and have worked on not much mech interp, but instead model internals and evals since working on initial experiments of our work. 2) I do think SAEs are still underperforming relative to investment from the field. Including today’s progress on CLTs! It is exciting work, but IMO there are a lot of ifs ahead of SAEs being actually providing nontrivial counterfactual direct value to safety 1. ^ Sam Marks: > To clarify, my points here are about OOD probing experiments where the SAE-based intervention is "just regularize the probe to attend to a sparse subset of the latents." > > I think that OOD probing experiments where you use human understanding to whitelist or blacklist some SAE latents are a fair test of an application of interpretability that I actually believe in. (And of course, the "blacklist" version of this is what we did in Sparse Feature Circuits https://x.com/saprmarks/status/1775513451668045946… )
Very impressive!  At least on a first read, to me this felt closer than any past work to realizing the SAE dream of actually understanding the computations occurring in the model, as opposed to just describing various "cool-looking patterns" that can be extracted from activations. I'm curious about what would happen if you studied cases similar to the examples you present, except that the recruitment of a particular capability (such as arithmetic or knowledge about a named entity) occurs through in-context learning. For example, you discuss an "obscured arithmetic" task involving publication dates.  In that case, the model seems to have learned in training that the correct prediction can be done by doing arithmetic.  But we could imagine obscured arithmetic tasks that are novel to the model, in which the mapping between the text and a "latent arithmetic problem" has to be learned in-context[1]. We might then ask ourselves: how does the model's approach to these problems relate to its approach to problems which it "can immediately tell" are arithmetic problems? A naively obvious "algorithm" would look like 1. Try out various mappings between the observed text and (among other things) arithmetic problems 2. Notice that one particular mapping to arithmetic always yields the right answer on previous example cases 3. Based on the observation in (2), map the current example to arithmetic, solve the arithmetic problem, and map back to predict the answer However, due to the feedforward and causal structure of transformer LMs, they can't re-use the same mechanism twice to "verify that arithmetic works" in 1+2 and then "do arithmetic" in 3.[2] It's possible that LLMs actually solve cases like this in some qualitatively different way than the "algorithm" above, in which case it would be interesting to learn what that is[3]. Alternatively, if the model is doing something like this "algorithm," it must be recruiting multiple "copies" of the same capability, and we could study how many "copies" exist and to what extent they use identical albeit duplicated circuitry. (See fn2 of this comment for more) It would be particularly interesting if feature circuit analysis could be used to make quantitative predictions about things like "the model can perform computations of depth D or lower when not obscured in a novel way, but it this depth lowers to some D' < D when it must identify the required computation through few-shot learning." (A related line of investigation would be looking into how the model solves problems that are obscured by transformations like base64, where the model has learned the mapping in training, yet the mapping is sufficiently complicated that its capabilities typically degrade significantly relative to those it displays on "plaintext" problems.) 1. ^ One could quantify the extent to which this is true by looking at how much the model benefits from examples. In an "ideal" case of this kind, the model would do very poorly when given no examples (equivalently, when predicting the first answer in a few-shot sequence), yet it would do perfectly when given many examples. 2. ^ For instance, suppose that the current example maps to an addition problem where one operand has 9 in the ones place.  So we might imagine that an "add _9" add function feature is involved in successfully computing the answer, here. But for this feature to be active at all, the model needs to know (by this point in the list of layers) that it should do addition with such an operand in the first place.  If it's figuring that out by trying mappings to arithmetic and noticing that they work, the implementations of arithmetic used to "try and verify" must appear in layers before the one in which the "add _9" feature under discussion occurs, since the final outputs of the entire "try and verify" process are responsible for activating that feature.  And then we could ask: how does this earlier implementation of arithmetic work?  And how many times does the model "re-implement" a capability across the layer list? 3. ^ Perhaps it is something like "try-and-check many different possible approaches at every answer-to-example position, then use induction heads to move info about try-and-check outputs that matched the associated answer position to later positions, and finally use this info to amplify the output of the 'right' computation and suppress everything else."
I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI. I for one am primarily working on technical alignment but bring up non-technical challenges to Safe & Beneficial AGI frequently and publicly, and here’s Nate Soares doing the same thing, and practically every AGI technical alignment researcher I can think of talks about governance and competitive races-to-the-bottom and so on all the time these days, …. Like, who specifically do you imagine that you’re arguing against here? Can you give an example? Dario Amodei maybe? (I am happy to throw Dario Amodei under the bus and no-true-Scotsman him out of the “AI safety community”.) I also disagree with the claim (not sure whether you endorse it, see next paragraph) that solving the technical alignment problem is not necessary to get to Safe & Beneficial AGI. If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callous lack of interest in whether humans live or die. And more and more people will get access to that demon-summoning recipe over time, and running that recipe will be highly profitable (just as using unwilling slave labor is very profitable until there’s a slave revolt). That’s clearly bad, right? Did you mean to imply that there’s a good future that looks like that? (Well, I guess “don’t ever build AGI” is an option in principle, though I’m skeptical in practice because forever is a very long time.) Alternatively, if you agree with me that solving the technical alignment problem is necessary to get to Safe & Beneficial AGI, and that other things are also necessary to get to Safe & Beneficial AGI, then I think your OP is not clearly conveying that position. The tone is wrong. If you believed that, then you should be cheering on the people working on technical alignment, while also encouraging more people to work on non-technical challenges to Safe & Beneficial AGI. By contrast, this post strongly has a tone that we should be working on non-technical challenges instead of the technical alignment problem, as if they were zero-sum, when they’re obviously (IMO) not. (See related discussion of zero-sum-ness here.)
Load More

Recent Discussion

2Adele Lopez
Rough intuition for LLM personas. An LLM is trained to be able emulate the words of any author. And to do so efficiently, they do it via generalization and modularity. So at a certain point, the information flows through a conceptual author, the sort of person who would write the things being said. These author-concepts are themselves built from generalized patterns and modular parts. Certain things are particularly useful: emotional patterns, intentions, worldviews, styles, and of course, personalities. Importantly, the pieces it has learned are able to adapt to pretty much any author of the text it was trained on (LLMs likely have a blindspot around the sort of person who never writes anything). And even more importantly, most (almost all?) depictions of agency will be part of an author-concept. Finetuning and RLHF cause it to favor routing information through a particular kind of author-concept when generating output tokens (it retains access to the rest of author-concept-space in order to model the user and the world in general). This author-concept is typically that of an inoffensive corporate type, but it could in principle be any sort of author. All which is to say, that when you converse with a typical LLM, you are typically interacting with a specific author-concept. It's a rough model of exactly the parts of a person pertinent to writing and speaking. For a small LLM, this is more like just the vibe of a certain kind of person. For larger ones, they can start being detailed enough to include a model of a body in a space. Importantly, this author-concept is just the tip of the LLM-iceburg. Most of the LLM is still just modeling the sort of world in which the current text might be written, including models of all relevant persons. It's only when it comes time to spit out the output token that it winnows it all through a specific author-concept. (Note: I think it is possible that an author-concept may have a certain degree of sentience in the larger mod
1artifex0
  I don't think that would help much, unfortunately.  Any accurate model of the world will also model malicious agents, even if the modeller only ever learns about them second-hand. So the concepts would still be there for the agent to use if it was motivated to do so. Censoring anything written by malicious people would probably make it harder to learn about some specific techniques of manipulation that aren't discussed much by non-malicious people or which appear much in fiction- but I doubt that would be much more than a brief speed bump for a real misaligned ASI, and probably at the expense of reducing useful capabilities in earlier models like the ability to identify maliciousness, which would give an advantage to competitors.

I think learning about them second-hand makes a big difference in the "internal politics" of the LLM's output. (Though I don't have any ~evidence to back that up.)

Basically, I imagine that the training starts building up all the little pieces of models which get put together to form bigger models and eventually author-concepts. And as text written without malicious intent is weighted more heavily in the training data, the more likely it is to build its early model around that. Once it gets more training and needs this concept anyway, it's more likely to ha... (read more)

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app. 

This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)

1. Introduction and summary

In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely:

  • Safety progress: our ability to develop new levels of AI capability safely,
  • Risk evaluation: our ability to track and forecast the level
...

When this post first came out, it annoyed me. I got a very strong feeling of "fake thinking", fake ontology, etc. And that feeling annoyed me a lot more than usual, because Joe is the person who wrote the (excellent) post on "fake vs real thinking". But at the time, I did not immediately come up with a short explanation for where that feeling came from.

I think I can now explain it, after seeing this line from kave's comment on this post:

Your taxonomies of the space of worries and orientations to this question are really good...

That's exactly it. The taxono... (read more)

1Chris_Leong
Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams. One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very diffuse factor. At the same time, a less diffuse intervention would be to increase the wisdom of specific actors. You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll quickly end up with AI systems too capable for us to control". I agree. In fact, a key reason why I think this is important is that we can't afford to leave anything on the table. One of the things I like about the approach of training AI advisors is that humans can compensate for weaknesses in the AI system. In other words, I'm introducing a third category of labour human-AI cybernetic systems/centaur labour. I think that it's likely that this might widen the sweet spot, however, we have to make sure that we do this in a way that differentially benefits safety. You do discuss the possibility of using AI to unlock enhanced human labour. It would also be possible to classify such centaur systems under this designation. 1. ^ More broadly, I think there's merit to the cyborgism approach even if some of the arguments is less compelling in light of recent capabilities advances.

Epistemic status – thrown together quickly. This is my best-guess, but could easily imagine changing my mind.

Intro

I recently copublished a report arguing that there might be a software intelligence explosion (SIE) – once AI R&D is automated (i.e. automating OAI), the feedback loop of AI improving AI algorithms could accelerate more and more without needing more hardware. 

If there is an SIE, the consequences would obviously be massive. You could shoot from human-level to superintelligent AI in a few months or years; by default society wouldn’t have time to prepare for the many severe challenges that could emerge (AI takeover, AI-enabled human coups, societal disruption, dangerous new technologies, etc). 

The best objection to an SIE is that progress might be bottlenecked by compute. We discuss this in the report, but I want...

I’ll define an “SIE” as “we can get >=5 OOMs of increase in effective training compute in <1 years without needing more hardware”. I

This is as of the point of full AI R&D automation? Or as of any point?

Your AI’s training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy.

Research I want to see

Each of the following experiments assumes positive signals from the previous ones:

  1. Create a dataset and use it to measure existing models
  2. Compare mitigations at a small scale
  3. An industry lab running large-scale mitigations

Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong effect, then we should act. We do not know when the preconditions of such “prophecies” will be met, so let’s act quickly.

 

https://turntrout.com/self-fulfilling-misalignment 

Any empirical evidence that the Waluigi effect is real? Or are you more appealing to jailbreaks and such?

2Alex Turner
I think we have quite similar evidence already. I'm more interested in moving from "document finetuning" to "randomly sprinkling doom text into pretraining data mixtures" --- seeing whether the effects remain strong.
2Alex Turner
This (and @Raemon 's comment[1]) misunderstand the article. It doesn't matter (for my point) that the AI eventually becomes aware of the existence of deception. The point is that training the AI on data saying "AI deceives" might make the AI actually deceive (by activating those circuits more strongly, for example). It's possible that "in context learning" might bias the AI to follow negative stereotypes about AI, but I doubt that effect is as strong.  From the article: 1. ^ "even if you completely avoided [that initial bias towards evil], I would still basically expect [later AI] to rediscover [that bias] on it's own"

Right now, alignment seems easy – but that’s because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet.

It might seem like there’s a swarming research field improving white box detectors – a new paper about probes drops on arXiv nearly every other week. But no one really knows how well these techniques work.

Some researchers have already tried to put white box detectors to the test. I built a model organism testbed a year ago, and Anthropic recently put their interpretability team to the test with some quirky models. But these tests were layups. The models in these experiments are disanalogous to real alignment faking, and we don’t have many model organisms.

This summer, I’m trying to take these testbeds...

4Steve Byrnes
Thanks! But I don’t think that’s a likely failure mode. I wrote about this long ago in the intro to Thoughts on safety in predictive learning. In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us. I consider that the obvious, central type of alignment failure mode for model-based RL AGI, and it remains an unsolved problem. I think (??) you’re bringing up a different and more exotic failure mode where the world-model by itself is secretly harboring a full-fledged planning agent. I think this is unlikely to happen. One way to think about it is: if the world-model is specifically designed by the programmers to be a world-model in the context of an explicit model-based RL framework, then it will probably be designed in such a way that it’s an effective search over plausible world-models, but not an effective search over a much wider space of arbitrary computer programs that includes self-contained planning agents. See also §3 here for why a search over arbitrary computer programs would be a spectacularly inefficient way to build all that agent stuff (TD learning in the critic, roll-outs in the planner, replay, whatever) compared to what the programmers will have already explicitly built into the RL agent architecture. So I think this kind of thing (the world-model by itself spawning a full-fledged planning agent capable of treacherous turns etc.) is unlikely to happen in the first place. And even if it happens, I think the problem is easily mitigated; see discussion in Thoughts on safety in predictive learning. (Or sorry if I’m misunderstanding.)
3Steve Byrnes
Thanks! I think “inner alignment” and “outer alignment” (as I’m using the term) is a “natural breakdown” of alignment failures in the special case of model-based actor-critic RL AGI with a “behaviorist” reward function (i.e., reward that depends on the AI’s outputs, as opposed to what the AI is thinking about). As I wrote here: (A bit more related discussion here.) That definitely does not mean that we should be going for a solution to outer alignment and a separate unrelated solution to inner alignment, as I discussed briefly in §10.6 of that post, and TurnTrout discussed at greater length in Inner and outer alignment decompose one hard problem into two extremely hard problems. (I endorse his title, but I forget whether I 100% agreed with all the content he wrote.) I find your comment confusing, I’m pretty sure you misunderstood me, and I’m trying to pin down how … One thing is, I’m thinking that the AGI code will be an RL agent, vaguely in the same category as MuZero or AlphaZero or whatever, which has an obvious part of its source code labeled “reward”. For example, AlphaZero-chess has a reward of +1 for getting checkmate, -1 for getting checkmated, 0 for a draw. Atari-playing RL agents often use the in-game score as a reward function. Etc. These are explicitly parts of the code, so it’s very obvious and uncontroversial what the reward is (leaving aside self-hacking), see e.g. here where an AlphaZero clone checks whether a board is checkmate. Another thing is, I’m obviously using “alignment” in a narrower sense than CEV (see the post—“the AGI is ‘trying’ to do what the programmer had intended for it to try to do…”) Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that. The idea is: actor-criti
3Towards_Keeperhood
Thanks! I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle. But nvm, I noticed my first attempt of how I wanted to explain what I feel like is wrong sucked and thus dropped it. This seems like a sensible breakdown to me, and I agree this seems like a useful distinction (although not a useful reduction of the alignment problem to subproblems, though I guess you agree here).  However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it's just about deception), and I think it's not: I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution. You could avoid talking about utility functions by saying "the learned value function just predicts reward", and that may work while you're staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you're going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to. I think humans have particular not-easy-to-pin-down machinery inside them, that makes their utility function generalize to some narrow cluster of all ground-truth-reward-compatible utility functions, and a mind with a different mind design is unlikely to generalize to the same cluster of utility functions. (Though we could aim for a different compatible utility function, namely the "indirect alignment" one that say "fulfill human's CEV", which has lower complexity than the ones humans gen

I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle.

OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its se... (read more)

Load More