AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I’ll argue this could be hard even if you caught your AI red-handed trying to escape.

Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I...

De-facto you won't be able to prove that what was going on was scheming or whether the model was just "role-playing"

Why not? There a very real and important difference between role playing and systematic scheming. The former isn't ultimately scary or systematic and won't reliably lead to AI takeover; the latter will. If it is indeed systematic scheming, you should be able to generate evidence of that empirically. It will persist in a variety of circumstances where there seems to be an opportunity to seize power, and it won't go away when you change unrelat... (read more)

5Ryan Greenblatt
It sounds as though you're imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations). This deep understanding presumably requires a relatively solid accounting of all the cases in which models defect? At least an accounting sufficient to very credibly indicate that the model is overall misaligned against motivated counterarguments. I think this might be hard to get by default. Maybe? Just because we've caught a model in production doesn't mean that we can consistently catch schemers without a huge amount of runtime compute (and possibly human labor). A full deployment massively subsidizes your ability to find cases where an AI does egregiously bad things in a credibly dangerous way. And indeed, AIs would want to only defect in cases where they could actually cause problems. So, it might be even harder to go from "we caught this one model" to "XYZ training process consistently results in misalignment". (If we had a robust check for misalignment, we could iterate or train against it.) ---------------------------------------- Personally, I feel OK (65% 60% chance?) about convincing the key people within 1 month if: * There is a solid plan in advance for how to study AIs we catch. * There is a large effort to study this once we catch the AI. (E.g. >20% of AI lab research resources are put on this.) But this isn't exactly amazing...
3Raymond Arnold
This seems technically true but I wanna flag the argument "it seems rally hard to be confident that you have robust enough checks that training against them is good, instead of bad (because it trains the AI to hide better)".
2Ryan Greenblatt
Agreed, I really should have said "or possibly even train against it". I think SGD is likely to be much worse than best-of-N over a bunch of variations on the training scheme where the variations are intended to plausibly reduce the chance of scheming. Of course, if you are worried about scheming emerging thoughout training, then you need N full training runs which is very pricy!

Crossposted by habryka with Sam's permission. Expect lower probability for Sam to respond to comments here than if he had posted it (he said he'll be traveling a bunch in the coming weeks, so might not have time to respond to anything). 


This piece reflects my current best guess at the major goals that Anthropic (or another similarly positioned AI developer) will need to accomplish to have things go well with the development of broadly superhuman AI. Given my role and background, it’s disproportionately focused on technical research and on averting emerging catastrophic risks.

For context, I lead a technical AI safety research group at Anthropic, and that group has a pretty broad and long-term mandate, so I spend a lot of time thinking about what kind of safety...

2Ryan Greenblatt
AIs which aren't qualitatively much smarter than humans seem plausible to use reasonably effectively while keeping risk decently low (though still unacceptably risky in objective/absolute terms). Keeping risk low seems to require substantial effort, though it seems maybe achievable. Even with token effort, I think risk is "only" around 25% with such AIs because default methods likely avoid egregious misalignment (perhaps 30% chance of egregious misalignment with token effort and then some chance you get lucky for a 25% chance of risk overall). Then given this, I have two objections to the story you seem to present: * AIs which aren't qualitatively smarter than humans seem very useful and with some US government support could suffice to prevent proliferation. (Both greatly reduce the cost of non-proliferation while also substantially increasing willingness to pay with demos etc.) * Plans that don't involve US government support while building crazy weapons/defense with wildly superhuman AIs involve commiting massive crimes and I think we should have a policy against this. Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else's crazily powerful AI. Building insane weapons/defenses requires US government consent (unless you're commiting massive crimes which seems like a bad idea). Thus, you might as well go all the way to preventing much smarter AIs from being built (by anyone) for a while which seems possible with some US government support and the use of these human-ish level AIs.

(Responding in a consolidated way just to this comment.)

Ok, got it.  I don't think the US government will be able and willing to coordinate and enforce a worldwide moratorium on superhuman TAI development, if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan.  It might become more willing than it is now (though I'm not hugely optimistic), but I currently don't think as an institution it's capable of executing on that kind of plan and don't see w... (read more)

2Ryan Greenblatt
My proposal would roughly be that the US government (in collaboration with allies etc) enforces no one building AI which are qualitatively smarter than humans and this should be the default plan. (This might be doable without government support via coordination between multiple labs, but I basically doubt it.) Their could be multiple AI projects backed by the US+allies or just one, either could be workable in principle, though multiple seems tricky.
2Ryan Greenblatt
TBC, I don't think there are plausible alternatives to at least some US government involvement which don't require commiting a bunch of massive crimes. I have a policy against commiting or recommending commiting massive crimes.
5Alex Turner
I quite appreciated Sam Bowman's recent Checklist: What Succeeding at AI Safety Will Involve. However, one bit stuck out: I don't see why we need to "perfectly" and "fully" solve "the" core challenges of alignment (as if that's a thing that anyone knows exists). Uncharitably, it seems like many people (and I'm not mostly thinking of Sam here) have their empirically grounded models of "prosaic" AI, and then there's the "real" alignment regime where they toss out most of their prosaic models and rely on plausible but untested memes repeated from the early days of LessWrong. Alignment started making a whole lot more sense to me when I thought in mechanistic detail about how RL+predictive training might create a general intelligence. By thinking in that detail, my risk models can grow along with my ML knowledge.

I haven't read the Shard Theory work in comprehensive detail. But, fwiw I've read at least a fair amount of your arguments here and not seen anything that bridged the gap between "motivations are made of shards that are contextually activated" and "we don't need to worry about Goodhart and misgeneralization of human at extreme levels of optimization." 

I've heard you make this basic argument several times, and my sense is you're pretty frustrated that people still don't seem to have "heard" it properly, or something. 

I did feel compelled by your a... (read more)

Previously: Predictions for shard theory mechanistic interpretability results 

Locally retargeting the search by modifying a single activation. We found a residual channel halfway through a maze-solving network. When we set one of the channel activations to +5.5, the agent often navigates to the maze location (shown above in red) implied by that positive activation. This allows limited on-the-fly redirection of the net's goals.

(The red dot is not part of the image observed by the network, it just represents the modified activation. Also, this GIF is selected to look cool. Our simple technique often works, but it isn't effortless, and some dot locations are harder to steer towards.)

TL;DR: We algebraically modified the net's runtime goals without finetuning. We also found (what we think is) a "motivational API" deep in...

3Alex Turner
Often people talk about policies getting "selected for" on the basis of maximizing reward. Then, inductive biases serve as "tie breakers" among the reward-maximizing policies. This perspective A) makes it harder to understand and describe what this network is actually implementing, and B) mispredicts what happens. Consider the setting where the cheese (the goal) was randomly spawned in the top-right 5x5. If reward were really lexicographically important --- taking first priority over inductive biases -- then this setting would train agents which always go to the cheese (because going to the top-right corner often doesn't lead to reward).  But that's not what happens! This post repeatedly demonstrates that the mouse doesn't reliably go to the cheese or the top-right corner. The original goal misgeneralization paper was trying to argue that if multiple "goals" lead to reward maximization on the training distribution, then we don't know which will be learned. This much was true for the 1x1 setting, where the cheese was always in the top-right square -- and so the policy just learned to go to that square (not to the cheese). However, it's not true that "go to the top-right 5x5" is a goal which maximizes training reward in the 5x5 setting! Go to the top right 5x5... and then what? Going to that corner doesn't mean the mouse hit the cheese. What happens next?[1] If you demand precision and don't let yourself say "it's basically just going to the corner during training" -- if you ask yourself, "what goal, precisely, has this policy learned?" -- you'll be forced to conclude that the network didn't learn a goal that was "compatible with training." The network learned multiple goals ("shards") which activate more strongly in different situations (e.g. near the cheese vs near the corner). And the learned goals do not all individually maximize reward (e.g. going to the corner does not max reward). In this way, shard theory offers a unified and principled perspective which

Often people talk about policies getting "selected for" on the basis of maximizing reward. Then, inductive biases serve as "tie breakers" among the reward-maximizing policies.

Does anyone do this? Under this model the data-memorizing model would basically always win out, which I've never really seen anyone predict. Seems clear that inductive biases do more than tie-breaking.

AI systems up to some high level of intelligence plausibly need to know exactly where they are in space-time in order for deception/"scheming" to make sense as a strategy.
This is because they need to know:
1) what sort of oversight they are subject to 
2) what effects their actions will have on the real world

(side note: Acausal trade might break this argument)

There are a number of informal proposals to keep AI systems selectively ignorant of (1) and (2) in order to prevent deception.  Those proposals seem very promising to flesh out; I'm not aware of any rigorous work doing so, however.  Are you?

4Answer by Daniel Kokotajlo
I know of no rigorous proposals. The general challenge such proposals face is that if you are relying on fooling your AGI about something to keep control over it, and it's constantly and rapidly getting smarter and wiser... that's a recipe for your scheme to fail suddenly and silently (when it stops being fooled), which is a recipe for disaster. Another type of proposal relies on making it actually true that it might be in a simulation--or to put it more precisely perhaps, making it actually the case that future aligned superintelligences will make simulations so accurate that even a baby superintelligence can't tell the difference. However, two can play at that game; more generally this just becomes a special case of acausal trade stuff which will be wild and confusing and very important once AIs are smart enough to take it seriously.
1David Scott Krueger
Not necessarily fooling it, just keeping it ignorant.  I think such schemes can plausibly scale to very high levels of capabilities, perhaps indefinitely, since intelligence doesn't give one the ability to create information from thin air...

Are you describing something that would fit within my 'Another type of proposal...' category?

The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:

  1. Talking about partial hypotheses rather than full hypotheses. You can't have a prior over partial hypotheses, because several of them can be true at once (though you can still assign them credences and update those credences according to evidence).
  2. Talking about models with degrees of truth rather than just hypotheses with degrees of likelihood. E.g. when using a binary conception of truth, general relativity is definitely false because it's inconsistent wit
... (read more)
