AI Alignment Forum

This is a linkpost for https://www.apolloresearch.ai/blog/evalsgap

In our engagements with governments, AI safety institutes, and frontier AI developers, we found the concept of the “evaluation gap” (short: ‘evals gap’) helpful to communicate the current state of the art and what is needed for the field to move towards more robust evaluations. In this post, we briefly explain the concept and its implications. For the purpose of this post, “evals” specifically refer to safety evaluations of frontier models.

Evals have become a prominent tool underpinning governance frameworks and AI safety mechanisms. Given that, we are concerned that policymakers and industry players both (i) overestimate the number of currently available high-quality evals and (ii) underestimate the time it takes to develop them. In our experience, available evals are not sufficient (in quality and quantity) to robustly...

(Continue Reading – 1924 more words)

3jbash2h

... and yet... Why are you sure that effective "evals" can exist even in principle? I think I'm seeing a "we really want this, therefore it must be possible" shift here.

Marius Hobbhahn1h20

Yeah, it's not a watertight argument and somewhat based on my current interpretation of past progress and projects in the making.

1. Intuitively, I would say for the problems we're facing in evals, a ton of progress is bottlenecked by running fairly simple experiments and iterating fast. A reasonable part of it feels very parallelizable and the skill required is quite reachable for many people.
2. Most evals questions feel like we have a decent number of "obvious things" to try and since we have very tight feedback loops, making progress feels qu... (read more)

Anthropic: Three Sketches of ASL-4 Safety Case Components

Zach Stein-Perlman

This is a linkpost for https://alignment.anthropic.com/2024/safety-cases/

The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so. However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal

...

(See More – 73 more words)

Buck Shlegeris2d42

If we can construct fake honeypots that are indistinguishable - to the policy - from real opportunities, and do this at scale, then that could suffice: we could drive the policy's P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.)

Yes totally. (This is impractical if you think that the model has a very large number of opportunities to attack, e.g. if every time it's programming is an opportunity to attack.)

If we can construct fake attacks that are indistinguishable -

... (read more)

5Buck Shlegeris3d

I don't think this makes any sense. How are you hoping to get the model to attack except by telling it that it's in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.

1Vladimir Mikulik3d

FWIW I agree with you and wouldn't put it the way it is in Roger's post. Not sure what Roger would say in response.

9Daniel Kokotajlo3d

Good point re 2. Re 1, meh, still seems like a meta-argument to me, because when I roll out my mental simulations of the ways the future could go, it really does seem like my If... condition obtaining would cut out about half of the loss-of-control ones. Re 3: point by point: 1. AISIs existing vs. not: Less important; I feel like this changes my p(doom) by more like 10-20% rather than 50%. 2. Big names coming out: idk this also feels like maybe 10-20% rather than 50% 3. I think Anthropic winning the race would be a 40% thing maybe, but being a runner-up doesn't help so much, but yeah p(anthropicwins) has gradually gone up over the last three years... 4. Trump winning seems like a smaller deal to me. 5. Ditto for Elon. 6. Not sure how to think about logical updates, but yeah, probably this should have swung my credence around more than it did. 7. ? This was on the mainline path basically and it happened roughly on schedule. 8. Takeoff speeds matter a ton, I've made various updates but nothing big and confident enough to swing my credence by 50% or anywhere close. Hmm. But yeah I agree that takeoff speeds matter more. 9. Picture here hasn't changed much in three years. 10. Ditto. OK, so I think I directionally agree that my p(doom) should have been oscillating more than it in fact did over the last three years (if I take my own estimates seriously). However I don't go nearly as far as you; most of the things you listed are either (a) imo less important, or (b) things I didn't actually change my mind on over the last three years such that even though they are very important my p(doom) shouldn't have been changing much. I agree with everything except the last sentence -- my claim took this into account, I was specifically imagining something like this playing out and thinking 'yep, seems like this kills about half of the loss-of-control worlds' I agree that's a stronger claim than I was making. However, part of my view here is that the weaker claim I did make has a

Thomas Kwa's Shortform

Thomas Kwa

3Thomas Kwa4d

What's the most important technical question in AI safety right now?

Buck Shlegeris3d80

In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:

How to evaluate whether models should be trusted or untrusted: currently I don't have a good answer and this is bottlenecking the efforts to write concrete control proposals.
How AI control should interact with AI security tools inside labs.

More generally:

How can we get more evidence on whether scheming is plausible?
How scary is underelicitation? How much should the results about password-locked models or arguments about being able to generate

Adam Shimi, Gabriel Alfour, Connor Leahy, Chris Scammell, Andrea_Miotti

11d

This is a linkpost for https://www.thecompendium.ai/

We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.

We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.

We would appreciate your feedback, whether or not you agree with us:

If you do agree with us, please point out where you

...

(See More – 489 more words)

Adam Shimi3d20

Typo addressed in the latest patch!

2Adam Shimi3d

Now addressed in the latest patch!

2Adam Shimi3d

Now addressed in the latest patch!

2Adam Shimi6d

Thanks for the comment! We have indeed gotten the feedback by multiple people that this part didn't feel detailed enough (although we got this much more from very technical readers than from non-technical ones), and are working at improving the arguments.

The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization

Sahil

This is an excerpt from the Introduction section to a book-length project that was kicked off as a response to the framing of the essay competition on the Automation of Wisdom and Philosophy. Many unrelated-seeming threads open in this post, that will come together by the end of the overall sequence.

If you don't like abstractness, the first few sections may be especially hard going.

Generalization

This sequence is a new story of generalization.

The usual story of progress in generalization, such as in a very general theory, is via the uncovering of deep laws. Distilling the real patterns, without any messy artefacts. Finding the necessities and universals, that can handle wide classes rather than being limited to particularities. The crisp, noncontingent abstractions. It is about opening black boxes. Articulating mind-independent, rigorous results, with no ambiguity and high...

(Continue Reading – 3334 more words)

AI ALIGNMENT FORUM
AF

Recommended Sequences

AI Alignment Posts

Popular Comments

Recent Discussion

Generalization