"More information about the dangerous capability evaluations we did with GPT-4 and Claude." by Beth Barnes

TYPE III AUDIO (All episodes) · TYPE III AUDIO

March 21, 202314m 16s

Audio is streamed directly from the publisher (buzzsprout.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Chapters

Show Notes

---
client: lesswrong
project_id: curated
feed_id: ai, ai_safety, ai_safety__technical
narrator: pw
qa: mds
qa_time: 0h30m
---
This is a linkpost for https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/

[Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]

We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight.

We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.

Original article:
https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/more-information-about-the-dangerous-capability-evaluations

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

← All episodes of TYPE III AUDIO (All episodes)

&quot;More information about the dangerous capability evaluations we did with GPT-4 and Claude.&quot; by Beth Barnes

Show Notes

"More information about the dangerous capability evaluations we did with GPT-4 and Claude." by Beth Barnes