"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

TYPE III AUDIO (All episodes) · TYPE III AUDIO

April 5, 202339m 43s

Audio is streamed directly from the publisher (buzzsprout.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Chapters

Show Notes

---
client: lesswrong
project_id: curated
feed_id: ai_safety
narrator: pw
qa: mds
qa_time: 1h00m
---

In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.

I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:

Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough.
I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.

Original article:
https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

← All episodes of TYPE III AUDIO (All episodes)

&quot;Discussion with Nate Soares on a key alignment difficulty&quot; by Holden Karnofsky

Show Notes

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky