Interpretability in the wild and other papers

TYPE III AUDIO (All episodes) · TYPE III AUDIO

April 6, 20235m 4s

Audio is streamed directly from the publisher (buzzsprout.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

---
client: t3a
feed_id: ai_safety_abstracts
narrator: ai
---

This episode covers 3 abstracts:

Active reward learning from multiple teachers - Peter Barnett et al.
Conditioning Predictive Models: Risks and Strategies - Hubinger et al.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT2 small - Kevin Wang et al.

Share feedback on this narration.

← All episodes of TYPE III AUDIO (All episodes)