
Interpretability in the wild and other papers
TYPE III AUDIO (All episodes) · TYPE III AUDIO
April 6, 20235m 4s
Audio is streamed directly from the publisher (buzzsprout.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
---
client: t3a
feed_id: ai_safety_abstracts
narrator: ai
---
This episode covers 3 abstracts:
- Active reward learning from multiple teachers - Peter Barnett et al.
- Conditioning Predictive Models: Risks and Strategies - Hubinger et al.
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT2 small - Kevin Wang et al.