Building eval systems that improve your AI product

September 9, 202521m 41s

Audio is streamed directly from the publisher (api.substack.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page

Show Notes

If you’re a premium subscriber, add the private feed to your podcast app at https://add.lennysreads.com

In this episode, we dive into the fast-emerging discipline of AI evaluation with Hamel Husain and Shreya Shankar, creators of AI Evals for Engineers & PMs, the #1 highest-grossing course on Maven.

After training 2000+ PMs and engineers across 500+ companies, Hamel and Shreya reveal the complete playbook for building evaluations that actually improve your AI product: moving beyond vanity dashboards, to a system that drives continuous improvement.

In this episode, you’ll learn:

• Why most AI eval dashboards fail to deliver real product improvements

• How to use error analysis to uncover your product’s most critical failure modes

• The role of a “principal domain expert” in setting a consistent quality bar

• Techniques for transforming messy error notes into a clean taxonomy of failures

• When to use code-based checks vs. LLM-as-a-judge evaluators

• How to build trust in your evals with human-labeled ground-truth datasets

• Why binary pass/fail labels outperform Likert scales in practice

• Evaluation strategies for complex systems: multi-turn conversations, RAG pipelines, and agentic workflows

• How CI safety nets and production monitoring work together to create a flywheel of continuous product improvement

References:

• Read the newsletter: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve

• AI Evals for Engineers & PMs: https://maven.com/parlance-labs/evals

• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/

• Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://arxiv.org/abs/2404.12272

• Aman Khan: https://www.linkedin.com/in/amanberkeley/

• Anthropic: https://www.anthropic.com/

• Arize Phoenix: https://phoenix.arize.com/

• Braintrust: https://www.braintrust.dev/

• Beyond vibe checks: A PM’s complete guide to evals: https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete

• Frequently Asked Questions (And Answers) About AI Evals: https://hamel.dev/blog/posts/evals-faq/

• Hamel Husain: https://www.linkedin.com/in/hamelhusain/

• LangSmith: https://smith.langchain.com/

• Not Dead Yet: On RAG: https://hamel.dev/notes/llm/rag/not_dead.html

• OpenAI: https://openai.com/

• Shreya Shankar: https://www.linkedin.com/in/shrshnk/

Listen:

• YouTube: https://www.youtube.com/@lennysreads

• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693

• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj

• Newsletter: https://www.lennysnewsletter.com/subscribe

Follow Lenny:

• Twitter/X: https://twitter.com/lennysan

• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/

• Podcast: https://www.youtube.com/@lennyspodcast

• YouTube: https://www.youtube.com/@lennysreads

• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693

• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj

• Substack: https://lennysreads.com/

Follow Lenny

• Twitter: https://twitter.com/lennysan

• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/

• Podcast: https://www.youtube.com/@lennyspodcast

About

Welcome to Lenny's Reads, where every week you’ll find a fresh audio version of my newsletter about building product, driving growth, and accelerating your career, read to you by the soothing voice of Lennybot.

To hear more, visit www.lennysnewsletter.com

← All episodes of Lenny's Reads