Lenny's Podcast: Product | Career | Growth

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

September 25, 2025

Key Takeaways Copied to clipboard!

  • Evals are a systematic way to measure and improve AI applications, moving beyond subjective 'vibe checks' to provide actionable feedback for iteration and experimentation. 
  • The process of building effective evals begins with error analysis, where product builders manually review application traces to identify and note down issues, a crucial step that LLMs cannot fully automate at this stage. 
  • LLM-as-judge evals offer a scalable way to automate the evaluation of specific, narrowly defined failure modes by providing a binary pass/fail output, but require careful prompt engineering and alignment with human judgment to ensure reliability. 
  • Evals are the new PRDs for AI products, serving as a dynamic and constantly running specification that evolves with product development and data insights. 
  • Effective AI product development requires a systematic approach to error analysis and continuous evaluation, moving beyond intuition ('vibes') to data-driven improvements. 
  • While AI can assist in the evaluation process, human oversight and critical thinking remain essential for interpreting data, identifying nuanced failure modes, and ensuring alignment with product goals. 

Segments

Defining AI Evals
Copied to clipboard!
(00:04:42)
  • Key Takeaway: Evals are a systematic method for measuring and improving AI applications, akin to data analytics for LLMs.
  • Summary: Evals provide a structured approach to analyze AI application performance, moving beyond subjective assessments to enable data-driven improvements. This process involves looking at data systematically and creating metrics to track progress and guide development.
Error Analysis Process
Copied to clipboard!
(00:12:34)
  • Key Takeaway: The initial step in building evals is error analysis, involving manual review of application traces to identify and document issues.
  • Summary: Product builders examine logs (traces) of AI interactions to pinpoint what’s going wrong. This manual note-taking, even if informal, is crucial for understanding user experience flaws and identifying areas for improvement, forming the foundation for more systematic testing.
Open Coding and Axial Codes
Copied to clipboard!
(00:25:12)
  • Key Takeaway: Open coding involves freeform note-taking on errors, which can then be synthesized into axial codes (categories of failure modes) using LLMs.
  • Summary: After identifying issues through open coding, LLMs can help categorize these notes into broader themes or failure modes (axial codes). This synthesis transforms raw observations into structured data, making it easier to identify prevalent problems within the AI application.
Code-Based vs. LLM-as-Judge Evals
Copied to clipboard!
(00:48:04)
  • Key Takeaway: Code-based evals use programmatic checks, while LLM-as-judge evals leverage LLMs to evaluate specific, complex failure modes with binary outputs.
  • Summary: Code-based evals are suitable for straightforward checks like output format, offering cost-effectiveness. LLM-as-judge evals are employed for more nuanced issues, where an LLM acts as a judge to provide a pass/fail assessment on a narrowly defined problem, enabling automated evaluation of complex behaviors.
Validating LLM Judges
Copied to clipboard!
(00:56:58)
  • Key Takeaway: LLM-as-judge evals must be rigorously validated against human judgment to ensure accuracy and maintain trust in the evaluation metrics.
  • Summary: It is critical to test LLM judges against human assessments, focusing on specific error types and using metrics beyond simple agreement percentage. Analyzing misclassifications (false positives/negatives) helps refine the LLM judge’s prompt and improve its alignment with desired outcomes.
Evals as New PRDs
Copied to clipboard!
(01:01:03)
  • Key Takeaway: Evals, particularly LLM-as-judge prompts, function as dynamic, constantly running Product Requirements Documents for AI products.
  • Summary: The detailed, specific instructions within an LLM-as-judge prompt define the expected behavior and quality standards for an AI product. This continuous, automated testing mechanism serves as a living document, ensuring the AI consistently meets its requirements in production.
Evals as Dynamic PRDs
Copied to clipboard!
(01:00:28)
  • Key Takeaway: Evals function as living Product Requirements Documents (PRDs) for AI, continuously specifying desired agent behavior and evolving with data.
  • Summary: Evals serve as the new PRDs for AI products, providing detailed, specific requirements for agent responses that are constantly tested. This process allows product managers to uncover new expectations and failure modes from data, leading to iterative improvements that can even enhance the original PRD. The dynamic nature of evals reflects the reality that desired product behavior is often discovered through usage and data analysis, not just upfront planning.
Criteria Drift in Evals
Copied to clipboard!
(01:03:03)
  • Key Takeaway: Human evaluators’ criteria for good and bad AI outputs change as they review more data, highlighting the challenge of upfront rubric definition.
  • Summary: Research indicates that evaluators’ opinions on AI output quality evolve during the review process, and they often only identify failure modes after encountering unexpected outputs. This ‘criteria drift’ means that defining perfect rubrics upfront is difficult, even for experienced professionals. It underscores the need for iterative evaluation and a flexible approach to defining success metrics in AI development.
Number of Evals Needed
Copied to clipboard!
(01:05:09)
  • Key Takeaway: Only a few targeted evals are typically necessary, focusing on persistent failure modes that cannot be resolved by prompt engineering alone.
  • Summary: The number of necessary evals is surprisingly small, usually between four and seven, as many issues can be addressed by refining the agent’s prompt. Evals should be reserved for the ‘pesky’ problems that persist despite prompt adjustments. Prioritization is key, focusing on the most critical or risky failure modes for the business.
Evals Beyond Initial Setup
Copied to clipboard!
(01:07:41)
  • Key Takeaway: Evals should be integrated into unit tests and online monitoring to ensure continuous improvement and maintain product quality.
  • Summary: After initial setup, evals should be operationalized through unit tests and online monitoring dashboards to track performance and drive ongoing product improvements. This systematic approach provides a sharp sense of application performance and acts as a competitive moat for AI products. The artifacts created during the eval process are valuable for repeated use and continuous enhancement.
The Evals Debate Nuances
Copied to clipboard!
(01:10:03)
  • Key Takeaway: Misconceptions about evals stem from rigid definitions and negative experiences with poorly implemented systems, leading to unnecessary debate.
  • Summary: The debate around evals often arises from narrow definitions, such as equating them solely with unit tests or data analysis, and from individuals having negative experiences with poorly executed eval processes. Many successful AI products, including coding agents, implicitly rely on systematic evaluation, even if not explicitly labeled as such. Understanding the nuance is crucial to avoid dismissing the value of structured evaluation.
Coding Agents and Evals
Copied to clipboard!
(01:11:51)
  • Key Takeaway: Coding agents’ unique developer-centric nature allows for shorter feedback loops and less reliance on traditional evals due to inherent dogfooding and domain expertise.
  • Summary: Coding agents are a special case where developers are both the users and domain experts, enabling a form of ‘dogfooding’ that can shorten the eval cycle. The direct visibility into generated code allows for immediate assessment of quality. This contrasts with other AI products where users may not possess the same level of expertise or tolerance for errors, necessitating more robust and explicit eval processes.
Evals vs. A/B Tests
Copied to clipboard!
(01:16:24)
  • Key Takeaway: A/B tests are a form of evals, but effective A/B testing should be informed by prior error analysis, not just hypothetical product requirements.
  • Summary: A/B tests are considered a subset of evals, involving systematic measurement and comparison of metrics. However, A/B tests are most effective when they are powered by insights gained from error analysis, rather than being based solely on hypothetical assumptions about product improvements. This ensures that the tests are addressing actual issues identified in the data.
Common Evals Misconceptions
Copied to clipboard!
(01:24:13)
  • Key Takeaway: Key misconceptions include believing tools can fully automate evals, underestimating the power of direct data analysis, and assuming a single correct method for evals.
  • Summary: A common misconception is that AI tools can entirely replace human effort in evals, when in reality, human involvement in error analysis is crucial. Many also underestimate the profound insights gained from simply looking at individual data traces. Furthermore, there isn’t one ‘correct’ way to conduct evals; the approach should be tailored to the product’s stage and available resources, though all effective methods involve error analysis.
Tips for Eval Success
Copied to clipboard!
(01:26:37)
  • Key Takeaway: Focus on actionable improvement over perfection, and leverage LLMs to organize thoughts and enhance the eval process without replacing human judgment.
  • Summary: The primary goal of evals is to actionably improve the product, not to achieve perfect evaluation metrics. Embracing imperfection and focusing on finding areas for improvement is key. LLMs can be valuable tools for organizing thoughts, refining product requirements based on error analysis, and presenting information more effectively, but they should augment, not replace, human critical thinking.
Time Investment in Evals
Copied to clipboard!
(01:30:38)
  • Key Takeaway: The initial setup for evals requires a significant one-time investment, but ongoing maintenance can be as little as 30 minutes per week.
  • Summary: Establishing an effective eval system involves an upfront investment of several days for initial error analysis and setup. However, once integrated into automated processes like unit tests or weekly scripts, the ongoing time commitment is minimal. This makes evals a highly scalable and efficient activity for continuous product improvement.
Course Content and Perks
Copied to clipboard!
(01:34:00)
  • Key Takeaway: The comprehensive AI evals course covers the full lifecycle of error analysis, automated evaluation, and application improvement, with unique perks like a detailed book and an AI assistant.
  • Summary: The course delves into building interfaces for error analysis, cost optimization for LLM usage, and creating a flywheel for continuous product improvement. Students receive a 160-page detailed guide and gain access to an AI assistant trained on all course materials, providing personalized support. The curriculum aims to equip participants with the skills to build robust and profitable AI products.
Lightning Round Insights
Copied to clipboard!
(01:38:01)
  • Key Takeaway: Key recommendations include diverse reading, embracing AI coding tools, and adopting mottos focused on continuous learning and empathy.
  • Summary: Recommended reading spans fiction like ‘Pachinko’ and non-fiction on business strategy, alongside foundational AI textbooks. AI-assisted coding tools like Cursor and Claude Code are highlighted for their potential to boost productivity. Life mottos emphasize continuous learning, beginner’s mindset, and understanding opposing viewpoints to foster collaboration and shared progress in AI development.