How I AI: Hamel Husain's Guide to Debugging AI Products & Writing Evals

This week, I had Hamel Husain on the show, an AI consultant and educator who demystified the process of debugging errors in your AI product and writing effective evaluations (evals.) He also showed us how he runs his entire business using sophisticated AI workflows.

Building AI products is a new frontier for many of us, especially product managers. The technical intricacies, the non-deterministic nature of large language models, and the sheer breadth of data make ensuring high quality, consistency, and reliability a truly challenging problem.

What I love about what Hamel shared with us is his systematic approach to quality, moving beyond mere “vibe checks” to implement data-driven processes that yield real, measurable improvements. He showed us that while the landscape is new, the fundamentals, like looking at data, are the same, just with an AI twist.

Hamel dove into two distinct, powerful workflows: first, a systematic method for identifying and fixing errors in AI products. Second, he gave us a peek into his personal AI-powered operations, revealing how he leverages Claude and Gemini within a GitHub monorepo to automate and streamline his entire business.

Workflow 1: Systematic Error Analysis for AI Products

When building AI products, a common challenge is that AI fails in weird, often non-obvious ways. You fix one prompt, and you're not sure if you're breaking something else or genuinely improving the system as a whole. Hamel's first workflow tackles this head-on with a structured approach to error analysis that helps teams identify, categorize, and prioritize AI failures.

Step 1: Log and Examine Real User Traces

The fundamental starting point, according to Hamel, is to look at your data. For AI products, this means examining “traces”—the full, multi-turn conversations and interactions your AI system has with real users. These traces capture not just user prompts and AI responses, but also internal events like tool calls, retrieval augmented generation (RAG) lookups, and system prompts. This is where you see how users actually interact with your AI, often with vague inputs or typos, which is crucial for understanding real-world performance.

Tools: Platforms like Braintrust or Arize are designed for logging and visualizing these AI traces. You can also build your own logging infrastructure.
Process: Collect real user interactions from your deployed system. If you're just starting, you can generate synthetic data, but Hamel emphasizes that real user data reveals the true distribution of inputs.
Example: Hamel demonstrated this with Nurture Boss, an AI assistant for property managers. He showed a trace where a user asked, "Hello there, what's up to four month rent?"—an ambiguous query that highlights how real users deviate from ideal test cases.

A detailed look at the Nurture Boss AI-powered Virtual Leasing Assistant website, showcasing its mobile chat interface for property management and the AI chatbot's features.

A detailed view of the 'Logs' section within the NurtureBoss platform by Parlance Labs, demonstrating AI conversation traces. The table shows input prompts, represented as truncated JSON objects, along with LLM-specific durations and token counts, providing insights into model interactions and performance.

Debugging an AI Assistant's Log: Tracing User Input, Tool Calls, and Responses in the NurtureBoss Platform, showcasing prompt instructions and an AI's tool-augmented reply.

Step 2: Perform Manual Error Analysis

This is the surprisingly effective “low-tech” part. Instead of immediately looking for automated solutions, you manually review a sample of traces and document what went wrong.

This process, known as “open coding” or journaling, involves reading through conversations and making one-sentence notes on every error you find. The key is to stop at the most upstream error in the sequence of events, as this is typically the causal root of downstream problems.

Process: Randomly sample about 100 traces. For each trace, read until you hit a snag—an incorrect, ambiguous, or high-friction part of the experience. Write a concise note about the error.
Insight: Focusing on the most upstream error is a heuristic to simplify the process and get fast results. Fixing early intent clarification or tool call issues often resolves many downstream issues.

Example Note: For the "what's up to four month rent?" query, Hamel's note was: "Should have asked follow up questions about the question, what's up with four month rent? Because it's unclear user intent."

Step 3: Create a Custom Annotation System

To make the manual review faster and more efficient, Hamel recommends building a custom annotation system. This could be a simple internal app or a customized view within an observability platform. The goal is to remove friction, allowing human annotators (often product managers or subject matter experts) to quickly categorize and label issues.

Tools: While platforms like Braintrust and Phoenix offer annotation features, a custom app can be tailored to your specific needs, channels (text message, email, chatbot), and metadata.
Benefits: Streamlines the process, ensures human-readable output, and makes it easy to “vibe code” and quickly navigate through data.

A detailed look at the NurtureBoss LLM Grader's custom annotation UI, demonstrating its capabilities for filtering and managing data sessions by communication type (voice, email, text, chatbot) and annotation status (good, bad, annotated, unannotated). The interface also previews an AI settings panel for prompt management.

Step 4: Categorize and Prioritize Errors by Frequency Counting

Once you have a collection of notes, the next step is to categorize them. You can use an LLM like ChatGPT to help bucket notes into common themes, though some back-and-forth might be needed to refine categories. The ultimate goal is simple: count the frequency of each error category. This frequency count provides a clear, prioritized list of problems to address.

Process: Aggregate all your notes. Use an LLM or manual review to group similar notes into error categories (e.g., "transfer and handoff issues," "tour scheduling issues," "incorrect information"). Count how many times each category appears.
Outcome: This gives you a data-driven roadmap for product improvements. For Nurture Boss, this revealed common problems like AI not handing off to a human correctly or repeatedly scheduling tours instead of rescheduling them.
Key Insight: "Counting is powerful." This simple metric provides objective confidence in what to work on, moving past paralysis and guesswork.

A detailed view of the 'LLM Grader' web application from the 'How I AI' podcast, showcasing categorized error results like 'Transfer/handoff issues' and 'Tour scheduling issues' with their respective counts, along with various communication session types.

Step 5: Write Targeted Evaluations (Evals)

With prioritized error categories, you can now write specific evaluations to test for these issues at scale. Evals fall into two main types:

Code-based Evals: For objective, deterministic checks. If you know the exact right answer or can check for specific patterns (e.g., user IDs not appearing in responses), you can write unit tests. An excellent example is ensuring sensitive information (like UIDs from system prompts) doesn't leak into user-facing outputs.
LLM Judges: For subjective problems that require nuanced understanding. If an error like a "transfer handoff issue" is more ambiguous, an LLM can act as a judge. However, it's critical to set these up correctly.
Binary Outcomes: LLM judges should output binary (yes/no, pass/fail) results for specific problems, not arbitrary scores (like a "helpfulness score" of 4.2 vs 4.7, which is meaningless).
Validation: You must hand-label some data and compare the LLM judge's output to human labels. This measures the "agreement" and builds trust in your automated evaluations. Without this, you risk showing "good" eval scores while users experience a "broken" product, eroding trust.
Context: The research paper "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences" emphasizes that humans are bad at writing specifications until they react to an LLM's output. The error analysis process helps externalize those needs, refining your LLM judge prompts.

An example of an overly complex LLM evaluation dashboard, labeled 'Don't Do This!!', demonstrating various performance metrics like Helpfulness, Conciseness, and Accuracy with scores and a performance over time graph.

Step 6: Iterate and Improve with Prompt Engineering or Fine-Tuning

Once you have reliable evals deployed, you can continuously monitor performance and identify where errors persist. The improvements could involve simple prompt engineering (e.g., adding today's date to a system prompt so the AI understands "tomorrow"), or in more advanced cases, fine-tuning your models with specific "difficult examples" identified during error analysis. Retrieval issues (in RAG systems) are often an Achilles' heel and a common area for improvement.

Techniques: Experiment with prompt structures, add more examples to prompts, or even fine-tune models with data derived from your identified errors. As I learned with ChatPRD, even two incorrect words in a monster system prompt can significantly degrade tool calling quality.
Advanced Analytics: For agent-based systems with multiple handoffs, you can use analytical tools like transition matrices to pinpoint where errors are most likely to occur between different agent steps (e.g., generate SQL to execute SQL).

A detailed failure transition heatmap from a document on application-centric AI evaluations, illustrating error frequencies between different AI system states such as ParseReq, IntentClass, GenSQL, and ExecSQL.

Workflow 2: Hamel Husain's AI-Powered Business Operations

Beyond product quality, Hamel runs his entire consulting and education business using AI as a co-pilot. His approach prioritizes efficiency, context management, and staying provider-agnostic, all managed through a single monorepo.

Step 1: Centralized "Claude Projects" for Every Business Function

Hamel uses Claude (and previously Claude's "projects" feature) to create dedicated, context-rich environments for different aspects of his business. Each "project" is essentially a detailed instruction set, often accompanied by examples, that helps Claude perform specific tasks.

Examples: He has projects for copywriting, a legal assistant, consulting proposals, course content generation, and creating "Lightning Lessons" (lead magnets).

Consulting Proposals Workflow

When a client requests a proposal, Hamel feeds the call transcript into his "Consulting Proposals" project. This project contains context about his skills (e.g., "partner of Palantir's, expert generative AI"), instructions (e.g., "get to the point, writing short sentences"), and numerous examples. Claude then generates a near-ready proposal that requires only about a minute of editing.

Course Content Workflow

For his Maven course on evals, Hamel has a Claude project loaded with the entire course book, an extensive FAQ, transcripts, and Discord messages. This project helps him create standalone, interesting FAQs and other educational content, guided by a prompt that emphasizes concise, filler-free writing.

A detailed look at the Claude AI 'Projects' dashboard, showcasing various AI-powered project templates like 'Video Copy', 'Legal Assistant', and 'Consulting Proposals', demonstrating how users can organize and initiate different AI tasks within the application.

A detailed view of the Claude AI interface, showing a 'Set project instructions' dialog box. The instructions provide specific guidelines for the AI on how to generate effective consulting proposals, emphasizing conciseness, customer focus, and advisory language.

A detailed view of the Claude AI interface, showing the 'Evaluations FAQ' project with its primary instruction to 'help course instructors create stand-alone answers.' The project leverages a knowledge base including 'combined_office_hours.txt', 'discord_messages.json', and 'course_notes.txt' for context.

Step 2: Custom Software for Content Transformation with Gemini

Hamel has developed custom software to automate content creation, particularly transforming video content into accessible, readable formats. This leverages the power of multimodal models like Gemini.

Workflow: He takes a YouTube video and uses his software to create an annotated presentation. The system pulls the video transcript, and if the video has slides, it screenshots each slide and generates a summary underneath about what was said. This allows for consuming a one-hour presentation in minutes.
Tools: Gemini models are particularly brilliant for video information ingestion, pulling transcripts, video, and slides all at once to produce comprehensive, structured summaries.
Application: This is invaluable for Hamel's educational work, helping him distribute notes and make complex content digestible for his students.

A web page demonstrating an annotated presentation for 'Inspect AI,' an open-source Python package designed for language model evaluations, featuring a comprehensive overview, navigation links, and a detailed table of contents.

Step 3: The GitHub Monorepo: The "Second Brain" for AI Workflows

The most fascinating aspect of Hamel's setup is his GH monorepo. This private repository serves as his central "second brain," housing all his data sources, notes, articles, personal writings, and, crucially, his collection of prompts and tools. This approach allows him to provide his AI co-pilots (like Claude Code or Cursor) with a unified, comprehensive context for everything he does.

Structure: The monorepo contains everything from his blog and the YouTube transcription project to copywriting instructions and proposals. Everything is interrelated.
AI Access: He points his AI tools at this repo, providing a set of "Claude rules" within the repo itself. These rules instruct the AI on where to find specific information or context for different writing or development tasks (e.g., "if you need to write, look here").
Benefits: This prevents vendor lock-in, ensures all context is available to the AI, and creates a highly organized, prompt-driven system for managing complex information and generating content. It's an engineer's dream for managing data and prompts in a way that truly scales personal productivity.

A detailed view of the GitHub repository `hamelsmu/prompts`, showcasing its monorepo structure with various projects like `.openhands/microagents`, `evals`, and `hamel_tools`. The repository primarily uses Makefile, Python, and Shell, and includes a `CLAUDE.md` file, indicating its use for AI prompts or related content. GitHub's suggested workflows for Python package management with Anaconda are also visible, offering insights into the project's technical stack.

Conclusion

This episode was a masterclass in how to approach AI product development and personal productivity with rigor and intentionality. We learned that the path to higher quality AI products is all about systematic data analysis, diligent error identification, and thoughtful evals. Hamel's pragmatic advice to "do the hard work" of looking at real data, annotating errors, and validating your LLM judges is truly empowering for any team building with AI.

His personal workflows also offered a glimpse into a highly efficient, AI-powered future for business operations. Hamel showed us how to build a flexible, powerful system that reduces toil and scales expertise.

Whether you're a product manager debugging an AI chatbot or an entrepreneur looking to automate your daily tasks, Hamel's insights provide actionable strategies to move your AI initiatives from good to great. I highly encourage you to explore his website and his Maven course to dive deeper into these invaluable techniques.

Sponsor Thanks

Brought to you by

GoFundMe Giving Funds—One Account. Zero Hassle.

Persona—Trusted identity verification for any use case

Episode Links

How I AI: Hamel Husain's Guide to Debugging AI Products & Writing Evals

Workflow 1: Systematic Error Analysis for AI Products

Step 1: Log and Examine Real User Traces

Step 2: Perform Manual Error Analysis

Step 3: Create a Custom Annotation System

Step 4: Categorize and Prioritize Errors by Frequency Counting

Step 5: Write Targeted Evaluations (Evals)

Step 6: Iterate and Improve with Prompt Engineering or Fine-Tuning

Workflow 2: Hamel Husain's AI-Powered Business Operations

Step 1: Centralized "Claude Projects" for Every Business Function

Consulting Proposals Workflow

Course Content Workflow

Step 2: Custom Software for Content Transformation with Gemini

Step 3: The GitHub Monorepo: The "Second Brain" for AI Workflows

Conclusion

Related Articles

What's New: Simpler Plans, Cleaner Template Management, and Customized Verbosity

How I AI: Marily Nika’s AI Native PM Workflow (Perplexity, Veo, v0, Notebook LM)

What's new: docs + events to become a ChatPRD expert

Become a 10x PM.
For just $5 / month.

Solutions

Support

Company

Resources

Workflow 1: Systematic Error Analysis for AI Products

Step 1: Log and Examine Real User Traces

Step 2: Perform Manual Error Analysis

Step 3: Create a Custom Annotation System

Step 4: Categorize and Prioritize Errors by Frequency Counting

Step 5: Write Targeted Evaluations (Evals)

Step 6: Iterate and Improve with Prompt Engineering or Fine-Tuning

Workflow 2: Hamel Husain's AI-Powered Business Operations

Step 1: Centralized "Claude Projects" for Every Business Function

Consulting Proposals Workflow

Course Content Workflow

Step 2: Custom Software for Content Transformation with Gemini

Step 3: The GitHub Monorepo: The "Second Brain" for AI Workflows

Conclusion

Related Articles

What's New: Simpler Plans, Cleaner Template Management, and Customized Verbosity

How I AI: Marily Nika’s AI Native PM Workflow (Perplexity, Veo, v0, Notebook LM)

What's new: docs + events to become a ChatPRD expert

Become a 10x PM.For just $5 / month.

Become a 10x PM.
For just $5 / month.