Back/How I AI
How I AI

How I AI: Hamel Husain's Guide to Debugging AI Products & Writing Evals

Let's demystify debugging AI product errors and building evals with this simple guide to improving AI products.

Claire Vo's profile picture

Claire Vo

October 14, 2025·11 min read
How I AI: Hamel Husain's Guide to Debugging AI Products & Writing Evals

This week, I had Hamel Husain on the show, an AI consultant and educator who broke down how to debug errors in your AI product and write effective evaluations (evals). He also walked us through how he runs his entire business using some pretty impressive AI workflows.

For many of us, especially product managers, building AI products feels like uncharted territory. The technical details, the unpredictable nature of large language models, and the huge amount of data can make it incredibly difficult to ensure your product is high-quality, consistent, and reliable.

What I love about Hamel’s approach is how systematic it is. He gets you past simple “vibe checks” and into data-driven processes that lead to real, measurable improvements. He showed us that even though the technology is new, the fundamentals—like looking at your data—are the same, just with an AI twist.

Hamel shared two distinct workflows with us. The first is a methodical way to find and fix errors in AI products. The second is a peek into his personal operations, where he uses Claude and Gemini inside a GitHub monorepo to automate and streamline his entire business.

Workflow 1: Systematic Error Analysis for AI Products

One of the biggest challenges with AI products is that they fail in weird, often non-obvious ways. You might fix one prompt, but you have no idea if you’ve just broken something else or actually improved the system overall. Hamel's first workflow is a structured approach to error analysis that helps teams identify, categorize, and prioritize these AI failures so you can make progress with confidence.

Step 1: Log and Examine Real User Traces

Hamel says the first step is simple: look at your data. For AI products, that means examining “traces,” which are the full, multi-turn conversations your AI system has with real users. These traces capture everything—user prompts, AI responses, and even internal events like tool calls, retrieval augmented generation (RAG) lookups, and system prompts. This is how you see the way people actually use your AI, typos and vague questions included, which is essential for understanding its real-world performance.

  • Tools: Platforms like Braintrust or Arize are designed for logging and visualizing these AI traces. You can also build your own logging infrastructure.
  • Process: Collect real user interactions from your deployed system. If you're just starting, you can generate synthetic data, but Hamel emphasizes that real user data reveals the true distribution of inputs.
  • Example: Hamel demonstrated this with Nurture Boss, an AI assistant for property managers. He showed a trace where a user asked, "Hello there, what's up to four month rent?"—an ambiguous query that highlights how real users deviate from ideal test cases.
A detailed look at the Nurture Boss AI-powered Virtual Leasing Assistant website, showcasing its mobile chat interface for property management and the AI chatbot's features.
A detailed view of the 'Logs' section within the NurtureBoss platform by Parlance Labs, demonstrating AI conversation traces. The table shows input prompts, represented as truncated JSON objects, along with LLM-specific durations and token counts, providing insights into model interactions and performance.
Debugging an AI Assistant's Log: Tracing User Input, Tool Calls, and Responses in the NurtureBoss Platform, showcasing prompt instructions and an AI's tool-augmented reply.

Step 2: Perform Manual Error Analysis

This part is surprisingly low-tech but effective. Instead of jumping straight to automated solutions, you just manually review a sample of traces and write down what went wrong.

This process is sometimes called “open coding” or journaling. It just means reading through conversations and making a quick one-sentence note on every error you see. The key is to stop at the very first error in the sequence of events, because that's usually the root cause of all the problems that follow.

  • Process: Randomly sample about 100 traces. For each trace, read until you hit a snag—an incorrect, ambiguous, or high-friction part of the experience. Write a concise note about the error.
  • Insight: Focusing on the most upstream error is a heuristic to simplify the process and get fast results. Fixing early intent clarification or tool call issues often resolves many downstream issues.
Example Note: For the "what's up to four month rent?" query, Hamel's note was: "Should have asked follow up questions about the question, what's up with four month rent? Because it's unclear user intent."

Step 3: Create a Custom Annotation System

To speed up the manual review, Hamel recommends building a custom annotation system. This doesn't have to be complicated—it could be a simple internal app or a custom view in your observability platform. The goal is to make it as easy as possible for human annotators (often product managers or subject matter experts) to quickly categorize and label issues.

  • Tools: While platforms like Braintrust and Phoenix offer annotation features, a custom app can be tailored to your specific needs, channels (text message, email, chatbot), and metadata.
  • Benefits: Streamlines the process, ensures human-readable output, and makes it easy to “vibe code” and quickly navigate through data.
A detailed look at the NurtureBoss LLM Grader's custom annotation UI, demonstrating its capabilities for filtering and managing data sessions by communication type (voice, email, text, chatbot) and annotation status (good, bad, annotated, unannotated). The interface also previews an AI settings panel for prompt management.

Step 4: Categorize and Prioritize Errors by Frequency Counting

Once you have a bunch of notes, it's time to categorize them. You can use an LLM like ChatGPT to help group your notes into common themes, though it might take a little back-and-forth to get the categories right. Then, you just count how many times each error category shows up. This simple frequency count gives you a clear, prioritized list of what to fix.

  • Process: Aggregate all your notes. Use an LLM or manual review to group similar notes into error categories (e.g., "transfer and handoff issues," "tour scheduling issues," "incorrect information"). Count how many times each category appears.
  • Outcome: This gives you a data-driven roadmap for product improvements. For Nurture Boss, this revealed common problems like AI not handing off to a human correctly or repeatedly scheduling tours instead of rescheduling them.
  • Key Insight: "Counting is powerful." This simple metric gives you real confidence in what to work on next, helping you move past analysis paralysis and guesswork.
A detailed view of the 'LLM Grader' web application from the 'How I AI' podcast, showcasing categorized error results like 'Transfer/handoff issues' and 'Tour scheduling issues' with their respective counts, along with various communication session types.

Step 5: Write Targeted Evaluations (Evals)

Now that you have prioritized error categories, you can write specific evaluations to test for these issues at scale. Evals generally come in two flavors:

  • Code-based Evals: For objective, deterministic checks. If you know the exact right answer or can check for specific patterns (e.g., user IDs not appearing in responses), you can write unit tests. An excellent example is ensuring sensitive information (like UIDs from system prompts) doesn't leak into user-facing outputs.
  • LLM Judges: For subjective problems that require nuanced understanding. If an error like a "transfer handoff issue" is more ambiguous, an LLM can act as a judge. However, it's critical to set these up correctly.
  • Binary Outcomes: LLM judges should give you a simple binary result (yes/no, pass/fail) for a specific problem. They shouldn't be generating arbitrary scores (like a "helpfulness score" of 4.2 vs. 4.7, which is pretty meaningless).
  • Validation: This is a big one: you must hand-label some data and compare the LLM judge's scores to the human labels. This measures the "agreement" and helps you trust your automated evals. If you skip this, you might see "good" eval scores in your dashboard while your users are having a terrible time, which is a fast way to lose their trust.
  • Context: The research paper "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences" points out that humans are often bad at writing clear instructions until they see how an LLM interprets them. The error analysis process you just did helps you figure out what you actually need, which makes your LLM judge prompts much better.
An example of an overly complex LLM evaluation dashboard, labeled 'Don't Do This!!', demonstrating various performance metrics like Helpfulness, Conciseness, and Accuracy with scores and a performance over time graph.

Step 6: Iterate and Improve with Prompt Engineering or Fine-Tuning

With reliable evals in place, you can keep an eye on performance and see where errors are still happening. The fixes might be simple prompt engineering (like adding today's date to a system prompt so the AI knows what "tomorrow" is), or something more involved like fine-tuning your models on the "difficult examples" you found during your error analysis. Issues with retrieval in RAG systems are a common weak spot and a good place to focus.

  • Techniques: Experiment with prompt structures, add more examples to prompts, or even fine-tune models with data derived from your identified errors. As I learned with ChatPRD, even two incorrect words in a monster system prompt can significantly degrade tool calling quality.
  • Advanced Analytics: For agent-based systems with multiple handoffs, you can use analytical tools like transition matrices to pinpoint where errors are most likely to occur between different agent steps (e.g., generate SQL to execute SQL).
A detailed failure transition heatmap from a document on application-centric AI evaluations, illustrating error frequencies between different AI system states such as ParseReq, IntentClass, GenSQL, and ExecSQL.

Workflow 2: Hamel Husain's AI-Powered Business Operations

Hamel doesn't just use these methods for products—he runs his entire consulting and education business with AI as a co-pilot. His setup is all about being efficient, managing context well, and not getting locked into a single AI provider, all handled from one monorepo.

Step 1: Centralized "Claude Projects" for Every Business Function

Hamel uses Claude (and a feature they used to have called "projects") to create separate, dedicated spaces for different parts of his business. Each "project" is really just a detailed set of instructions, often with examples, that tells Claude how to do a specific job.

Examples: He has projects for copywriting, a legal assistant, consulting proposals, course content generation, and creating "Lightning Lessons" (lead magnets).

Consulting Proposals Workflow

When a client asks for a proposal, Hamel just feeds the call transcript into his "Consulting Proposals" project. The project is already loaded up with context about his skills (like "partner of Palantir's, expert generative AI"), his style guide (like "get to the point, writing short sentences"), and lots of examples. Claude then produces a proposal that's almost ready to go, requiring only a minute or so of editing.

Course Content Workflow

For his Maven course on evals, Hamel has a Claude project filled with the entire course book, a huge FAQ, transcripts, and even Discord messages. He uses this to create standalone FAQs and other course materials, all guided by a prompt that tells Claude to write concisely and without any fluff.

A detailed look at the Claude AI 'Projects' dashboard, showcasing various AI-powered project templates like 'Video Copy', 'Legal Assistant', and 'Consulting Proposals', demonstrating how users can organize and initiate different AI tasks within the application.
A detailed view of the Claude AI interface, showing a 'Set project instructions' dialog box. The instructions provide specific guidelines for the AI on how to generate effective consulting proposals, emphasizing conciseness, customer focus, and advisory language.
A detailed view of the Claude AI interface, showing the 'Evaluations FAQ' project with its primary instruction to 'help course instructors create stand-alone answers.' The project leverages a knowledge base including 'combined_office_hours.txt', 'discord_messages.json', and 'course_notes.txt' for context.

Step 2: Custom Software for Content Transformation with Gemini

Hamel has also built his own software to automate content creation, especially for turning video content into easy-to-read formats. He does this using multimodal models like Gemini.

  • Workflow: He takes a YouTube video and uses his software to create an annotated presentation. The system pulls the video transcript, and if the video has slides, it screenshots each slide and generates a summary underneath about what was said. This allows for consuming a one-hour presentation in minutes.
  • Tools: Gemini models are especially good at processing video. They can pull the transcript, video, and slides all at once to create a complete, structured summary.
  • Application: This is invaluable for Hamel's educational work, helping him distribute notes and make complex content digestible for his students.
A web page demonstrating an annotated presentation for 'Inspect AI,' an open-source Python package designed for language model evaluations, featuring a comprehensive overview, navigation links, and a detailed table of contents.

Step 3: The GitHub Monorepo: The "Second Brain" for AI Workflows

I think the most interesting part of Hamel's setup is his GitHub monorepo. This private repository is his central "second brain," holding all his data sources, notes, articles, personal writings, and his collection of prompts and tools. This way, he can give his AI co-pilots (like Claude Code or Cursor) a single, complete source of context for everything he does.

  • Structure: The monorepo contains everything from his blog and the YouTube transcription project to copywriting instructions and proposals. Everything is interrelated.
  • AI Access: He points his AI tools at this repo, providing a set of "Claude rules" within the repo itself. These rules instruct the AI on where to find specific information or context for different writing or development tasks (e.g., "if you need to write, look here").
  • Benefits: This prevents him from getting locked into one vendor, makes sure all his context is available to the AI, and creates a highly organized, prompt-driven system for managing complex information. It's an engineer's dream for managing data and prompts in a way that really scales your personal output.
A detailed view of the GitHub repository `hamelsmu/prompts`, showcasing its monorepo structure with various projects like `.openhands/microagents`, `evals`, and `hamel_tools`. The repository primarily uses Makefile, Python, and Shell, and includes a `CLAUDE.md` file, indicating its use for AI prompts or related content. GitHub's suggested workflows for Python package management with Anaconda are also visible, offering insights into the project's technical stack.

Conclusion

This was such a great episode for learning how to think about AI product development and personal productivity in a more structured way. The big takeaway for me is that improving AI quality comes down to systematic data analysis, careful error identification, and well-designed evals. Hamel’s practical advice to "do the hard work" of looking at real data, annotating errors, and validating your LLM judges is so helpful for any team building with AI.

His personal workflows also gave us a look at what a super-efficient, AI-powered business can look like. He showed us how to build a flexible system that cuts down on repetitive work and lets you scale your own expertise.

Whether you're a product manager debugging a chatbot or an entrepreneur trying to automate daily tasks, Hamel's ideas give you a clear playbook for improving your AI projects. I highly recommend checking out his website and his Maven course to learn more about his methods.

Sponsor Thanks

Brought to you by

GoFundMe Giving Funds—One Account. Zero Hassle.

Persona—Trusted identity verification for any use case

****

Episode Links

Try These Workflows

Step-by-step guides extracted from this episode.

Start shipping
better products.

Join 100,000+ product managers who use ChatPRD to write better docs, align teams faster, and build products users love.

Free to start
No credit card
SOC 2 certified
Enterprise ready