Back/Engineering/ChatGPT
IntermediateEngineeringChatGPT

How to Systematically Analyze and Debug Errors in AI Products

A structured, data-driven workflow for improving AI product quality by logging real user traces, manually analyzing errors, creating targeted evaluations (evals), and iterating on prompts or models.

From How I AI

How I AI: Hamel Husain's Guide to Debugging AI Products & Writing Evals

with Claire Vo

How to Systematically Analyze and Debug Errors in AI Products

Tools Used

ChatGPT

OpenAI conversational AI

Step-by-Step Guide

1

Log and Examine Real User Traces

Collect and review 'traces'—full, multi-turn conversations your AI has with real users. This reveals how people actually use your product, including typos and ambiguous questions, which is crucial for understanding real-world performance. Use platforms like Braintrust or Arize for this.

2

Perform Manual Error Analysis

Manually review a sample of traces (e.g., 100) and write a one-sentence note for the *very first error* you find in each conversation. This 'open coding' process helps identify the root cause of failures quickly.

Prompt:
Example error note: "Should have asked follow up questions about the question, what's up with four month rent? Because it's unclear user intent."
Pro Tip: Focusing on the most upstream error is a powerful heuristic. Fixing early intent clarification or tool call issues often resolves many downstream problems.
3

Create a Custom Annotation System

To speed up manual review, build a simple internal app or custom view in your observability platform. The goal is to make it easy for annotators (like product managers) to quickly categorize and label issues.

4

Categorize and Prioritize Errors

Group your manual error notes into common themes or categories, either manually or with help from an LLM like ChatGPT. Count the frequency of each category to create a clear, data-driven, and prioritized list of what to fix first.

5

Write Targeted Evaluations (Evals)

Based on your prioritized error categories, write specific evals. Use code-based evals for objective checks (like data leaks) and LLM judges for subjective issues. Crucially, validate your LLM judges against human-labeled data to ensure they are accurate and produce simple binary (pass/fail) outcomes.

6

Iterate and Improve

With reliable evals in place, you can now confidently make changes. Use prompt engineering, fine-tuning on difficult examples, or improving your RAG system to address the identified errors and measure the impact with your evals.

Become a 10x PM.
For just $5 / month.

We've made ChatPRD affordable so everyone from engineers to founders to Chief Product Officers can benefit from an AI PM.