Systematic Error Analysis for LLM Systems: Moving Beyond Random Testing
A comprehensive guide to systematic error analysis for LLM systems using dimensional testing and qualitative coding methods
Systematic Error Analysis for LLM Systems: Moving Beyond Random Testing
Your natural language to SQL system works perfectly in demos. Then users start complaining. "Show me sales from last quarter" returns nothing. "Find high-value transactions" pulls up everything except what they wanted.
Most teams test LLM systems like traditional software: write test cases, throw random queries at it, fix obvious bugs, ship it. But LLM failures aren't random—they follow patterns.
Your system might handle precise technical queries but consistently fail when business analysts ask vague questions about time periods. It might process simple filters perfectly but break when non-technical users combine aggregation with complex conditions.
Random testing finds obvious problems. It doesn't find the systematic failure modes that actually hurt your users.
Systematic Test Coverage with Dimensions
Instead of random testing, you need systematic coverage. This means identifying the key ways your system can vary and testing combinations deliberately.
Dimensions are the different axes along which your system behavior changes. For a natural language to SQL system, three critical dimensions are:
- Query Type: filter, aggregation, join, complex
- User Persona: business analyst, developer, non-technical user
- Request Clarity: precise, ambiguous, incomplete
Each dimension captures one way things can go wrong. Query type affects SQL complexity. User persona influences how requests are phrased. Request clarity determines how much the system must infer.
Tuples are specific combinations: (Filter, Business Analyst, Ambiguous) or (Join, Developer, Precise). Instead of random queries, you systematically generate examples for each meaningful combination.
This gives you comprehensive coverage of failure modes rather than hoping random examples catch problems.
The Three-Stage Discovery Process
Once you have systematic test data, you need a way to find patterns in the failures. Qualitative research gives us a proven method: open coding, axial coding, and taxonomy creation.
Open Coding: Run your test queries and label everything you notice. Stay descriptive, not interpretive. "Missing WHERE clause for date range" not "poor date handling." Capture all observations without organizing them yet.
Axial Coding: Group related observations. "Missing WHERE clause," "ignored time constraint," and "default date range used" all cluster into a "Missing Filters" pattern.
Taxonomy Creation: Organize patterns into a hierarchy. Missing Filters becomes a subcategory under Query Logic, alongside Wrong Aggregation and Incorrect Joins.
This bottom-up approach discovers actual failure patterns from your data rather than imposing assumptions about what should go wrong.
Complete Workflow: NL-to-SQL Error Analysis
Step 1: Define Dimensions and Generate Test Data
For our NL-to-SQL system, we use three dimensions:
- Query Type: filter, aggregation, join, complex
- User Persona: business analyst, developer, non-technical user
- Request Clarity: precise, ambiguous, incomplete
Generate tuples like (Filter, Business Analyst, Ambiguous) → "Show me sales from last quarter" or (Join, Developer, Precise) → "Get customer names with order totals from January 2024."
Create 100 realistic queries covering all meaningful combinations.
Step 2: Run Tests and Open Code
Execute each query through your system. For every failure, write descriptive labels:
- "Missing date constraint in WHERE clause"
- "Used COUNT instead of SUM for totals"
- "Joined wrong table for customer data"
- "Generated syntactically correct but logically wrong SQL"
Step 3: Find Patterns with Axial Coding
Group related observations:
- Missing Filters: date constraints, status conditions ignored
- Wrong Aggregation: COUNT/SUM/AVG confusion
- Schema Errors: invalid columns, wrong tables
- Logic Errors: correct syntax, wrong business meaning
Step 4: Create Taxonomy and Quantify
Build a hierarchical structure from your patterns:
SQL Generation Failures
├── Schema Understanding
│ ├── Invalid Columns (18% of failures)
│ └── Wrong Tables (12% of failures)
├── Query Logic
│ ├── Missing Filters (35% of failures)
│ ├── Wrong Aggregation (22% of failures)
│ └── Incorrect Joins (8% of failures)
└── Syntax Issues
├── Malformed SQL (3% of failures)
└── Type Mismatches (2% of failures)
Step 5: Prioritize and Fix
Focus on highest-impact categories first. Missing Filters at 35% gets immediate attention. Schema errors at 30% combined need systematic fixes like better column name matching.
This data-driven approach tells you exactly where to invest engineering effort rather than guessing at improvements.
Why This Systematic Approach Works
Traditional QA catches obvious bugs but misses systematic patterns. Random testing might find that your system fails on "complex joins" but won't tell you it specifically fails when non-technical users request complex joins with ambiguous time constraints.
The systematic approach gives you:
Comprehensive Coverage: Instead of hoping random examples catch edge cases, you systematically test all meaningful combinations of variables that affect system behavior.
Data-Driven Insights: Patterns emerge from actual failures rather than assumptions about what might go wrong. You discover that 35% of failures are missing filters, not the SQL syntax errors you expected.
Actionable Priorities: Quantified failure modes tell you exactly where to focus engineering effort for maximum impact.
Reproducible Process: The methodology works across different LLM systems and domains, not just NL-to-SQL.
This transforms error analysis from reactive bug hunting into proactive system improvement based on real usage patterns.