Lesson 44: Troubleshoot — Common Problems & Fixes

Day 44 of 50 · ~10 min · Phase 5: Advanced Patterns

Your Goal

You'll encounter real problems when using Claude Code at scale. This lesson teaches diagnosis and recovery for three common scenarios.

Scenario 1: Subagent Returns Garbage Summary

The Problem

You spawn a subagent to review security vulnerabilities in your codebase:

Main: "Review all database queries for SQL injection vulnerabilities"

[Subagent runs for 5 minutes]

Subagent returns:
"Found some issues. Database.js uses variables. Check it."

The summary is vague, incomplete, and unhelpful. You can't act on it. You don't know:

How many issues?
Which ones are critical?
How to fix them?
Are there false positives?

Diagnosis

Ask yourself:

Was the task autonomous enough? ("Review for SQL injection" is concrete. ✓)
Did Claude have the right context? (You didn't give the subagent your codebase map. ✗)
Was the task too broad? ("All database queries" might be 50 files. Subagent got lost. ✗)
Did you specify output format? (No. Subagent improvised. ✗)

Fix

Restart the subagent with better instructions:

Main: "Review all database queries in src/db/ for SQL injection.

Critical information:
- We use parameterized queries (prepared statements)
- Look for: raw string concatenation, user input in query strings
- Output format:
  1. Files analyzed: [count]
  2. Findings:
     - [file.js line N]: [issue type] - [brief description]
  3. Risk level: LOW / MEDIUM / HIGH
  4. Recommended action: [specific fix]

Be thorough and specific. Show code examples."

Better summary:

Files analyzed: 8
Findings:
- auth.js line 42: Direct string concatenation in query
  Code: const q = `SELECT * FROM users WHERE id = ${userId}`
  Fix: Use parameterized query: db.query('SELECT * FROM users WHERE id = ?', [userId])

Risk level: HIGH (SQL injection in authentication)
Recommended action: Refactor auth.js to use prepared statements

Concept Reinforced

Subagent quality depends on:

Clear task: Not vague, not too broad
Specific output format: Tell Claude exactly how to structure results
Context provided: CLAUDE.md, architecture map, examples of good/bad code
Autonomy: Can the task be completed without asking you questions?

Mental model: Subagents are like writing requirements for a contractor. Vague requirements = vague results. Specific requirements = usable results.

Scenario 2: Headless Mode Runs Infinitely

The Problem

You set up a GitHub Actions workflow to auto-format code:

- name: Auto-format code
  run: |
    claude --headless --task "format code with prettier"

The workflow starts. Minutes pass. No completion. 30 minutes later, GitHub cancels the job.

error: Job exceeded maximum execution time

Claude is stuck in a loop.

Diagnosis

Common causes:

Claude can't complete the task — It keeps finding things to format, never stops
Circular dependency — Formatting triggers another format step
No explicit termination condition — Claude never knows when it's done
Insufficient permissions — Claude is blocked and retrying endlessly

Check the logs:

# GitHub Actions logs show:
Found 50 files to format...
Formatted 10 files...
Formatted 10 files...  # Repeated, not progressing
Formatted 10 files...

Claude is retrying the same files.

Fix

Rewrite the task with explicit bounds:

- name: Auto-format code
  run: |
    claude --headless \
      --task "format code with prettier" \
      --max-iterations 1 \
      --timeout 300  # 5 minutes max
      
    # Explicit success check
    if [ $? -eq 0 ]; then
      git add .
      git commit -m "chore: auto-format"
    else
      echo "Format task failed or timed out"
      exit 1
    fi

Better version — give Claude a specific list:

# Get files that need formatting
FILES=$(git diff --name-only origin/main...HEAD | grep -E '\.(js|ts)$' | head -20)

claude --headless \
  --task "format exactly these files with prettier, no other changes: $FILES" \
  --timeout 300

Concept Reinforced

Preventing infinite loops:

Explicit scope — Tell Claude exactly which files/lines, not "all code"
Termination condition — "Format until all files pass linting" vs. "Format these 5 files"
Timeout protection — Set max execution time (5-10 min for typical tasks)
Idempotency — Task should produce the same result if run twice
Logging — Make Claude report progress so you can spot loops early

Mental model: Headless Claude needs guardrails. An interactive Claude can ask for help. A headless Claude must have clear boundaries, or it spins forever.

Scenario 3: Cost Spirals Out of Control

The Problem

You set up subagents to explore a large codebase daily. One week in, your bill is $500.

"That's $500/day. That's going to bankrupt us."

Diagnosis

Why does cost spiral?

Large codebase + fresh context — Each subagent reads the entire codebase (100K tokens each)
Multiple subagents — You spawned 5 subagents × 100K tokens = 500K tokens/run
Frequent runs — Running daily means high recurrence
No context reuse — Each subagent starts from scratch

Cost breakdown:

5 subagents × 100K tokens input = 500K tokens
500K input tokens × $0.003 per 1K = $1.50
5 subagents × 10K tokens output = 50K tokens  
50K output tokens × $0.015 per 1K = $0.75
Cost per run: $2.25
Runs per day: 1
Runs per month: 30
Monthly cost: $67.50

But wait... you also run it on demand during development.
3 extra runs per day × 30 days = 90 extra runs
90 × $2.25 = $202.50

Monthly total: $270

That matches your bill!

Fix

Option 1: Reuse context (best)

Instead of spawning 5 independent subagents:

Main session reads codebase once (100K tokens)
Main session spawns 5 subagents with: "Here's the codebase. Do your specific analysis."

Saves: 400K tokens per run
New cost: $0.60/run instead of $2.25/run

Option 2: Reduce scope

# Instead of analyzing entire codebase:
Main: "Analyze only src/api/ and src/db/"

Saves: 60% of tokens (only reading relevant code)

Option 3: Use cheaper model

claude --model haiku "analyze codebase"
# Haiku costs 50% less than Sonnet

Option 4: Reduce frequency

# Instead of daily: run weekly
# Instead of 5 subagents: run 2
# Instead of full codebase: run on PRs only

Combined fix:

Before:
- 5 subagents daily
- Full codebase
- Sonnet model
- Cost: $270/month

After:
- 2 subagents weekly (on PRs only)
- Only changed files
- Haiku model
- Cost: $15/month

Savings: 95%

Concept Reinforced

Cost optimization strategy:

Measure first — Understand where costs come from
Reuse context — Main session reads once, subagents reference it
Reduce scope — Analyze only what changed, not the whole codebase
Right-size models — Use Haiku for simple tasks, Sonnet for complex ones
Reduce frequency — Weekly is cheaper than daily; on-demand is cheaper than scheduled
Monitor continuously — Track costs weekly and set alerts

Mental model: Cost is a dial you can turn down. Start simple, measure, then optimize. Most teams can cut costs 70% by being strategic without sacrificing quality.

Quick Reference: Common Problems

Problem	Diagnosis	Quick Fix
Subagent returns vague results	No output format specified	Specify exact format: "Output: [file] [line] [issue]"
Headless mode hangs	Circular dependency or no termination condition	Add timeout, explicit task bounds, iteration limits
Costs explode	Full codebase reads, too many subagents, too frequent	Reduce scope, reuse context, use Haiku, reduce frequency
Tests fail inconsistently	Flaky tests, not Claude's problem	Fix tests; add determinism and isolation
Permission denied errors	CLAUDE.md missing or too restrictive	Review permissions, allow what you need
Large codebase confusion	No architecture map	Create CLAUDE.md with directory structure and key files
Safety concerns	Don't trust headless Claude	Use `--print` to preview, test in staging first
Slow performance	Too much context, too many files	Use Glob/Grep to read selectively, split into subagents

Your Troubleshooting Checklist

When something goes wrong with Claude Code:

1. Is it actually Claude Code, or something else?

Check error messages carefully
Run with --verbose flag for debug output

2. Check permissions (CLAUDE.md)

Can Claude Code read the files it needs?
Can it write/edit?
Are there restrictions?

3. Check the task definition

Is it clear and specific?
Is it autonomous (doesn't require human feedback)?
Does it have bounds (time, scope, format)?

4. Check context

Is Claude reading too much?
Is it missing important information?
Should you provide CLAUDE.md or architecture map?

5. Check the model/cost

Are you using the right model?
Can you optimize token usage?
Is frequency/scope reasonable?

6. Isolate the problem

Run a simpler version of the task
Try in interactive mode first
Test with a small subset of code

Final Thought

The most common mistakes are:

Underspecifying tasks — "Analyze the code" vs. "Check src/api/ for SQL injection and report findings as a list"
Not using /print before headless — Always preview before running unsupervised
No permissions boundary — Always define what Claude can and cannot do in CLAUDE.md
Not measuring costs — You can't optimize what you don't measure

Fix these three, and 80% of problems disappear.

← Back to Curriculum