Claude Code Session Log Analysis
Analysis of session logs from ~/.claude/projects/-home-me-git-moss/.
Session Overview
Main session: 6cb78e3b-873f-4a1f-aebe-c98d701052e3.jsonl (59MB) Subagents: 15+ agent files (200KB-500KB each)
Message Types
| Type | Count |
|---|---|
| assistant | 8512 |
| user | 3918 |
| file-history-snapshot | 319 |
| queue-operation | 110 |
| system | 41 |
| summary | 1 |
Tool Usage
| Tool | Calls | Errors | Success Rate |
|---|---|---|---|
| Bash | 1,372 | 234 | 83% |
| Edit | 990 | 7 | 99% |
| Read | 644 | 4 | 99% |
| TodoWrite | 286 | 0 | 100% |
| Write | 253 | 9 | 96% |
| Grep | 221 | 0 | 100% |
| Glob | 40 | 0 | 100% |
| Task | 12 | 0 | 100% |
| Other | 35 | 3 | 91% |
Total: 3,853 tool calls, 257 errors (93% success rate)
Note: "File not read yet" errors are attributed to Write (the tool that checks), not Edit.
Bash Command Patterns
Most common command prefixes:
uv(550): Running tests, python scripts, package managementruff(331): Linting and formattinggit(314): Version controlgrep(41): Text search (despite Grep tool!)nix(12): Development environment
Error Analysis
Bash Errors (234 total)
Exit codes:
- Exit code 1: 222 (general failure)
- Exit code 2: 7 (misuse)
- Exit code 124: 4 (timeout)
- Exit code 127: 1 (command not found)
Common error categories:
- Test failures (pytest): Test assertions failing during development
- Lint errors (ruff): Code style violations caught during check
- Build errors: Missing dependencies, import errors
- LSP server tests: Specific test file with persistent failures
File Operation Errors (20 total)
| Error | Count | Cause |
|---|---|---|
| File not read yet | 9 | Write/Edit before Read |
| File does not exist | 4 | Read/Write to missing file |
| Found N matches | 3 | Edit string not unique |
| String not found | 2 | Edit target changed |
| File modified since read | 2 | Linter changed file |
Other Errors (3 total)
- 2: User plan refinements (interactive editing, not rejections)
- 1: KillShell on already-completed shell
User Rejections
The 2 "user doesn't want to proceed" errors were actually plan refinements - the user was editing/improving proposed plans via Claude Code's interactive mode, not rejecting work outright.
Observations
Patterns
- High Bash usage for testing: Running
uv run pytestfrequently, which is expected during development - Linting integrated into workflow:
ruff checkandruff formatcalled 331 times - Git operations frequent: 314 git commands for commits, status checks
- TodoWrite heavily used: 285 calls, showing good task tracking discipline
Potential Inefficiencies
- Using grep in Bash (41 calls) when Grep tool exists and is optimized
- File read-before-write errors (9 cases): Could be addressed with better tool sequencing
- Non-unique Edit strings (3 cases): Need more context in edit operations
- Low parallel tool usage: 99.95% of turns have only 1 content block - could parallelize more
Turn Patterns
| Content Blocks | Count |
|---|---|
| 1 | 8,562 |
| 2 | 2 |
| 3 | 2 |
Almost all turns are sequential (1 tool call at a time). There's significant opportunity to parallelize independent operations like:
- Reading multiple files at once
- Running git status + git diff in parallel
- Multiple Grep searches
What's Working Well
- Edit success rate 99%: Very reliable file editing
- Read success rate 99%: File reading rarely fails
- TodoWrite 100% success: Task tracking working flawlessly
- Subagent usage: Only 12 Task calls, showing restraint in spawning subagents
- Subagents are focused: Mostly use Read/Grep/Glob for exploration, minimal Bash
Subagent Patterns
15 subagents spawned, tool usage shows they're used for research:
- Read: Primary tool (11-27 calls per agent)
- Grep: Secondary (1-7 calls)
- Glob: File finding (1-4 calls)
- Bash: Minimal (0-10 calls, usually for running code)
Subagents don't edit files - they gather information for the main session.
Recommendations
For Claude Code Users
- Prefer Grep tool over bash grep: More integrated, better error handling
- Always Read before Write/Edit: Avoid the 9 "not read yet" errors
- Provide more context in Edit: Avoid non-unique string matches
- Parallelize independent operations: Read multiple files, run multiple commands in one turn
For This Project
- LSP tests need attention: Multiple failures in test_lsp_server.py
- Consider log analysis tooling: This manual analysis could be automated as a
mosscommand
Future Work
Could build a moss analyze-logs command that:
- Parses Claude Code session logs
- Computes tool success rates
- Identifies error patterns
- Suggests workflow improvements
- Estimates token costs
This would help users understand their agent interaction patterns.
Token Usage
Note: Log entries are duplicated per streaming chunk. Numbers below use unique requestId grouping.
| Metric | Count |
|---|---|
| Unique API Calls | 3,716 |
| New Input Tokens | 33K |
| Cache Creation | 7.4M |
| Cache Read | 362M |
| Output Tokens | 1.45M |
| Total Context Processed | 372M |
Context Size Distribution
| Metric | Tokens |
|---|---|
| Minimum context | 19K |
| Maximum context | 156K |
| Average context | 100K |
Context grew from ~19K to ~156K tokens as the session progressed over 31 hours.
Cost Estimate (Opus 4.5 API rates)
Opus 4.5: $5/M input, $25/M output. Cache: 90% discount read, 25% premium write.
| Component | Tokens | Rate | Cost |
|---|---|---|---|
| Cache read | 362M | $0.50/M | $181 |
| Cache create | 7.4M | $6.25/M | $46 |
| New input | 33K | $5/M | $0.17 |
| Output | 1.45M | $25/M | $36 |
| Total | ~$263 |
Note: Claude Code Pro subscription ($100-200/mo) provides unlimited usage, so actual user cost depends on subscription tier, not raw API rates.
Cost Comparison: Alternative Models & Paradigms
Baseline numbers from this session:
- 3,716 API calls
- ~370M input tokens processed (97% from cache)
- 1.45M output tokens
- Average context: 100K tokens/call
a) Gemini Flash ($0.50/M input, $3/M output, 90% auto-cache)
| Component | Tokens | Rate | Cost |
|---|---|---|---|
| Cached input | 359M | $0.05/M | $18 |
| Non-cached input | 11M | $0.50/M | $6 |
| Output | 1.45M | $3/M | $4 |
| Total | ~$28 |
9x cheaper than Opus ($263 → $28) - output pricing dominates ($25/M → $3/M).
b) Moss Paradigm (conservative 10x context reduction)
Target: 10K tokens/call instead of 100K (conservative; vision is 1-5K).
| Metric | Current | Moss |
|---|---|---|
| Context/call | 100K | 10K |
| Total input | 370M | 37M |
| Output (est. 30% reduction) | 1.45M | 1.0M |
Moss + Opus 4.5:
| Component | Tokens | Rate | Cost |
|---|---|---|---|
| Cached input | 35.9M | $0.50/M | $18 |
| Cache create | 740K | $6.25/M | $5 |
| New input | 360K | $5/M | $2 |
| Output | 1.0M | $25/M | $25 |
| Total | ~$50 |
5x cheaper than current ($263 → $50).
Moss + Gemini Flash:
| Component | Tokens | Rate | Cost |
|---|---|---|---|
| Cached input | 33.3M | $0.05/M | $2 |
| Non-cached input | 3.7M | $0.50/M | $2 |
| Output | 1.0M | $3/M | $3 |
| Total | ~$7 |
38x cheaper than current Opus ($263 → $7).
Summary
| Configuration | Cost | vs Current |
|---|---|---|
| Opus 4.5 (current) | $263 | baseline |
| Gemini Flash (same paradigm) | $28 | 9x cheaper |
| Opus 4.5 + Moss | $50 | 5x cheaper |
| Gemini Flash + Moss | $7 | 38x cheaper |
The moss paradigm's value compounds with cheaper models. Reduced context means:
- Less to cache/transmit
- More focused model attention (better quality)
- Faster response times
Data Deduplication Note
The raw log has 8,500+ assistant entries because each streaming chunk is logged separately. Grouping by requestId and taking max values gives the correct per-call totals.
Architectural Insights
The Real Problem: Context Accumulation
The chatlog-as-context paradigm is fundamentally flawed:
- Context grew 19K → 156K tokens over 31 hours
- 99% of tokens are re-sent unchanged every turn
- "Lost in the middle" - models degrade at high context
- Signal drowns in noise
Why All Agentic Tools Do This
Every tool (Claude Code, Cursor, Copilot, etc.) uses similar approaches. Not because it's optimal, but:
- LLM APIs are stateless - must send context each call
- Summarization risks losing critical details
- RAG retrieval might miss something
- "Good enough" for short sessions
- Engineering complexity / risk aversion
The Better Approach (Moss Vision)
For code: Merkle tree of structural views
- Hash AST/modules/deps
- Know what changed instantly
- Invalidate cached summaries surgically
For chat: Don't keep history
- Task state: "implementing feature X"
- Retrieved memories: RAG for relevant past learnings
- Recent turns: last 2-3 exchanges only
- Compiled context: skeleton/deps, not raw files
Target: 1-10K tokens instead of 100K+
Why This Should Be More Reliable
Curated context isn't just cheaper - it's better:
- Model focuses on relevant signal
- No "lost in the middle" degradation
- Less noise = fewer hallucinations
- Faster iteration cycles
Analysis performed on session from Dec 17-18, 2025