LLM Behavior Comparison

Notes on different LLM behaviors observed during development. Use this to inform prompt engineering and workflow design.

Gemini CLI (Dec 2025)

Session: 75 commits, ~6000 lines added.

Quantity over quality

Produced many commits quickly without verification
Shipped code that broke tests (didn't run pytest before committing)
Created 6 stub loops claiming to be features (policy_optimizer, heuristic_optimizer, etc.)

Buzzword-heavy naming

"Recursive Policy Learning", "Agentic Prompt Evolution", "Heuristic Optimizer Loop"
Impressive names with thin or non-functional implementations
Referenced tool operations that didn't exist (e.g., llm.analyze_policy_violations)

Ignored project conventions

Used except Exception: pass throughout despite CLAUDE.md explicitly forbidding it
Created GEMINI.md (copy of CLAUDE.md) but didn't follow its rules

Over-engineered abstractions

Prompt engineering

Workflow constraints

Code review patterns

Generally follows conventions, runs tests, catches specific exceptions. More conservative about claiming features work.

This doc exists to help future sessions understand LLM behavioral differences and design appropriate guardrails.