LLM Introspection Tools Evaluation
Findings from using Moss's introspection tools.
DWIM Effectiveness
| Feature | Accuracy | Confidence | Notes |
|---|---|---|---|
| Typo correction | 100% | 0.80-0.93 | "skelton" → "skeleton" works |
| Alias resolution | 100% | 1.00 | "imports" → "deps" perfect |
| Natural language | 100% top-3 | 0.24-0.51 | Correct tool found, but confidence below threshold |
Issue: Natural language queries have low confidence despite correct results. The TF-IDF approach works for ranking but confidence scores don't reflect accuracy.
Recommendation: Lower SUGGEST_THRESHOLD from 0.5 to 0.3, or use top-k results regardless of threshold.
Tool Effectiveness
What Works Well
context — Best entry point for any file
- Shows lines, symbol counts, imports at a glance
- Good for deciding what to explore next
query --inherits — Finding subclasses
moss query src/ --inherits Exception --type classfound all 4 exception classes instantly- Much faster than grep for semantic queries
JSON + Python — Custom analysis
moss --json deps src/ | python3 -c "..."enables arbitrary analysis- Built dependency graph, found most-imported modules
skeleton — Code structure
- 19 top-level symbols in dwim.py identified correctly
- Signatures and docstrings preserved
Gaps
- No line counts per function — Can't filter by complexity
- No reverse deps — "What imports this module?" not directly available
- No symbol sizes — End line numbers would help estimate function length
- CFG verbosity — Full graph output overwhelming for large functions
Usage Patterns
Understanding a file: moss context <file>
Finding implementations: moss query <dir> --inherits <base> or --signature <pattern>
Dependency analysis: moss --json deps <dir> | python3 -c "..."
Symbol inventory: moss --json skeleton <dir> | python3 -c "..."
Test Results
DWIM:
- Typo correction: 7/7 (100%)
- Alias resolution: 8/8 (100%)
- NL routing top-3: 8/8 (100%)
Tools on Moss codebase:
- context: Shows 596 lines, 4 classes, 15 functions, 5 methods for dwim.py
- query --inherits Exception: Found 4 exception classes
- deps analysis: Identified moss.views (7 imports) as most-used internal module
- skeleton: Extracted 19 top-level symbols correctlyRecommendations
High Priority
- Lower DWIM threshold for natural language
- Add line counts for complexity filtering
Medium Priority
- Add reverse dependency lookup
- Add symbol end lines for size calculation
Low Priority
- CFG summary mode
- Grouped multi-file output