LLM Introspection Tools Evaluation

Findings from using Moss's introspection tools.

DWIM Effectiveness

Feature	Accuracy	Confidence	Notes
Typo correction	100%	0.80-0.93	"skelton" → "skeleton" works
Alias resolution	100%	1.00	"imports" → "deps" perfect
Natural language	100% top-3	0.24-0.51	Correct tool found, but confidence below threshold

Issue: Natural language queries have low confidence despite correct results. The TF-IDF approach works for ranking but confidence scores don't reflect accuracy.

Recommendation: Lower SUGGEST_THRESHOLD from 0.5 to 0.3, or use top-k results regardless of threshold.

Tool Effectiveness

What Works Well

context — Best entry point for any file

Shows lines, symbol counts, imports at a glance
Good for deciding what to explore next

query --inherits — Finding subclasses

moss query src/ --inherits Exception --type class found all 4 exception classes instantly
Much faster than grep for semantic queries

JSON + Python — Custom analysis

moss --json deps src/ | python3 -c "..." enables arbitrary analysis
Built dependency graph, found most-imported modules

skeleton — Code structure

19 top-level symbols in dwim.py identified correctly
Signatures and docstrings preserved

Gaps

No line counts per function — Can't filter by complexity
No reverse deps — "What imports this module?" not directly available
No symbol sizes — End line numbers would help estimate function length
CFG verbosity — Full graph output overwhelming for large functions

Usage Patterns

Understanding a file: moss context <file>

Finding implementations: moss query <dir> --inherits <base> or --signature <pattern>

Dependency analysis: moss --json deps <dir> | python3 -c "..."

Symbol inventory: moss --json skeleton <dir> | python3 -c "..."

Test Results

DWIM:
- Typo correction: 7/7 (100%)
- Alias resolution: 8/8 (100%)
- NL routing top-3: 8/8 (100%)

Tools on Moss codebase:
- context: Shows 596 lines, 4 classes, 15 functions, 5 methods for dwim.py
- query --inherits Exception: Found 4 exception classes
- deps analysis: Identified moss.views (7 imports) as most-used internal module
- skeleton: Extracted 19 top-level symbols correctly

Recommendations

High Priority

Lower DWIM threshold for natural language
Add line counts for complexity filtering

Medium Priority

Add reverse dependency lookup
Add symbol end lines for size calculation

Low Priority

CFG summary mode
Grouped multi-file output

LLM Introspection Tools Evaluation ​

DWIM Effectiveness ​

Tool Effectiveness ​

What Works Well ​

Gaps ​

Usage Patterns ​

Test Results ​

Recommendations ​

High Priority ​

Medium Priority ​

Low Priority ​