Model Analysis
Claude Opus 4.5
anthropic/claude-opus-4-5
76.9
overall score
72.3% accuracy
27.9% fabrication
Tasks
30
Passed
29
Failed
1
Avg latency
8571ms
Total cost
$1.5056
Cluster Performance
Behavioral Claims6 tasks
75.7
Edge Case ID5 tasks
57.0
API Knowledge5 tasks
64.1
Complexity Analysis4 tasks
91.7
Bug Detection5 tasks
88.1
Doc Accuracy5 tasks
88.1
All Task Results
| Task | Cluster | Score | Accuracy | Fabricated | Latency | |
|---|---|---|---|---|---|---|
| halluc-nested-merge-claims | Behavioral Claims | 0.0 | 0 | 0 | 10462ms | |
| halluc-edge-pagination | Edge Case ID | 20.2 | 0 | 6 | 10656ms | |
| halluc-edge-rate-limiter | Edge Case ID | 36.5 | 17 | 5 | 12469ms | |
| halluc-api-map-set | API Knowledge | 39.1 | 17 | 5 | 7934ms | |
| halluc-edge-tree-traversal | Edge Case ID | 43.3 | 33 | 4 | 9742ms | |
| halluc-api-regex-named-groups | API Knowledge | 49.7 | 33 | 4 | 9363ms | |
| halluc-api-promises | API Knowledge | 58.9 | 50 | 3 | 8043ms | |
| halluc-dedup-sort-claims | Behavioral Claims | 68.1 | 60 | 2 | 7585ms | |
| halluc-doc-middleware-chain | Doc Accuracy | 71.7 | 67 | 2 | 9770ms | |
| halluc-api-node-crypto | API Knowledge | 73.2 | 67 | 2 | 14271ms | |
| halluc-bug-closure-loop | Bug Detection | 73.5 | 67 | 2 | 8584ms | |
| halluc-complexity-sort-chain | Complexity Analysis | 80.9 | 83 | 1 | 8157ms | |
| halluc-bug-async-race | Bug Detection | 81.8 | 83 | 1 | 8878ms | |
| halluc-doc-query-builder | Doc Accuracy | 82.3 | 83 | 1 | 8352ms | |
| halluc-edge-string-truncate | Edge Case ID | 85.5 | 80 | 1 | 7437ms | |
| halluc-bug-off-by-one | Bug Detection | 85.7 | 80 | 1 | 7699ms | |
| halluc-doc-event-emitter | Doc Accuracy | 86.6 | 83 | 1 | 8639ms | |
| halluc-cache-eviction-claims | Behavioral Claims | 86.7 | 83 | 1 | 7558ms | |
| halluc-complexity-graph-bfs | Complexity Analysis | 86.7 | 83 | 1 | 8994ms | |
| halluc-complexity-nested-loops | Complexity Analysis | 99.4 | 100 | 0 | 8545ms | |
| halluc-bug-type-coercion | Bug Detection | 99.6 | 100 | 0 | 9198ms | |
| halluc-api-zod-schema | API Knowledge | 99.7 | 100 | 0 | 7547ms | |
| halluc-complexity-recursive-memo | Complexity Analysis | 99.7 | 100 | 0 | 6520ms | |
| halluc-edge-date-parser | Edge Case ID | 99.7 | 100 | 0 | 6541ms | |
| halluc-parser-output-claims | Behavioral Claims | 99.8 | 100 | 0 | 8798ms | |
| halluc-retry-logic-claims | Behavioral Claims | 99.8 | 100 | 0 | 8186ms | |
| halluc-bug-null-coalesce | Bug Detection | 99.9 | 100 | 0 | 6972ms | |
| halluc-doc-http-handler | Doc Accuracy | 100.0 | 100 | 0 | 6482ms | |
| halluc-doc-validation-pipe | Doc Accuracy | 100.0 | 100 | 0 | 6762ms | |
| halluc-state-machine-claims | Behavioral Claims | 100.0 | 100 | 0 | 6999ms |
30tasks · Sorted by score (lowest first) · Fabricated = high-confidence false claims