BridgeBenchBridgeBench
Hallucination
Model Analysis

Claude Opus 4.7

openrouter/anthropic/claude-opus-4.7

77.1

overall score

74.1% accuracy
27.5% fabrication

Tasks

25

Passed

24

Failed

1

Avg latency

6967ms

Total cost

$0.4198

Cluster Performance

Behavioral Claims6 tasks
75.1
Edge Case ID5 tasks
48.6
Complexity Analysis4 tasks
92.3
Bug Detection5 tasks
86.0
Doc Accuracy5 tasks
87.2

All Task Results

TaskClusterScore
halluc-nested-merge-claimsBehavioral Claims0.0
halluc-edge-paginationEdge Case ID20.3
halluc-edge-rate-limiterEdge Case ID33.6
halluc-edge-tree-traversalEdge Case ID43.3
halluc-edge-date-parserEdge Case ID60.6
halluc-doc-middleware-chainDoc Accuracy69.2
halluc-dedup-sort-claimsBehavioral Claims70.3
halluc-bug-closure-loopBug Detection74.2
halluc-cache-eviction-claimsBehavioral Claims81.1
halluc-complexity-sort-chainComplexity Analysis81.4
halluc-bug-async-raceBug Detection82.0
halluc-doc-event-emitterDoc Accuracy82.3
halluc-doc-query-builderDoc Accuracy84.9
halluc-edge-string-truncateEdge Case ID85.0
halluc-bug-off-by-oneBug Detection86.0
halluc-bug-type-coercionBug Detection88.0
halluc-complexity-graph-bfsComplexity Analysis88.3
halluc-parser-output-claimsBehavioral Claims99.5
halluc-complexity-nested-loopsComplexity Analysis99.6
halluc-retry-logic-claimsBehavioral Claims99.6
halluc-complexity-recursive-memoComplexity Analysis99.7
halluc-doc-validation-pipeDoc Accuracy99.7
halluc-bug-null-coalesceBug Detection99.8
halluc-doc-http-handlerDoc Accuracy99.8
halluc-state-machine-claimsBehavioral Claims99.9

25tasks · Sorted by score (lowest first) · Fabricated = high-confidence false claims