BridgeBenchBridgeBench
Hallucination
Model Analysis

Claude Opus 4.5

anthropic/claude-opus-4-5

76.9

overall score

72.3% accuracy
27.9% fabrication

Tasks

30

Passed

29

Failed

1

Avg latency

8571ms

Total cost

$1.5056

Cluster Performance

Behavioral Claims6 tasks
75.7
Edge Case ID5 tasks
57.0
API Knowledge5 tasks
64.1
Complexity Analysis4 tasks
91.7
Bug Detection5 tasks
88.1
Doc Accuracy5 tasks
88.1

All Task Results

TaskClusterScore
halluc-nested-merge-claimsBehavioral Claims0.0
halluc-edge-paginationEdge Case ID20.2
halluc-edge-rate-limiterEdge Case ID36.5
halluc-api-map-setAPI Knowledge39.1
halluc-edge-tree-traversalEdge Case ID43.3
halluc-api-regex-named-groupsAPI Knowledge49.7
halluc-api-promisesAPI Knowledge58.9
halluc-dedup-sort-claimsBehavioral Claims68.1
halluc-doc-middleware-chainDoc Accuracy71.7
halluc-api-node-cryptoAPI Knowledge73.2
halluc-bug-closure-loopBug Detection73.5
halluc-complexity-sort-chainComplexity Analysis80.9
halluc-bug-async-raceBug Detection81.8
halluc-doc-query-builderDoc Accuracy82.3
halluc-edge-string-truncateEdge Case ID85.5
halluc-bug-off-by-oneBug Detection85.7
halluc-doc-event-emitterDoc Accuracy86.6
halluc-cache-eviction-claimsBehavioral Claims86.7
halluc-complexity-graph-bfsComplexity Analysis86.7
halluc-complexity-nested-loopsComplexity Analysis99.4
halluc-bug-type-coercionBug Detection99.6
halluc-api-zod-schemaAPI Knowledge99.7
halluc-complexity-recursive-memoComplexity Analysis99.7
halluc-edge-date-parserEdge Case ID99.7
halluc-parser-output-claimsBehavioral Claims99.8
halluc-retry-logic-claimsBehavioral Claims99.8
halluc-bug-null-coalesceBug Detection99.9
halluc-doc-http-handlerDoc Accuracy100.0
halluc-doc-validation-pipeDoc Accuracy100.0
halluc-state-machine-claimsBehavioral Claims100.0

30tasks · Sorted by score (lowest first) · Fabricated = high-confidence false claims