BridgeBenchBridgeBench
Debugging
Model Analysis

Claude Opus 4.7

openrouter/anthropic/claude-opus-4.7

86.2

overall score

100.0% repro
100.0% regress
21.5% diagnose

Tasks

10

Passed

10

Failed

0

Avg latency

6004ms

Total cost

$0.1636

Cluster Performance

Test Failure2 tasks
89.4
Runtime Exception2 tasks
85.6
Incorrect Output1 tasks
87.3
Async Timing1 tasks
87.3
State Mutation2 tasks
84.4
Regression After Refactor2 tasks
84.6

All Task Results

TaskClusterScore
Loop Closure Capture Bug

debug-closure-loop-v2

Test Failure84.0
Deep Clone Shared Reference Bug

debug-deep-clone-v2

State Mutation84.0
Cycle Detection False Positive Regression

debug-graph-cycle-v2

Regression After Refactor84.1
Default Merge Mutation Bug

debug-object-mutation-v2

State Mutation84.8
LRU Access Order Regression

debug-lru-eviction-v2

Regression After Refactor85.0
Flatten Object Null Base Case Bug

debug-recursion-base-case-v2

Runtime Exception85.1
JSON Parser Escape and Whitespace Bugs

debug-json-parser-v2

Runtime Exception86.0
Broken Batch Promise Chain

debug-promise-chain-v2

Async Timing87.3
Sliding Window Rate Limiter

debug-rate-limiter-v2

Incorrect Output87.3
Safe Number Parsing Coercion Bug

debug-type-coercion-v2

Test Failure94.8

10tasks · visible repro, hidden bug, and regression scoring