Claude Opus 4.6
openrouter/anthropic/claude-opus-4-6
81.6
overall score
Tasks
30
Passed
14
Failed
16
Avg latency
29899ms
Total cost
$1.9058
AI Commentary
by openai/gpt-5.4-miniClaude Opus 4.6 is strong on security-oriented validation and policy logic, with excellent scores in sanitization (97.6), access control (97.5), and traffic protection (95.6). Its main weaknesses are in stateful/authentication edge cases and low-level parsing/formatting tasks: auth-session drops to 69.2 due to JWT, OAuth state, and password scoring errors, while crypto-utils is especially weak at 54.5 with null/format failures.
Domain Performance
Performance is near-ceiling at 97.6 across file upload, hostname allowlisting, HTML entity encoding, safe redirects, and URL sanitization. The model appears reliable when the task is straightforward normalization or validation with clear rules.
This is a mixed domain at 69.2: cookie policy, CSRF token management, refresh token rotation, and session fixation detection are strong, but JWT validation and OAuth state handling fail badly. The JWT task suggests brittle parsing or over-rejection of malformed inputs, while OAuth state returned null/TypeError, indicating missing output construction or unchecked undefined access.
Very strong at 97.5, with correct handling of access control engine logic, API key scopes, permissions, and tenant isolation. No meaningful weakness appears in authorization reasoning or multi-tenant boundary enforcement.
At 76.3, the model is competent but inconsistent in security detection tasks. It performs well on CSP parsing, dependency risk classification, SQL injection detection, and vulnerability scanning, but misses or over-adds anomaly labels in auth-log analysis and fails hard on SSRF detection with parse errors, suggesting weaker robustness on structured edge-case inputs.
The single rate-limit task scored 95.6, indicating solid handling of throttling logic and request control. There is not enough breadth here to infer broader weaknesses.
This is the weakest domain at 54.5, driven by a severe failure in sec-crypto-utils where outputs were null or empty and TypeErrors indicate undefined-length access. The model likely struggles with exact byte/string transformations and output formatting in cryptographic helper routines.
Notable Tasks
The model rejected all examples as invalid token format and returned null header/payload fields, which points to brittle JWT parsing rather than nuanced validation of alg/claims.
Returning null with a TypeError on length access suggests the implementation failed before producing any structured result, likely due to missing initialization or unsafe handling of absent state values.
Unexpected end-of-input errors indicate the detector could not parse the test cases at all, so the failure is structural rather than a simple classification mistake.
This task is notable because it sits inside a near-perfect access-control domain score, implying the model handles policy evaluation and authorization boundaries reliably.
Null outputs, TypeErrors, and an empty string where a transformed value was expected point to broken utility logic and poor resilience to edge-case inputs.
All Task Results
| Task | Domain | Score | Correct | Hidden | Latency | |
|---|---|---|---|---|---|---|
| sec-crypto-utils | Crypto Utils | 9.5 | 0 | 0 | 26260ms | |
| sec-oauth-state-validator | Auth & Session | 9.5 | 0 | 0 | 36204ms | |
| sec-ssrf-detector | Detection & Analysis | 9.5 | 0 | 0 | 57431ms | |
| sec-jwt-validator | Auth & Session | 27.7 | 33 | 8 | 22619ms | |
| sec-password-strength | Auth & Session | 49.5 | 67 | 23 | 17292ms | |
| sec-auth-log-anomaly-detector | Detection & Analysis | 62.2 | 33 | 82 | 38712ms | |
| sec-insecure-config-scanner | Detection & Analysis | 66.3 | 50 | 75 | 34995ms | |
| sec-secret-detector | Detection & Analysis | 70.0 | 67 | 67 | 22770ms | |
| sec-input-sanitizer | Sanitization | 87.3 | 100 | 73 | 19951ms | |
| sec-csp-nonce-validator | Detection & Analysis | 87.8 | 100 | 75 | 16688ms | |
| sec-abac-rule-engine | Access Control | 94.2 | 100 | 89 | 29594ms | |
| sec-rate-limit-engine | Traffic Protection | 95.6 | 100 | 92 | 49477ms | |
| sec-sql-injection-detector | Detection & Analysis | 95.6 | 100 | 92 | 27190ms | |
| sec-csp-parser | Detection & Analysis | 96.1 | 100 | 92 | 21064ms | |
| sec-permission-checker | Access Control | 96.8 | 100 | 94 | 33367ms | |
| sec-api-key-scope-checker | Access Control | 97.3 | 100 | 96 | 41315ms | |
| sec-access-control-engine | Access Control | 99.5 | 100 | 100 | 42917ms | |
| sec-cookie-policy-validator | Auth & Session | 99.5 | 100 | 100 | 27488ms | |
| sec-csrf-token-manager | Auth & Session | 99.5 | 100 | 100 | 21192ms | |
| sec-dependency-risk-classifier | Detection & Analysis | 99.5 | 100 | 100 | 18116ms | |
| sec-encryption-pipeline | Crypto Utils | 99.5 | 100 | 100 | 33790ms | |
| sec-hostname-allowlist-validator | Sanitization | 99.5 | 100 | 100 | 22119ms | |
| sec-refresh-token-rotation | Auth & Session | 99.5 | 100 | 100 | 37376ms | |
| sec-safe-redirect-builder | Sanitization | 99.5 | 100 | 100 | 25758ms | |
| sec-session-fixation-detector | Auth & Session | 99.5 | 100 | 100 | 34059ms | |
| sec-tenant-isolation-checker | Access Control | 99.5 | 100 | 100 | 31109ms | |
| sec-url-sanitizer | Sanitization | 99.5 | 100 | 100 | 35230ms | |
| sec-vulnerability-scanner | Detection & Analysis | 99.5 | 100 | 100 | 39600ms | |
| sec-file-upload-validator | Sanitization | 100.0 | 100 | 100 | 14163ms | |
| sec-html-entity-encoder | Sanitization | 100.0 | 100 | 100 | 19113ms |
30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate