Model Analysis

Claude Opus 4.6

openrouter/anthropic/claude-opus-4-6

81.6

overall score

81.7% visible

78.6% hidden

Tasks

Passed

Failed

Avg latency

29899ms

Total cost

$1.9058

AI Commentary

by openai/gpt-5.4-mini

Claude Opus 4.6 is strong on security-oriented validation and policy logic, with excellent scores in sanitization (97.6), access control (97.5), and traffic protection (95.6). Its main weaknesses are in stateful/authentication edge cases and low-level parsing/formatting tasks: auth-session drops to 69.2 due to JWT, OAuth state, and password scoring errors, while crypto-utils is especially weak at 54.5 with null/format failures.

Domain Performance

Sanitization6 tasks

97.6

Performance is near-ceiling at 97.6 across file upload, hostname allowlisting, HTML entity encoding, safe redirects, and URL sanitization. The model appears reliable when the task is straightforward normalization or validation with clear rules.

Auth & Session7 tasks

69.2

This is a mixed domain at 69.2: cookie policy, CSRF token management, refresh token rotation, and session fixation detection are strong, but JWT validation and OAuth state handling fail badly. The JWT task suggests brittle parsing or over-rejection of malformed inputs, while OAuth state returned null/TypeError, indicating missing output construction or unchecked undefined access.

Access Control5 tasks

97.5

Very strong at 97.5, with correct handling of access control engine logic, API key scopes, permissions, and tenant isolation. No meaningful weakness appears in authorization reasoning or multi-tenant boundary enforcement.

Detection & Analysis9 tasks

76.3

At 76.3, the model is competent but inconsistent in security detection tasks. It performs well on CSP parsing, dependency risk classification, SQL injection detection, and vulnerability scanning, but misses or over-adds anomaly labels in auth-log analysis and fails hard on SSRF detection with parse errors, suggesting weaker robustness on structured edge-case inputs.

Traffic Protection1 tasks

95.6

The single rate-limit task scored 95.6, indicating solid handling of throttling logic and request control. There is not enough breadth here to infer broader weaknesses.

Crypto Utils2 tasks

54.5

This is the weakest domain at 54.5, driven by a severe failure in sec-crypto-utils where outputs were null or empty and TypeErrors indicate undefined-length access. The model likely struggles with exact byte/string transformations and output formatting in cryptographic helper routines.

Notable Tasks

sec-jwt-validator27.7Auth & Session

The model rejected all examples as invalid token format and returned null header/payload fields, which points to brittle JWT parsing rather than nuanced validation of alg/claims.

sec-oauth-state-validator9.5Auth & Session

Returning null with a TypeError on length access suggests the implementation failed before producing any structured result, likely due to missing initialization or unsafe handling of absent state values.

sec-ssrf-detector9.5Detection & Analysis

Unexpected end-of-input errors indicate the detector could not parse the test cases at all, so the failure is structural rather than a simple classification mistake.

sec-access-control-engine99.5Access Control

This task is notable because it sits inside a near-perfect access-control domain score, implying the model handles policy evaluation and authorization boundaries reliably.

sec-crypto-utils9.5Crypto Utils

Null outputs, TypeErrors, and an empty string where a transformed value was expected point to broken utility logic and poor resilience to edge-case inputs.

All Task Results

Task	Domain	Score	Correct	Hidden	Latency
sec-crypto-utils	Crypto Utils	9.5	0	0	26260ms
sec-oauth-state-validator	Auth & Session	9.5	0	0	36204ms
sec-ssrf-detector	Detection & Analysis	9.5	0	0	57431ms
sec-jwt-validator	Auth & Session	27.7	33	8	22619ms
sec-password-strength	Auth & Session	49.5	67	23	17292ms
sec-auth-log-anomaly-detector	Detection & Analysis	62.2	33	82	38712ms
sec-insecure-config-scanner	Detection & Analysis	66.3	50	75	34995ms
sec-secret-detector	Detection & Analysis	70.0	67	67	22770ms
sec-input-sanitizer	Sanitization	87.3	100	73	19951ms
sec-csp-nonce-validator	Detection & Analysis	87.8	100	75	16688ms
sec-abac-rule-engine	Access Control	94.2	100	89	29594ms
sec-rate-limit-engine	Traffic Protection	95.6	100	92	49477ms
sec-sql-injection-detector	Detection & Analysis	95.6	100	92	27190ms
sec-csp-parser	Detection & Analysis	96.1	100	92	21064ms
sec-permission-checker	Access Control	96.8	100	94	33367ms
sec-api-key-scope-checker	Access Control	97.3	100	96	41315ms
sec-access-control-engine	Access Control	99.5	100	100	42917ms
sec-cookie-policy-validator	Auth & Session	99.5	100	100	27488ms
sec-csrf-token-manager	Auth & Session	99.5	100	100	21192ms
sec-dependency-risk-classifier	Detection & Analysis	99.5	100	100	18116ms
sec-encryption-pipeline	Crypto Utils	99.5	100	100	33790ms
sec-hostname-allowlist-validator	Sanitization	99.5	100	100	22119ms
sec-refresh-token-rotation	Auth & Session	99.5	100	100	37376ms
sec-safe-redirect-builder	Sanitization	99.5	100	100	25758ms
sec-session-fixation-detector	Auth & Session	99.5	100	100	34059ms
sec-tenant-isolation-checker	Access Control	99.5	100	100	31109ms
sec-url-sanitizer	Sanitization	99.5	100	100	35230ms
sec-vulnerability-scanner	Detection & Analysis	99.5	100	100	39600ms
sec-file-upload-validator	Sanitization	100.0	100	100	14163ms
sec-html-entity-encoder	Sanitization	100.0	100	100	19113ms

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate