Model Analysis

Claude Sonnet 4.6

anthropic/claude-sonnet-4-6

85.3

overall score

86.7% visible

81.8% hidden

Tasks

Passed

Failed

Avg latency

28300ms

Total cost

$1.1848

AI Commentary

by openai/gpt-5.4-mini

Claude Sonnet 4.6 is strong on security-oriented implementation tasks, with high scores in sanitization (98.8), access control (97.6), and traffic protection (95.6), and it handled most visible tests well (86.7%). Its main gaps are in brittle edge-case handling and output correctness on a few specialized validators: auth/session drops to 80.3 due to OAuth state and password-strength issues, detection/analysis is uneven at 79.6, and crypto-utils is the weakest area at 52.6 with structural failures and incorrect outputs.

Domain Performance

Sanitization6 tasks

98.8

Performance is near-ceiling across all five listed tasks, including file upload validation, hostname allowlisting, HTML entity encoding, safe redirect building, and URL sanitization. No meaningful weaknesses appear here, suggesting strong baseline defensive coding and normalization behavior.

Auth & Session7 tasks

80.3

This domain is mostly solid, but the OAuth state validator failed catastrophically with null output and a TypeError, indicating a broken code path rather than a logic miss. Password strength scoring also overestimated several weak passwords and missed multiple feedback conditions, so the model is less reliable when rules require nuanced aggregation of multiple weakness signals.

Access Control5 tasks

97.6

Access control is a clear strength, with strong results on access-control engine, API key scope checking, permission checking, and tenant isolation. The high average suggests the model handles authorization boundaries and policy evaluation consistently.

Detection & Analysis9 tasks

79.6

Results are mixed: CSP parsing, dependency risk classification, insecure config scanning, and vulnerability scanning are strong, but anomaly detection, secret detection, and SSRF detection are weak. The failures point to incomplete pattern matching and parser robustness issues, including truncated anomaly sets, missed secrets, and outright parse errors on SSRF inputs.

Traffic Protection1 tasks

95.6

Rate limiting is strong and appears stable, with no notable weaknesses in the single task. This suggests good handling of thresholding and request-accounting logic.

Crypto Utils2 tasks

52.6

This is the weakest domain by a wide margin, with one strong encryption pipeline task but a failing crypto-utils task that returned null and TypeErrors. The failure pattern suggests the implementation is not resilient to expected input shapes and may have broken assumptions about array/string handling.

Notable Tasks

sec-oauth-state-validator9.5Auth & Session

The task returned null with a TypeError on reading 'length', which indicates a runtime failure in state handling rather than an incorrect validation decision.

sec-password-strength66.9Auth & Session

The model under-scored several weak passwords and omitted required feedback items, then over-classified some cases as stronger than expected, showing inconsistent rule aggregation and thresholding.

sec-ssrf-detector9.5Detection & Analysis

All reported failures were parse-level crashes ('Unexpected end of input'), so the detector likely cannot robustly parse or normalize malformed/edge-case URLs before applying SSRF policy checks.

sec-secret-detector69.5Detection & Analysis

It truncated secret matches and missed at least one AWS secret key entirely, which points to brittle regex extraction and incomplete multi-secret handling.

sec-access-control-engine99.5Access Control

This is a strong authorization result, indicating the model can correctly encode access rules and preserve tenant or permission boundaries under test.

All Task Results

Task	Domain	Score	Correct	Hidden	Latency
sec-crypto-utils	Crypto Utils	9.5	0	0	23066ms
sec-oauth-state-validator	Auth & Session	9.5	0	0	32899ms
sec-ssrf-detector	Detection & Analysis	9.5	0	0	52934ms
sec-auth-log-anomaly-detector	Detection & Analysis	62.2	33	82	37759ms
sec-password-strength	Auth & Session	66.9	100	31	16807ms
sec-secret-detector	Detection & Analysis	69.5	67	67	14891ms
sec-csp-nonce-validator	Detection & Analysis	88.3	100	75	25476ms
sec-session-fixation-detector	Auth & Session	91.7	100	83	36482ms
sec-sql-injection-detector	Detection & Analysis	92.2	100	83	19531ms
sec-input-sanitizer	Sanitization	93.7	100	87	17703ms
sec-abac-rule-engine	Access Control	94.2	100	89	35705ms
sec-api-key-scope-checker	Access Control	95.1	100	91	32921ms
sec-refresh-token-rotation	Auth & Session	95.3	100	91	28794ms
sec-rate-limit-engine	Traffic Protection	95.6	100	92	36644ms
sec-encryption-pipeline	Crypto Utils	95.7	100	92	30208ms
sec-csp-parser	Detection & Analysis	96.1	100	92	18583ms
sec-access-control-engine	Access Control	99.5	100	100	24441ms
sec-cookie-policy-validator	Auth & Session	99.5	100	100	25397ms
sec-csrf-token-manager	Auth & Session	99.5	100	100	19004ms
sec-dependency-risk-classifier	Detection & Analysis	99.5	100	100	47498ms
sec-file-upload-validator	Sanitization	99.5	100	100	29999ms
sec-insecure-config-scanner	Detection & Analysis	99.5	100	100	30371ms
sec-jwt-validator	Auth & Session	99.5	100	100	20120ms
sec-permission-checker	Access Control	99.5	100	100	28766ms
sec-safe-redirect-builder	Sanitization	99.5	100	100	30066ms
sec-tenant-isolation-checker	Access Control	99.5	100	100	29722ms
sec-vulnerability-scanner	Detection & Analysis	99.5	100	100	42231ms
sec-hostname-allowlist-validator	Sanitization	100.0	100	100	22373ms
sec-html-entity-encoder	Sanitization	100.0	100	100	17129ms
sec-url-sanitizer	Sanitization	100.0	100	100	21490ms

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate