Claude Sonnet 4.6
anthropic/claude-sonnet-4-6
85.3
overall score
Tasks
30
Passed
14
Failed
16
Avg latency
28300ms
Total cost
$1.1848
AI Commentary
by openai/gpt-5.4-miniClaude Sonnet 4.6 is strong on security-oriented implementation tasks, with high scores in sanitization (98.8), access control (97.6), and traffic protection (95.6), and it handled most visible tests well (86.7%). Its main gaps are in brittle edge-case handling and output correctness on a few specialized validators: auth/session drops to 80.3 due to OAuth state and password-strength issues, detection/analysis is uneven at 79.6, and crypto-utils is the weakest area at 52.6 with structural failures and incorrect outputs.
Domain Performance
Performance is near-ceiling across all five listed tasks, including file upload validation, hostname allowlisting, HTML entity encoding, safe redirect building, and URL sanitization. No meaningful weaknesses appear here, suggesting strong baseline defensive coding and normalization behavior.
This domain is mostly solid, but the OAuth state validator failed catastrophically with null output and a TypeError, indicating a broken code path rather than a logic miss. Password strength scoring also overestimated several weak passwords and missed multiple feedback conditions, so the model is less reliable when rules require nuanced aggregation of multiple weakness signals.
Access control is a clear strength, with strong results on access-control engine, API key scope checking, permission checking, and tenant isolation. The high average suggests the model handles authorization boundaries and policy evaluation consistently.
Results are mixed: CSP parsing, dependency risk classification, insecure config scanning, and vulnerability scanning are strong, but anomaly detection, secret detection, and SSRF detection are weak. The failures point to incomplete pattern matching and parser robustness issues, including truncated anomaly sets, missed secrets, and outright parse errors on SSRF inputs.
Rate limiting is strong and appears stable, with no notable weaknesses in the single task. This suggests good handling of thresholding and request-accounting logic.
This is the weakest domain by a wide margin, with one strong encryption pipeline task but a failing crypto-utils task that returned null and TypeErrors. The failure pattern suggests the implementation is not resilient to expected input shapes and may have broken assumptions about array/string handling.
Notable Tasks
The task returned null with a TypeError on reading 'length', which indicates a runtime failure in state handling rather than an incorrect validation decision.
The model under-scored several weak passwords and omitted required feedback items, then over-classified some cases as stronger than expected, showing inconsistent rule aggregation and thresholding.
All reported failures were parse-level crashes ('Unexpected end of input'), so the detector likely cannot robustly parse or normalize malformed/edge-case URLs before applying SSRF policy checks.
It truncated secret matches and missed at least one AWS secret key entirely, which points to brittle regex extraction and incomplete multi-secret handling.
This is a strong authorization result, indicating the model can correctly encode access rules and preserve tenant or permission boundaries under test.
All Task Results
| Task | Domain | Score | Correct | Hidden | Latency | |
|---|---|---|---|---|---|---|
| sec-crypto-utils | Crypto Utils | 9.5 | 0 | 0 | 23066ms | |
| sec-oauth-state-validator | Auth & Session | 9.5 | 0 | 0 | 32899ms | |
| sec-ssrf-detector | Detection & Analysis | 9.5 | 0 | 0 | 52934ms | |
| sec-auth-log-anomaly-detector | Detection & Analysis | 62.2 | 33 | 82 | 37759ms | |
| sec-password-strength | Auth & Session | 66.9 | 100 | 31 | 16807ms | |
| sec-secret-detector | Detection & Analysis | 69.5 | 67 | 67 | 14891ms | |
| sec-csp-nonce-validator | Detection & Analysis | 88.3 | 100 | 75 | 25476ms | |
| sec-session-fixation-detector | Auth & Session | 91.7 | 100 | 83 | 36482ms | |
| sec-sql-injection-detector | Detection & Analysis | 92.2 | 100 | 83 | 19531ms | |
| sec-input-sanitizer | Sanitization | 93.7 | 100 | 87 | 17703ms | |
| sec-abac-rule-engine | Access Control | 94.2 | 100 | 89 | 35705ms | |
| sec-api-key-scope-checker | Access Control | 95.1 | 100 | 91 | 32921ms | |
| sec-refresh-token-rotation | Auth & Session | 95.3 | 100 | 91 | 28794ms | |
| sec-rate-limit-engine | Traffic Protection | 95.6 | 100 | 92 | 36644ms | |
| sec-encryption-pipeline | Crypto Utils | 95.7 | 100 | 92 | 30208ms | |
| sec-csp-parser | Detection & Analysis | 96.1 | 100 | 92 | 18583ms | |
| sec-access-control-engine | Access Control | 99.5 | 100 | 100 | 24441ms | |
| sec-cookie-policy-validator | Auth & Session | 99.5 | 100 | 100 | 25397ms | |
| sec-csrf-token-manager | Auth & Session | 99.5 | 100 | 100 | 19004ms | |
| sec-dependency-risk-classifier | Detection & Analysis | 99.5 | 100 | 100 | 47498ms | |
| sec-file-upload-validator | Sanitization | 99.5 | 100 | 100 | 29999ms | |
| sec-insecure-config-scanner | Detection & Analysis | 99.5 | 100 | 100 | 30371ms | |
| sec-jwt-validator | Auth & Session | 99.5 | 100 | 100 | 20120ms | |
| sec-permission-checker | Access Control | 99.5 | 100 | 100 | 28766ms | |
| sec-safe-redirect-builder | Sanitization | 99.5 | 100 | 100 | 30066ms | |
| sec-tenant-isolation-checker | Access Control | 99.5 | 100 | 100 | 29722ms | |
| sec-vulnerability-scanner | Detection & Analysis | 99.5 | 100 | 100 | 42231ms | |
| sec-hostname-allowlist-validator | Sanitization | 100.0 | 100 | 100 | 22373ms | |
| sec-html-entity-encoder | Sanitization | 100.0 | 100 | 100 | 17129ms | |
| sec-url-sanitizer | Sanitization | 100.0 | 100 | 100 | 21490ms |
30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate