The False-Positive Tax Is Worse Than It Looks

When Semgrep announced Semgrep Multimodal earlier this year, the headline number was "50% fewer false positives" and "8x more true positives" relative to rules-only SAST. That is a real improvement, and the underlying technique — using an LLM as a second-opinion classifier over rule-based candidates — is a sensible architecture for developer workflows where findings are triaged asynchronously.

It is also, structurally, not enough. Cutting false positives by half still leaves a pentester or AppSec engineer manually triaging every finding in the report. The dominant cost in a real engagement is not detection; it is the human hours spent confirming that each alert corresponds to an actually exploitable condition. A 50% FP reduction moves that tax from "unacceptable" to "painful." It does not move it to "zero."

There is a second problem with second-opinion classification: the verifier is the same kind of system that produced the candidate. If the first-pass classifier hallucinated a reflected-XSS from context where the sink is HTML-escaped, a second LLM looking at the same tokens is surprisingly likely to agree. Two model passes on identical context correlates errors more than it cancels them. The verifier needs to obtain information the classifier did not have — ideally the actual runtime behavior of the application when the payload is sent.

That is the premise behind SILENTCHAIN's Phase 2 Active Verification. Instead of asking a second model to reclassify, we take the candidate finding, synthesize payloads, inject them into a real HTTP request in a sandbox, observe the response, and only then ask an LLM to judge whether the response actually demonstrates exploitation. No exploit path, no finding.

The core claim. A finding that cannot be driven to exploitation in a sandbox should not be in a report. The alternative — shipping statistically likely findings — is what trains pentesters to stop trusting scanner output in the first place.

Phase 2 Architecture: Seven Steps From Candidate to Verified Finding

Phase 2 runs after the first-pass AI classifier has produced a list of candidate findings from HTTP transactions. In SILENTCHAIN Enterprise it lives in src/silentchain/scanning/phase2.py and is gated behind the phase2_enabled config flag. The pipeline looks like this:

Candidate
WAF Check
Param ID
AI Strategy
Inject + Send
Heuristic
AI Verify
VERIFIED

Any step can short-circuit to REJECTED — WAF blocks, no parameter match, payload not reflected, heuristics fail, or AI confidence below 70%.

Concretely, the top-level entry point has this signature:

async def verify_finding(
    finding: Finding,
    original_transaction: HttpTransaction,
    engine: "ScanEngine",
) -> Phase2Result | None:
    ...

A None return means the candidate did not verify — the finding is downgraded or discarded. A Phase2Result return means we successfully injected a payload, observed an expected indicator in the response, and the AI verifier confirmed the behavior demonstrates the vulnerability class. That result is attached to the finding as evidence, with the exact payload, parameter, and response-length delta recorded.

Step 1: WAF Status Short-Circuit

Before spending an AI round trip on strategy generation, Phase 2 consults the per-hostname WAF detection cache. If the target is behind a WAF that has already blocked enough requests to cross the configured threshold, Phase 2 either switches to evasion-aware payload sets or skips verification and annotates the finding as "unverified — WAF blocking." The WAF state is tracked as engine statistics so the user sees exactly how many candidates were skipped for this reason.

Step 2: Parameter Identification

The first-pass classifier usually names the vulnerable parameter in its finding detail text ("the id query parameter is vulnerable to SQL injection"). Phase 2 parses that text against the actual request's query, body, and header parameters, and builds an ordered list of parameters to fuzz. If no match is found, Phase 2 falls back to fuzzing every user-controlled parameter in the request — slower but complete. This matters because the candidate-writing LLM sometimes refers to parameters by plain-English aliases ("the user ID") that only align with the real param name after fuzzy matching.

Step 3: AI Strategy Prompt — Payloads + Expected Indicators

This is the first of two AI calls inside Phase 2, and the more interesting one. The prompt (build_phase2_prompt() in ai/prompts.py) asks the model to behave as a payload engineer: given the vulnerability class, the parameter context, and RAG-retrieved exploit snippets, produce a strategy object:

{
  "payloads": [
    "' OR '1'='1-- -",
    "'; WAITFOR DELAY '0:0:5'-- -",
    "admin' AND 1=CONVERT(int, @@version)-- -"
  ],
  "expected_indicators": [
    "SQL syntax",
    "mysql_fetch",
    "ODBC SQL Server Driver",
    "sqlite3.OperationalError"
  ],
  "verification_method": "response_analysis"
}

The critical field is expected_indicators. This is what converts a verification question ("did this work?") into a concrete, testable predicate ("does the response body contain one of these substrings after we inject the payload?"). The LLM is producing evaluation criteria before it sees the result, which eliminates the post-hoc rationalization failure mode where a model looks at a response and reverse-engineers a reason to call it vulnerable.

The prompt includes vulnerability-class-specific guidance. For XSS the model is reminded to produce payloads that break out of the enclosing HTML context, not just reflect inside it. For SSTI it is told to choose expressions with a deterministic computed result ({{7*7}} → look for 49). For command injection it is told to target markers that are unmistakably system output (uid=, gid=, root directory listings). These class-specific criteria are the difference between a predicate that fires on harmless reflections and one that fires only on actual exploitation.

Step 4: Hybrid Payload Merge

The AI-generated payloads are merged with a curated payload set from the knowledge base, deduplicated, and ordered curated-first. The curated set is there as a safety net: if the model returns bad JSON or gets clever and proposes something nonsensical, we still fuzz with payloads that have a known track record against this vulnerability class. A graceful-degradation branch handles the case where AI strategy generation fails entirely — Phase 2 proceeds with curated-only payloads rather than aborting verification.

Why hybrid beats pure-AI or pure-curated. Curated payloads are reliable but static — they miss novel bypasses. AI payloads are creative but unreliable — they sometimes hallucinate syntax. Running curated first gives us a verified baseline within a few requests; the AI payloads then catch the WAF-evasion and technology-specific cases the curated set misses.

Step 5: Inject and Send

For each (parameter, payload) pair, Phase 2 clones the original HTTP transaction, substitutes the payload into the target parameter, and sends it through the engine's HTTP client. Two pre-send optimizations matter:

Step 6: The Heuristic Cascade

After each injected request comes back, Phase 2 runs a four-tier heuristic check (scanning/heuristics.py). If any tier matches, we escalate to the AI verifier. If none match, the (parameter, payload) pair is discarded and we move on. The tiers, in order:

  1. Payload reflection. The injected payload (case-insensitive, length > 3 to avoid generic substring collisions) appears verbatim in the response body. Strongest signal for XSS, HTML injection, and template injection.
  2. Expected-indicator match. Any string from the AI's expected_indicators array appears in the response. This is the primary signal for SQL injection, command injection, SSTI, and LFI — response contents the attacker's payload should have produced.
  3. Error-signature match. Class-specific error patterns — mysql_fetch, sqlite3.OperationalError, Warning: include(), Traceback (most recent call last) — compiled per vulnerability class. Captures the case where the payload triggered an error even if the response does not contain the expected success indicator.
  4. Length deviation. Response body length differs from the baseline by more than 30%, and the baseline was larger than 200 bytes. Weakest signal, used mainly as a tiebreaker for findings where the first three tiers are ambiguous.

A key property of the cascade: it is deterministic, cheap, and runs before the AI verifier. We do not pay for a second LLM call on the overwhelming majority of candidates that a human would look at and immediately dismiss. Only candidates that cross the heuristic bar get promoted to the AI verification step.

Step 7: AI Verification — Judging the Response

When the heuristic cascade fires, Phase 2 makes its second AI call (build_verification_prompt()). The prompt is deliberately narrow. It does not ask "is this vulnerable?" — that invites the model to confabulate. It asks class-specific questions that have yes/no answers:

The model returns a structured object:

{
  "verified": true,
  "confidence": 85,
  "analysis": "The injected payload ' OR '1'='1 returned the full user
               table instead of a single row, and the response body is
               170x larger than the baseline. This is consistent with
               a successful boolean-based SQL injection."
}

Phase 2 accepts verified: true only when confidence ≥ 70. The threshold is not a popularity contest — it is a deliberate filter to cut the long tail of "maybe" responses that the verifier produces when the heuristic fired on a borderline case. In our benchmark runs, lowering the threshold to 50 doubled the Phase 2 acceptance rate but introduced three false-positive verifications across the test corpus. Seventy holds the line.

Benchmark: E2E Run Against python.testinvicti.com

We ran SILENTCHAIN Enterprise end-to-end against http://python.testinvicti.com, a deliberately vulnerable Django target maintained by Invicti. The run used Katana crawl at depth 2, Phase 2 active verification enabled, and the RAG engine providing retrieved context for both strategy generation and verification prompts. Environment details:

The final report contained 36 findings after Phase 2: 8 Medium, 24 Low, 4 Informational, 0 errors, with one finding carrying a phase2_verified=true evidence record. That raw count is not the interesting number. The interesting number is what Phase 2 removed before the findings reached the report at all.

Stage Candidate Count Delta Notes
First-pass classifier output 60 candidates High/medium/low confidence mixed
After Phase 2 verification 36 findings −24 (−40%) 24 candidates downgraded or discarded
High-confidence candidates killed ~40% of that tier Mostly reflected-XSS in HTML-escaped sinks
Phase 2 verified w/ evidence 1 finding Payload, param, response delta attached
Errors / crashes 0 Graceful degradation on all AI failures

The ~40% kill rate on high-confidence candidates is the headline figure, and it is worth dwelling on. These were not low-confidence "maybes" from the first-pass classifier — they were findings the classifier would have shipped to the report with high confidence under any scanner architecture that lacks active verification. The majority of the killed cases were reflected-XSS candidates where the parameter reflected the payload but the sink was inside a context that HTML-escaped it. The canary reflection test caught some of them in step 5; the rest died when the expected-indicator check failed to find the unescaped breakout string in the response. A second-opinion LLM classifier looking at the same request/response pair would have had a non-trivial chance of agreeing with the first classifier — these are exactly the cases where static analysis of the context is ambiguous. The injection disambiguated it.

Caveat on the benchmark. This is one target, one run, one day. We are publishing these numbers because the architectural story is defensible and the kill-rate magnitude is stable across our internal test corpus — not because one E2E run against a deliberately vulnerable Django app is a comprehensive benchmark. A broader head-to-head against OWASP Benchmark and VulnBench is in progress and will be published separately.

Why "Active" Beats "Second Opinion," Architecturally

The structural difference between Phase 2 and second-opinion LLM classification is the information available at verification time. A second-opinion classifier sees exactly what the first classifier saw: the original request, the original response, and some static context. It is constrained to the same input distribution. When the first model makes a mistake because the input is genuinely ambiguous, the second model looking at the same ambiguous input has correlated error.

Phase 2 breaks that correlation by generating new inputs. The strategy prompt produces a payload that is materially different from anything in the original request. The injection step sends the modified request to the real application. The response is new information that neither model had access to at candidate-generation time. The verification decision is based on observed behavior under a controlled stimulus, not on a re-read of the original tokens.

In information-theoretic terms: second-opinion classification draws from the same source of bits. Active verification opens a new channel. You can ensemble two models on the same input and they will correlate; you cannot correlate a static classifier with a runtime observation, because the runtime observation is independent evidence. This is why Phase 2's ~40% kill rate on high-confidence candidates is not just a larger threshold — it is a qualitatively different filter.

The Gotchas We Hit Building This

Gotcha 1: The Verifier Will Rationalize If You Let It

Our earliest verification prompt asked the model a free-form question: "Based on the request, the payload, and the response, is this finding a true positive?" The model happily answered yes to almost everything. The fix was to force class-specific predicates into the prompt and require the model to cite a specific response substring as justification. A verdict with no substring citation is treated as verified: false regardless of the confidence number.

Gotcha 2: Canary Reflection Is Not Sufficient for XSS

We originally used canary reflection as a standalone XSS verification signal. It gave us false positives everywhere reflection was encoded or wrapped in an attribute context that would not execute JavaScript. Canary reflection is now a cheap pre-filter to decide whether to spend more requests on real payloads, not a verification in itself. Real XSS verification requires an unescaped breakout-context payload to survive into the response and the AI verifier to confirm the surrounding HTML context permits execution.

Gotcha 3: Hybrid Payload Order Matters

Running AI-generated payloads first looked elegant but wasted requests when the model produced plausible-but-broken syntax. Running curated first costs nothing extra if a curated payload succeeds (we short-circuit on first match) and ensures that a bad AI response cannot make Phase 2 worse than the curated-only baseline. The resulting rule: never let AI degrade below a static baseline. Curated is the floor. AI is the ceiling.

Gotcha 4: RAG Context Lifts Both Prompts Noticeably

We initially ran Phase 2 without the RAG engine — raw prompts, no retrieved context. Turning on RAG retrieval for the strategy prompt improved Phase 2 acceptance rates on real vulnerabilities (fewer payloads needed per verified finding) and turning it on for the verification prompt improved precision (fewer borderline heuristic matches escalated to false verifications). The retrieved context is usually a small number of Exploit-DB snippets or SecLists payloads for the vulnerability class, feeding the model verified ground truth to compare the observed response against.

Where Second-Opinion Classification Still Makes Sense

To be fair to the other approach: active verification is not always the right architecture. For SAST running on unexecuted source code in CI, there is no runtime to probe, no response to observe, and injection is not an option. A second-opinion LLM classifier layered on rule-based findings is genuinely the right shape for that workflow. Semgrep Multimodal's 50% FP reduction is a real and useful result for pre-commit pipelines where "flag fewer false alarms to developers" is the actual goal.

The story changes for pentest and security-audit workflows where the output has to hold up in a client meeting. In those contexts, a triage queue of probably-true findings is not the deliverable. The deliverable is a list of vulnerabilities, each with a proof-of-concept, each reproducible on demand. That is the workflow Phase 2 was built for, and that is where the architectural difference between "classify again" and "exploit it" becomes the whole game.

Our SILENTCHAIN SOURCE edition faces this exact boundary on the SAST side. It addresses it by taking findings into a Docker sandbox, writing a PoC script, and executing the PoC against the instrumented code. If the PoC runs and produces the expected effect, the finding stays. If it fails, the finding is downgraded or discarded. Same principle, different substrate: exploit the candidate or drop it.

Try It Yourself

Phase 2 active verification ships with SILENTCHAIN Professional (Burp Suite extension, Jython) and SILENTCHAIN Enterprise (standalone Python 3, FastAPI + WebSocket). The Community edition uses the same first-pass AI classification but omits Phase 2 — by design, so that Community stays a simple drop-in Burp extension with no sandbox orchestration overhead.

To run the same pipeline we benchmarked:

# SILENTCHAIN Enterprise (standalone)
pip install silentchain-enterprise
silentchain scan \
  --url http://your-target.local \
  --phase2-enabled \
  --rag-enabled \
  --rag-api-url http://localhost:8000

# SILENTCHAIN Pro (Burp extension)
# Load silentchain-pro-v1.2.11.py in Burp Extender,
# enable "Phase 2 Active Verification" in the config tab,
# and point the RAG client at your local RAG engine.

The configuration fields that map to the architecture described above: phase2_enabled, phase2_confidence_threshold (default 70), phase2_max_payloads_per_finding, rag_enabled, rag_api_url, rag_top_k, and rag_min_relevance.

Stop Triaging False Positives. Start Reading Verified Findings.

SILENTCHAIN Professional and Enterprise ship Phase 2 active verification out of the box. Every finding in the report has a payload, a parameter, and a response delta attached as evidence. No Phase 2 match, no finding. Period.

Explore Enterprise
← Back to all posts