Building a RAG Pipeline for Vulnerability Detection

Why Standard LLM Prompting Falls Short for Vuln Detection

The naive approach to AI vulnerability scanning is straightforward: take an HTTP request/response pair or a block of source code, send it to an LLM with a prompt like "find security vulnerabilities," and parse the output. This works surprisingly well for demos. It fails in production for three structural reasons.

First, LLMs have no access to ground truth. When an LLM reports that a parameter is vulnerable to CVE-2025-29814, it is generating a plausible-looking CVE number from statistical patterns — not looking it up in the National Vulnerability Database. The model has no mechanism to verify that the CVE exists, applies to the detected software version, or matches the observed behavior.

Second, context windows are stateless. Scan request number 47 has no knowledge of what was found in requests 1 through 46. The model cannot correlate an open admin panel discovered earlier with an SSRF it finds later. Every analysis is an isolated event.

Third, payload knowledge is frozen at training time. The model knows about ' OR 1=1-- because it appeared in thousands of training documents. It does not know about the WAF bypass payload your team discovered last week that works against Cloudflare with a specific Content-Type manipulation. Training data is months or years stale; real vulnerability landscapes shift daily.

RAG addresses all three failures by injecting verified, current, retrievable knowledge into the prompt before the LLM generates its analysis. The model stops guessing and starts citing.

Designing the Knowledge Base Schema

The knowledge base is the foundation of any RAG pipeline. For vulnerability detection, the schema must accommodate fundamentally different types of security data while maintaining a consistent structure for retrieval and ranking.

Document Types

A production security knowledge base needs at minimum five categories of documents:

Exploits: Proof-of-concept code from Exploit-DB, Nuclei templates, and curated payload collections. These are the ground truth — actual code that demonstrates a vulnerability is exploitable.
CWE definitions: The Common Weakness Enumeration provides standardized descriptions, detection heuristics, and remediation guidance. Each CWE entry is chunked into sub-documents (description, consequences, detection methods, mitigations) for targeted retrieval.
CVE records: National Vulnerability Database entries with affected software versions, CVSS scores, and references. These let the scanner cross-reference detected technology stacks against known vulnerabilities.
Payloads: Categorized attack strings from SecLists, PayloadsAllTheThings, and internal collections. Organized by vulnerability class (XSS, SQLi, LFI, SSTI, XXE) and tagged with target context (WAF type, backend technology).
Scan results: Previously verified findings from your own scanning operations. These are the highest-value documents because they represent confirmed vulnerabilities in your actual environment.

Chunking Strategies for Security Data

Standard RAG chunking strategies — split on paragraph boundaries, keep chunks under 512 tokens — do not work for security content. A SQL injection payload split across two chunks is useless. An exploit's setup code separated from its payload defeats the purpose of retrieval.

Security-specific chunking rules:

Exploits stay intact. A proof-of-concept is a single atomic unit. If it exceeds the chunk size limit, it gets its own oversized document rather than being split.
CWE entries split by section. Description, detection, and remediation are independently retrievable. A query about detecting XSS should not pull back remediation text.
Payloads are one-per-document. Each payload is a single chunk with metadata tags (category, target WAF, backend, effectiveness rating).
CVE records split by severity. Critical and high-severity CVEs get individual documents. Medium and low CVEs can be grouped by affected product to reduce index size.

Metadata Fields

Every document in the knowledge base carries structured metadata that drives retrieval ranking:

{
  "source": "exploit-db",          // Origin: exploit-db, cwe, nvd, seclists, scan-result
  "severity": "high",              // Critical, High, Medium, Low, Info
  "confidence": "confirmed",       // Confirmed, Probable, Theoretical
  "hit_count": 14,                 // Times this doc informed a verified finding
  "category": "sqli",             // Vulnerability class
  "last_verified": "2026-03-15",  // When this was last confirmed useful
  "target_context": {
    "waf": "cloudflare",
    "backend": "express",
    "framework": "react"
  }
}

The hit_count field is particularly important. Documents that have previously contributed to verified findings accumulate higher hit counts, making them more likely to surface in future retrievals. This is the mechanism through which the knowledge base learns from operational use.

Vector Embeddings for Security Content

Why General-Purpose Embeddings Miss Security Semantics

Standard embedding models (like those trained primarily on general web text) encode semantic similarity based on natural language meaning. In that vector space, "SQL injection" is close to "database query" but might also be close to "SQL tutorial" or "database administration." For security retrieval, we need UNION SELECT to be close to SQL injection, and <script>alert(1)</script> to be close to reflected XSS — relationships that general-purpose models may encode weakly or not at all.

Embedding Model Selection

After testing multiple embedding models against security-specific retrieval benchmarks, nomic-embed-text provides the best balance of quality, speed, and local deployability. It runs locally via Ollama with no external API dependency, produces 768-dimensional embeddings, and handles code-mixed content (natural language + payload syntax) better than models trained exclusively on prose.

The critical advantage of local embeddings is zero data exfiltration risk. Security scan data — HTTP traffic, source code, vulnerability findings — never leaves the host machine. For security tooling, this is a non-negotiable requirement for most organizations.

ChromaDB as the Vector Store

ChromaDB serves as the vector store for several practical reasons: it embeds natively with Python, supports metadata filtering alongside vector similarity search, persists to disk with SQLite backing, and requires no external infrastructure. A single ChromaDB instance can hold 100,000+ security documents with sub-second query latency on commodity hardware.

The collection schema maps directly to our document types:

collection = chroma_client.get_or_create_collection(
    name="security_knowledge",
    metadata={"hnsw:space": "cosine"},
    embedding_function=ollama_embedder
)

The Retrieval Pipeline

Query Construction from HTTP Traffic and Source Code

Raw HTTP traffic and source code are not good retrieval queries. A 50KB HTTP response dumped into a vector search will match everything and nothing. The retrieval pipeline must extract security-relevant signals and construct targeted queries from them.

For DAST (HTTP traffic analysis), the query is constructed from:

Response headers indicating technology stack (X-Powered-By, Server)
Input reflection points (parameters echoed in the response body)
Error signatures (stack traces, SQL error messages, debug output)
The suspected vulnerability class from initial heuristic analysis

For SAST (source code analysis), the query draws from:

Identified sinks (database queries, file operations, command execution)
Data flow from user input to sink
Framework and language context
Missing sanitization or validation patterns

Raw Input → Signal Extraction → Query Construction → Vector Search → Top-K Docs

Top-K Selection and Minimum Relevance Thresholds

Retrieving too many documents floods the LLM context window with noise. Retrieving too few misses critical context. In practice, top_k=5 with a minimum relevance threshold of 0.65 (cosine similarity) provides the best results for security analysis.

Documents below the relevance threshold are discarded even if fewer than K results remain. It is better to give the LLM three highly relevant documents than five documents where two are tangentially related. Irrelevant context actively degrades LLM output quality — the model will attempt to incorporate every retrieved document, even unhelpful ones.

Source Priority Weighting

Not all knowledge base documents carry equal authority. The retrieval pipeline applies a priority multiplier to the raw similarity score:

Verified scan results (1.5x) — Confirmed by a human analyst
Exploit-DB PoCs (1.3x) — Working proof-of-concept code
NVD/CVE records (1.2x) — Authoritative vulnerability data
CWE definitions (1.0x) — Baseline reference material
Theoretical payloads (0.8x) — Unverified, from public collections

A verified exploit that scored 0.70 similarity outranks a theoretical payload at 0.85 after weighting (0.70 x 1.5 = 1.05 vs 0.85 x 0.8 = 0.68). This ensures the LLM receives the most operationally useful context, not just the most textually similar.

Prompt Engineering with Retrieved Context

System Prompt Design

The system prompt defines the LLM's analytical framework. For RAG-augmented vulnerability detection, the system prompt must accomplish three things: establish the security expert persona, define the output schema, and — critically — constrain the model to cite retrieved sources.

You are a security vulnerability analyst. Analyze the provided
HTTP traffic or source code for security vulnerabilities.

IMPORTANT: Base your analysis ONLY on the provided knowledge
base context. For every finding, cite the specific KB document
that supports it. If no KB document supports a suspected
vulnerability, classify it as "Needs Verification" rather
than reporting it as confirmed.

Do not invent CVE numbers. Only reference CVEs that appear
in the provided KB context.

The instruction to avoid inventing CVEs is load-bearing. Without it, models will generate plausible CVE identifiers 15-20% of the time. With the constraint plus retrieved NVD data, hallucinated CVEs drop to under 2%.

Injecting KB Documents into the Analysis Prompt

Retrieved documents are formatted as a structured context block inserted between the system prompt and the analysis target:

=== KNOWLEDGE BASE CONTEXT ===

[KB-1] Source: exploit-db | Severity: high | Verified: yes
Title: SQL Injection in Express.js with unsanitized req.query
Payload: ' UNION SELECT username,password FROM users--
...

[KB-2] Source: cwe | ID: CWE-89
Detection: Look for string concatenation in SQL query construction
...

=== END KNOWLEDGE BASE CONTEXT ===

Analyze the following HTTP exchange for vulnerabilities:

Each document is labeled with a reference ID ([KB-1], [KB-2]) so the LLM can cite specific sources in its output. The structured format with explicit Source and Severity fields helps the model weight the documents appropriately.

The Feedback Loop

A static knowledge base degrades over time. New vulnerabilities emerge, old payloads stop working, and the security landscape shifts. The feedback loop is what transforms a RAG pipeline from a static retrieval system into a continuously learning one.

Verified Findings Boost Related Documents

When a security analyst marks a finding as a true positive, the pipeline triggers a feedback cycle:

The KB documents that were retrieved for that finding have their hit_count incremented
Documents with higher hit counts receive stronger priority weighting in future retrievals
Related documents (same CWE, same vulnerability class, same target technology) get a smaller secondary boost

Conversely, when a finding is marked as a false positive, the retrieved documents receive a negative signal. They are not deleted — they may be valid for other targets — but their effective ranking for that specific technology context decreases.

Auto-Ingestion of Confirmed Vulnerabilities

Findings verified with "Certain" or "Firm" confidence are automatically ingested back into the knowledge base as new documents. These scan-result documents include the full context: target URL, technology stack, the payload that worked, the evidence that confirmed the vulnerability, and the analyst's verification notes.

This creates a compounding knowledge effect. The scanner that has run 500 scans against your infrastructure has a knowledge base enriched with hundreds of verified findings specific to your technology stack. It knows which payloads bypass your WAF, which endpoints have been historically vulnerable, and which frameworks your team uses. A fresh installation cannot match this accumulated operational intelligence.

Hit Count Decay for Stale Documents

Documents that were useful six months ago may no longer be relevant. The pipeline applies a time-decay function to hit counts: a document verified last week has its full hit count applied, while a document last verified six months ago has its effective score reduced. This prevents the knowledge base from becoming anchored to outdated vulnerability patterns while still retaining the underlying information for edge cases.

The net effect: Every scan is both an analysis operation and a training operation. The knowledge base after 1,000 scans is a fundamentally different (and better) artifact than the one you started with.

Production Considerations

HNSW Index Corruption and File-Based Locking

ChromaDB uses HNSW (Hierarchical Navigable Small World) indexes for approximate nearest neighbor search. These indexes are not safe for concurrent writes. Two scan processes writing to the same ChromaDB collection simultaneously can corrupt the HNSW graph, causing silent retrieval failures — queries return results, but the results are wrong because the index topology is broken.

The solution is a file-based lock that serializes write operations:

LOCK_FILE = "/tmp/rag_engine.lock"

def acquire_write_lock():
    lock_fd = open(LOCK_FILE, "w")
    fcntl.flock(lock_fd, fcntl.LOCK_EX)
    return lock_fd

def release_write_lock(lock_fd):
    fcntl.flock(lock_fd, fcntl.LOCK_UN)
    lock_fd.close()

Read operations (retrieval queries) do not require the lock and can proceed concurrently. Only ingestion, feedback updates, and document deletions acquire the exclusive lock.

Concurrent Scan Safety

Multiple scans running in parallel must each get consistent retrieval results without interfering with each other. The architecture handles this through read-write separation: scan processes only read from the knowledge base during analysis. Write operations (feedback, auto-ingestion) are queued and processed asynchronously by a single writer process, eliminating contention entirely.

Knowledge Base Size vs. Retrieval Latency

ChromaDB with HNSW indexing provides sub-linear retrieval time, but the constant factors matter at scale. Empirical measurements from production use:

10,000 documents: ~15ms per query (negligible impact on scan time)
50,000 documents: ~45ms per query (still negligible)
100,000 documents: ~120ms per query (noticeable but acceptable)
250,000+ documents: ~300ms+ per query (consider partitioning by document type)

For knowledge bases exceeding 200K documents, partitioning into separate ChromaDB collections by document type (exploits, CWEs, payloads, scan results) and querying only the relevant collection for each analysis phase reduces latency back to the sub-50ms range while maintaining the full corpus.

Build on a Production RAG Pipeline

SILENTCHAIN AI ships with a fully integrated RAG Knowledge Engine backed by 75,000+ security documents, feedback loops, and concurrent scan safety built in. Skip the infrastructure work and start scanning with grounded AI analysis today.

Get Started Free

← Back to all posts