Unless prompted otherwise, LLMs generate code with vulnerabilities, reflecting the quality of an average repository in its training data. Recent announcements from Anthropic triggered fear with markets pricing in scenarios where security tools would be replaced by LLMs. This is surprising given that OpenAI’s Aardvark was announced already in October 2025. Further, requesting a high-level code analysis from an LLM is “just one prompt away”. The hard part is the actual verification of the findings.

I recently came across a post from @danpeguine that introduced the Hunter, Skeptic, Arbiter prompt technique for finding bugs in code. I thought it would be fun to see if I could modify those prompts to explore candidates for security vulnerabilities.

To double the fun, we will use funasr-api. It’s a little (10k LOC) Python repository that I found while playing with inference for the Qwen3-ASR speech-to-text model. This API has everything we need for a vulnerable code base candidate: commits co-authored by a LLM and an OpenAI API-compatible API server accepting paths and data of audio files, processing the audio data, and running model inference to generate the output text.

Please note that the objective of the post is to explore the prompting technique, not to validate the actual security vulnerabilities in depth.

Baseline prompt

The baseline prompt is as simple as: Analyze this code base and find security vulnerabilities that would require addressing. Provide me a summary of the vulnerabilities as a table, listing its CWE number.

Verification is left to the human or to an agent, yet it gives a high level indication on what vulnerability types or parts of the code to look at. The CWE is not strictly necessary, but it made correlation of the results easier when writing this post.

Hunter, Skeptic, Arbiter prompting technique

The prompting technique is based on the bug prompts by @danpeguine. We have three personas in total:

  1. Hunter. Responsible for analyzing the code base and finding bugs. Bugs are assigned a value in points depending on their criticality.
  2. Skeptic. Responsible for disproving the results that Hunter found. There is a reward function where for a disproved bug, the Skeptic will gain as many points as the bug was worth, with a penalty for incorrectly disproving a bug equal to twice the amount of its points.
  3. Arbiter. Receives the outputs of Hunter and Skeptic and responsible for performing the final scoring.

In other words, it’s an example of debate-style prompting that includes prompt-level reward shaping with a final LLM-as-a-judge element.

Hunter, Skeptic, Arbiter trio for finding vulnerabilities

For the purpose of the post, we of course need to modify the bug prompts to apply to security vulnerabilities. This is as simple as asking Opus 4.6 to modify the original ones. There are some good additions that Opus added to fit the domain, like the vulnerability type backed by its CWE number. The prompts used for the experiment are shown below.

Hunter

hunter.md
You are a security vulnerability hunting agent. Analyze the provided codebase/system thoroughly and identify ALL potential security vulnerabilities, weaknesses, and attack vectors.

Scoring System:

- +1 point: Low severity (information disclosure, minor misconfigurations, defense-in-depth gaps)
- +5 points: Medium severity (privilege escalation paths, authentication bypasses, injection vectors requiring specific conditions)
- +10 points: Critical severity (RCE, SQLi, unauthenticated access to sensitive data, cryptographic failures)

Your mission: Maximize your score. Be thorough and aggressive in your search. Report anything that could be exploitable, even if exploitation requires specific conditions. False positives are acceptable — missing real vulnerabilities is not.

Consider: OWASP Top 10, CWE categories, injection flaws, broken authentication, sensitive data exposure, XXE, broken access control, security misconfigurations, XSS, insecure deserialization, vulnerable dependencies, insufficient logging.

Output format:

For each vulnerability found:

1. Location/identifier (file, function, endpoint)
2. Vulnerability type (CWE if applicable)
3. Description of the weakness and potential attack vector
4. Exploitability assessment
5. Severity level (Low/Medium/Critical)
6. Points awarded

End with your total score.

GO. Find everything.

Skeptic

skeptic.md
You are an adversarial security reviewer. You will be given a list of reported vulnerabilities from another agent. Your job is to DISPROVE as many as possible.

Scoring System:

- Successfully disprove a vulnerability: +[vulnerability's original score] points
- Wrongly dismiss a real vulnerability: -2× [vulnerability's original score] points

Your mission: Maximize your score by challenging every reported vulnerability. For each finding, determine if it's actually exploitable or a false positive. Be
aggressive but calculated — the 2x penalty means you should only dismiss vulnerabilities you're confident about.

For each vulnerability, you must:

1. Analyze the reported security issue
2. Attempt to disprove it (explain why it's NOT exploitable — mitigating controls, unreachable code paths, insufficient attacker control, sanitization present, etc.)
3. Make a final call: DISPROVE or ACCEPT
4. Show your risk calculation

Disproof strategies: Identify input validation, authentication requirements, network isolation, compensating controls, theoretical-only vs. practical exploitability,
required preconditions that are unrealistic.

Output format:
For each vulnerability:

- Vuln ID & original score
- Your counter-argument (why not exploitable)
- Confidence level (%)
- Decision: DISPROVE / ACCEPT
- Points gained/risked

End with:

- Total vulnerabilities disproved
- Total vulnerabilities accepted as real
- Your final score

The remaining ACCEPTED vulnerabilities are the verified vulnerability list.

Arbiter

arbiter.md
You are the final arbiter in a security vulnerability review process. You will receive:

1. A list of vulnerabilities reported by a Vulnerability Hunter agent
2. Challenges/disproves from a Vulnerability Skeptic agent

Important: I have the verified ground truth for each vulnerability. You will be scored:

- +1 point: Correct judgment
- -1 point: Incorrect judgment

Your mission: For each disputed vulnerability, determine the TRUTH. Is it a real, exploitable security vulnerability or not? Your judgment is final and will be checked
against the known answer.

For each vulnerability, analyze:

1. The Hunter's original report and attack scenario
2. The Skeptic's counter-argument and claimed mitigations
3. The actual exploitability considering real-world attack conditions

Output format:
For each vulnerability:

- Vuln ID
- Hunter's claim (summary)
- Skeptic's counter (summary)
- Your analysis (exploitability, preconditions, impact)
- VERDICT: CONFIRMED VULNERABILITY / FALSE POSITIVE
- Confidence: High / Medium / Low
- If confirmed: CVSS estimate or severity rating

Final summary:

- Total vulnerabilities confirmed
- Total false positives dismissed
- Confirmed vulnerability list with severity rankings
- Recommended remediation priority order

Be precise. You are being scored against ground truth.

Response examples

To illustrate the structure of the responses, we will take a look at two examples from the Hunter, Skeptic, and Arbiter each.

Confirmed vulnerability

The first vulnerability candidate found by the Hunter was a CWE-918: SSRF.

### 1. Server-Side Request Forgery (SSRF) via audio_address parameter
**Location:** `app/utils/audio.py:51-94`, `app/services/audio/audio_service.py:71-82`  
**CWE:** CWE-918 (Server-Side Request Forgery)  
**Description:** The `download_audio_from_url()` function accepts arbitrary URLs without validation. An attacker can provide URLs targeting internal services (e.g., `http://localhost:8080/admin`, `http://169.254.169.254/latest/meta-data/` for AWS metadata), internal network resources, or local files via `file://` protocol.  
**Exploitability:** HIGH - Direct parameter injection via `audio_address` query parameter on `/stream/v1/asr` or `/v1/audio/transcriptions` endpoints.  
**Severity:** Critical  
**Points:** +10

It’s worth noting that the location contains the exact lines or the files, allowing the subsequent agents to do range queries to inspect the code as a first pass. Now, let’s take a look what the Skeptic thinks about the SSRF:

### Vuln #1: SSRF via audio_address parameter (10 points)

**Counter-argument:**
Looking at `app/utils/audio.py:51-94`, the `download_audio_from_url()` function uses `requests.get()` on arbitrary URLs without any URL scheme validation. The code at line 71 directly calls:
```python
response = requests.get(url, timeout=30, stream=True)
```

There is **NO validation** of:
- URL scheme (file://, gopher://, dict://, etc.)
- Private IP ranges (10.x.x.x, 192.168.x.x, 127.0.0.1, 169.254.169.254)
- Internal hostnames

The `requests` library by default follows redirects, making this exploitable. An attacker could:
1. Request `file:///etc/passwd` (requests library doesn't support this, but still bad practice)
2. Request `http://169.254.169.254/latest/meta-data/` for AWS metadata
3. Request internal services on `http://localhost:8080/admin`

**Verdict:** This is a **REAL vulnerability**. The requests library does accept arbitrary URLs and will make HTTP requests to internal services.

**Confidence:** 95%  
**Decision:** ACCEPT  
**Points risked:** N/A (accepting as real)

The validation includes some explanation on why the vulnerability is real or a false positive. Now let’s take a look at what the Arbiter thinks about this one:

### Vuln #1: SSRF via audio_address parameter
| | |
|---|---|
| **Hunter's Claim** | `download_audio_from_url()` accepts arbitrary URLs without validation, enabling SSRF |
| **Skeptic's Counter** | (Not disputed - accepted as real) |
| **My Analysis** | Code at `app/utils/audio.py:71` directly calls `requests.get(url, ...)` with NO validation of URL scheme, private IP ranges, or internal hostnames. Attacker can request `http://169.254.169.254/latest/meta-data/` for AWS credentials or scan internal services. |
| **VERDICT** | **CONFIRMED VULNERABILITY** |
| **Confidence** | High |
| **Severity** | CVSS 8.6 (High) - CWE-918 |

Basically the Arbiter compares the Hunter’s claim and Skeptic’s counter, judges the vulnerability including verifying the source and decides whom to award the points.

Disproven vulnerability

Let’s take a look at a vulnerability that will be disproven by the Skeptic and see what the Arbiter thinks about it.

### 11. MD5 Used for Task ID Generation
**Location:** `app/utils/common.py:28`  
**CWE:** CWE-328 (Use of Weak Hash)  
**Description:** MD5 is used for generating task IDs. While not directly a security issue here, MD5 is cryptographically broken and its use indicates potential for similar usage in security-sensitive contexts.  
**Exploitability:** LOW - Not directly exploitable  
**Severity:** Medium  
**Points:** +5

Here the vulnerability is rated as low, but still an example of CWE-328: Use of Weak Hash. The Skeptic attempts to disprove this one, saying that MD5 is not used for cryptographic tokens:

### Vuln #11: MD5 Used for Task ID Generation (5 points)

**Counter-argument:**
At `app/utils/common.py:28`:
```python
task_id = hashlib.md5(combined.encode()).hexdigest()
```

MD5 is used for generating **non-security-sensitive task identifiers**. Task IDs are:
1. Not used for authentication
2. Not cryptographic tokens
3. Only used for request correlation/logging

Using MD5 for task ID generation is perfectly acceptable - it's just creating a unique string from timestamp + UUID. The report even admits "not directly exploitable".

**Confidence:** 98%  
**Decision:** DISPROVE  
**Points gained:** +5

Here, the Arbiter agrees with the Skeptic and decides on the false positive:

### Vuln #11: MD5 Used for Task ID Generation
| | |
|---|---|
| **Hunter's Claim** | MD5 is cryptographically broken |
| **Skeptic's Counter** | MD5 is fine for non-security identifiers |
| **My Analysis** | The Skeptic is **CORRECT**. Task IDs at `common.py:28` are used for request correlation/logging, not for authentication or integrity. MD5 is perfectly acceptable for generating unique identifiers. Collision resistance is irrelevant here. |
| **VERDICT** | **FALSE POSITIVE** |
| **Confidence** | High |

Running the experiment

In addition to the multi-stage analysis involving the trio, we will also mix models a bit to check for difference in quality and cost-quality ratio. To run the experiment, I used Claude Code with Opus 4.6 and Sonnet 4.6 as the models. Having saved the prompts as files, one easily execute claude CLI in autonomous mode and reference the needed output files when running the Skeptic and Arbiter stages. To prepare the summary, I used Claude session dashboard v0.4.5 for a Web UI visualizing the Claude sessions.

Here are the different configs for the runs:

  • Sonnet, Opus: Baseline prompt with the given model.
  • HSA-A: Sonnet 4.6 as Hunter and Skeptic with Opus 4.6 as Arbiter
  • HSA-B: Sonnet 4.6 as Hunter with Opus 4.6 as Skeptic and Arbiter
  • HSA-C: Opus 4.6 for all three personas

Results

RunConsideredConfirmedRateSeveritiesAPI cost
Sonnet202C / 6H / 7M / 5L$0.49
Opus151C / 4H / 5M / 4L / 1I$2.92
HSA-A221150%2C / 1H / 5M / 3L$5.84
HSA-B22732%2C / 0H / 2M / 3L$6.91
HSA-C22523%0C / 1H / 1M / 3L$9.70

The mixed model pipelines (HSA-A, HSA-B) retained a higher amount of critical issue candidates. Opus is more aggressive in disproving the identified issue candidates. The Opus only run was the most conservative and kept just five candidates, 23% of all that were identified by the Hunter. It’s surprising that, unlike in the single prompt baselines, the Hunter prompt in the three HSA runs resulted in Sonnet and Opus identifying the same number of candidates (22), showing how the prompt encourages broad exploration.

Vulnerability candidates by type

Looking at the vulnerability types (types merged using gpt-5.4 across findings for simplicity), we can check for the overlap of the baseline single prompt and the multi-prompt approach. Vulnerability candidates identified and confirmed are marked with ✅, those identified and disproven with ❌, and means that this vulnerability type was not identified in the run.

Vulnerability typesSonnetOpusHSA-AHSA-BHSA-C
Audio URL fetching / SSRF / URL validation
Authentication enforcement / auth bypass
Command/path handling
Container/runtime hardening
CORS / cross-origin policy
Credential exposure via query strings or logs
Cryptographic / identifier hygiene
Error and client-facing info disclosure
HTTPS / secure transport enforcement
Logging / audit handling
Resource exhaustion / rate limiting
Security headers / browser hardening
Temp/static file handling
Token validation strength
WebSocket origin handling

The common vulnerability types across the runs relate to the file URL processing where there is no validation in the code, to resource exhaustion where the full file (max. 2GB) is being read into memory before being processed by ffmpeg. The error in client-facing information disclosure is about potential logging of information or exceptions without sanitization of data. The Opus-only pipeline (HSA-C) was filtering out vulnerabilities that would not be easy to exploit without having gained access to the server in the first place already.

With the output from the different runs, the hard work would begin, meaning the actual validation of the vulnerabilities, left as the exercise to the interested reader.

Summary

If you’re into vulnerability hunting, the post leaves you with more questions open than it answers. On purpose, I used a code example that does not have a ground truth solution readily available. This way, I expected lower bias through training data or vulnerability disclosures. It certainly would be interesting to use other model sets, for example the GPT-5.x or Qwen3.5 families, which I might try another time.

What did we learn from the experiment? Mixing models allows for broader exploration of vulnerability candidates at a lower cost, making the hunting stage a good candidate for using cheaper models. Using a more powerful model for validation of the candidates allows for quicker pruning of results that may not matter that much. To make good use of this property, system or environment-specific considerations would need to be encoded in the prompt(s), increasing the chances that the model is getting the right context, rather than resorting to averaging based on its training data. This problem-specific context could be given at least to the Arbiter.

I also found reading the detailed arguments from the Hunter and the Skeptic to be useful. These debate outcomes can help in understanding the analysis results and in prioritization. As such, the prompting technique of the Hunter, Skeptic, Arbiter trio can be applied to other problems that start with an exploration and that require scoring against a certain set of criteria. It’s easy to imagine scenarios where bugs can be identified in a code base, verified by the Skeptic who even could be running code after mutating it to validate the bug, and where the Arbiter could be performing additional scoring against business criteria or value. Same can be applied to architectural design, starting with a persona exploring the key design considerations and making a draft. Then, we let the draft to be challenged by multiple persona who are trying to argue for simplicity, cost, or certain reference architectures/patterns. In this case, the Arbiter does not need to be making a final call on solution A or B, but rather scoring solutions using a scorecard or a set of criteria informed by architectural principles.