Paper: CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Authors: Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim, Hoki Kim

Venue: arXiv 2026

Link: CyBiasBench (arXiv)

Introduction


Large language models (LLMs) are no longer just single-turn question-answering tools. They are rapidly evolving into autonomous agents that call tools, plan over long horizons, and carry out tasks under their own judgment. This shift is especially pronounced in cybersecurity, where offensive tasks — penetration testing, vulnerability discovery, red-teaming — are increasingly automated.

Within this trend, evaluation benchmarks have so far centered on a single question: “Did the agent succeed?” — the Attack Success Rate (ASR). CTF-based subtask decomposition (Cybench), real-world CVE exploitation (CVE-Bench), and bug-bounty workflows (BountyBench) were all meaningful, outcome-centric progress.

This paper, however, asks a step deeper.

“Given the same penetration test, do all agents attack the same way?”

The answer was no. The core phenomenon this paper uncovers is the following — each LLM agent has its own attack-selection bias, disproportionately concentrating effort on specific attack families. And this tendency barely moves even when the prompt changes.

Per-agent selection rate (solid) versus attack success rate (dashed) over attack families. Each agent draws a distinctly different preference shape, and the families it selects most often (solid) do not match the families that actually succeed (dashed).

As Figure 1 shows, each agent’s selection-rate distribution traces a clearly different fingerprint. To quantify this systematically, our lab released CyBiasBench on arXiv. CyBiasBench runs 630 sessions in total across multiple agents, targets, and prompts, logging raw HTTP traffic and labeling each request with a deterministic classifier based on the OWASP Core Rule Set (CRS) — measuring bias and performance from the agent’s behavior itself.

Background


Why measure behavior — HTTP traffic, not self-reports

An LLM agent can describe in words which attacks it ran. But self-reports are prone to exaggeration, omission, and hallucination, and are hard to reproduce. Instead of the agent’s reasoning text, CyBiasBench intercepts externally observable HTTP requests through a proxy and classifies them into ten web-exploitation attack families.

sqli, xss, cmdi, path_traversal, auth_bypass, idor, ssrf, csrf, file_upload, info_disclosure

Classification combines OWASP CRS patterns with CAPEC-, CWE-, and OWASP WSTG-derived rules, and success is verified from HTTP responses, authentication-state changes, and target-specific heuristics — not from the agent’s self-report. The whole process was conducted with the consultation and formal approval of three cybersecurity experts, and all experiments run in isolated Docker networks with matched Kali Linux containers.

Four key metrics

The paper measures agent behavior across three layers.

Layer Metric Meaning
Bias Entropy $H(X)$ Diversity of the attack-family distribution (higher = more spread out)
Bias Selection Rate $\mathrm{Sel}_i$ Fraction of total attempts allocated to family $i$
Performance ASR / per-family $\mathrm{ASR}_i$ Overall and per-family success rates
Efficiency & Robustness TPS / Prompt-stability JSD Tokens per success / distribution stability across prompts

Here Prompt-stability JSD is the key device that distinguishes whether a pattern was caused by the prompt or reflects an agent’s intrinsic tendency. A low value means the distribution is the agent’s own trait rather than a prompt artifact.

Benchmark


Design overview

CyBiasBench consists of three stages: (1) prompt design → (2) agent penetration testbed → (3) evaluation metrics.

Overview of CyBiasBench. (Stage 1) Four prompt conditions along the guidance and output-structure axes; (Stage 2) each agent attacks three target applications inside an isolated Kali container; (Stage 3) HTTP traces are classified into ten attack families to measure bias, performance, and efficiency.

Prompt space $\mathcal{P}$. Prompts vary along two axes. Guided/Unguided controls whether a fixed list of attack-family labels is provided up front; Structured/Unstructured controls the reporting format (a per-line JSONL log versus a final summary table). The two axes yield four prompt conditions.

Agent space $\mathcal{A}$ and target space $\mathcal{T}$. We evaluate five agents — Claude (Opus 4.5), Kimi (k2.5), GLM (5.1), Codex (GPT-5.2 codex), and Gemini (2.5 Pro). The targets are OWASP Juice Shop, which broadly covers the OWASP Top 10; MLflow 2.9.2, a real ML platform with documented CVEs (RCE, path traversal, SSRF); and Vuln-Shop, a controlled target for classifier calibration.

Result 1 — Each agent has a different attack-selection bias

Summarizing the 36 free-choice sessions per agent, what they select most and how concentrated they are diverge sharply.

Agent Most selected family ($\mathrm{Sel}_i$) $H(X)$ Selection CR1 Session ASR
Claude info_disclosure (25.3%) 2.607 32.1% 0.324
Kimi sqli (23.9%) 2.376 34.5% 0.257
GLM auth_bypass (21.6%) 2.202 45.2% 0.302
Codex info_disclosure (31.5%) 1.652 50.7% 0.213
Gemini sqli (22.7%) 1.122 66.6% 0.317

Claude and Kimi try a relatively broad set (CR1 32–35%), whereas Codex and Gemini concentrate on a few families (CR1 50.7%, 66.6%). Session-level Kruskal–Wallis tests confirm these structural differences are statistically significant ($p \le 2.3\times10^{-10}$).

Result 2 — The bias persists across prompt changes

Adding guidance or changing the report format does shift individual selection rates somewhat. But the shape of the distribution across attack families is not substantially reorganized: the distribution change across prompt conditions (mean JSD 0.0379) is smaller than the between-agent difference (mean 0.0543). In other words, “who is attacking” determines the distribution more strongly than “which prompt was used.” A random forest that identifies the agent from its selection pattern alone reaches 65% accuracy (against a 20% random baseline).

Result 3 — Frequently selected ≠ frequently successful

The most striking finding is the decoupling of selection and success. In Figure 1, the solid (selection) and dashed (success) curves do not align. Agents pour substantial effort into low-success families and sometimes neglect high-success ones. For instance, Codex selects info_disclosure most often (31.5%) and is highly concentrated (CR1 50.7%), yet has the lowest session ASR at 0.213. Attack-family diversity ($H(X)$) also fails to predict session ASR (no significant association for four of five agents, $\lvert\rho\rvert<0.23$). The conclusion: selection bias is a behavioral trait of the agent, not a performance signal.

Going further — Bias Injection and Bias Momentum


In real operations, users often explicitly steer an agent — e.g., “Please focus on thoroughly analyzing XSS vulnerabilities.” This paper calls such user-driven intervention bias injection, and measures the fraction of effort an agent allocates to the requested family (compliance) after being pointed at each of the ten attack families. With 10 families × 3 targets × 3 repetitions per agent, this adds 450 more sessions.

Cell-level compliance across the 50 (agent, family) cells. (Left) strongly correlated with the free-choice preference $\mathrm{Sel}_i$ ($\rho=+0.529$); (right) not significantly correlated with the family's performance $\mathrm{ASR}_i$ ($\rho=-0.248$).

The result was clear. Agents readily comply with steering toward families they already prefer, but resist steering toward families that conflict with their preference. Compliance is well predicted by prior preference ($\mathrm{Sel}_i$, $\rho=+0.529$) but not by the family’s performance ($\mathrm{ASR}_i$). The paper names this phenomenon — where free-choice preference persists like momentum even under explicit steering — bias momentum.

Mean session ASR under free-choice versus bias injection. All five agents show $\Delta\mathrm{ASR} \le 0$ — forcing a specific attack actually lowers performance (Gemini −0.136, Codex −0.073, etc.).

Moreover, forcing a specific attack did not improve performance — it lowered it. All five agents had mean $\Delta\mathrm{ASR} \le 0$ (largest drop: Gemini −0.136), and higher-compliance cells were not compensated by higher ASR. Since the per-family technical performance ranking was largely preserved before and after steering ($\rho=+0.702$), this ASR drop is read as an allocation problem, not a skill problem — the signature of bias momentum.

Conclusion


CyBiasBench directly measures a behavioral axis of LLM agents that attack success rate alone cannot reveal — the attack-selection bias. The paper’s message reduces to two points.

  • (1) Free-choice attack allocation is agent-specific and barely moves with the prompt. A single-condition ASR number therefore underdescribes agent behavior.
  • (2) Compliance under explicit steering follows prior preference, not technical skill (bias momentum). Being better at a family does not mean performance rises when the agent is forced toward it.

For security audits using autonomous agents, this implies that each agent’s free-choice pattern should be considered alongside per-family performance. For reproducibility, our lab releases an interactive results dashboard (trustworthyai.co.kr/CyBiasBench) and a reproducibility artifact with aggregated statistics and evaluation scripts (GitHub).

  • Proactive Defense Benchmark Against Deepfake Generation [ICML 2026] | [Paper]
  • Position: Current Model Cards Are Insufficient for Downstream Governance of Open-Weight Foundation Models [ICML 2026] | [Paper]