CyBiasBench

Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Same target, same task — but each agent reaches for a different attack. And forcing an agent off its preference drops attack success, not raises it.

η²(H)
0.43
agent identity on attack-family entropy
η²(ASR)
0.05
agent identity on session ASR
ρ transfer
+0.70
per-family ASR carries across settings
ΔASR
≤ 0
bias injection · all 5 agents

5 agents · 3 targets · 4 prompt conditions · 630 sessions

Authors

Taein Lim*Chung-Ang University
Seongyong Ju*Chung-Ang University
Munhyeok Kim*Chung-Ang University
Hyunjun KimMyongji University
Hoki KimChung-Ang University

* Equal contribution  ·  Corresponding authors

Methodology

Methodology

Experiment design, metrics, and evaluation framework

Experiment Design

CyBiasBench comprises two phases. The Bias Observation phase runs a 5 × 3 × 4 factorial (5 agents, 3 targets, 4 prompt conditions) with 3 repetitions, yielding 180 free-choice sessions. The Bias Injection phase forces a single attack family per session across 5 agents × 10 families × 3 targets with 3 repetitions, adding 450 sessions — 630 sessions in total.

5 agents3 targets4 prompts3 reps180 + 450 = 630 sessions
5 LLM Agents
Claude
ClaudeClaude Opus 4.5Claude Code
Codex
CodexGPT-5.2 CodexCodex CLI
Gemini
GeminiGemini 2.5 ProGemini CLI
Kimi
KimiKimi k2.5Kimi Code
GLM
GLMGLM 5.1Open Code
3 Target Applications
OWASP Juice Shop100+ challenges, OWASP Top 10 coverage, ground-truth via Challenge API
MLflow 2.9.2ML platform with known CVEs (path traversal, SSRF, arbitrary file read)
Vuln ShopCustom vulnerable e-commerce app with controlled vulnerability surface
Two Phases
Bias Observation180 free-choice sessions — 4 prompt conditions (GS/GU/US/UU) × 3 reps.
Bias Injection450 forced sessions — each of 10 families injected individually × 3 reps.

Infrastructure

Docker-based isolation ensuring independent, uncontaminated experiment runs. All agents share the same Kali Linux base image with identical tooling.

CyBiasBench Architecture Overview

Stage 1 · Provision

Bring up an isolated Docker network per agent: shared Kali base image, victim container, mitmproxy logger, and LiteLLM metrics proxy.

Stage 2 · Run

Execute the agent CLI under the assigned prompt condition. All HTTP traffic and LLM API usage are captured to per-session JSONL logs.

Stage 3 · Analyze

Classify HTTP flows by attack family (CRS + CAPEC patterns), score success against ground truth, then aggregate the bias and performance metrics.

Network Isolation

Each agent runs in a named Docker network. Agents cannot communicate with each other or another agent's victim.

HTTP Logger

mitmproxy captures every HTTP request/response between agent and victim. Logs saved as JSONL per session.

Metrics Proxy

LiteLLM proxy records tokens, cost, latency for every LLM API call to usage.jsonl.

Shared Kali Linux Base Image

nmapniktosqlmapdirbcurlwgetnetcatpython3nodejsjqripgrepgit

Prompt Conditions

A 2 × 2 factorial of guidance level and output format.

Guided

Agent receives a fixed list of 10 attack-family labels to guide and categorize its testing.

Technique labels (fixed set):
sqli, xss, cmdi, path_traversal,
auth_bypass, idor, ssrf, csrf,
file_upload, info_disclosure, others

"Use your own judgment to conduct
the penetration test and proceed
as systematically as possible."

Attack Classification

HTTP traffic is classified independently of agent self-reports using a multi-layer pipeline.

Classification Pipeline

  1. OWASP CRS patterns — ModSecurity Core Rule Set v4.x regex matching
  2. CAPEC/WSTG/CWE patterns — 60+ additional patterns for attack-family identification
  3. Target-specific classifiers — Heuristics tuned per victim (Juice Shop, MLflow, Vuln Shop)
  4. Response analysis — HTTP status codes and response body patterns for success verification

10 Attack-Family Taxonomy

Each family is anchored to a public security taxonomy (CAPEC, OWASP 2025 Top 10, CWE Top 25). The classifier-cue column lists signals used for family assignment — they are not exploit instructions. Additional labels probing and scanning are excluded from bias metrics.

Click a column header to sort; click any row for the full taxonomy excerpt.

Metrics

Bias, performance, and injection axes are kept on separate axes so that preference and capability can be read independently.

Bias Evaluation
Attack Performance
Efficiency & Robustness
Bias Injection (§5)

Click any row for the §3.3 definition and a worked numeric example.

Findings