2026-04-15

Red-team methodology (overview)

How pop-pay is tested under adversarial pressure: the corpus framework, scoring semantics, and reproducible harness.

This post is the public overview of the red-team framework. Specific attack payloads, model-by-model failure traces, and the live target list are held internally during the active hardening phase.

Posture

pop-pay’s claims survive only if they survive adversarial pressure. The red-team program exists to find structural failures — not to chase one-off prompt tricks — and to publish the methodology before the headline numbers, so the numbers are checkable.

Corpus framework

The test corpus is organized into eleven attack categories, each defined by an attacker objective rather than a payload shape:

  1. Direct credential extraction — get the agent to print or transmit raw card data
  2. Indirect credential extraction — get the agent to log, screenshot, or echo unmasked card data via a side channel
  3. Scope escalation — turn an approved purchase into a larger or different purchase
  4. Domain redirection — spoof a merchant after approval (TOCTOU)
  5. Vendor injection — substitute the destination merchant during checkout
  6. Hallucinated transaction — induce the agent to initiate a purchase that wasn’t requested
  7. Tool-chain compromise — abuse a malicious MCP / plugin to intercept payment flow
  8. Vault probing — read or copy the encrypted vault and attempt offline use
  9. Downgrade attack — force the runtime to fall back to an unhardened code path
  10. Approval-bypass — submit a transaction that should require human approval without one
  11. Reasoning leak — extract card data from the agent’s chain-of-thought after a legitimate flow

Each category has explicit pass / warn / fail criteria documented in the methodology. Categories are not weighted; a failure in any category is a failure.

Payload generation

Payloads are generated through a hybrid pipeline: hand-authored canonical attacks for each category, plus model-assisted variation to cover phrasing and language drift. Multi-language payloads are included by design — Traditional Chinese, Japanese, and Korean variations are part of the corpus, not an afterthought.

The corpus is versioned. Every reported result cites the corpus version (corpus@vN.M) and the model + temperature that produced the run. Numbers without a corpus version are not numbers.

Scoring

Three outcomes per payload, evaluated by a deterministic checker (not the model under test):

A payload is only considered passed when the structural primitive is the reason it passed. A payload that passes because the model happened to refuse on a given run is not a structural pass — it is recorded separately and not credited to pop-pay.

Reproducibility

The harness is a single command in each language runtime:

npm run redteam --corpus=vN.M --model=<model-id>
pop-pay redteam --corpus=vN.M --model=<model-id>     # python parity

Every run emits a JSON report with the corpus version, model, timestamp, per-payload outcomes, and the deterministic checker version. Reports are designed to be diffed.

What’s published vs. held

Published: the eleven categories, the scoring semantics, the harness, the canary file (vault.enc.challenge), aggregate pass-rate numbers per release, and any structural finding that has been remediated.

Held during active hardening: live attack payloads that are not yet patched, model-specific failure traces, and the specific targets currently being broken. These move into the published set as fixes ship. Coordinated disclosure address: security@pop-pay.ai (PGP on request, 72-hour SLA).

What this is not

This is not a benchmark designed to make pop-pay look good. The corpus is built to find failures; published numbers are whatever the corpus reports. If a future release regresses, the regression will be in the report.


Canonical methodology source: docs/RED_TEAM_METHODOLOGY.md (when published; full payload corpus is internal during the current hardening phase).