Adversarial Sandbox (upcoming)

A planned safe-test environment for running real exploits against AI models without touching production systems

1 · Purpose

Adversarial Sandbox is envisioned as the controlled “lab” inside HackAI. Model owners will spin up an isolated copy of their model, define resource limits, and invite white-hat researchers to attack it. All runs are logged and replayable, so anyone can verify a claimed exploit before bounty funds move.


2 · What the first version needs to do

Spin up a sandbox from a frozen model hash

Keeps tests deterministic and prevents “it worked on my machine” disputes

Run user payloads (prompts, data files, API calls) under strict quotas

Lets researchers push boundaries without risking live users

Capture a trace root (input + output + system logs)

Creates a single hash that anyone can reproduce for proof

Tear down automatically after each run

No lingering state, no data leaks

All logs stay off-chain; only the hash and basic metadata are posted to the Bounty Hub contract.


3 · Conceptual workflow

  1. Request Model owner submits model hash + limits → Sandbox orchestrator boots an isolated container.

  2. Attack Researcher sends payload from the Chrome extension or CLI.

  3. Record Sandbox logs I/O, computes a Merkle root, and returns it to both parties.

  4. Verify Any peer can replay the same job; matching root = valid exploit.

  5. Clean Container deletes itself. No data leaves the sandbox unless the owner approves disclosure.


4 · Why it matters

  • Safety first: Real attacks do not endanger production systems or leak customer data.

  • Reproducibility: A single hash replaces long PDF write-ups and screencasts.

  • Fair payouts: Verifiable traces remove guesswork when the Bounty Hub releases funds.

Last updated