Safety-Critical OpenEnv Benchmark

RecallTrace OpenEnv

A real-world supply-chain recall benchmark where agents must trace contaminated lots, follow relabeled inventory lineage, inspect evidence, and quarantine only the unsafe stock.

OpenEnv compliant Deterministic grading 3 escalating tasks Precision containment

Average baseline 0.9677

Hard task focus Mixed safe/unsafe inventory

Judging edge Operational realism over toy mechanics

Task Runner

Choose a task and run the deterministic baseline to inspect the full trajectory.

Task level

Scoreboard

Live summary of the current task and the multi-task baseline run.

Current score -

Steps taken -

Status Ready

Average over all tasks -

Run all tasks to compare easy, medium, and hard performance.

Episode Output

Visual baseline trajectory, readable action summaries, and final grading highlights.

Reward Curve

Run a task to render the reward trajectory.

Final Outcome

Readable scoring highlights will appear here.

Run a task to populate the episode trajectory.

Judge Lens

Real-world utility

Models a safety-critical recall workflow that QA, operations, and supply-chain teams actually perform.

Frontier challenge

The hard task forces precision containment of mixed safe and unsafe stock under partial observability.

Benchmark quality

Deterministic graders evaluate precision, coverage, investigation depth, and efficiency with reproducible scores.

Project Hub

Health endpoint Reset endpoint Task catalog JSON GitHub source Space files Docker runtime OpenEnv ecosystem