Task Runner
Choose a task and run the deterministic baseline to inspect the full trajectory.
Scoreboard
Live summary of the current task and the multi-task baseline run.
Episode Output
Visual baseline trajectory, readable action summaries, and final grading highlights.
Reward Curve
Final Outcome
Judge Lens
Models a safety-critical recall workflow that QA, operations, and supply-chain teams actually perform.
The hard task forces precision containment of mixed safe and unsafe stock under partial observability.
Deterministic graders evaluate precision, coverage, investigation depth, and efficiency with reproducible scores.