All features
Feature
Idempotent state machine + quarantine
Failures are isolated, diagnosed, and retried — not blast-radius events.
What it is
When a device run fails recoverably, it pauses, surfaces the diagnostic, and offers retry. When it fails terminally (post SOURCE_CLEANED), it routes to quarantine with a structured triage workflow. The rest of the wave keeps moving.
How it works
14-state device-run machine (12 forward + 2 terminals). Every transition keyed by (device_run_id, target_state) — replays are no-ops. PhaseEventIngest is the single chokepoint for state change. Quarantine endpoints expose list / diagnose / retry / abandon.
What you get
- One bad device doesn't stop a 500-device wave.
- Operators triage from the same console they ran the wave from.
- Recoverable rollbacks are bounded; manual remediation only after the point of no return.