Skip to content
All features

Feature

Idempotent state machine + quarantine

Failures are isolated, diagnosed, and retried — not blast-radius events.

What it is

When a device run fails recoverably, it pauses, surfaces the diagnostic, and offers retry. When it fails terminally (post SOURCE_CLEANED), it routes to quarantine with a structured triage workflow. The rest of the wave keeps moving.

How it works

14-state device-run machine (12 forward + 2 terminals). Every transition keyed by (device_run_id, target_state) — replays are no-ops. PhaseEventIngest is the single chokepoint for state change. Quarantine endpoints expose list / diagnose / retry / abandon.

What you get

  • One bad device doesn't stop a 500-device wave.
  • Operators triage from the same console they ran the wave from.
  • Recoverable rollbacks are bounded; manual remediation only after the point of no return.

See this feature running.