Checkpoints

Human and Agent decision points in workflows

En bref

Les checkpoints sont comme les points de sauvegarde dans un jeu video : le workflow met en pause l'execution a des moments strategiques pour vous permettre de verifier ce qui s'est passe et de decider si vous voulez continuer. Vous pouvez approuver, rejeter, ou modifier les parametres avant que l'action critique ne soit executee. C'est votre filet de securite.

Points cles :

  • Points de pause pour revision humaine (HIL) ou par agent (AIL)
  • Controle des actions critiques ou irreversibles
  • Prevention des erreurs couteuses
  • Flexibilite d'ajuster le workflow en cours d'execution

Analogie : Sauvegarde de jeu video - Avant un boss difficile, le jeu sauvegarde. Si vous echouez, vous revenez au checkpoint au lieu de tout recommencer.

What Are Checkpoints?

Checkpoints are points in a workflow where execution pauses for review or decision-making. They provide control over automated workflows, ensuring critical actions are verified before proceeding.

DAG Observability

DAG Resilience

HIL (Human-in-the-Loop)

Human-in-the-Loop checkpoints pause for human review and approval.

When to Use HIL

Scenario Example
Destructive operations Deleting files, dropping tables
External actions Sending emails, creating issues
Cost implications API calls with billing
Sensitive data Accessing credentials, PII
Compliance requirements Audit trails, approvals

HIL Flow

┌─────────────────────────────────────────────────────────────────┐
│                         HIL Checkpoint                           │
│                                                                  │
│  Workflow executes Task A, Task B...                            │
│                                                                  │
│  ─────────────────────────────────────────────────              │
│  ⏸️  PAUSED: Human approval required                             │
│                                                                  │
│  Action: Delete 47 files from /data/archive                     │
│                                                                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │  ✓ Approve  │  │  ✗ Reject   │  │  ✏️ Modify  │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│  ─────────────────────────────────────────────────              │
│                                                                  │
│  On Approve: Continue to Task C                                 │
│  On Reject: Stop workflow, mark as cancelled                    │
│  On Modify: Adjust parameters, then continue                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

HIL Information Provided

When a HIL checkpoint triggers, the human sees:

Information Purpose
Task description What will happen
Parameters Specific values being used
Context Previous task results
Risk level Severity indicator
Alternatives Other options available

AIL (Agent-in-the-Loop)

Agent-in-the-Loop checkpoints delegate decisions to an AI agent rather than a human.

When to Use AIL

Scenario Example
Quality decisions Is this output good enough?
Routing logic Which path should we take?
Error recovery Should we retry or abort?
Dynamic adjustments Modify parameters based on results

AIL Flow

┌─────────────────────────────────────────────────────────────────┐
│                         AIL Checkpoint                           │
│                                                                  │
│  Workflow executes Task A...                                    │
│                                                                  │
│  ─────────────────────────────────────────────────              │
│  🤖 AGENT REVIEW                                                 │
│                                                                  │
│  Task A output: { status: "partial", items: 15, errors: 2 }     │
│                                                                  │
│  Agent analyzes:                                                │
│    • 15 items processed successfully                            │
│    • 2 errors encountered                                       │
│    • Error rate: 13%                                            │
│                                                                  │
│  Agent decides:                                                 │
│    "Error rate acceptable. Proceeding with successful items."   │
│                                                                  │
│  ─────────────────────────────────────────────────              │
│                                                                  │
│  Continue to Task B with 15 items                               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

AIL Capabilities

An AIL checkpoint can:

Action Description
Approve/Reject Binary decision on continuing
Modify parameters Adjust next task's inputs
Add tasks Insert new tasks dynamically
Skip tasks Remove unnecessary steps
Replan Restructure remaining workflow

Combining HIL and AIL

Complex workflows can use both:

┌─────────────────────────────────────────────────────────────────┐
│  Workflow: Automated Report Generation                          │
│                                                                  │
│  [Fetch Data] ──▶ 🤖 AIL: Validate data quality                 │
│        │                                                        │
│        ▼                                                        │
│  [Generate Report] ──▶ 🤖 AIL: Check formatting                 │
│        │                                                        │
│        ▼                                                        │
│  [Send to Stakeholders] ◀── ⏸️ HIL: Approve before sending      │
│                                                                  │
│  AIL handles routine validation                                 │
│  HIL ensures human oversight for external actions               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

When to Use

HIL: Irreversible actions, external impact, compliance, sensitive data, high stakes.

AIL: Routine context-dependent decisions, speed important, recoverable errors, objective criteria.

None: Read-only, easily reversible, tested workflow, low risk.

Checkpoint Configuration

Checkpoints are configured per task:

Task: delete_files
  checkpoint:
    type: HIL
    message: "About to delete {count} files"
    risk_level: high
    timeout: 3600  (1 hour to respond)

Task: validate_output
  checkpoint:
    type: AIL
    prompt: "Is this output acceptable?"
    fallback: reject  (if agent fails)

Exemple concret : Publication de contenu

Workflow: Publish Blog Post

Layer 0: Fetch draft + Check images + Spell check
Layer 1: Generate HTML, optimize images

🤖 AIL: Quality check (alt text, SEO, links)

Layer 2: Generate preview

⏸️ HIL: Editorial Review
  Preview URL + Post details
  Options: ✓ Approve  ✗ Reject  ✏️ Edit
  Decision: Approved (changed publish time)

Layer 3: Publish to CMS + RSS + Tweet + Sitemap

🤖 AIL: Post-publish verification
  Checks: Post live? RSS ok? Tweet sent?
  If fail → Rollback + Alert ops

Layer 4: Send notifications, update dashboard

ANALOGIE JEU VIDEO (Dark Souls) :

AIL Checkpoint (auto):
  Avant donjon: "Assez de potions? Arme reparee?"

HIL Checkpoint (humain):
  Porte du boss: SAUVEGARDE
  Vous decidez: Continuer? Revenir plus tard?
  Si mort → Retour au checkpoint

AIL Checkpoint (verif):
  Apres boss: Butin obtenu?

Cas reels :

E-commerce: Mise a jour prix (500 produits, -20%)
  🤖 AIL: Verif calculs
  ⏸️ HIL: Impact -$45k → Approuver?
  Update DB + Clear cache

Database Migration:
  🤖 AIL: Backup integrity
  Test sur staging
  🤖 AIL: Staging validation
  ⏸️ HIL: Migrate prod? (Risque HIGH, 15min downtime)
  🤖 AIL: Post-migration checks → Auto-rollback si fail

Regles:
  HIL: Irreversible, impact financier, jugement humain
  AIL: Criteres objectifs, technique, rapide
  Aucun: Lecture seule, reversible, risque faible

Next