Skip to content

Workflow Lifecycle

A workflow in Ruvon goes through a series of states from creation to completion. Understanding this lifecycle helps you predict behavior, debug issues, and design robust workflows.

Lifecycle States

Every workflow exists in one of these states:

PENDING_ASYNC → WAITING_HUMAN_INPUT → PENDING_SUB_WORKFLOW
     ↓                 ↓                      ↓
   ACTIVE ←────────────┴──────────────────────┘
     ├─→ COMPLETED (success)
     ├─→ FAILED (unhandled exception)
     ├─→ FAILED_ROLLED_BACK (saga compensation completed)
     ├─→ FAILED_WORKER_CRASH (zombie detection)
     ├─→ FAILED_CHILD_WORKFLOW (sub-workflow failed)
     └─→ CANCELLED (user-initiated)

Let's explore each state and what it means.

State Definitions

PENDING_ASYNC

Meaning: An async step has been dispatched to a worker and is executing in the background.

Triggers: - Workflow executes an ASYNC type step - Execution provider (e.g., Celery) dispatches task to worker - Workflow saves state and returns control to caller

What's happening: - Worker process is executing the step function - Main workflow process is idle (not blocked) - State is persisted to database

Next state: - ACTIVE when worker completes the task and reports back - FAILED if worker encounters an unhandled exception

Duration: Seconds to hours, depending on step logic

Example scenario: Sending 10,000 emails in an async step

WAITING_HUMAN_INPUT

Meaning: Workflow is paused, waiting for external input (human approval, API callback, etc.).

Triggers: - Step function raises WorkflowPauseDirective - Typically used for approvals, manual review, or external events

What's happening: - Workflow is effectively "suspended" - No worker is actively processing - Waiting for external system or user to call resume_workflow()

Next state: - ACTIVE when resume_workflow() is called with required input - CANCELLED if user cancels the workflow

Duration: Minutes to days, depending on business process

Example scenario: Waiting for manager approval on a large purchase order

PENDING_SUB_WORKFLOW

Meaning: Workflow has spawned a child workflow and is waiting for it to complete.

Triggers: - Step function raises StartSubWorkflowDirective - Parent workflow creates child workflow - Parent saves state and returns control

What's happening: - Child workflow is executing independently - Parent is idle, checking child status periodically - Child can be in any state (ACTIVE, WAITING_HUMAN_INPUT, etc.)

Next state: - ACTIVE when child completes successfully - WAITING_CHILD_HUMAN_INPUT if child pauses for input - FAILED_CHILD_WORKFLOW if child fails

Duration: Varies based on child workflow complexity

Example scenario: Order workflow spawns inventory allocation sub-workflow

ACTIVE

Meaning: Workflow is actively executing or ready to execute the next step.

Triggers: - Workflow is created with create_workflow() - Async task completes - Human input is provided via resume_workflow() - Sub-workflow completes - Any transition that makes the workflow ready to proceed

What's happening: - Workflow is in a runnable state - Next call to next_step(user_input={}) will process a step - Step may execute immediately or transition to another state

Next state: - PENDING_ASYNC if next step is async - WAITING_HUMAN_INPUT if next step pauses - PENDING_SUB_WORKFLOW if next step spawns child - COMPLETED if no more steps - FAILED on unhandled exception

Duration: Milliseconds to seconds per step

Example scenario: Processing an order through validation, payment, and fulfillment steps

COMPLETED

Meaning: Workflow has successfully executed all steps.

Triggers: - Final step completes without errors - No more steps to execute (end of workflow reached)

What's happening: - All steps executed successfully - Final state is persisted - Workflow is immutable (no further execution)

Next state: Terminal state (no further transitions)

Duration: Permanent

Example scenario: Order shipped, payment settled, customer notified

FAILED

Meaning: An unhandled exception occurred during step execution.

Triggers: - Step function raises an exception (not a directive) - Validation error on input or output - Network timeout, database error, etc.

What's happening: - Exception is logged to audit log - Workflow state captured at point of failure - Error message and stack trace stored

Next state: - ACTIVE if user calls retry_workflow() - Otherwise terminal (manual intervention required)

Duration: Permanent unless retried

Example scenario: Payment gateway timeout during charge step

FAILED_ROLLED_BACK

Meaning: Workflow failed and saga compensation successfully executed.

Triggers: - Saga mode is enabled (workflow.enable_saga_mode()) - Step fails during execution - All compensatable steps execute in reverse order - All compensations complete successfully

What's happening: - Failed step identified - Compensation functions execute in reverse (LIFO) - Each compensation attempts to undo its step's side effects - All compensations succeed

Next state: Terminal state (workflow cannot continue)

Duration: Permanent

Example scenario: Payment charged, inventory allocated, then shipping fails. Compensation releases inventory and refunds payment.

FAILED_WORKER_CRASH

Meaning: Worker process crashed while executing an async step, detected by zombie scanner.

Triggers: - Worker sends heartbeats while processing step - Worker crashes (OOM, hardware failure, SIGKILL) - Heartbeat stops updating - Zombie scanner detects stale heartbeat (threshold: 120s default)

What's happening: - Workflow stuck in PENDING_ASYNC state - No worker is actually processing the step - Zombie scanner marks workflow as crashed

Next state: - ACTIVE if user calls retry_workflow(from_step=...) - Otherwise terminal (manual intervention required)

Duration: Permanent unless retried

Example scenario: Celery worker killed by OOM killer while processing large file upload

FAILED_CHILD_WORKFLOW

Meaning: A child workflow failed, causing the parent to fail.

Triggers: - Parent spawns child via StartSubWorkflowDirective - Child workflow encounters unhandled exception - Child transitions to FAILED or FAILED_ROLLED_BACK - Parent checks child status and propagates failure

What's happening: - Child workflow is in failed state - Parent workflow cannot proceed (child's result is required) - Parent transitions to failed state

Next state: Terminal state (both parent and child failed)

Duration: Permanent

Example scenario: Order workflow spawns payment workflow, payment fails due to invalid card

CANCELLED

Meaning: User explicitly cancelled the workflow.

Triggers: - User calls cancel_workflow(workflow_id) - Optional: User provides cancellation reason

What's happening: - Workflow execution halted immediately - If async step was running, worker continues but result is ignored - Cancellation reason logged to audit log

Next state: Terminal state (workflow cannot continue)

Duration: Permanent

Example scenario: Customer cancels order before it ships

State Transition Examples

Happy Path: Simple Workflow

1. create_workflow("OrderProcessing", data)
   → State: ACTIVE

2. next_step(user_input={})  # Validate_Order (STANDARD step)
   → Executes inline
   → State: ACTIVE (ready for next step)

3. next_step(user_input={})  # Charge_Payment (STANDARD step)
   → Executes inline
   → State: ACTIVE

4. next_step(user_input={})  # Ship_Order (STANDARD step)
   → Executes inline
   → No more steps
   → State: COMPLETED

Async Step Workflow

1. create_workflow("EmailCampaign", data)
   → State: ACTIVE

2. next_step(user_input={})  # Send_Emails (ASYNC step)
   → Dispatch to Celery worker
   → State: PENDING_ASYNC

   [Worker executes Send_Emails in background]

3. Worker completes task, calls complete_async_task()
   → State: ACTIVE

4. next_step(user_input={})  # Track_Results (STANDARD step)
   → Executes inline
   → State: COMPLETED

Human-in-the-Loop Workflow

1. create_workflow("LoanApproval", data)
   → State: ACTIVE

2. next_step(user_input={})  # Calculate_Risk (STANDARD step)
   → Executes inline, calculates risk score
   → State: ACTIVE

3. next_step(user_input={})  # Manager_Approval (STANDARD step)
   → Raises WorkflowPauseDirective(result={"awaiting_approval": True})
   → State: WAITING_HUMAN_INPUT

   [Hours/days pass while manager reviews]

4. resume_workflow(workflow_id, input={"approved": True})
   → State: ACTIVE

5. next_step(user_input={})  # Disburse_Loan (STANDARD step)
   → Executes inline
   → State: COMPLETED

Sub-Workflow Example

1. create_workflow("OrderProcessing", data)
   → State: ACTIVE

2. next_step(user_input={})  # Allocate_Inventory (triggers sub-workflow)
   → Raises StartSubWorkflowDirective(workflow_type="InventoryAllocation")
   → Creates child workflow
   → State: PENDING_SUB_WORKFLOW

3. [Child workflow executes independently]
   → Child state: ACTIVE → PENDING_ASYNC → ACTIVE → COMPLETED

4. Parent checks child status, finds COMPLETED
   → Merges child results into parent state
   → State: ACTIVE

5. next_step(user_input={})  # Ship_Order
   → State: COMPLETED

Saga Compensation Example

1. create_workflow("BookingWorkflow", data)
   → Enable saga mode
   → State: ACTIVE

2. next_step(user_input={})  # Reserve_Flight (compensatable)
   → Executes inline, reserves seat
   → State: ACTIVE

3. next_step(user_input={})  # Reserve_Hotel (compensatable)
   → Executes inline, reserves room
   → State: ACTIVE

4. next_step(user_input={})  # Charge_Payment (compensatable)
   → Fails! Card declined
   → Saga triggered
   → State: Executing compensation...

5. Compensation: Cancel_Payment (no-op, payment never charged)
6. Compensation: Release_Hotel (cancels hotel reservation)
7. Compensation: Cancel_Flight (releases flight seat)

   → State: FAILED_ROLLED_BACK

Persistent vs Transient State

Persistent State

Saved to database via PersistenceProvider.save_workflow(): - Workflow ID - Workflow type and version - Current status - Current step name - Workflow state (Pydantic model serialized to JSON) - Created/updated timestamps - Owner ID, data region (multi-tenancy) - Definition snapshot (for version isolation)

When saved: - After every step execution - Before async dispatch - On pause, cancel, or failure

Transient State

Exists only in memory during execution: - Step function local variables - Intermediate calculation results - Loop iteration state (current index, etc.)

Why separate: Persistent state is durable across restarts, transient state is ephemeral and reconstructed on resume.

Lifecycle Hooks (via WorkflowObserver)

The WorkflowObserver provider allows you to hook into lifecycle events:

class MyObserver(WorkflowObserver):
    def on_workflow_started(self, workflow_id, workflow_type):
        # Called when workflow created
        pass

    def on_workflow_status_changed(self, workflow_id, old_status, new_status):
        # Called on every state transition
        if new_status == "COMPLETED":
            send_notification(workflow_id)

    def on_step_executed(self, workflow_id, step_name, result):
        # Called after each step
        record_metric(step_name, duration=result.get('duration'))

    def on_workflow_failed(self, workflow_id, error):
        # Called on failure
        log_error(workflow_id, error)
        alert_ops_team(workflow_id)

Querying Workflow State

The persistence provider exposes methods to query workflows by state:

# List all active workflows
active_workflows = await persistence.list_workflows(status="ACTIVE")

# List workflows waiting for human input
paused_workflows = await persistence.list_workflows(status="WAITING_HUMAN_INPUT")

# List failed workflows for manual review
failed_workflows = await persistence.list_workflows(
    status="FAILED",
    limit=100
)

What's Next

Now that you understand the lifecycle: - State Management - How state flows through the lifecycle - Saga Pattern - Compensation on failure - Zombie Recovery - Detecting and recovering from worker crashes