Performance Model and Optimizations¶

Ruvon SDK is designed for high-throughput production use. Understanding its performance characteristics helps you tune for your workload.

Performance Pillars¶

Ruvon achieves high performance through four key optimizations:

1. uvloop Event Loop (2-4x faster async I/O)¶

Python's default asyncio event loop has significant overhead. Ruvon uses uvloop, a drop-in replacement built on libuv (same as Node.js).

Benchmark Results:

# Standard asyncio
await db.query(...)  # 50µs latency

# With uvloop
await db.query(...)  # 12µs latency (4x faster)

How it works:

# src/ruvon/__init__.py
import sys
import os

# Auto-enable uvloop on import (can be disabled via env var)
if os.getenv("RUVON_USE_UVLOOP", "true").lower() == "true":
    try:
        import uvloop
        uvloop.install()  # Replace asyncio event loop
    except ImportError:
        pass  # Fallback to standard asyncio

Impact: All async operations (database queries, HTTP calls, etc.) benefit automatically.

When to disable: Debugging compatibility issues, running on Windows (uvloop not supported).

2. orjson Serialization (3-5x faster JSON)¶

Workflow state is serialized to JSON frequently. Python's stdlib json module is slow. Ruvon uses orjson, a C extension optimized for speed.

Benchmark Results:

state = OrderState(order_id="123", items=[...], ...)  # 10KB state

# stdlib json
json.dumps(state.dict())  # 450µs

# orjson
orjson.dumps(state.dict())  # 90µs (5x faster)

How it works:

# src/ruvon/utils/serialization.py
import os

if os.getenv("RUVON_USE_ORJSON", "true").lower() == "true":
    import orjson

    def serialize(data):
        return orjson.dumps(data).decode('utf-8')

    def deserialize(json_str):
        return orjson.loads(json_str)
else:
    import json

    def serialize(data):
        return json.dumps(data)

    def deserialize(json_str):
        return json.loads(json_str)

Usage:

from ruvon.utils.serialization import serialize, deserialize

# Automatically uses orjson if available
json_str = serialize({"key": "value"})
data = deserialize(json_str)

Impact: Every workflow state save/load benefits. At 1000 workflows/sec, saves ~360ms/sec CPU time.

3. Connection Pooling (10-50 connections)¶

Creating database connections is expensive (~50-100ms). Ruvon maintains a connection pool for reuse.

Benchmark Results:

# No pooling (create connection per query)
await db.query(...)  # 75ms (50ms connection + 25ms query)

# With pooling (reuse connections)
await db.query(...)  # 25ms (query only)

PostgreSQL Configuration:

persistence = PostgresPersistenceProvider(
    db_url=db_url,
    pool_min_size=10,   # Keep 10 connections warm
    pool_max_size=50,   # Max 50 concurrent connections
    pool_command_timeout=10,  # 10s query timeout
    pool_max_queries=50000,   # Recycle after 50k queries
    pool_max_inactive_lifetime=300  # Close idle connections after 5min
)

Environment Variables:

export POSTGRES_POOL_MIN_SIZE=10
export POSTGRES_POOL_MAX_SIZE=50
export POSTGRES_POOL_COMMAND_TIMEOUT=10
export POSTGRES_POOL_MAX_QUERIES=50000
export POSTGRES_POOL_MAX_INACTIVE_LIFETIME=300

Tuning by Workload: | Workload | Min Size | Max Size | Reasoning | |----------|----------|----------|-----------| | Low (< 10 concurrent) | 5 | 20 | Minimize overhead | | Medium (10-100 concurrent) | 10 | 50 | Default (balanced) | | High (100+ concurrent) | 20 | 100 | Maximize throughput |

Impact: At 100 concurrent workflows, saves ~5 seconds/sec total (50ms × 100 connections avoided).

4. Import Caching (162x faster function resolution)¶

Step functions are imported via importlib.import_module(). Repeated imports are slow. Ruvon caches imports.

Benchmark Results:

# First import (uncached)
func = import_from_string("myapp.steps.process_payment")  # 5-10ms

# Subsequent imports (cached)
func = import_from_string("myapp.steps.process_payment")  # 0.03ms (162x faster)

How it works:

# src/ruvon/builder.py
class WorkflowBuilder:
    _import_cache = {}  # Class-level cache

    @classmethod
    def _import_from_string(cls, path: str):
        if path in cls._import_cache:
            return cls._import_cache[path]

        module_path, function_name = path.rsplit('.', 1)
        module = importlib.import_module(module_path)
        func = getattr(module, function_name)

        cls._import_cache[path] = func
        return func

Impact: Every step execution saves ~5-10ms. At 1000 steps/sec, saves ~5-10 seconds/sec CPU time.

Benchmark Results¶

Serialization Performance¶

python tests/benchmarks/workflow_performance.py

Output:

Serialization Benchmarks:
  orjson:       2,453,971 ops/sec
  stdlib json:    489,234 ops/sec
  Speedup:        5.0x

Deserialization Benchmarks:
  orjson:       1,823,456 ops/sec
  stdlib json:    412,987 ops/sec
  Speedup:        4.4x

Import Caching Performance¶

# Uncached import
%timeit import_from_string("myapp.steps.process_payment")
# 5.21 ms ± 0.18 ms per loop

# Cached import
%timeit import_from_string("myapp.steps.process_payment")
# 32.1 µs ± 1.2 µs per loop

# Speedup: 162x

Async Latency (uvloop)¶

# Measure event loop overhead
async def noop():
    pass

# asyncio (stdlib)
%timeit asyncio.run(noop())
# 15.3 µs ± 0.5 µs per loop

# uvloop
%timeit asyncio.run(noop())  # with uvloop installed
# 5.5 µs ± 0.2 µs per loop

# Speedup: 2.8x

End-to-End Workflow Throughput¶

Test Setup: - Workflow: 5 steps (3 STANDARD, 2 DECISION) - State: OrderState (~5KB) - Persistence: PostgreSQL (connection pool) - Execution: SyncExecutionProvider

Results:

Throughput: 703,633 workflows/sec (simplified benchmark)
Latency (p50): 1.4ms
Latency (p99): 3.2ms
Memory per workflow: ~3MB

Note: Real-world throughput is lower due to actual step function logic, network latency, etc. This benchmark isolates engine overhead.

Performance Tuning¶

For Throughput (More Workflows/Second)¶

1. Increase Connection Pool:

persistence = PostgresPersistenceProvider(
    db_url=db_url,
    pool_max_size=100  # Up from default 50
)

2. Use Celery for Parallel Execution:

# Instead of sync execution (sequential)
execution = SyncExecutionProvider()

# Use Celery (parallel workers)
execution = CeleryExecutionProvider()

3. Scale Horizontally (Kubernetes HPA):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ruvon-celery-worker-hpa
spec:
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Result: Linear scaling from 3 to 20 workers = 6.7x throughput increase.

For Latency (Faster Per-Workflow)¶

1. Minimize State Size:

# ❌ Bad: 100KB state
class OrderState(BaseModel):
    raw_request_json: str  # 95KB
    items: list[dict]

# ✅ Good: 5KB state
class OrderState(BaseModel):
    request_id: str  # Reference, not embed
    items: list[dict]

2. Use In-Memory Persistence (if acceptable):

# For transient workflows (test, prototyping)
persistence = MemoryPersistenceProvider()  # 0ms I/O

3. Reduce Step Count:

# ❌ Bad: 20 tiny steps
steps:
  - name: "Step_1"  # 1ms each
  - name: "Step_2"
  # ... 18 more

# ✅ Good: 5 larger steps
steps:
  - name: "Batch_Process_1_5"  # 5ms
  - name: "Batch_Process_6_10"  # 5ms

Result: 20 steps × 1ms = 20ms total. 5 steps × 5ms = 25ms total, but 15ms saved in overhead.

For Memory Efficiency¶

1. Clear Step Results (if not needed):

def my_step(state: OrderState, context: StepContext):
    # Generate large intermediate result
    result = process_large_dataset(state)

    # Extract only what's needed
    summary = {"count": len(result), "total": sum(result)}

    # Don't return full result (saves memory)
    return summary

2. Use Generators for Large Datasets:

def process_items(state: OrderState, context: StepContext):
    # ❌ Bad: Load all items in memory
    items = db.query("SELECT * FROM items").fetchall()

    # ✅ Good: Stream items
    for item in db.query("SELECT * FROM items").stream():
        process(item)

3. Configure Database Pool for memory:

# If memory constrained
persistence = PostgresPersistenceProvider(
    db_url=db_url,
    pool_max_size=10  # Reduce from 50 (saves ~100MB)
)

Performance Monitoring¶

Key Metrics¶

Workflow Execution:

from prometheus_client import Histogram

workflow_duration = Histogram(
    'ruvon_workflow_duration_seconds',
    'Workflow execution duration',
    buckets=[0.1, 0.5, 1, 5, 10, 30, 60]
)

# Measure
with workflow_duration.time():
    await workflow.next_step(user_input={})

Step Execution:

step_duration = Histogram(
    'ruvon_step_duration_seconds',
    'Step execution duration',
    ['step_name'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1, 5]
)

step_duration.labels(step_name="Process_Payment").observe(duration)

Database Operations:

db_query_duration = Histogram(
    'ruvon_db_query_duration_seconds',
    'Database query duration',
    ['operation'],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5]
)

with db_query_duration.labels(operation="save_workflow").time():
    await persistence.save_workflow(workflow_id, workflow_dict)

Grafana Dashboard¶

Panel 1: Throughput

Query: rate(ruvon_workflows_started_total[5m])
Title: Workflows Started/sec
Alert: < 10 workflows/sec (low throughput)

Panel 2: Latency (p50, p95, p99)

Query:
  - histogram_quantile(0.50, ruvon_workflow_duration_seconds_bucket)
  - histogram_quantile(0.95, ruvon_workflow_duration_seconds_bucket)
  - histogram_quantile(0.99, ruvon_workflow_duration_seconds_bucket)
Title: Workflow Latency
Alert: p99 > 10s (slow workflows)

Panel 3: Database Pool Utilization

Query: ruvon_db_pool_active_connections / ruvon_db_pool_max_size
Title: DB Pool Utilization
Alert: > 0.9 (pool exhausted)

Expected Performance Gains¶

Ruvon optimizations provide measurable benefits:

Optimization	Gain	Scenario
uvloop	+50-100% throughput	I/O-bound workflows (API calls, DB queries)
orjson	-30% latency	Large state models (>10KB)
Connection Pooling	+400% efficiency	High concurrency (>50 concurrent workflows)
Import Caching	-90% import overhead	Repeated step function calls

Combined Effect (typical I/O-bound workflow): - Throughput: +120% (2.2x faster) - Latency: -40% (1.67x faster per workflow) - Memory: -20% (more efficient pooling)

Disabling Optimizations (Debugging)¶

Sometimes you need to disable optimizations for debugging:

# Disable uvloop (use stdlib asyncio)
export RUVON_USE_UVLOOP=false

# Disable orjson (use stdlib json)
export RUVON_USE_ORJSON=false

# Reduce connection pool (easier to debug)
export POSTGRES_POOL_MIN_SIZE=1
export POSTGRES_POOL_MAX_SIZE=5

# Clear import cache (for testing hot-reload)
WorkflowBuilder._import_cache.clear()

Performance Comparison¶

Benchmark Results¶

Test: 5-step workflow, 5 KB state, 100 concurrent executions, PostgreSQL backend.

Metric	Baseline (no opts)	Ruvon (all opts)	Improvement
Throughput	50 workflows/sec	120 workflows/sec	+140%
Latency (p50)	250 ms	180 ms	−28%
Latency (p99)	1,500 ms	800 ms	−47%
Memory/workflow	~5 MB	~3 MB	−40%
DB connections	10 ad-hoc	50 pooled	+400% efficiency

What drives the gains: - ✅ Connection pooling — persistent connections reused across requests - ✅ orjson — 3–5× faster JSON serialization vs stdlib - ✅ uvloop — 2–4× faster async I/O vs stdlib asyncio - ✅ Import caching — 162× faster function resolution

What's Next¶

Now that you understand performance: - Architecture - How optimizations fit into architecture - Provider Pattern - Choosing providers for performance - Parallel Execution - Maximizing throughput with parallelism