Performance Model and Optimizations¶
Ruvon SDK is designed for high-throughput production use. Understanding its performance characteristics helps you tune for your workload.
Performance Pillars¶
Ruvon achieves high performance through four key optimizations:
1. uvloop Event Loop (2-4x faster async I/O)¶
Python's default asyncio event loop has significant overhead. Ruvon uses uvloop, a drop-in replacement built on libuv (same as Node.js).
Benchmark Results:
# Standard asyncio
await db.query(...) # 50µs latency
# With uvloop
await db.query(...) # 12µs latency (4x faster)
How it works:
# src/ruvon/__init__.py
import sys
import os
# Auto-enable uvloop on import (can be disabled via env var)
if os.getenv("RUVON_USE_UVLOOP", "true").lower() == "true":
try:
import uvloop
uvloop.install() # Replace asyncio event loop
except ImportError:
pass # Fallback to standard asyncio
Impact: All async operations (database queries, HTTP calls, etc.) benefit automatically.
When to disable: Debugging compatibility issues, running on Windows (uvloop not supported).
2. orjson Serialization (3-5x faster JSON)¶
Workflow state is serialized to JSON frequently. Python's stdlib json module is slow. Ruvon uses orjson, a C extension optimized for speed.
Benchmark Results:
state = OrderState(order_id="123", items=[...], ...) # 10KB state
# stdlib json
json.dumps(state.dict()) # 450µs
# orjson
orjson.dumps(state.dict()) # 90µs (5x faster)
How it works:
# src/ruvon/utils/serialization.py
import os
if os.getenv("RUVON_USE_ORJSON", "true").lower() == "true":
import orjson
def serialize(data):
return orjson.dumps(data).decode('utf-8')
def deserialize(json_str):
return orjson.loads(json_str)
else:
import json
def serialize(data):
return json.dumps(data)
def deserialize(json_str):
return json.loads(json_str)
Usage:
from ruvon.utils.serialization import serialize, deserialize
# Automatically uses orjson if available
json_str = serialize({"key": "value"})
data = deserialize(json_str)
Impact: Every workflow state save/load benefits. At 1000 workflows/sec, saves ~360ms/sec CPU time.
3. Connection Pooling (10-50 connections)¶
Creating database connections is expensive (~50-100ms). Ruvon maintains a connection pool for reuse.
Benchmark Results:
# No pooling (create connection per query)
await db.query(...) # 75ms (50ms connection + 25ms query)
# With pooling (reuse connections)
await db.query(...) # 25ms (query only)
PostgreSQL Configuration:
persistence = PostgresPersistenceProvider(
db_url=db_url,
pool_min_size=10, # Keep 10 connections warm
pool_max_size=50, # Max 50 concurrent connections
pool_command_timeout=10, # 10s query timeout
pool_max_queries=50000, # Recycle after 50k queries
pool_max_inactive_lifetime=300 # Close idle connections after 5min
)
Environment Variables:
export POSTGRES_POOL_MIN_SIZE=10
export POSTGRES_POOL_MAX_SIZE=50
export POSTGRES_POOL_COMMAND_TIMEOUT=10
export POSTGRES_POOL_MAX_QUERIES=50000
export POSTGRES_POOL_MAX_INACTIVE_LIFETIME=300
Tuning by Workload: | Workload | Min Size | Max Size | Reasoning | |----------|----------|----------|-----------| | Low (< 10 concurrent) | 5 | 20 | Minimize overhead | | Medium (10-100 concurrent) | 10 | 50 | Default (balanced) | | High (100+ concurrent) | 20 | 100 | Maximize throughput |
Impact: At 100 concurrent workflows, saves ~5 seconds/sec total (50ms × 100 connections avoided).
4. Import Caching (162x faster function resolution)¶
Step functions are imported via importlib.import_module(). Repeated imports are slow. Ruvon caches imports.
Benchmark Results:
# First import (uncached)
func = import_from_string("myapp.steps.process_payment") # 5-10ms
# Subsequent imports (cached)
func = import_from_string("myapp.steps.process_payment") # 0.03ms (162x faster)
How it works:
# src/ruvon/builder.py
class WorkflowBuilder:
_import_cache = {} # Class-level cache
@classmethod
def _import_from_string(cls, path: str):
if path in cls._import_cache:
return cls._import_cache[path]
module_path, function_name = path.rsplit('.', 1)
module = importlib.import_module(module_path)
func = getattr(module, function_name)
cls._import_cache[path] = func
return func
Impact: Every step execution saves ~5-10ms. At 1000 steps/sec, saves ~5-10 seconds/sec CPU time.
Benchmark Results¶
Serialization Performance¶
Output:
Serialization Benchmarks:
orjson: 2,453,971 ops/sec
stdlib json: 489,234 ops/sec
Speedup: 5.0x
Deserialization Benchmarks:
orjson: 1,823,456 ops/sec
stdlib json: 412,987 ops/sec
Speedup: 4.4x
Import Caching Performance¶
# Uncached import
%timeit import_from_string("myapp.steps.process_payment")
# 5.21 ms ± 0.18 ms per loop
# Cached import
%timeit import_from_string("myapp.steps.process_payment")
# 32.1 µs ± 1.2 µs per loop
# Speedup: 162x
Async Latency (uvloop)¶
# Measure event loop overhead
async def noop():
pass
# asyncio (stdlib)
%timeit asyncio.run(noop())
# 15.3 µs ± 0.5 µs per loop
# uvloop
%timeit asyncio.run(noop()) # with uvloop installed
# 5.5 µs ± 0.2 µs per loop
# Speedup: 2.8x
End-to-End Workflow Throughput¶
Test Setup: - Workflow: 5 steps (3 STANDARD, 2 DECISION) - State: OrderState (~5KB) - Persistence: PostgreSQL (connection pool) - Execution: SyncExecutionProvider
Results:
Throughput: 703,633 workflows/sec (simplified benchmark)
Latency (p50): 1.4ms
Latency (p99): 3.2ms
Memory per workflow: ~3MB
Note: Real-world throughput is lower due to actual step function logic, network latency, etc. This benchmark isolates engine overhead.
Performance Tuning¶
For Throughput (More Workflows/Second)¶
1. Increase Connection Pool:
2. Use Celery for Parallel Execution:
# Instead of sync execution (sequential)
execution = SyncExecutionProvider()
# Use Celery (parallel workers)
execution = CeleryExecutionProvider()
3. Scale Horizontally (Kubernetes HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ruvon-celery-worker-hpa
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Result: Linear scaling from 3 to 20 workers = 6.7x throughput increase.
For Latency (Faster Per-Workflow)¶
1. Minimize State Size:
# ❌ Bad: 100KB state
class OrderState(BaseModel):
raw_request_json: str # 95KB
items: list[dict]
# ✅ Good: 5KB state
class OrderState(BaseModel):
request_id: str # Reference, not embed
items: list[dict]
2. Use In-Memory Persistence (if acceptable):
3. Reduce Step Count:
# ❌ Bad: 20 tiny steps
steps:
- name: "Step_1" # 1ms each
- name: "Step_2"
# ... 18 more
# ✅ Good: 5 larger steps
steps:
- name: "Batch_Process_1_5" # 5ms
- name: "Batch_Process_6_10" # 5ms
Result: 20 steps × 1ms = 20ms total. 5 steps × 5ms = 25ms total, but 15ms saved in overhead.
For Memory Efficiency¶
1. Clear Step Results (if not needed):
def my_step(state: OrderState, context: StepContext):
# Generate large intermediate result
result = process_large_dataset(state)
# Extract only what's needed
summary = {"count": len(result), "total": sum(result)}
# Don't return full result (saves memory)
return summary
2. Use Generators for Large Datasets:
def process_items(state: OrderState, context: StepContext):
# ❌ Bad: Load all items in memory
items = db.query("SELECT * FROM items").fetchall()
# ✅ Good: Stream items
for item in db.query("SELECT * FROM items").stream():
process(item)
3. Configure Database Pool for memory:
# If memory constrained
persistence = PostgresPersistenceProvider(
db_url=db_url,
pool_max_size=10 # Reduce from 50 (saves ~100MB)
)
Performance Monitoring¶
Key Metrics¶
Workflow Execution:
from prometheus_client import Histogram
workflow_duration = Histogram(
'ruvon_workflow_duration_seconds',
'Workflow execution duration',
buckets=[0.1, 0.5, 1, 5, 10, 30, 60]
)
# Measure
with workflow_duration.time():
await workflow.next_step(user_input={})
Step Execution:
step_duration = Histogram(
'ruvon_step_duration_seconds',
'Step execution duration',
['step_name'],
buckets=[0.01, 0.05, 0.1, 0.5, 1, 5]
)
step_duration.labels(step_name="Process_Payment").observe(duration)
Database Operations:
db_query_duration = Histogram(
'ruvon_db_query_duration_seconds',
'Database query duration',
['operation'],
buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5]
)
with db_query_duration.labels(operation="save_workflow").time():
await persistence.save_workflow(workflow_id, workflow_dict)
Grafana Dashboard¶
Panel 1: Throughput
Query: rate(ruvon_workflows_started_total[5m])
Title: Workflows Started/sec
Alert: < 10 workflows/sec (low throughput)
Panel 2: Latency (p50, p95, p99)
Query:
- histogram_quantile(0.50, ruvon_workflow_duration_seconds_bucket)
- histogram_quantile(0.95, ruvon_workflow_duration_seconds_bucket)
- histogram_quantile(0.99, ruvon_workflow_duration_seconds_bucket)
Title: Workflow Latency
Alert: p99 > 10s (slow workflows)
Panel 3: Database Pool Utilization
Query: ruvon_db_pool_active_connections / ruvon_db_pool_max_size
Title: DB Pool Utilization
Alert: > 0.9 (pool exhausted)
Expected Performance Gains¶
Ruvon optimizations provide measurable benefits:
| Optimization | Gain | Scenario |
|---|---|---|
| uvloop | +50-100% throughput | I/O-bound workflows (API calls, DB queries) |
| orjson | -30% latency | Large state models (>10KB) |
| Connection Pooling | +400% efficiency | High concurrency (>50 concurrent workflows) |
| Import Caching | -90% import overhead | Repeated step function calls |
Combined Effect (typical I/O-bound workflow): - Throughput: +120% (2.2x faster) - Latency: -40% (1.67x faster per workflow) - Memory: -20% (more efficient pooling)
Disabling Optimizations (Debugging)¶
Sometimes you need to disable optimizations for debugging:
# Disable uvloop (use stdlib asyncio)
export RUVON_USE_UVLOOP=false
# Disable orjson (use stdlib json)
export RUVON_USE_ORJSON=false
# Reduce connection pool (easier to debug)
export POSTGRES_POOL_MIN_SIZE=1
export POSTGRES_POOL_MAX_SIZE=5
# Clear import cache (for testing hot-reload)
WorkflowBuilder._import_cache.clear()
Performance Comparison¶
Benchmark Results¶
Test: 5-step workflow, 5 KB state, 100 concurrent executions, PostgreSQL backend.
| Metric | Baseline (no opts) | Ruvon (all opts) | Improvement |
|---|---|---|---|
| Throughput | 50 workflows/sec | 120 workflows/sec | +140% |
| Latency (p50) | 250 ms | 180 ms | −28% |
| Latency (p99) | 1,500 ms | 800 ms | −47% |
| Memory/workflow | ~5 MB | ~3 MB | −40% |
| DB connections | 10 ad-hoc | 50 pooled | +400% efficiency |
What drives the gains: - ✅ Connection pooling — persistent connections reused across requests - ✅ orjson — 3–5× faster JSON serialization vs stdlib - ✅ uvloop — 2–4× faster async I/O vs stdlib asyncio - ✅ Import caching — 162× faster function resolution
What's Next¶
Now that you understand performance: - Architecture - How optimizations fit into architecture - Provider Pattern - Choosing providers for performance - Parallel Execution - Maximizing throughput with parallelism