Fault Tolerance
Laminar provides exactly-once processing semantics through checkpointing and automatic recovery mechanisms.
Overview
Checkpointing
Checkpoints are consistent snapshots of pipeline state that enable recovery.
Checkpoint Components
| Component | Description | Storage |
|---|---|---|
| Source Offsets | Position in each source partition | Object Storage |
| Operator State | Aggregation buffers, window state, join state | Object Storage |
| Metadata | Checkpoint ID, timestamp, status | PostgreSQL |
Checkpoint Process
Checkpoint Lifecycle
- Trigger: Coordinator initiates checkpoint at configured interval
- Barrier: Checkpoint barriers flow through the pipeline
- Snapshot: Each operator snapshots its state when barrier arrives
- Upload: State snapshots are uploaded to object storage
- Commit: Checkpoint is marked complete when all operators finish
Checkpoint Configuration
Configure checkpointing behavior in your pipeline settings.
| Setting | Description | Default |
|---|---|---|
checkpointIntervalMs | Time between checkpoints | 60000 (1 min) |
checkpointTimeoutMs | Max time for checkpoint completion | 600000 (10 min) |
minPauseBetweenCheckpoints | Minimum gap between checkpoints | 0 |
maxConcurrentCheckpoints | Number of in-flight checkpoints | 1 |
{
"checkpointing": {
"checkpointIntervalMs": 60000,
"checkpointTimeoutMs": 600000
}
}Exactly-Once Semantics
Laminar achieves exactly-once processing through coordinated checkpointing.
How It Works
- Source Tracking: Record source offsets at checkpoint
- State Snapshot: Capture all operator state
- Sink Transactions: Coordinate with transactional sinks
- Atomic Commit: All components commit together
Sink Semantics
| Sink Type | Semantics | Mechanism |
|---|---|---|
| Kafka | Exactly-once | Transactional producer |
| Iceberg | Exactly-once | Atomic table commits |
| StdOut | At-least-once | No deduplication |
Failure Recovery
When a failure occurs, Laminar automatically recovers from the last successful checkpoint.
Failure Detection
Recovery Process
Recovery Steps
- Identify Checkpoint: Find the most recent completed checkpoint
- Load Metadata: Retrieve checkpoint metadata from PostgreSQL
- Download State: Fetch state snapshots from object storage
- Restore Operators: Each operator restores its state
- Reset Sources: Seek sources to checkpointed offsets
- Resume Processing: Continue processing from checkpoint position
State Backend
Laminar uses a Parquet-based state backend for checkpoint storage.
Parquet Backend
| Property | Description |
|---|---|
| Format | Apache Parquet with Zstd compression |
| Storage | S3, GCS, Azure Blob, or local filesystem |
| State Tables | GlobalKeyedTable, ExpiringTimeKeyTable |
Supported Storage Providers
- AWS S3 - Primary cloud storage
- Google Cloud Storage - GCS buckets
- Azure Blob Storage - ABFS/HTTPS
- Cloudflare R2 - S3-compatible
- Local Filesystem - For development/testing
Monitoring Recovery
Track checkpoint and recovery metrics.
| Metric | Description |
|---|---|
checkpoint_duration_ms | Time to complete checkpoint |
checkpoint_size_bytes | Size of checkpoint data |
checkpoint_failures | Number of failed checkpoints |
recovery_time_ms | Time to recover from failure |
records_reprocessed | Records replayed after recovery |