Work in Progress: This page is under development. Use the feedback button on the bottom right to help us improve it.

Fault Tolerance

Laminar provides exactly-once processing semantics through checkpointing and automatic recovery mechanisms.

Overview


Checkpointing

Checkpoints are consistent snapshots of pipeline state that enable recovery.

Checkpoint Components

ComponentDescriptionStorage
Source OffsetsPosition in each source partitionObject Storage
Operator StateAggregation buffers, window state, join stateObject Storage
MetadataCheckpoint ID, timestamp, statusPostgreSQL

Checkpoint Process

Checkpoint Lifecycle

  1. Trigger: Coordinator initiates checkpoint at configured interval
  2. Barrier: Checkpoint barriers flow through the pipeline
  3. Snapshot: Each operator snapshots its state when barrier arrives
  4. Upload: State snapshots are uploaded to object storage
  5. Commit: Checkpoint is marked complete when all operators finish

Checkpoint Configuration

Configure checkpointing behavior in your pipeline settings.

SettingDescriptionDefault
checkpointIntervalMsTime between checkpoints60000 (1 min)
checkpointTimeoutMsMax time for checkpoint completion600000 (10 min)
minPauseBetweenCheckpointsMinimum gap between checkpoints0
maxConcurrentCheckpointsNumber of in-flight checkpoints1
{
  "checkpointing": {
    "checkpointIntervalMs": 60000,
    "checkpointTimeoutMs": 600000
  }
}

Exactly-Once Semantics

Laminar achieves exactly-once processing through coordinated checkpointing.

How It Works

  1. Source Tracking: Record source offsets at checkpoint
  2. State Snapshot: Capture all operator state
  3. Sink Transactions: Coordinate with transactional sinks
  4. Atomic Commit: All components commit together

Sink Semantics

Sink TypeSemanticsMechanism
KafkaExactly-onceTransactional producer
IcebergExactly-onceAtomic table commits
StdOutAt-least-onceNo deduplication

Failure Recovery

When a failure occurs, Laminar automatically recovers from the last successful checkpoint.

Failure Detection

Recovery Process

Recovery Steps

  1. Identify Checkpoint: Find the most recent completed checkpoint
  2. Load Metadata: Retrieve checkpoint metadata from PostgreSQL
  3. Download State: Fetch state snapshots from object storage
  4. Restore Operators: Each operator restores its state
  5. Reset Sources: Seek sources to checkpointed offsets
  6. Resume Processing: Continue processing from checkpoint position

State Backend

Laminar uses a Parquet-based state backend for checkpoint storage.

Parquet Backend

PropertyDescription
FormatApache Parquet with Zstd compression
StorageS3, GCS, Azure Blob, or local filesystem
State TablesGlobalKeyedTable, ExpiringTimeKeyTable

Supported Storage Providers

  • AWS S3 - Primary cloud storage
  • Google Cloud Storage - GCS buckets
  • Azure Blob Storage - ABFS/HTTPS
  • Cloudflare R2 - S3-compatible
  • Local Filesystem - For development/testing

Monitoring Recovery

Track checkpoint and recovery metrics.

MetricDescription
checkpoint_duration_msTime to complete checkpoint
checkpoint_size_bytesSize of checkpoint data
checkpoint_failuresNumber of failed checkpoints
recovery_time_msTime to recover from failure
records_reprocessedRecords replayed after recovery