Storage

LiteJoin uses SQLite as its primary data store, with optional tiered storage for long-term retention.

SQLite Sharding

Data is distributed across multiple SQLite databases using FNV hashing on the message key:

storage:
  shard_count: 8       # Number of SQLite shards
  data_dir: "./data"   # Directory for shard files
  reader_pool_size: 4  # Reader connections per shard

Each shard is a separate .db file in WAL mode with a single writer connection and a configurable pool of reader connections.

Setting	Default	Description
`shard_count`	`8`	Number of SQLite database shards. More shards = better write parallelism.
`data_dir`	`./data`	Directory for all data files.
`reader_pool_size`	`4`	Read connections per shard for concurrent queries.

Retention

Data older than the retention duration is periodically deleted:

retention:
  duration: 24h        # Keep data for 24 hours
  clean_interval: 1m   # Check every minute

When tiered storage is disabled, deleted data is permanently lost. Set retention based on your downstream query needs.

Tiered Storage (Optional)

When enabled, LiteJoin compacts expired data into Parquet files before deleting from SQLite. These files are queryable via an embedded DuckDB instance and can optionally be uploaded to cloud storage.

How It Works

SQLite (hot) → Compactor → Parquet (warm) → Uploader → Cloud Storage (cold)

Retention fires — rows older than the TTL are eligible for compaction.
Compactor reads rows from SQLite, writes them to Parquet files with Snappy compression.
Rows are deleted from SQLite, reclaiming space.
DuckDB queries Parquet files for historical data.
Uploader (optional) copies Parquet files to S3/GCS/Azure Blob Storage.

Configuration

storage:
  archive:
    enabled: true
    compaction_interval: 1m
    target_file_size: 128MB
    compression: snappy          # snappy | zstd | none
    duckdb_memory_limit: 256MB
    local_retention: 168h        # Keep local Parquet for 7 days

    cloud:
      enabled: false
      provider: s3               # s3 | gcs | azure
      bucket: my-litejoin-archive
      prefix: litejoin/
      region: us-east-1
      upload_concurrency: 4

Archive Config Reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable tiered storage.
`compaction_interval`	duration	`1m`	How often compaction runs.
`target_file_size`	string	`128MB`	Target Parquet file size.
`compression`	string	`snappy`	Parquet compression codec.
`duckdb_memory_limit`	string	`256MB`	Max memory for DuckDB queries.
`duckdb_threads`	int	`0`	DuckDB threads. `0` = match GOMAXPROCS.
`local_retention`	duration	`168h`	How long to keep local Parquet files.

Cloud Config Reference

Field	Type	Default	Description
`cloud.enabled`	bool	`false`	Enable cloud upload.
`cloud.provider`	string	—	`s3`, `gcs`, or `azure`.
`cloud.bucket`	string	—	Bucket name.
`cloud.prefix`	string	—	Key prefix within bucket.
`cloud.region`	string	—	Cloud region.
`cloud.upload_concurrency`	int	`4`	Parallel upload workers.
`cloud.upload_timeout`	duration	`5m`	Per-file upload timeout.

Data Lifecycle Example

Given retention: 1h and archive.local_retention: 168h:

Time	Tier	State
t=0	SQLite	Written, available for real-time joins
t=1h	Parquet (local)	Compacted from SQLite, queryable via DuckDB
t=1h+30s	Parquet + Cloud	Uploaded to S3 (if enabled)
t=7d	Cloud only	Local Parquet evicted
t=∞	Cloud	Retained indefinitely

Querying Historical Data

Historical data is queryable via the Snapshot API. When a from parameter extends beyond the retention window, the snapshot handler automatically queries Parquet files via DuckDB.

The hot path (real-time joins) has zero overhead from tiered storage. DuckDB is only used for historical queries.

Documentation Index

​Storage

​SQLite Sharding

​Retention

​Tiered Storage (Optional)

​How It Works

​Configuration

​Archive Config Reference

​Cloud Config Reference

​Data Lifecycle Example

​Querying Historical Data

Storage

SQLite Sharding

Retention

Tiered Storage (Optional)

How It Works

Configuration

Archive Config Reference

Cloud Config Reference

Data Lifecycle Example

Querying Historical Data