Skip to main content

Storage

LiteJoin uses SQLite as its primary data store, with optional tiered storage for long-term retention.

SQLite Sharding

Data is distributed across multiple SQLite databases using FNV hashing on the message key:
storage:
  shard_count: 8       # Number of SQLite shards
  data_dir: "./data"   # Directory for shard files
  reader_pool_size: 4  # Reader connections per shard
Each shard is a separate .db file in WAL mode with a single writer connection and a configurable pool of reader connections.
SettingDefaultDescription
shard_count8Number of SQLite database shards. More shards = better write parallelism.
data_dir./dataDirectory for all data files.
reader_pool_size4Read connections per shard for concurrent queries.

Retention

Data older than the retention duration is periodically deleted:
retention:
  duration: 24h        # Keep data for 24 hours
  clean_interval: 1m   # Check every minute
When tiered storage is disabled, deleted data is permanently lost. Set retention based on your downstream query needs.

Tiered Storage (Optional)

When enabled, LiteJoin compacts expired data into Parquet files before deleting from SQLite. These files are queryable via an embedded DuckDB instance and can optionally be uploaded to cloud storage.

How It Works

SQLite (hot) → Compactor → Parquet (warm) → Uploader → Cloud Storage (cold)
  1. Retention fires — rows older than the TTL are eligible for compaction.
  2. Compactor reads rows from SQLite, writes them to Parquet files with Snappy compression.
  3. Rows are deleted from SQLite, reclaiming space.
  4. DuckDB queries Parquet files for historical data.
  5. Uploader (optional) copies Parquet files to S3/GCS/Azure Blob Storage.

Configuration

storage:
  archive:
    enabled: true
    compaction_interval: 1m
    target_file_size: 128MB
    compression: snappy          # snappy | zstd | none
    duckdb_memory_limit: 256MB
    local_retention: 168h        # Keep local Parquet for 7 days

    cloud:
      enabled: false
      provider: s3               # s3 | gcs | azure
      bucket: my-litejoin-archive
      prefix: litejoin/
      region: us-east-1
      upload_concurrency: 4

Archive Config Reference

FieldTypeDefaultDescription
enabledboolfalseEnable tiered storage.
compaction_intervalduration1mHow often compaction runs.
target_file_sizestring128MBTarget Parquet file size.
compressionstringsnappyParquet compression codec.
duckdb_memory_limitstring256MBMax memory for DuckDB queries.
duckdb_threadsint0DuckDB threads. 0 = match GOMAXPROCS.
local_retentionduration168hHow long to keep local Parquet files.

Cloud Config Reference

FieldTypeDefaultDescription
cloud.enabledboolfalseEnable cloud upload.
cloud.providerstrings3, gcs, or azure.
cloud.bucketstringBucket name.
cloud.prefixstringKey prefix within bucket.
cloud.regionstringCloud region.
cloud.upload_concurrencyint4Parallel upload workers.
cloud.upload_timeoutduration5mPer-file upload timeout.

Data Lifecycle Example

Given retention: 1h and archive.local_retention: 168h:
TimeTierState
t=0SQLiteWritten, available for real-time joins
t=1hParquet (local)Compacted from SQLite, queryable via DuckDB
t=1h+30sParquet + CloudUploaded to S3 (if enabled)
t=7dCloud onlyLocal Parquet evicted
t=∞CloudRetained indefinitely

Querying Historical Data

Historical data is queryable via the Snapshot API. When a from parameter extends beyond the retention window, the snapshot handler automatically queries Parquet files via DuckDB.
The hot path (real-time joins) has zero overhead from tiered storage. DuckDB is only used for historical queries.