Skip to main content

Configuration Reference

LiteJoin is configured via a YAML file, typically named litejoin.yaml. This page documents every configuration field.

Top-Level Structure

sources: []      # Data ingestion sources
sinks: []        # Output destinations
storage: {}      # SQLite storage settings
joins: []        # Real-time join queries
windows: []      # Time-based aggregations
retention: {}    # Data retention policy
writer: {}       # Write batching settings
joiner: {}       # Join engine settings
windower: {}     # Window engine settings
delivery: {}     # Delivery guarantee settings

Sources

sources:
  - type: api | http | kafka
    name: "unique-name"
    topic: "topic-name"        # For api sources
    topics: ["topic1"]         # For http/kafka sources
    config: {}                 # Source-specific config
    api: {}                    # API source config (type: api only)

HTTP Source Config

FieldTypeDefaultDescription
config.addrstringrequiredListen address (e.g., :8080).

Kafka Source Config

FieldTypeDefaultDescription
config.brokersstringrequiredComma-separated broker addresses.
config.group_idstringrequiredConsumer group ID.

API Source Config

See API Source for the complete api: block reference.

Sinks

sinks:
  - type: http | kafka | sse | sqlite
    name: "unique-name"
    config: {}

HTTP Sink

FieldTypeDefaultDescription
config.urlstringrequiredWebhook URL.
config.timeoutstring30sRequest timeout.

Kafka Sink

FieldTypeDefaultDescription
config.brokersstringrequiredBroker addresses.
config.topicstringrequiredTarget topic.

SSE Sink

FieldTypeDefaultDescription
config.addrstringrequiredListen address.

SQLite Sink

FieldTypeDefaultDescription
config.pathstringrequiredDatabase file path.

Storage

storage:
  shard_count: 8
  data_dir: "./data"
  reader_pool_size: 4
  archive:
    enabled: false
    compaction_interval: 1m
    target_file_size: 128MB
    compression: snappy
    duckdb_memory_limit: 256MB
    duckdb_threads: 0
    local_retention: 168h
    cloud:
      enabled: false
      provider: s3
      bucket: ""
      prefix: ""
      region: ""
      upload_concurrency: 4
      upload_timeout: 5m
FieldTypeDefaultDescription
shard_countint8Number of SQLite shards.
data_dirstring./dataData directory.
reader_pool_sizeint4Reader connections per shard.
See Storage for archive configuration.

Joins

joins:
  - name: "join-name"
    query: |
      SELECT ...
    sink: "sink-name"
    key_column: "column"     # Optional
    result_key: "alias"      # Optional
FieldTypeRequiredDescription
namestringyesUnique join name.
querystringyesSQL query.
sinkstringyesTarget sink name.
key_columnstringnoColumn for result grouping key.
result_keystringnoAlias for result key in output.
See Joins for examples.

Windows

windows:
  - name: "window-name"
    type: tumbling | sliding | session
    size: 5m               # tumbling, sliding
    slide: 1m              # sliding only
    gap: 30m               # session only
    topic: "topic-name"
    query: |
      SELECT ...
    sink: "sink-name"
FieldTypeRequiredDescription
namestringyesUnique window name.
typestringyestumbling, sliding, or session.
sizedurationtumbling/slidingWindow size.
slidedurationslidingSlide interval (must be ≤ size).
gapdurationsessionInactivity gap to close session.
topicstringyesTopic to aggregate.
querystringyesSQL aggregation query.
sinkstringyesTarget sink name.
See Windows for examples.

Retention

retention:
  duration: 24h
  clean_interval: 1m
FieldTypeDefaultDescription
durationduration24hDelete data older than this.
clean_intervalduration1mHow often the cleaner runs.

Writer

writer:
  flush_interval: 10ms
  batch_size: 1000
FieldTypeDefaultDescription
flush_intervalduration10msTime between batch flushes.
batch_sizeint1000Max messages per batch.

Delivery

delivery:
  guarantee: best_effort | at_least_once
  dlq:
    path: "./data/dlq.db"
    retry_interval: 30s
    max_retries: 0
    backoff_multiplier: 2.0
    max_backoff: 5m
    ttl: 72h
    max_size_mb: 500
    cleanup_interval: 1h
See Delivery Guarantees for details.

Environment Variables

All string values support ${ENV_VAR} expansion, resolved at startup:
sources:
  - name: stripe
    type: api
    api:
      url: "https://api.stripe.com/v1/charges"
      headers:
        Authorization: "Bearer ${STRIPE_SECRET_KEY}"
Never commit secrets directly in config files. Use environment variables for all sensitive values.

Complete Example

sources:
  - name: stripe_charges
    type: api
    topic: charges
    api:
      url: "https://api.stripe.com/v1/charges?limit=100"
      interval: 10s
      key_path: "id"
      response_path: "data"
      headers:
        Authorization: "Bearer ${STRIPE_SECRET_KEY}"
      watermark:
        strategy: cursor
        path: "data.@last.id"
        param: "starting_after"

  - name: http_events
    type: http
    topics: []
    config:
      addr: ":8080"

sinks:
  - type: sse
    name: dashboard
    config:
      addr: ":9100"

  - type: http
    name: webhook
    config:
      url: "http://localhost:9000/webhook"

storage:
  shard_count: 8
  data_dir: ./data

writer:
  flush_interval: 10ms
  batch_size: 1000

retention:
  duration: 24h
  clean_interval: 1m

joins:
  - name: charge-enrichment
    query: |
      SELECT
        c.key as charge_id,
        c.payload as charge_data
      FROM charges c
      WHERE c.timestamp > (strftime('%s', 'now') - 60)
    sink: dashboard

delivery:
  guarantee: at_least_once
  dlq:
    retry_interval: 30s
    max_backoff: 5m
    ttl: 72h