Data Services
The data stack is the heart of DataHub.local. It provides a complete, end-to-end data platform: ingestion, streaming, storage, transformation, querying, and visualization — all running on the homelab cluster.
Stack Overview
flowchart LR
subgraph Ingestion
Airflow["Apache Airflow\n(Orchestration)"]
Spark["Apache Spark\n(Batch Processing)"]
Redpanda["Redpanda\n(Event Streaming)"]
end
subgraph Storage
Garage["Garage S3\n(Object Store / Data Lake)"]
PG["PostgreSQL\n(Relational DB)"]
Nessie["Project Nessie\n(Iceberg Catalog)"]
Valkey["Valkey\n(In-Memory Cache)"]
end
subgraph Query & Viz
Trino["Trino\n(SQL Query Engine)"]
Superset["Apache Superset\n(Dashboards)"]
end
Airflow -->|"trigger"| Spark
Airflow -->|"trigger"| Redpanda
Spark -->|"write"| Garage
Redpanda -->|"consume"| Spark
Garage -->|"Iceberg tables"| Nessie
Nessie -->|"catalog"| Trino
Garage -->|"query"| Trino
PG -->|"query"| Trino
Trino -->|"SQL results"| Superset
Airflow -->|"metadata"| PG
Services
Apache Airflow
Airflow is the orchestration backbone. All data pipeline schedules — ETL jobs, Spark submissions, API pulls — are defined as DAGs in Python and synced via GitSync from the datahub-local-workflows repository. Components: API server, DAG processor, scheduler, triggerer, worker, and a CloudNative PostgreSQL backend.
Trino
Trino is the query layer of the data lakehouse. It federates queries across all data sources — no ETL needed to join a PostgreSQL table with an Iceberg table stored in Garage S3. Connectors: Iceberg (via Nessie), PostgreSQL, Hive, and S3. Used for ad-hoc analytics and as the backend for Superset dashboards.
Apache Superset
Superset provides a self-hosted alternative to Tableau or Looker. Charts, dashboards, and SQL Lab queries connect through the Trino query layer, giving access to the full data lakehouse without leaving the browser. Async workers and a Valkey (Redis-compatible) cache handle heavy query loads.
Redpanda
Redpanda replaces Kafka with a leaner, faster, single-binary implementation. Used for streaming events between services, ingesting data in real time, and as the backbone for any streaming pipeline. The included Console UI makes it easy to inspect topics and consumer groups. Deployed as a 3-broker cluster.
Project Nessie
Nessie acts as the catalog for all Iceberg tables stored in Garage S3. It enables table versioning — you can create branches of your data, experiment with transformations, and merge changes back, just like Git. Trino and Spark both connect to Nessie as their Iceberg catalog; PostgreSQL serves as the persistence backend.
CloudNative PostgreSQL (CNPG)
CNPG provides a Kubernetes-native PostgreSQL operator with automatic failover, streaming replication, and backup integration. Multiple PgBouncer connection poolers ensure efficient connection management. Serves as the metadata store for Airflow, Superset config, and the Nessie catalog.
Garage (S3-compatible Object Storage)
Garage is a lightweight, distributed object storage system. It stores all Iceberg table data, Spark outputs, ML model artifacts, and Velero backups. Compatible with any S3 client — boto3, Spark S3A connector, ArgoCD artifacts, etc. Deployed as a 3-node cluster with a web UI.
Custom Helm Chart
The garage-helm chart was developed as part of this project and is published open source.
Apache Spark
Spark runs as on-demand jobs via the SparkOperator. Airflow DAGs submit SparkApplication resources for heavy batch processing. The Spark S3A connector writes Iceberg-formatted output directly to Garage, with Nessie managing the catalog metadata.
Custom Helm Chart
The spark-apps-helm chart simplifies deploying SparkApplication resources with shared defaults.
Valkey
Valkey is a Redis-compatible open-source alternative (following the Redis license change). It serves as the cache and Celery broker for Superset's async query workers, providing fast in-memory storage without licensing concerns.
S3-GDrive Gateway
Bridges personal Google Drive storage into the data lake. Syncs files — Google Sheets exports, Drive documents — into Garage S3 so they can be queried via Trino or processed by Airflow DAGs. Enables personal data workflows without manual export steps.