Monitoring & Observability
The monitoring stack provides full-stack observability: metrics, logs, alerts, and Kubernetes-level automation. Every service in the cluster exposes Prometheus metrics, ships logs to Loki, and is covered by AlertManager rules.
Stack Overview
flowchart TB
subgraph Sources
Pods["All Pods\n(metrics endpoints)"]
Nodes["Nodes\n(node-exporter)"]
K8sAPI["Kubernetes API\n(kube-state-metrics)"]
Speedtest["Speedtest\n(internet monitoring)"]
end
subgraph Collection
Prometheus["Prometheus\n(metrics scrape)"]
Promtail["Promtail\n(log shipper, per node)"]
end
subgraph Storage
Loki["Loki\n(log storage)"]
PromDB["Prometheus TSDB\n(metric storage)"]
end
subgraph Visualization
Grafana["Grafana\n(dashboards)"]
end
subgraph Alerting
AlertManager["AlertManager"]
Robusta["Robusta\n(enrichment + automation)"]
Notify["Notifications\n(Slack, email, etc.)"]
end
Pods --> Prometheus
Nodes --> Prometheus
K8sAPI --> Prometheus
Speedtest --> Prometheus
Pods -->|"stdout/stderr"| Promtail
Promtail --> Loki
Prometheus --> PromDB
PromDB --> Grafana
Loki --> Grafana
PromDB --> AlertManager
AlertManager --> Robusta
Robusta --> Notify
Services
Prometheus + AlertManager
Prometheus scrapes metrics from every service that exposes a /metrics endpoint, as well as from node-exporter (7 nodes) and kube-state-metrics. AlertManager routes firing alerts through configurable receivers. All scrape targets are auto-discovered via ServiceMonitor resources.
Custom exporters running:
node-exporter-textfiles— custom metrics collected via shell scripts, exposed as Prometheus textfile format (custom open-source project)speedtest-exporter— periodic internet speed test results as metrics
Grafana
Grafana provides dashboards for every layer of the stack, with SSO via OAuth2 / Dex OIDC. Data sources: Prometheus (metrics) and Loki (logs).
- Cluster overview — node CPU, memory, network, disk (from kube-prometheus-stack defaults)
- Data services — Airflow, Trino, Redpanda, PostgreSQL, Spark custom dashboards
- Application metrics — per-namespace resource usage
- Internet performance — Speedtest results over time
Loki + Promtail
Promtail runs on every node as a DaemonSet, tailing all pod log files and shipping them to Loki with labels (namespace, pod, container). Loki stores logs in Garage S3 for long-term retention. All logs are queryable from Grafana using LogQL. Components: Loki (2 pods + gateway), Promtail (DaemonSet — 1 per node).
Robusta
Robusta acts as a smart AlertManager webhook receiver. When an alert fires, Robusta:
- Enriches it — attaches pod logs, recent events, resource graphs automatically
- Routes it — sends enriched notifications to Slack/Teams/email with all context
- Can remediate — configured playbooks can automatically restart pods, scale deployments, or run diagnostic commands
This dramatically reduces alert fatigue by providing context alongside every notification.
Speedtest Exporter
Runs Speedtest CLI periodically and exposes download speed, upload speed, ping, and jitter as Prometheus metrics. Grafana dashboards visualize internet performance trends over time — useful for detecting ISP issues or home network degradation before they affect cluster services.