Monitoring & Observability

The monitoring stack provides full-stack observability: metrics, logs, alerts, and Kubernetes-level automation. Every service in the cluster exposes Prometheus metrics, ships logs to Loki, and is covered by AlertManager rules.

Stack Overview

flowchart TB
    subgraph Sources
        Pods["All Pods\n(metrics endpoints)"]
        Nodes["Nodes\n(node-exporter)"]
        K8sAPI["Kubernetes API\n(kube-state-metrics)"]
        Speedtest["Speedtest\n(internet monitoring)"]
    end

    subgraph Collection
        Prometheus["Prometheus\n(metrics scrape)"]
        Promtail["Promtail\n(log shipper, per node)"]
    end

    subgraph Storage
        Loki["Loki\n(log storage)"]
        PromDB["Prometheus TSDB\n(metric storage)"]
    end

    subgraph Visualization
        Grafana["Grafana\n(dashboards)"]
    end

    subgraph Alerting
        AlertManager["AlertManager"]
        Robusta["Robusta\n(enrichment + automation)"]
        Notify["Notifications\n(Slack, email, etc.)"]
    end

    Pods --> Prometheus
    Nodes --> Prometheus
    K8sAPI --> Prometheus
    Speedtest --> Prometheus
    Pods -->|"stdout/stderr"| Promtail
    Promtail --> Loki
    Prometheus --> PromDB
    PromDB --> Grafana
    Loki --> Grafana
    PromDB --> AlertManager
    AlertManager --> Robusta
    Robusta --> Notify

Services

Prometheus + AlertManager

monitoring metrics alerting time-series

Prometheus scrapes metrics from every service that exposes a /metrics endpoint, as well as from node-exporter (7 nodes) and kube-state-metrics. AlertManager routes firing alerts through configurable receivers. All scrape targets are auto-discovered via ServiceMonitor resources.

Custom exporters running:

node-exporter-textfiles — custom metrics collected via shell scripts, exposed as Prometheus textfile format (custom open-source project)
speedtest-exporter — periodic internet speed test results as metrics

Grafana

visualization dashboards observability

Grafana provides dashboards for every layer of the stack, with SSO via OAuth2 / Dex OIDC. Data sources: Prometheus (metrics) and Loki (logs).

Cluster overview — node CPU, memory, network, disk (from kube-prometheus-stack defaults)
Data services — Airflow, Trino, Redpanda, PostgreSQL, Spark custom dashboards
Application metrics — per-namespace resource usage
Internet performance — Speedtest results over time

Loki + Promtail

logging log-aggregation observability

Promtail runs on every node as a DaemonSet, tailing all pod log files and shipping them to Loki with labels (namespace, pod, container). Loki stores logs in Garage S3 for long-term retention. All logs are queryable from Grafana using LogQL. Components: Loki (2 pods + gateway), Promtail (DaemonSet — 1 per node).

Robusta

alerting kubernetes automation incident-response

Robusta acts as a smart AlertManager webhook receiver. When an alert fires, Robusta:

Enriches it — attaches pod logs, recent events, resource graphs automatically
Routes it — sends enriched notifications to Slack/Teams/email with all context
Can remediate — configured playbooks can automatically restart pods, scale deployments, or run diagnostic commands

This dramatically reduces alert fatigue by providing context alongside every notification.

Speedtest Exporter

monitoring network metrics performance

Runs Speedtest CLI periodically and exposes download speed, upload speed, ping, and jitter as Prometheus metrics. Grafana dashboards visualize internet performance trends over time — useful for detecting ISP issues or home network degradation before they affect cluster services.