Log Search (Loki + Grafana)

Persistent log aggregation for the production server. Logs survive container redeploys, blue/green flips, and Loki/Grafana restarts. Retention: 30 days.

What problem this solves

Before this stack: every container redeploy wiped the previous container's stdout. docker logs metadata-rest-blue would only show output from this incarnation; if you wanted to debug something that happened before yesterday's deploy, the logs were gone. Cross-blue/green correlation meant docker logs two containers separately and squinting.

After: logs are persisted on a host volume, queryable through a single Grafana UI, and labelled by container, service, and deploy-color so blue↔green is one filter away.

Deployment

The Loki/Promtail/Grafana stack is wired into the standard deploy playbook — there's no separate "set up logging" step. Every ansible-playbook -i inventory.yml playbook-deploy.yml run:

Rsyncs monitoring/*.yml configs from the local repo to /opt/schemastack/monitoring/ on the server.
Idempotently brings up loki, promtail, grafana (no-op if already running).
Force-recreates them if any config file changed (registered via monitoring_config.changed, same pattern Traefik uses).

First deploy after the stack lands pulls the three Docker images (~200 MB total, one-time), creates the loki-data and grafana-data named volumes, and starts the containers. Subsequent deploys are no-ops for this stack unless you've edited a config file.

GRAFANA_ADMIN_PASSWORD lives in .env.production.local (vault-encrypted). The compose file uses Docker Compose's strict-required syntax — ${GRAFANA_ADMIN_PASSWORD:?...} — so if the var ever goes missing the deploy fails fast rather than booting Grafana with a default password.

Architecture

┌── Hetzner CAX21 ───────────────────────────────────────┐
│                                                        │
│  All app containers (metadata-rest, consumer-worker,   │
│  processor, workspace-api, traefik, postgres, …)       │
│         │ stdout / stderr                              │
│         ▼                                              │
│  Promtail container                                    │
│   ├─ scrapes Docker socket every 5s                    │
│   ├─ tags lines with container, image, deploy_color    │
│   └─ pushes to Loki at http://loki:3100                │
│         │                                              │
│         ▼                                              │
│  Loki container                                        │
│   ├─ tsdb indices on /loki/index                       │
│   ├─ filesystem chunks on /loki/chunks                 │
│   ├─ docker volume `loki-data` mounted at /loki        │
│   └─ compactor enforces 14-day retention               │
│         │                                              │
│         ▼                                              │
│  Grafana container                                     │
│   ├─ datasource auto-provisioned to Loki               │
│   ├─ docker volume `grafana-data` for dashboards/users │
│   └─ binds to 127.0.0.1:3000 (SSH tunnel only)         │
└────────────────────────────────────────────────────────┘

All three containers are on the internal Docker network. None expose a public port; only Grafana exposes anything at all, and it's localhost-bound (same pattern as Dozzle).

Persistence model

What	Where	Survives container redeploy?
Log chunks	Docker volume `loki-data` (`/var/lib/docker/volumes/...`)	✅
Loki indices	Same volume	✅
Grafana dashboards / saved queries / users	Docker volume `grafana-data`	✅
Promtail position file	`/tmp` inside container	❌ — but Loki dedupes on re-ingestion, so worst case is brief log replay on restart

Named Docker volumes survive docker compose down/up and container recreation. They're only destroyed by explicit docker volume rm. Backing them up isn't part of the pg-backup pipeline — these are operational data, not user data, and a 14-day rolling window is fine to lose.

Access

Same SSH tunnel pattern as Dozzle. From a local machine:

bash

ssh -L 3000:localhost:3000 deploy@<hetzner-ip>

Then open http://localhost:3000 in a browser.

Combine multiple tunnels in one command if you want both Dozzle and Grafana:

bash

ssh -L 3000:localhost:3000 -L 8888:localhost:8888 deploy@<hetzner-ip>

Login: admin + the password from .env.production.local (GRAFANA_ADMIN_PASSWORD). The password is generated at vault-encrypt time and rotated by editing the vault and restarting the grafana container — Grafana picks up env-var changes on restart.

The first time you log in, Grafana will prompt to change the admin password. You can either change it (then update vault to match) or skip the prompt — env-var login still works.

Querying

Grafana → Explore → Loki datasource (default) → LogQL.

Common queries

logql

# All logs from the active blue deployment of metadata-rest
{container="metadata-rest-blue"}

# Errors across all services
{job="docker"} |= "ERROR" or "error"

# 5xx Traefik responses
{container="traefik"} | json | OriginStatus >= 500

# Login attempts from a specific IP (from Traefik access logs)
{container="traefik"} | json | RequestPath="/api/auth/login" | line_format "{{.ClientHost}} → {{.OriginStatus}}"

# Cross-blue/green view (deploy_color label)
{service="consumer-worker"}                       # both blue + green
{service="consumer-worker", deploy_color="blue"}  # active slot only

Useful filters

Promtail attaches these labels automatically:

Label	Source	Example
`container`	Docker container name	`metadata-rest-blue`
`service`	`com.docker.compose.service` label	`metadata-rest-blue`
`deploy_color`	Custom `deploy.color` label on app containers	`blue`, `green`
`stream`	stdout vs stderr	`stdout`

Cardinality is intentionally kept low — don't add per-request IDs or timestamps as labels. Put those in the log line itself and grep with LogQL filter expressions.

Retention and disk usage

Configured in monitoring/loki-config.yml:

yaml

limits_config:
  retention_period: 720h    # 30 days

compactor:
  retention_enabled: true
  compaction_interval: 10m
  retention_delete_delay: 2h

The compactor runs every 10 minutes, deleting chunks older than the retention period. Expect ~5–50 MB/day at current traffic, so the loki-data volume settles at ~150 MB to 1.5 GB at steady state — well within Hetzner CAX21 disk headroom.

If you ever need to expand retention, edit the config file, redeploy with the playbook (the monitoring_config.changed register triggers a Loki recreate), and the new value applies going forward. Past data isn't backfilled in the other direction — shrinking retention deletes existing data older than the new window.

Operational notes

Why Promtail drops entries older than 1 hour, and why Loki's max_chunk_age is 4h

Docker's service discovery hands Promtail the full stdout history of every container on first attach — for long-running containers like RabbitMQ or Postgres that's months of logs. Loki has two gates that reject most of it:

Distributor rejects entries older than reject_old_samples_max_age (we set this to match the 30-day retention window).
Ingester rejects entries more than ~(max_chunk_age − chunk_idle_period) behind the stream head once any recent entry has been accepted. This is by design — Loki streams are append-only with bounded out-of-order tolerance.

The two settings have to be aligned, otherwise you get errors at the seam. Our current values:

Setting	Value	Where
Promtail `drop.older_than`	`1h`	edge filter — anything older never leaves Promtail
Loki `max_chunk_age`	`4h`	upper bound on open chunk lifetime
Loki `chunk_idle_period`	`30m`	idle close threshold
Effective ingester window	`~3h 30m`	derived from above

Loki's window must be wider than Promtail's drop threshold; otherwise slightly-stale batches that Promtail still considers "fresh" get rejected by Loki. The 1h-vs-4h gap gives plenty of buffer for batch latency and clock skew.

Memory cost of the larger max_chunk_age is negligible at our log volume — chunks flush on size before they hit the age limit anyway.

Practical implication: log search starts fresh from when the Loki stack is deployed. You won't be able to look back at logs from before that moment. Going forward, full 30-day visibility.

Promtail missed log lines after Loki restart

Loki dedupes on chunk hash, so Promtail re-pushing a tail of recent logs is harmless. The position file in /tmp resetting only causes a brief replay window. Don't bother persisting it.

Grafana is sluggish

Grafana's container memory limit is 192 MB. If queries on large time ranges stall, the limit is the first thing to look at — bump it in docker-compose.yml, restart the container. Loki itself is the more common bottleneck; its 384 MB limit is generous for our log volume.

Why not Grafana Cloud

Grafana Cloud's free tier covers 50 GB ingestion/month and 14d retention — easily within our envelope. We chose self-hosted because:

We're already running everything else self-hosted; one more container is a smaller cognitive load than a new SaaS account.
No external dependency for log lookup. SSH tunnel works whenever the box is up; Grafana Cloud requires the public internet plus their service being up.
Zero vendor lock-in on log queries (LogQL is the same either way, but no re-onboarding cost if Grafana Cloud changes pricing).

The trade-off: logs die with the box. Per the architecture decision elsewhere, that's acceptable at this scale — backups protect data, not logs.

Security posture

127.0.0.1:3000 binding means Grafana is unreachable from the public internet. Only ports 80/443 (Traefik) and 22 (SSH) are exposed on the Hetzner firewall.
Anyone with SSH access (the deploy user) can reach Grafana via the tunnel; same trust boundary as Dozzle.
Grafana's GF_USERS_ALLOW_SIGN_UP=false and GF_AUTH_ANONYMOUS_ENABLED=false mean only the admin account exists. Add additional Grafana users via the UI if needed (not currently used).
Anonymous telemetry (GF_ANALYTICS_REPORTING_ENABLED=false) is off — Grafana doesn't phone home.

monitoring/loki-config.yml — Loki server config (retention, storage paths).
monitoring/promtail-config.yml — scrape config + label rules.
monitoring/grafana-datasources.yml — provisioning file that auto-wires the Loki datasource on first Grafana startup.
Dozzle is not going away — it's still useful for live tailing a single container without leaving the terminal. Grafana wins for searching, time-range queries, and cross-container correlation.

Log Search (Loki + Grafana) ​

What problem this solves ​

Deployment ​

Architecture ​

Persistence model ​

Access ​

Querying ​

Common queries ​

Useful filters ​

Retention and disk usage ​

Operational notes ​

Why Promtail drops entries older than 1 hour, and why Loki's max_chunk_age is 4h ​

Promtail missed log lines after Loki restart ​

Grafana is sluggish ​

Why not Grafana Cloud ​

Security posture ​

Related ​