Appearance
Log Search (Loki + Grafana)
Persistent log aggregation for the production server. Logs survive container redeploys, blue/green flips, and Loki/Grafana restarts. Retention: 30 days.
What problem this solves
Before this stack: every container redeploy wiped the previous container's stdout. docker logs metadata-rest-blue would only show output from this incarnation; if you wanted to debug something that happened before yesterday's deploy, the logs were gone. Cross-blue/green correlation meant docker logs two containers separately and squinting.
After: logs are persisted on a host volume, queryable through a single Grafana UI, and labelled by container, service, and deploy-color so blue↔green is one filter away.
Deployment
The Loki/Promtail/Grafana stack is wired into the standard deploy playbook — there's no separate "set up logging" step. Every ansible-playbook -i inventory.yml playbook-deploy.yml run:
- Rsyncs
monitoring/*.ymlconfigs from the local repo to/opt/schemastack/monitoring/on the server. - Idempotently brings up
loki,promtail,grafana(no-op if already running). - Force-recreates them if any config file changed (registered via
monitoring_config.changed, same pattern Traefik uses).
First deploy after the stack lands pulls the three Docker images (~200 MB total, one-time), creates the loki-data and grafana-data named volumes, and starts the containers. Subsequent deploys are no-ops for this stack unless you've edited a config file.
GRAFANA_ADMIN_PASSWORD lives in .env.production.local (vault-encrypted). The compose file uses Docker Compose's strict-required syntax — ${GRAFANA_ADMIN_PASSWORD:?...} — so if the var ever goes missing the deploy fails fast rather than booting Grafana with a default password.
Architecture
┌── Hetzner CAX21 ───────────────────────────────────────┐
│ │
│ All app containers (metadata-rest, consumer-worker, │
│ processor, workspace-api, traefik, postgres, …) │
│ │ stdout / stderr │
│ ▼ │
│ Promtail container │
│ ├─ scrapes Docker socket every 5s │
│ ├─ tags lines with container, image, deploy_color │
│ └─ pushes to Loki at http://loki:3100 │
│ │ │
│ ▼ │
│ Loki container │
│ ├─ tsdb indices on /loki/index │
│ ├─ filesystem chunks on /loki/chunks │
│ ├─ docker volume `loki-data` mounted at /loki │
│ └─ compactor enforces 14-day retention │
│ │ │
│ ▼ │
│ Grafana container │
│ ├─ datasource auto-provisioned to Loki │
│ ├─ docker volume `grafana-data` for dashboards/users │
│ └─ binds to 127.0.0.1:3000 (SSH tunnel only) │
└────────────────────────────────────────────────────────┘All three containers are on the internal Docker network. None expose a public port; only Grafana exposes anything at all, and it's localhost-bound (same pattern as Dozzle).
Persistence model
| What | Where | Survives container redeploy? |
|---|---|---|
| Log chunks | Docker volume loki-data (/var/lib/docker/volumes/...) | ✅ |
| Loki indices | Same volume | ✅ |
| Grafana dashboards / saved queries / users | Docker volume grafana-data | ✅ |
| Promtail position file | /tmp inside container | ❌ — but Loki dedupes on re-ingestion, so worst case is brief log replay on restart |
Named Docker volumes survive docker compose down/up and container recreation. They're only destroyed by explicit docker volume rm. Backing them up isn't part of the pg-backup pipeline — these are operational data, not user data, and a 14-day rolling window is fine to lose.
Access
Same SSH tunnel pattern as Dozzle. From a local machine:
bash
ssh -L 3000:localhost:3000 deploy@<hetzner-ip>Then open http://localhost:3000 in a browser.
Combine multiple tunnels in one command if you want both Dozzle and Grafana:
bash
ssh -L 3000:localhost:3000 -L 8888:localhost:8888 deploy@<hetzner-ip>Login: admin + the password from .env.production.local (GRAFANA_ADMIN_PASSWORD). The password is generated at vault-encrypt time and rotated by editing the vault and restarting the grafana container — Grafana picks up env-var changes on restart.
The first time you log in, Grafana will prompt to change the admin password. You can either change it (then update vault to match) or skip the prompt — env-var login still works.
Querying
Grafana → Explore → Loki datasource (default) → LogQL.
Common queries
logql
# All logs from the active blue deployment of metadata-rest
{container="metadata-rest-blue"}
# Errors across all services
{job="docker"} |= "ERROR" or "error"
# 5xx Traefik responses
{container="traefik"} | json | OriginStatus >= 500
# Login attempts from a specific IP (from Traefik access logs)
{container="traefik"} | json | RequestPath="/api/auth/login" | line_format "{{.ClientHost}} → {{.OriginStatus}}"
# Cross-blue/green view (deploy_color label)
{service="consumer-worker"} # both blue + green
{service="consumer-worker", deploy_color="blue"} # active slot onlyUseful filters
Promtail attaches these labels automatically:
| Label | Source | Example |
|---|---|---|
container | Docker container name | metadata-rest-blue |
service | com.docker.compose.service label | metadata-rest-blue |
deploy_color | Custom deploy.color label on app containers | blue, green |
stream | stdout vs stderr | stdout |
Cardinality is intentionally kept low — don't add per-request IDs or timestamps as labels. Put those in the log line itself and grep with LogQL filter expressions.
Retention and disk usage
Configured in monitoring/loki-config.yml:
yaml
limits_config:
retention_period: 720h # 30 days
compactor:
retention_enabled: true
compaction_interval: 10m
retention_delete_delay: 2hThe compactor runs every 10 minutes, deleting chunks older than the retention period. Expect ~5–50 MB/day at current traffic, so the loki-data volume settles at ~150 MB to 1.5 GB at steady state — well within Hetzner CAX21 disk headroom.
If you ever need to expand retention, edit the config file, redeploy with the playbook (the monitoring_config.changed register triggers a Loki recreate), and the new value applies going forward. Past data isn't backfilled in the other direction — shrinking retention deletes existing data older than the new window.
Operational notes
Why Promtail drops entries older than 1 hour, and why Loki's max_chunk_age is 4h
Docker's service discovery hands Promtail the full stdout history of every container on first attach — for long-running containers like RabbitMQ or Postgres that's months of logs. Loki has two gates that reject most of it:
- Distributor rejects entries older than
reject_old_samples_max_age(we set this to match the 30-day retention window). - Ingester rejects entries more than
~(max_chunk_age − chunk_idle_period)behind the stream head once any recent entry has been accepted. This is by design — Loki streams are append-only with bounded out-of-order tolerance.
The two settings have to be aligned, otherwise you get errors at the seam. Our current values:
| Setting | Value | Where |
|---|---|---|
Promtail drop.older_than | 1h | edge filter — anything older never leaves Promtail |
Loki max_chunk_age | 4h | upper bound on open chunk lifetime |
Loki chunk_idle_period | 30m | idle close threshold |
| Effective ingester window | ~3h 30m | derived from above |
Loki's window must be wider than Promtail's drop threshold; otherwise slightly-stale batches that Promtail still considers "fresh" get rejected by Loki. The 1h-vs-4h gap gives plenty of buffer for batch latency and clock skew.
Memory cost of the larger max_chunk_age is negligible at our log volume — chunks flush on size before they hit the age limit anyway.
Practical implication: log search starts fresh from when the Loki stack is deployed. You won't be able to look back at logs from before that moment. Going forward, full 30-day visibility.
Promtail missed log lines after Loki restart
Loki dedupes on chunk hash, so Promtail re-pushing a tail of recent logs is harmless. The position file in /tmp resetting only causes a brief replay window. Don't bother persisting it.
Grafana is sluggish
Grafana's container memory limit is 192 MB. If queries on large time ranges stall, the limit is the first thing to look at — bump it in docker-compose.yml, restart the container. Loki itself is the more common bottleneck; its 384 MB limit is generous for our log volume.
Why not Grafana Cloud
Grafana Cloud's free tier covers 50 GB ingestion/month and 14d retention — easily within our envelope. We chose self-hosted because:
- We're already running everything else self-hosted; one more container is a smaller cognitive load than a new SaaS account.
- No external dependency for log lookup. SSH tunnel works whenever the box is up; Grafana Cloud requires the public internet plus their service being up.
- Zero vendor lock-in on log queries (LogQL is the same either way, but no re-onboarding cost if Grafana Cloud changes pricing).
The trade-off: logs die with the box. Per the architecture decision elsewhere, that's acceptable at this scale — backups protect data, not logs.
Security posture
127.0.0.1:3000binding means Grafana is unreachable from the public internet. Only ports 80/443 (Traefik) and 22 (SSH) are exposed on the Hetzner firewall.- Anyone with SSH access (the
deployuser) can reach Grafana via the tunnel; same trust boundary as Dozzle. - Grafana's
GF_USERS_ALLOW_SIGN_UP=falseandGF_AUTH_ANONYMOUS_ENABLED=falsemean only the admin account exists. Add additional Grafana users via the UI if needed (not currently used). - Anonymous telemetry (
GF_ANALYTICS_REPORTING_ENABLED=false) is off — Grafana doesn't phone home.
Related
monitoring/loki-config.yml— Loki server config (retention, storage paths).monitoring/promtail-config.yml— scrape config + label rules.monitoring/grafana-datasources.yml— provisioning file that auto-wires the Loki datasource on first Grafana startup.- Dozzle is not going away — it's still useful for live tailing a single container without leaving the terminal. Grafana wins for searching, time-range queries, and cross-container correlation.