Backups

Operational backups for the Hetzner-hosted Postgres instances. Customer workspace databases are explicitly out of scope — see "Scope" below.

Scope

What gets backed up

A single pg-backup sidecar container running on the production server dumps three Postgres databases:

Database	Host (inside Docker network)	What it contains
`metadata`	`postgres:5432`	Org/user/auth, entity metadata, view config, subscriptions, all platform state
`website`	`postgres-website:5432`	Public website data — leads, contact form submissions
`demo`	`postgres-demo:5432`	The demo workspace's user data

What does NOT get backed up

Customer workspace databases. This is by design, not omission. The product's no-vendor-lock-in promise (etc/business/vendor-lock-in-strategy.md) means each customer connects their own Postgres. We don't have the credentials, can't reach the host, and shouldn't be holding their data. Customers are advised to back up their own DBs.

Uploaded files (S3). Per-workspace S3 buckets are owned and configured by the workspace admin; their lifecycle is the customer's call.

RabbitMQ state. Messages are designed to be idempotent and reproducible — losing the queue means lost in-flight work, not lost data.

Architecture

┌────────────────────────────────────────────────────────┐
│ Hetzner CAX21 (production server)                      │
│                                                        │
│  ┌──────────────┐  ┌──────────────────┐  ┌──────────┐ │
│  │   postgres   │  │ postgres-website │  │ postgres-│ │
│  │  (metadata)  │  │                  │  │  demo    │ │
│  └──────┬───────┘  └────────┬─────────┘  └────┬─────┘ │
│         │                   │                 │       │
│         │   pg_dump | gzip  │                 │       │
│         └─────────┬─────────┴─────────────────┘       │
│                   ▼                                    │
│         ┌──────────────────┐                           │
│         │   pg-backup      │  ← sidecar container,     │
│         │   container      │    bash loop, every 6 h   │
│         └────────┬─────────┘                           │
└──────────────────┼─────────────────────────────────────┘
                   │
                   ▼  aws s3 cp - (streaming)
       ┌───────────────────────────┐
       │  s3://schemastack-prod/   │
       │   hetzner-prod/           │
       │     metadata/2026-04-26/  │
       │     website/2026-04-26/   │
       │     demo/2026-04-26/      │
       └───────────────────────────┘
                   │
                   ▼  S3 lifecycle policy
            (auto-expire after 7d)

How the sidecar works

Dockerfile.pg-backup — postgres:15-alpine + AWS CLI.
pg-backup/entrypoint.sh — runs the backup once on container start, then sleeps BACKUP_INTERVAL_SECONDS (currently 21600 = 6 h) in a loop. Container restart → backup runs immediately.
pg-backup/backup.sh — for each of the three DBs, streams pg_dump | gzip | aws s3 cp - directly to S3. No local file is ever written, so disk usage stays at zero regardless of dump size.
Connectivity comes from being on the internal Docker network alongside the Postgres containers (which are otherwise localhost-only).
The container has a bounded memory limit; it doesn't materialise dumps in RAM either — pg_dump | gzip is a stream pipeline.

Why a sidecar instead of a host cron job

Three reasons:

Reachable. The Postgres containers expose their ports only on the Docker bridge (and localhost of the host for the website DB). A container on the same Docker network reaches them directly without poking ports open to the host.
Reproducible. The whole backup environment is in version control. Setup on a fresh server is docker compose up; nothing to install, configure, or remember.
Self-healing. restart: unless-stopped plus the inner while true loop means the backup keeps running across host reboots, container crashes, network blips. A host cron has more places to silently fail.

Configuration

Set in .env.production template; real values live in vault-encrypted .env.production.local:

Variable	Default	Purpose
`BACKUP_S3_BUCKET`	`CHANGE_ME`	S3 bucket name
`BACKUP_S3_PREFIX`	`hetzner-prod`	Prefix under the bucket — distinguishes hetzner-prod from any future origins (e.g. a sandbox)
`BACKUP_AWS_REGION`	`eu-central-1`	Standard region for European data residency
`BACKUP_AWS_ACCESS_KEY_ID`	`CHANGE_ME`	IAM user with `s3:PutObject` on the bucket only
`BACKUP_AWS_SECRET_ACCESS_KEY`	`CHANGE_ME`	Matched secret
`BACKUP_INTERVAL_SECONDS`	`21600`	Interval between runs. 21600 = 6 h (4 dumps/day). Lower for testing.

Postgres credentials are reused from the application's existing environment vars (POSTGRES_PASSWORD, POSTGRES_WEBSITE_PASSWORD, etc.) — no separate backup-only DB user.

Retention — S3 lifecycle policy

The sidecar does no rotation itself. S3 enforces retention via a bucket-level lifecycle policy. Current policy:

json

{
  "Rules": [
    {
      "ID": "schemastack-backups-7d",
      "Status": "Enabled",
      "Filter": { "Prefix": "hetzner-prod/" },
      "Expiration": { "Days": 7 }
    }
  ]
}

At 6-hourly cadence × 7 days × 3 databases = 84 objects in the bucket at steady state. All small (gzipped pg_dump output). Storage cost is negligible.

Why 7 days

Plenty of granularity to roll back to "this morning before the bad migration" or "last night's pre-deploy state."
Long enough to absorb an alerting gap — even if backup failures go undetected for a couple of days, the most recent good dump is still around.
Short enough that GDPR/personal-data-deletion requests don't have a tail in the backup bucket lasting months.

If you need longer retention, switch to a tiered policy (e.g. keep 30 days, transition to Glacier-IR after 7 to drop cost). At current dump sizes the storage saving doesn't justify the operational complexity.

Applying or updating the policy

bash

# From schemastack-deployment/, with appropriate AWS creds:
aws s3api put-bucket-lifecycle-configuration \
  --bucket schemastack-prod \
  --lifecycle-configuration file://backup-lifecycle.json

# Verify
aws s3api get-bucket-lifecycle-configuration --bucket schemastack-prod

S3 lifecycle deletions happen asynchronously — expect objects to disappear within ~24 h of the cutoff, not at exactly the 7-day mark.

Verifying backups are healthy

Recent log output

The best view is Grafana (LogQL — {container="docker-pg-backup-1"}). Logs persist 30 days, so you can audit any past run, not just the live tail. See Log Search for the SSH-tunnel command.

Live tail via Dozzle (http://localhost:8888) is the alternative when you only care about right-now state. A healthy run looks like:

[entrypoint] starting; backup will run immediately, then every 21600s
[backup] === run started at 2026-04-26T16-29-58Z ===
[backup] metadata: dumping schemastack@postgres:5432/schemastack → s3://...
[backup] metadata: OK in 1s
[backup] website: dumping website@postgres-website:5432/website → s3://...
[backup] website: OK in 1s
[backup] demo: dumping postgres@postgres-demo:5432/postgres → s3://...
[backup] demo: OK in 2s
[backup] === run completed in 4s ===
[entrypoint] backup ok; sleeping 21600s

From S3 directly

bash

aws s3 ls s3://schemastack-prod/hetzner-prod/metadata/ --recursive | tail -5
# Expect: an object dated within the last 6 hours.

A missing object after >6 hours = the sidecar isn't running or is failing silently.

Locking semantics

pg_dump (default mode) takes ACCESS SHARE locks on every table — the most permissive table lock there is.

Operation	Blocks?
`SELECT` / `INSERT` / `UPDATE` / `DELETE`	❌ Does not block
`ALTER TABLE`, `DROP TABLE`, `TRUNCATE`, `REINDEX`, `VACUUM FULL`, `CLUSTER`	✅ Blocks until dump completes

Each dump currently completes in 1–2 seconds per database, so the DDL-collision window is tiny. Cadence shorter than 6 h would multiply that contention without buying meaningful RPO improvement.

pg_dump uses snapshot isolation (REPEATABLE READ by default), so the dump is consistent without holding write locks for its duration.

Restoring

Production restore (metadata DB)

This is the one to drill periodically — "do we know how to bring metadata back" is the existential question.

bash

# 1. SSH to the production server
ssh deploy@<hetzner-ip>

# 2. Identify the dump to restore (latest, or a specific timestamp)
aws s3 ls s3://schemastack-prod/hetzner-prod/metadata/ --recursive | tail -5

# 3. Stop services that write to the metadata DB
cd /opt/schemastack
docker compose stop metadata-rest-blue consumer-worker-blue \
                    metadata-rest-green consumer-worker-green \
                    processor-blue processor-green

# 4. Drop and recreate the schema (DESTRUCTIVE)
docker exec -it postgres psql -U schemastack -d schemastack \
  -c "DROP SCHEMA public CASCADE; CREATE SCHEMA public;"

# 5. Stream the dump back
aws s3 cp s3://schemastack-prod/hetzner-prod/metadata/2026-04-26/metadata-2026-04-26T06-00-00Z.sql.gz - \
  | gunzip \
  | docker exec -i postgres psql -U schemastack -d schemastack

# 6. Restart services in order — metadata-rest first (runs Liquibase, but with restored schema it should be a no-op)
docker compose up -d metadata-rest-blue
# ... wait for healthcheck ...
docker compose up -d consumer-worker-blue processor-blue

Sandbox restore (recommended for any non-trivial restore)

Don't restore production blind. Spin up a Postgres locally, restore there, inspect:

bash

docker run --rm -d --name pg-restore-sandbox \
  -e POSTGRES_PASSWORD=test -p 5544:5432 postgres:15

aws s3 cp s3://schemastack-prod/hetzner-prod/metadata/2026-04-26/...sql.gz - \
  | gunzip \
  | psql postgresql://postgres:test@localhost:5544/postgres

Then connect a separate Quarkus dev instance to that Postgres (via QUARKUS_DATASOURCE_JDBC_URL) and verify what came back before touching production.

Known gaps

In priority order:

No automated alerting. A backup failure goes unnoticed until someone checks Grafana or Dozzle logs. Manual smoke-checking is the workaround until alerting is wired (audit's open item). 30-day Loki retention means a missed run leaves a clear paper trail when you do go looking — but nothing pushes a notification.
No restore-time SLA tested. We know how to restore in principle; nobody has timed an end-to-end restore at production data size.
No off-region copy. A region-wide AWS outage in eu-central-1 would lose backups. Cross-region replication on the bucket is one CLI call away — not done because the failure mode (single AWS region disappearing for hours/days) is rare and recoverable.
Customer workspace DBs are not our problem to back up, but customers should be advised to set up their own. Worth promoting in onboarding docs.

Why 6-hourly, not hourly

The pricing-tiers doc mentions "hourly automated backups" as a future Enterprise tier feature. That's a different mechanism (likely WAL archiving or logical replication, not pg_dump every hour). Don't conflate the two: this 6-hourly job is the operational baseline for our own infrastructure, not the customer-facing backup product.

schemastack-deployment/BACKUPS.md — full operator runbook with one-time setup (S3 bucket creation, IAM user, lifecycle JSON, container provisioning) and the canonical restore procedure. The dev doc you're reading explains the architecture and rationale; BACKUPS.md is the "what commands do I run" guide.
schemastack-deployment/CERTS.md — Origin Cert rotation, separate concern but adjacent in the operations folder.

Backups ​

Scope ​

What gets backed up ​

What does NOT get backed up ​

Architecture ​

How the sidecar works ​

Why a sidecar instead of a host cron job ​

Configuration ​

Retention — S3 lifecycle policy ​

Why 7 days ​

Applying or updating the policy ​

Verifying backups are healthy ​

Recent log output ​

From S3 directly ​

Locking semantics ​

Restoring ​

Production restore (metadata DB) ​

Sandbox restore (recommended for any non-trivial restore) ​

Known gaps ​

Why 6-hourly, not hourly ​

Related runbooks ​

Backups

Scope

What gets backed up

What does NOT get backed up

Architecture

How the sidecar works

Why a sidecar instead of a host cron job

Configuration

Retention — S3 lifecycle policy

Why 7 days

Applying or updating the policy

Verifying backups are healthy

Recent log output

From S3 directly

Locking semantics

Restoring

Production restore (metadata DB)

Sandbox restore (recommended for any non-trivial restore)

Known gaps

Why 6-hourly, not hourly

Related runbooks