Appearance
Backups
Operational backups for the Hetzner-hosted Postgres instances. Customer workspace databases are explicitly out of scope — see "Scope" below.
Scope
What gets backed up
A single pg-backup sidecar container running on the production server dumps three Postgres databases:
| Database | Host (inside Docker network) | What it contains |
|---|---|---|
metadata | postgres:5432 | Org/user/auth, entity metadata, view config, subscriptions, all platform state |
website | postgres-website:5432 | Public website data — leads, contact form submissions |
demo | postgres-demo:5432 | The demo workspace's user data |
What does NOT get backed up
Customer workspace databases. This is by design, not omission. The product's no-vendor-lock-in promise (etc/business/vendor-lock-in-strategy.md) means each customer connects their own Postgres. We don't have the credentials, can't reach the host, and shouldn't be holding their data. Customers are advised to back up their own DBs.
Uploaded files (S3). Per-workspace S3 buckets are owned and configured by the workspace admin; their lifecycle is the customer's call.
RabbitMQ state. Messages are designed to be idempotent and reproducible — losing the queue means lost in-flight work, not lost data.
Architecture
┌────────────────────────────────────────────────────────┐
│ Hetzner CAX21 (production server) │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────┐ │
│ │ postgres │ │ postgres-website │ │ postgres-│ │
│ │ (metadata) │ │ │ │ demo │ │
│ └──────┬───────┘ └────────┬─────────┘ └────┬─────┘ │
│ │ │ │ │
│ │ pg_dump | gzip │ │ │
│ └─────────┬─────────┴─────────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ pg-backup │ ← sidecar container, │
│ │ container │ bash loop, every 6 h │
│ └────────┬─────────┘ │
└──────────────────┼─────────────────────────────────────┘
│
▼ aws s3 cp - (streaming)
┌───────────────────────────┐
│ s3://schemastack-prod/ │
│ hetzner-prod/ │
│ metadata/2026-04-26/ │
│ website/2026-04-26/ │
│ demo/2026-04-26/ │
└───────────────────────────┘
│
▼ S3 lifecycle policy
(auto-expire after 7d)How the sidecar works
Dockerfile.pg-backup—postgres:15-alpine+ AWS CLI.pg-backup/entrypoint.sh— runs the backup once on container start, then sleepsBACKUP_INTERVAL_SECONDS(currently 21600 = 6 h) in a loop. Container restart → backup runs immediately.pg-backup/backup.sh— for each of the three DBs, streamspg_dump | gzip | aws s3 cp -directly to S3. No local file is ever written, so disk usage stays at zero regardless of dump size.- Connectivity comes from being on the
internalDocker network alongside the Postgres containers (which are otherwise localhost-only). - The container has a bounded memory limit; it doesn't materialise dumps in RAM either —
pg_dump | gzipis a stream pipeline.
Why a sidecar instead of a host cron job
Three reasons:
- Reachable. The Postgres containers expose their ports only on the Docker bridge (and
localhostof the host for the website DB). A container on the same Docker network reaches them directly without poking ports open to the host. - Reproducible. The whole backup environment is in version control. Setup on a fresh server is
docker compose up; nothing to install, configure, or remember. - Self-healing.
restart: unless-stoppedplus the innerwhile trueloop means the backup keeps running across host reboots, container crashes, network blips. A host cron has more places to silently fail.
Configuration
Set in .env.production template; real values live in vault-encrypted .env.production.local:
| Variable | Default | Purpose |
|---|---|---|
BACKUP_S3_BUCKET | CHANGE_ME | S3 bucket name |
BACKUP_S3_PREFIX | hetzner-prod | Prefix under the bucket — distinguishes hetzner-prod from any future origins (e.g. a sandbox) |
BACKUP_AWS_REGION | eu-central-1 | Standard region for European data residency |
BACKUP_AWS_ACCESS_KEY_ID | CHANGE_ME | IAM user with s3:PutObject on the bucket only |
BACKUP_AWS_SECRET_ACCESS_KEY | CHANGE_ME | Matched secret |
BACKUP_INTERVAL_SECONDS | 21600 | Interval between runs. 21600 = 6 h (4 dumps/day). Lower for testing. |
Postgres credentials are reused from the application's existing environment vars (POSTGRES_PASSWORD, POSTGRES_WEBSITE_PASSWORD, etc.) — no separate backup-only DB user.
Retention — S3 lifecycle policy
The sidecar does no rotation itself. S3 enforces retention via a bucket-level lifecycle policy. Current policy:
json
{
"Rules": [
{
"ID": "schemastack-backups-7d",
"Status": "Enabled",
"Filter": { "Prefix": "hetzner-prod/" },
"Expiration": { "Days": 7 }
}
]
}At 6-hourly cadence × 7 days × 3 databases = 84 objects in the bucket at steady state. All small (gzipped pg_dump output). Storage cost is negligible.
Why 7 days
- Plenty of granularity to roll back to "this morning before the bad migration" or "last night's pre-deploy state."
- Long enough to absorb an alerting gap — even if backup failures go undetected for a couple of days, the most recent good dump is still around.
- Short enough that GDPR/personal-data-deletion requests don't have a tail in the backup bucket lasting months.
If you need longer retention, switch to a tiered policy (e.g. keep 30 days, transition to Glacier-IR after 7 to drop cost). At current dump sizes the storage saving doesn't justify the operational complexity.
Applying or updating the policy
bash
# From schemastack-deployment/, with appropriate AWS creds:
aws s3api put-bucket-lifecycle-configuration \
--bucket schemastack-prod \
--lifecycle-configuration file://backup-lifecycle.json
# Verify
aws s3api get-bucket-lifecycle-configuration --bucket schemastack-prodS3 lifecycle deletions happen asynchronously — expect objects to disappear within ~24 h of the cutoff, not at exactly the 7-day mark.
Verifying backups are healthy
Recent log output
The best view is Grafana (LogQL — {container="docker-pg-backup-1"}). Logs persist 30 days, so you can audit any past run, not just the live tail. See Log Search for the SSH-tunnel command.
Live tail via Dozzle (http://localhost:8888) is the alternative when you only care about right-now state. A healthy run looks like:
[entrypoint] starting; backup will run immediately, then every 21600s
[backup] === run started at 2026-04-26T16-29-58Z ===
[backup] metadata: dumping schemastack@postgres:5432/schemastack → s3://...
[backup] metadata: OK in 1s
[backup] website: dumping website@postgres-website:5432/website → s3://...
[backup] website: OK in 1s
[backup] demo: dumping postgres@postgres-demo:5432/postgres → s3://...
[backup] demo: OK in 2s
[backup] === run completed in 4s ===
[entrypoint] backup ok; sleeping 21600sFrom S3 directly
bash
aws s3 ls s3://schemastack-prod/hetzner-prod/metadata/ --recursive | tail -5
# Expect: an object dated within the last 6 hours.A missing object after >6 hours = the sidecar isn't running or is failing silently.
Locking semantics
pg_dump (default mode) takes ACCESS SHARE locks on every table — the most permissive table lock there is.
| Operation | Blocks? |
|---|---|
SELECT / INSERT / UPDATE / DELETE | ❌ Does not block |
ALTER TABLE, DROP TABLE, TRUNCATE, REINDEX, VACUUM FULL, CLUSTER | ✅ Blocks until dump completes |
Each dump currently completes in 1–2 seconds per database, so the DDL-collision window is tiny. Cadence shorter than 6 h would multiply that contention without buying meaningful RPO improvement.
pg_dump uses snapshot isolation (REPEATABLE READ by default), so the dump is consistent without holding write locks for its duration.
Restoring
Production restore (metadata DB)
This is the one to drill periodically — "do we know how to bring metadata back" is the existential question.
bash
# 1. SSH to the production server
ssh deploy@<hetzner-ip>
# 2. Identify the dump to restore (latest, or a specific timestamp)
aws s3 ls s3://schemastack-prod/hetzner-prod/metadata/ --recursive | tail -5
# 3. Stop services that write to the metadata DB
cd /opt/schemastack
docker compose stop metadata-rest-blue consumer-worker-blue \
metadata-rest-green consumer-worker-green \
processor-blue processor-green
# 4. Drop and recreate the schema (DESTRUCTIVE)
docker exec -it postgres psql -U schemastack -d schemastack \
-c "DROP SCHEMA public CASCADE; CREATE SCHEMA public;"
# 5. Stream the dump back
aws s3 cp s3://schemastack-prod/hetzner-prod/metadata/2026-04-26/metadata-2026-04-26T06-00-00Z.sql.gz - \
| gunzip \
| docker exec -i postgres psql -U schemastack -d schemastack
# 6. Restart services in order — metadata-rest first (runs Liquibase, but with restored schema it should be a no-op)
docker compose up -d metadata-rest-blue
# ... wait for healthcheck ...
docker compose up -d consumer-worker-blue processor-blueSandbox restore (recommended for any non-trivial restore)
Don't restore production blind. Spin up a Postgres locally, restore there, inspect:
bash
docker run --rm -d --name pg-restore-sandbox \
-e POSTGRES_PASSWORD=test -p 5544:5432 postgres:15
aws s3 cp s3://schemastack-prod/hetzner-prod/metadata/2026-04-26/...sql.gz - \
| gunzip \
| psql postgresql://postgres:test@localhost:5544/postgresThen connect a separate Quarkus dev instance to that Postgres (via QUARKUS_DATASOURCE_JDBC_URL) and verify what came back before touching production.
Known gaps
In priority order:
- No automated alerting. A backup failure goes unnoticed until someone checks Grafana or Dozzle logs. Manual smoke-checking is the workaround until alerting is wired (audit's open item). 30-day Loki retention means a missed run leaves a clear paper trail when you do go looking — but nothing pushes a notification.
- No restore-time SLA tested. We know how to restore in principle; nobody has timed an end-to-end restore at production data size.
- No off-region copy. A region-wide AWS outage in
eu-central-1would lose backups. Cross-region replication on the bucket is one CLI call away — not done because the failure mode (single AWS region disappearing for hours/days) is rare and recoverable. - Customer workspace DBs are not our problem to back up, but customers should be advised to set up their own. Worth promoting in onboarding docs.
Why 6-hourly, not hourly
The pricing-tiers doc mentions "hourly automated backups" as a future Enterprise tier feature. That's a different mechanism (likely WAL archiving or logical replication, not pg_dump every hour). Don't conflate the two: this 6-hourly job is the operational baseline for our own infrastructure, not the customer-facing backup product.
Related runbooks
schemastack-deployment/BACKUPS.md— full operator runbook with one-time setup (S3 bucket creation, IAM user, lifecycle JSON, container provisioning) and the canonical restore procedure. The dev doc you're reading explains the architecture and rationale; BACKUPS.md is the "what commands do I run" guide.schemastack-deployment/CERTS.md— Origin Cert rotation, separate concern but adjacent in the operations folder.