Backup & DR runbook — RapidValue IGA (internal)

RTO / RPO commitments per tier

Tier	RPO (data loss)	RTO (downtime)	Backup mechanism	Retention
Trial	≤ 24h	Best-effort	RDS PITR only	7 days
Starter	≤ 24h	≤ 8h	RDS PITR + tenant export on-demand	7 days
Growth	≤ 4h	≤ 4h	RDS PITR + daily S3 pgdump	30 days
IVIP Visibility	≤ 4h	≤ 4h	Same as Growth	30 days
Enterprise	≤ 1h	≤ 1h	RDS PITR + S3 pgdump + cross-region replica	90 days (30d Standard + 60d IA)

These are the list commitments. The cost calculator's SLA model accounts for what we need infra-wise to honor each one. Don't sell Enterprise RPO/RTO at Growth pricing — see cost calculator.

S3 backup bucket layout

All backups land under s3://{BACKUPS_BUCKET}/ with per-prefix lifecycle policies (see infra/terraform/backups.tf):

Prefix	Purpose	Lifecycle
`exports/`	Per-tenant GDPR Art. 20 export (on-demand, customer-pull)	7 days
`pgdump/`	Daily logical pg_dump of full DB	90 days (Standard-IA after 30d)
`vault/`	Vault snapshots (VACUUM INTO + decrypt-verify)	90 days (Standard-IA after 30d)
`tier-backups/business/`	Per-tenant daily backup for Growth/IVIP customers	30 days
`tier-backups/enterprise/`	Per-tenant daily backup for Enterprise customers	90 days (Standard-IA after 30d)

Restore procedures

Scenario 1 — Single tenant restore (data corruption / accidental delete)

# 1. Identify the latest good per-tenant export aws s3 ls s3://iga-backups-prod/tier-backups/business/<tenant_id>/ # 2. Spin up an ephemeral RDS from the latest pgdump (read-only verification) aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier iga-prod-postgres \ --target-db-instance-identifier iga-restore-verify-$(date +%s) \ --restore-time 2026-05-30T08:00:00Z \ --db-instance-class db.t3.medium \ --no-multi-az # 3. Connect, validate the tenant data shape, copy back to prod docker compose exec backend python scripts/restore_tenant.py \ --source-host=<ephemeral-rds-endpoint> \ --target-host=iga-prod-postgres.xxx.rds.amazonaws.com \ --tenant-id=<id> # 4. Verify audit-chain integrity post-restore curl -s https://app.rapidvalue.eu/api/v1/audit/verify-chain

Scenario 2 — Full RDS restore (catastrophic)

# 1. Use RDS console or CLI to restore from the latest automated snapshot aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier iga-prod-postgres-restored \ --db-snapshot-identifier rds:iga-prod-postgres-2026-05-30-03-00 \ --db-instance-class db.t3.medium \ --multi-az # 2. Update Secrets Manager with new endpoint aws secretsmanager update-secret \ --secret-id iga/prod/db-endpoint \ --secret-string "iga-prod-postgres-restored.xxx.rds.amazonaws.com" # 3. Restart backend containers to pick up new endpoint ssh ec2-user@app.rapidvalue.eu "docker compose restart backend" # 4. Verify deep health curl https://app.rapidvalue.eu/health/deep

Scenario 3 — Tenant export (GDPR Art. 20 portability)

Customer can self-serve via the tenant settings page. Or vendor-initiated:

curl -X POST https://app.rapidvalue.eu/api/v1/platform/tenants/<id>/export \ -H "X-Platform-Admin-Key: <secret>" \ -H "Content-Type: application/json" # Returns a 24h presigned S3 URL to a tar.gz of JSONL-per-table

Includes audit_event (customer's right to their full audit trail). Excluded only from hard-delete tenant flow per CLAUDE.md durable rule #1.

Monthly restore-test drill

Per CLAUDE.md SLA invariants: backup that isn't tested = backup that doesn't work. Schedule via .github/workflows/rds-restore-drill.yml:

Frequency: monthly (1st Monday at 04:00 UTC)
Procedure: spin up db.t3.medium ephemeral RDS from latest snapshot → run validation script → tear down
Validation: row-count delta < 0.5% vs source, audit-chain intact, sample tenant data accessible
Cost: ~$0.50 per drill (30 min ephemeral RDS)
Failure path: drill failure → Slack alert + GitHub issue auto-opened. Don't dismiss; investigate.

Cross-region DR (Enterprise tier)

For 99.99% SLA customers — workflow lives in .github/workflows/rds-cross-region-copy.yml:

Primary region: eu-west-1 (Dublin)
DR region: eu-central-1 (Frankfurt)
Mechanism: S3 CRR for pgdumps + automated RDS snapshot copy nightly
Failover RTO: 1 hour (manual DNS cutover + RDS promotion)
Failover RPO: ≤ 24h (nightly snapshot lag)

Cost: ~$30-60/mo per tenant on Enterprise tier (cross-region storage + replica). Already in the cost calculator under "Cross-region warm RDS standby".

Audit-immutability invariant

Per CLAUDE.md security rule #1 + migration 0139: audit_event is DB-immutable via PostgreSQL BEFORE UPDATE OR DELETE trigger. ORM code that mutates an audit row → request crashes.

This invariant survives restores because the trigger is part of the schema. After any restore:

docker compose exec backend psql -c " SELECT tgname FROM pg_trigger WHERE tgrelid = 'audit_event'::regclass AND tgname LIKE 'audit_event%'; " # Should return audit_event_no_update + audit_event_no_delete

If the trigger is missing post-restore, re-apply via alembic upgrade 0139 — it's idempotent.

Hard-delete tenant (GDPR Art. 17)

Per CLAUDE.md durable rule #2: typed-confirm + OFFBOARDING-gated. Flow:

Tenant status → OFFBOARDING (admin action via UI)
Wait 30 days (retention window)
Vendor confirms with typed input: POST /platform/tenants/{id}/delete with confirm={tenant_id} exact
Service does FK-cycle auto-break + topological delete (~12 tables)
audit_event rows are preserved under GDPR Art. 17(3)(b) — compliance retention ground

Hard-delete is irreversible. Always export first (Art. 20). Document the request in writing from the customer.