RTO / RPO commitments per tier

TierRPO
(data loss)
RTO
(downtime)
Backup mechanismRetention
Trial≤ 24hBest-effortRDS PITR only7 days
Starter≤ 24h≤ 8hRDS PITR + tenant export on-demand7 days
Growth≤ 4h≤ 4hRDS PITR + daily S3 pgdump30 days
IVIP Visibility≤ 4h≤ 4hSame as Growth30 days
Enterprise≤ 1h≤ 1hRDS PITR + S3 pgdump + cross-region replica90 days (30d Standard + 60d IA)
These are the list commitments. The cost calculator's SLA model accounts for what we need infra-wise to honor each one. Don't sell Enterprise RPO/RTO at Growth pricing — see cost calculator.

S3 backup bucket layout

All backups land under s3://{BACKUPS_BUCKET}/ with per-prefix lifecycle policies (see infra/terraform/backups.tf):

PrefixPurposeLifecycle
exports/Per-tenant GDPR Art. 20 export (on-demand, customer-pull)7 days
pgdump/Daily logical pg_dump of full DB90 days (Standard-IA after 30d)
vault/Vault snapshots (VACUUM INTO + decrypt-verify)90 days (Standard-IA after 30d)
tier-backups/business/Per-tenant daily backup for Growth/IVIP customers30 days
tier-backups/enterprise/Per-tenant daily backup for Enterprise customers90 days (Standard-IA after 30d)

Restore procedures

Scenario 1 — Single tenant restore (data corruption / accidental delete)

# 1. Identify the latest good per-tenant export aws s3 ls s3://iga-backups-prod/tier-backups/business/<tenant_id>/ # 2. Spin up an ephemeral RDS from the latest pgdump (read-only verification) aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier iga-prod-postgres \ --target-db-instance-identifier iga-restore-verify-$(date +%s) \ --restore-time 2026-05-30T08:00:00Z \ --db-instance-class db.t3.medium \ --no-multi-az # 3. Connect, validate the tenant data shape, copy back to prod docker compose exec backend python scripts/restore_tenant.py \ --source-host=<ephemeral-rds-endpoint> \ --target-host=iga-prod-postgres.xxx.rds.amazonaws.com \ --tenant-id=<id> # 4. Verify audit-chain integrity post-restore curl -s https://app.rapidvalue.eu/api/v1/audit/verify-chain

Scenario 2 — Full RDS restore (catastrophic)

# 1. Use RDS console or CLI to restore from the latest automated snapshot aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier iga-prod-postgres-restored \ --db-snapshot-identifier rds:iga-prod-postgres-2026-05-30-03-00 \ --db-instance-class db.t3.medium \ --multi-az # 2. Update Secrets Manager with new endpoint aws secretsmanager update-secret \ --secret-id iga/prod/db-endpoint \ --secret-string "iga-prod-postgres-restored.xxx.rds.amazonaws.com" # 3. Restart backend containers to pick up new endpoint ssh ec2-user@app.rapidvalue.eu "docker compose restart backend" # 4. Verify deep health curl https://app.rapidvalue.eu/health/deep

Scenario 3 — Tenant export (GDPR Art. 20 portability)

Customer can self-serve via the tenant settings page. Or vendor-initiated:

curl -X POST https://app.rapidvalue.eu/api/v1/platform/tenants/<id>/export \ -H "X-Platform-Admin-Key: <secret>" \ -H "Content-Type: application/json" # Returns a 24h presigned S3 URL to a tar.gz of JSONL-per-table

Includes audit_event (customer's right to their full audit trail). Excluded only from hard-delete tenant flow per CLAUDE.md durable rule #1.

Monthly restore-test drill

Per CLAUDE.md SLA invariants: backup that isn't tested = backup that doesn't work. Schedule via .github/workflows/rds-restore-drill.yml:

  • Frequency: monthly (1st Monday at 04:00 UTC)
  • Procedure: spin up db.t3.medium ephemeral RDS from latest snapshot → run validation script → tear down
  • Validation: row-count delta < 0.5% vs source, audit-chain intact, sample tenant data accessible
  • Cost: ~$0.50 per drill (30 min ephemeral RDS)
  • Failure path: drill failure → Slack alert + GitHub issue auto-opened. Don't dismiss; investigate.

Cross-region DR (Enterprise tier)

For 99.99% SLA customers — workflow lives in .github/workflows/rds-cross-region-copy.yml:

  • Primary region: eu-west-1 (Dublin)
  • DR region: eu-central-1 (Frankfurt)
  • Mechanism: S3 CRR for pgdumps + automated RDS snapshot copy nightly
  • Failover RTO: 1 hour (manual DNS cutover + RDS promotion)
  • Failover RPO: ≤ 24h (nightly snapshot lag)
Cost: ~$30-60/mo per tenant on Enterprise tier (cross-region storage + replica). Already in the cost calculator under "Cross-region warm RDS standby".

Audit-immutability invariant

Per CLAUDE.md security rule #1 + migration 0139: audit_event is DB-immutable via PostgreSQL BEFORE UPDATE OR DELETE trigger. ORM code that mutates an audit row → request crashes.

This invariant survives restores because the trigger is part of the schema. After any restore:

docker compose exec backend psql -c " SELECT tgname FROM pg_trigger WHERE tgrelid = 'audit_event'::regclass AND tgname LIKE 'audit_event%'; " # Should return audit_event_no_update + audit_event_no_delete

If the trigger is missing post-restore, re-apply via alembic upgrade 0139 — it's idempotent.

Hard-delete tenant (GDPR Art. 17)

Per CLAUDE.md durable rule #2: typed-confirm + OFFBOARDING-gated. Flow:

  1. Tenant status → OFFBOARDING (admin action via UI)
  2. Wait 30 days (retention window)
  3. Vendor confirms with typed input: POST /platform/tenants/{id}/delete with confirm={tenant_id} exact
  4. Service does FK-cycle auto-break + topological delete (~12 tables)
  5. audit_event rows are preserved under GDPR Art. 17(3)(b) — compliance retention ground
Hard-delete is irreversible. Always export first (Art. 20). Document the request in writing from the customer.