On-call runbook

Incident playbook

P1 Backend not starting / lifespan loop

Symptom: /health/deep returns 503 with {"db": "drift"} or {"alembic": "not-at-head"}. Site shows generic 503.

Likely causes: failed alembic auto-upgrade on boot (incompatible migration), DB connection blocked, missing env-var.

Check logs immediately:

docker compose logs backend --tail=300 | grep -E "alembic|lifespan|ERROR"

Identify the failing migration: look for "FAILED: …" in alembic output. Note the revision ID.

Manual downgrade if migration is reversible:

docker compose exec backend alembic downgrade -1

Or skip + roll back deploy: revert the offending commit on main, force-push, re-deploy.

git revert <sha> --no-edit git push origin main

For URL-encoded password issues (interpolation errors): per CLAUDE.md security rule #7, set cfg.file_config._interpolation = configparser.Interpolation() in app/main.py lifespan. Usually already in place; check it wasn't removed.

P1 RDS at capacity / connection-pool exhaustion

Symptom: backend logs show QueuePool limit reached. Wide 500s. RDS CloudWatch shows DB connection-count near max.

Root cause: too many tenants on one cluster, or a runaway scheduler sweep holding connections. Default db.t3.small RDS = ~80 max connections; 5 connections/tenant pool = 16 active concurrent tenants tops.

Check what's holding connections:

docker compose exec backend psql -c " SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start ASC LIMIT 20; "

Kill long-running queries if safe:

SELECT pg_terminate_backend(<pid>);

Short-term mitigation: bump RDS instance class via Terraform (db_instance_class var) — typically goes t3.small → t3.medium with ~10 min downtime.

Permanent fix: split tenants across a new RDS cluster. See cost calculator for tenants-per-cluster guidance (15-25 paying tenants per db.t3.medium).

P2 Vault key rotation needed

Symptom: IGA_VAULT_MASTER_KEY compromised, expiring, or moving environments.

Per CLAUDE.md security rule #4: production guards require this to be set, and vault.db + key must live on separate disks/secret stores.

Read the runbook: backend/docs/VAULT_OPERATIONS.md has the full procedure.

Backup first: backend/scripts/vault_backup.py does atomic VACUUM INTO + decrypt-verify with current key.

Rotate via re-encrypt: run rotation script with both old + new keys. The script re-encrypts every secret row.

Update Secrets Manager: new key value goes to AWS Secrets Manager iga/{env}/vault-master-key. Backend reads at startup.

Verify: restart backend, check that connectors can still resolve their auth_config (look for VaultResolver errors in logs).

P2 Wizard 500 / unknown waypoint

Symptom: POST /api/v1/onboarding/sessions returns 500. Backend logs show invalid input value for enum waypoint_id.

Root cause: Python WaypointId enum has values the Postgres enum doesn't. This was exactly the 2026-05-30 incident — see migration 0165.

Diff Python enum vs DB:

# From the backend container: docker compose exec backend python -c " from app.domain.onboarding.models import WaypointId print([w.value for w in WaypointId]) " # Compare against DB enum: docker compose exec backend psql -c " SELECT unnest(enum_range(NULL::waypoint_id))::text; "

Add missing values via alembic migration: create new revision, op.execute("ALTER TYPE waypoint_id ADD VALUE IF NOT EXISTS '...'"). Postgres 12+ allows ADD VALUE inside a transaction.

Deploy + restart: lifespan auto-upgrade applies the migration.

P2 Tenant DB / data corruption suspected

Symptom: customer reports missing identities / weird recon-diff state / failed grants that should succeed.

Use audit-replay (read-only forensic tool) per CLAUDE.md rule:

curl https://staging.app.rapidvalue.eu/api/v1/platform/tenants/<id>/audit-replay?from=2026-05-30T00:00&to=2026-05-30T18:00

Verify audit-chain integrity:

curl https://staging.app.rapidvalue.eu/api/v1/audit/verify-chain

Returns 200 if chain_hash is intact. If broken: a non-service code path inserted into audit_event. Investigate.

Restore from backup if confirmed corruption: see backup & DR runbook.

P3 Cert sweep stuck / not firing

Symptom: scheduled certifications not generating tasks; CertCampaign shows 0 tasks.

Check scheduler-loop logs: docker compose logs backend | grep "_maybe_sweep\|cert"

Verify scheduler is running per tenant. Sometimes a single tenant fails and others continue — look for the tenant_id stack-trace.

Manual trigger: POST /api/v1/certification/sweep

P3 Frontend chunk-load errors / blank page

Symptom: user sees blank page or "ChunkLoadError" in console after deploy.

Root cause: service worker caching old chunk hashes that no longer exist after deploy. CLAUDE.md mentions Cache-Control: no-cache, no-store on index.html — verify it's applied.

Tell affected users to hard-refresh (Ctrl+Shift+R) or clear site data.

Long-term: ensure nginx config has the no-cache header on index.html.

Quick ssh + diagnostics

Incident playbook

P1 Backend not starting / lifespan loop

P1 RDS at capacity / connection-pool exhaustion

P2 Vault key rotation needed

P2 Wizard 500 / unknown waypoint

P2 Tenant DB / data corruption suspected

P3 Cert sweep stuck / not firing

P3 Frontend chunk-load errors / blank page

Deploy rollback

Escalation