Before you touch anything in production: note the timestamp, screenshot the alert + dashboard state, and announce in #incident-coordination Slack channel. Recovery actions without a paper trail are how we lost the rapidvalue.eu zone migration.

Quick ssh + diagnostics

# SSH to staging EC2 (key in 1Password vault "iga-staging-ssh") ssh ec2-user@staging.app.rapidvalue.eu ssh ec2-user@app.rapidvalue.eu # prod — use sparingly # Inspect running containers docker compose ps docker compose logs backend --tail=200 docker compose logs frontend --tail=50 # Backend health canary (returns 503 on DB drift / alembic-not-at-head) curl -s https://staging.app.rapidvalue.eu/health/deep | jq . # Connect to RDS (read-only first!) docker compose exec backend python -c "from app.db import engine; ..."

Incident playbook

P1 Backend not starting / lifespan loop

Symptom: /health/deep returns 503 with {"db": "drift"} or {"alembic": "not-at-head"}. Site shows generic 503.

Likely causes: failed alembic auto-upgrade on boot (incompatible migration), DB connection blocked, missing env-var.

1
Check logs immediately:
docker compose logs backend --tail=300 | grep -E "alembic|lifespan|ERROR"
2
Identify the failing migration: look for "FAILED: …" in alembic output. Note the revision ID.
3
Manual downgrade if migration is reversible:
docker compose exec backend alembic downgrade -1
4
Or skip + roll back deploy: revert the offending commit on main, force-push, re-deploy.
git revert <sha> --no-edit git push origin main
5
For URL-encoded password issues (interpolation errors): per CLAUDE.md security rule #7, set cfg.file_config._interpolation = configparser.Interpolation() in app/main.py lifespan. Usually already in place; check it wasn't removed.

P1 RDS at capacity / connection-pool exhaustion

Symptom: backend logs show QueuePool limit reached. Wide 500s. RDS CloudWatch shows DB connection-count near max.

Root cause: too many tenants on one cluster, or a runaway scheduler sweep holding connections. Default db.t3.small RDS = ~80 max connections; 5 connections/tenant pool = 16 active concurrent tenants tops.

1
Check what's holding connections:
docker compose exec backend psql -c " SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start ASC LIMIT 20; "
2
Kill long-running queries if safe:
SELECT pg_terminate_backend(<pid>);
3
Short-term mitigation: bump RDS instance class via Terraform (db_instance_class var) — typically goes t3.small → t3.medium with ~10 min downtime.
4
Permanent fix: split tenants across a new RDS cluster. See cost calculator for tenants-per-cluster guidance (15-25 paying tenants per db.t3.medium).

P2 Vault key rotation needed

Symptom: IGA_VAULT_MASTER_KEY compromised, expiring, or moving environments.

Per CLAUDE.md security rule #4: production guards require this to be set, and vault.db + key must live on separate disks/secret stores.

1
Read the runbook: backend/docs/VAULT_OPERATIONS.md has the full procedure.
2
Backup first: backend/scripts/vault_backup.py does atomic VACUUM INTO + decrypt-verify with current key.
3
Rotate via re-encrypt: run rotation script with both old + new keys. The script re-encrypts every secret row.
4
Update Secrets Manager: new key value goes to AWS Secrets Manager iga/{env}/vault-master-key. Backend reads at startup.
5
Verify: restart backend, check that connectors can still resolve their auth_config (look for VaultResolver errors in logs).

P2 Wizard 500 / unknown waypoint

Symptom: POST /api/v1/onboarding/sessions returns 500. Backend logs show invalid input value for enum waypoint_id.

Root cause: Python WaypointId enum has values the Postgres enum doesn't. This was exactly the 2026-05-30 incident — see migration 0165.

1
Diff Python enum vs DB:
# From the backend container: docker compose exec backend python -c " from app.domain.onboarding.models import WaypointId print([w.value for w in WaypointId]) " # Compare against DB enum: docker compose exec backend psql -c " SELECT unnest(enum_range(NULL::waypoint_id))::text; "
2
Add missing values via alembic migration: create new revision, op.execute("ALTER TYPE waypoint_id ADD VALUE IF NOT EXISTS '...'"). Postgres 12+ allows ADD VALUE inside a transaction.
3
Deploy + restart: lifespan auto-upgrade applies the migration.

P2 Tenant DB / data corruption suspected

Symptom: customer reports missing identities / weird recon-diff state / failed grants that should succeed.

1
Use audit-replay (read-only forensic tool) per CLAUDE.md rule:
curl https://staging.app.rapidvalue.eu/api/v1/platform/tenants/<id>/audit-replay?from=2026-05-30T00:00&to=2026-05-30T18:00
2
Verify audit-chain integrity:
curl https://staging.app.rapidvalue.eu/api/v1/audit/verify-chain
Returns 200 if chain_hash is intact. If broken: a non-service code path inserted into audit_event. Investigate.
3
Restore from backup if confirmed corruption: see backup & DR runbook.

P3 Cert sweep stuck / not firing

Symptom: scheduled certifications not generating tasks; CertCampaign shows 0 tasks.

1
Check scheduler-loop logs: docker compose logs backend | grep "_maybe_sweep\|cert"
2
Verify scheduler is running per tenant. Sometimes a single tenant fails and others continue — look for the tenant_id stack-trace.
3
Manual trigger: POST /api/v1/certification/sweep

P3 Frontend chunk-load errors / blank page

Symptom: user sees blank page or "ChunkLoadError" in console after deploy.

Root cause: service worker caching old chunk hashes that no longer exist after deploy. CLAUDE.md mentions Cache-Control: no-cache, no-store on index.html — verify it's applied.

1
Tell affected users to hard-refresh (Ctrl+Shift+R) or clear site data.
2
Long-term: ensure nginx config has the no-cache header on index.html.

Deploy rollback

If a deploy went sideways:

# 1. Identify the last good commit git log --oneline -10 # 2. Revert (don't reset — main is shared) git revert <bad-sha> --no-edit git push origin main # auto-deploys to staging git push origin main:prod # auto-deploys to prod (only after staging-validation) # 3. Or manually re-deploy a known-good SHA via workflow_dispatch gh workflow run deploy-staging.yml --ref <good-sha> gh workflow run deploy-ec2.yml --ref <good-sha>

Escalation

If you can't resolve in 30 minutes:

  • P1 prod incident: page founder (PagerDuty rotation). Open #incident-coordination Slack with timeline.
  • RDS / AWS infra: AWS Business Support ticket (response < 1h). Account/role in 1Password.
  • Customer impact: tell customer via status page (statuspage.rapidvalue.eu) + email primary contact. Don't wait until resolved.
  • Security incident: stop everything, document evidence, contact cyber insurance line. Don't try to fix anything until forensic snapshot is captured.