Skip to main content

Context

ADR-0014 (0015-studio-multi-tenant-saas-aligned-svc-access) commits Studio to a single shared Postgres database with app-layer tenant isolation (every tenant-scoped query carries an explicit tenant_id filter; no row-level security). That model is only safe if the schema itself — table layout, indexes, default values, the tenant_id columns themselves — is managed deliberately, with the same rigor as code.

We need an answer to four questions for every schema change Studio ever makes:

  1. What is the source of truth for the schema? — not a snapshot in someone's psql session, not the result of a CREATE TABLE someone hand-typed against dev.
  2. How do we know which version of the schema is in dev / qa / prod? — and how do we know they match the version of the application code that's currently serving requests there.
  3. How do we apply a schema change safely? — without dropping requests, without a window in which the deployed code expects a column that doesn't exist yet (or vice versa), and without manual psql sessions in production.
  4. How do we keep the runtime application's identity from accidentally being able to alter the schema? — i.e. how do we enforce ADR-0014's app-layer-only model at the Postgres permission level, not just by convention.

Specific constraints:

  • The Cloud SQL instance is private-IP only (network layer + VPC peering — no public IP). Migrations cannot be run from a developer laptop or a vanilla GitHub-hosted runner; they must originate from inside the VPC.
  • Runtime: Python + SQLAlchemy 2.0 async + asyncpg on Cloud Run (chosen separately as part of Phase 2 of the Studio plan).
  • Deploys are continuous and image-based: every merge to a deploy branch publishes a new container image SHA; the api Cloud Run service rolls forward to that SHA.
  • We have already shipped (as of 2026-05-04) the IaC primitives this ADR depends on:
    • cloud-run-migrate layer — privileged migrator service account + Cloud Run Job (iac-studio#18).
    • cloud-sql-bootstrap extension — app schema, migrator is cloudsqlsuperuser, api is granted DML via ALTER DEFAULT PRIVILEGES, api.search_path = app, public (iac-studio#19).

Decision

Adopt Alembic as Studio's exclusive Postgres schema-migration tool, and execute Alembic migrations from a Cloud Run Job that the deploy pipeline runs before flipping the api service to the new image.

Concretely:

  1. Alembic (the SQLAlchemy-native schema migration framework) lives in svc-studio under alembic/ with a versions/ directory of forward-only migration scripts. Migrations are written as Python files using the Alembic op.* API (or raw SQL when needed for Postgres-specific features). Each migration has a deterministic revision id and an explicit down_revision linking it to its parent.

  2. The migrator runs from inside the same image as the api service. svc-studio's Dockerfile installs alembic as a runtime dependency (not dev-only), copies alembic.ini and alembic/ to the image, and the same SHA is pushed to Artifact Registry for both the api service and the migration job. One artifact, one SHA, one pipeline.

  3. Migrations execute as a Cloud Run Job (studio-{env}-migrate) using a dedicated service account (studio-{env}-migrate@). The job is provisioned by the iac-studio cloud-run-migrate layer with a placeholder image and command = ["alembic"], args = ["upgrade", "head"]. svc-studio CI overrides the image at execute time:

    gcloud run jobs update studio-{env}-migrate --image …:<sha>
    gcloud run jobs execute studio-{env}-migrate --wait

    Only after the job exits 0 does the deploy proceed to:

    gcloud run services update studio-{env}-api --image …:<sha>

    A failed migration aborts the deploy; the api service stays on the previous (compatible) revision.

  4. Privilege separation at the Postgres level. The migrator Postgres role has cloudsqlsuperuser (DDL + GRANT). The api Postgres role has only DML (SELECT/INSERT/UPDATE/DELETE) on the app schema, granted automatically by ALTER DEFAULT PRIVILEGES FOR ROLE migrator IN SCHEMA app. Application code therefore cannot run CREATE TABLE, ALTER TABLE, DROP …, or any DDL — even if a SQL injection were to occur, the worst possible outcome is data corruption within the existing schema, not schema modification.

  5. Forward-only, expand-contract discipline. Migrations are never reversed in production by running alembic downgrade. Schema changes are designed so the previous-revision app code and the next-revision app code can both run against the new schema for at least one deploy:

    • Adding a column → nullable or with default; backfill in a separate migration; only after old code is gone, mark NOT NULL.
    • Removing a column → stop reading it in code first, deploy, then drop in a later migration.
    • Renaming a column → add new column, dual-write, backfill, switch reads, drop old column. Four deploys minimum.
  6. Autogenerate is a draft, not a commit. alembic revision --autogenerate -m "…" is allowed for bootstrapping a migration file, but every generated migration is reviewed by hand: index choices, column types (especially timestamps + numerics), enum changes, foreign-key cascades, and the absence of unintentional op.drop_* calls.

Consequences

  • Single source of truth for the schema. The alembic/versions/ directory is the authoritative history. The DB's alembic_version table identifies exactly which migration is applied where. No drift between environments is possible without it being detectable by re-running alembic current against the DB.

  • Code and schema deploy atomically. Because the migration job uses the same image SHA as the api service and runs first, there is no window in which the deployed code expects schema state that hasn't been applied. This eliminates an entire category of incident (the "deploy went out, half the requests are 500ing because the new code references a column that doesn't exist yet" class).

  • Application code cannot accidentally migrate the schema. A misconfigured ORM, a debug endpoint, a third-party library doing CREATE INDEX on first use, a successful SQL injection — none of them can alter the schema, because the api role lacks DDL privileges at the Postgres permission level. The only way to migrate the schema is to push a commit through CI. This is the keystone of ADR-0014 in operational practice.

  • Forward-only discipline costs developer time. A column rename is four deploys, not one. This is intentional: it is also four deploys with zero risk of dropping requests. Teams that have learned this discipline (Heroku release-phase pattern, Stripe, GitHub) report it pays for itself within months of the first avoided incident.

  • Migrations that are too slow for the deploy gate must be split. The studio-{env}-migrate job has a 600s (10min) timeout. Migrations that exceed this (large table rewrites, long CREATE INDEX outside CONCURRENTLY) are an architecture smell and must be broken into a non-blocking concurrent migration plus a verification step.

  • Adds a Python dependency Studio doesn't strictly need at runtime. alembic is on the api service's runtime PATH so the same image can run as the migration job. ~1 MB image overhead and one extra entry in pyproject.toml. Acceptable.

  • The migrator service account has cloudsqlsuperuser. This is a high-value identity. It is only used by the Cloud Run Job — never assumed by a human, never assigned to another service, never granted to anything outside the cloud-run-migrate layer. Its key material does not exist (Cloud Run Jobs use Workload Identity, not key files). This is enforced by IAM at the project level.

  • Bootstrap of a new environment (e.g. qa) requires running alembic upgrade head once before the first api deploy can succeed. The deploy pipeline does this automatically (the migration job runs every deploy, and is idempotent on a brand-new database — it just applies all migrations from scratch).

Alternatives considered

  • Hand-written SQL files in cloud-sql-bootstrap. Rejected: there is no notion of "which file has been applied"; re-runs would fail or duplicate; no rollback semantics; cannot be autogenerated from model changes; pushes the entire schema history into the IaC repo where it will rot and be ignored.

  • Base.metadata.create_all() / "create tables on startup". Rejected: causes race conditions across multiple Cloud Run instances starting at the same time, has no notion of versions, cannot evolve a schema (only create it from scratch), and forces application code to run with DDL privileges (violates ADR-0014's app-layer-only model).

  • Django migrations. Rejected: Studio is not on Django; pulling in Django ORM as a sidecar to manage schema for a non-Django service is absurd.

  • SQLModel / Tortoise migrations. Rejected: SQLModel's migration story is essentially "use Alembic anyway." Tortoise has its own migrator (aerich) but is a much smaller ecosystem and ties us to Tortoise ORM, which we are not using.

  • Prisma / Atlas / Skeema (or any non-Python schema tool). Rejected: adds a non-Python toolchain to a Python service. New binary in the Docker image, new mental model for the team, two languages to keep in sync. The win (declarative schema) is real but does not outweigh the cost for a single-service-single-DB setup. Worth re-evaluating if Studio ever spans many DBs.

  • Migrations as a gcloud sql import step from CI. Rejected: requires Cloud SQL imports of arbitrary SQL files, which need the Cloud SQL Auth Proxy or public IP. Our instance has no public IP. Doesn't solve the versioning problem either.

  • Migrations executed by the api service on startup (singleton lock pattern). Rejected: requires the api service to run with DDL privileges, defeating ADR-0014's privilege-separation goal. Also couples migration timing to revision rollouts in a way that makes failed migrations harder to roll back.

  • No deploy-time gate; migrations are a separate manual step. Rejected: the entire incident class this ADR is designed to prevent assumes the migration is always up-to-date with the deployed code. A human-in-the-loop step will be skipped, deferred, or done in the wrong order eventually. The cost is one extra Cloud Run Job execution per deploy (~30s) — trivial vs. the on-call shift it pays for.

References

  • ADR-0015 — Studio multi-tenant SaaS aligned with svc-access (the app-layer-isolation model this ADR enforces operationally)
  • ADR-0016 — Studio repo split (defines svc-studio + iac-studio as separate repos)
  • iac-studio#17 — roles/cloudsql.client on the api SA
  • iac-studio#18 — cloud-run-migrate layer (migrator SA + Cloud Run Job)
  • iac-studio#19 — cloud-sql-bootstrap extension (app schema + ALTER DEFAULT PRIVILEGES)
  • Alembic documentation
  • Heroku release phase — the deploy-time-migration pattern this ADR adopts
  • Postgres ALTER DEFAULT PRIVILEGES