Context
ADR-0014 (0015-studio-multi-tenant-saas-aligned-svc-access) commits Studio to a single shared Postgres database with app-layer tenant isolation (every tenant-scoped query carries an explicit tenant_id filter; no row-level security). That model is only safe if the schema itself — table layout, indexes, default values, the tenant_id columns themselves — is managed deliberately, with the same rigor as code.
We need an answer to four questions for every schema change Studio ever makes:
- What is the source of truth for the schema? — not a snapshot in someone's psql session, not the result of a
CREATE TABLEsomeone hand-typed against dev. - How do we know which version of the schema is in dev / qa / prod? — and how do we know they match the version of the application code that's currently serving requests there.
- How do we apply a schema change safely? — without dropping requests, without a window in which the deployed code expects a column that doesn't exist yet (or vice versa), and without manual psql sessions in production.
- How do we keep the runtime application's identity from accidentally being able to alter the schema? — i.e. how do we enforce ADR-0014's app-layer-only model at the Postgres permission level, not just by convention.
Specific constraints:
- The Cloud SQL instance is private-IP only (network layer + VPC peering — no public IP). Migrations cannot be run from a developer laptop or a vanilla GitHub-hosted runner; they must originate from inside the VPC.
- Runtime: Python + SQLAlchemy 2.0 async + asyncpg on Cloud Run (chosen separately as part of Phase 2 of the Studio plan).
- Deploys are continuous and image-based: every merge to a deploy branch publishes a new container image SHA; the api Cloud Run service rolls forward to that SHA.
- We have already shipped (as of 2026-05-04) the IaC primitives this ADR depends on:
cloud-run-migratelayer — privileged migrator service account + Cloud Run Job (iac-studio#18).cloud-sql-bootstrapextension —appschema,migratoriscloudsqlsuperuser,apiis granted DML viaALTER DEFAULT PRIVILEGES,api.search_path = app, public(iac-studio#19).
Decision
Adopt Alembic as Studio's exclusive Postgres schema-migration tool, and execute Alembic migrations from a Cloud Run Job that the deploy pipeline runs before flipping the api service to the new image.
Concretely:
-
Alembic (the SQLAlchemy-native schema migration framework) lives in
svc-studiounderalembic/with aversions/directory of forward-only migration scripts. Migrations are written as Python files using the Alembicop.*API (or raw SQL when needed for Postgres-specific features). Each migration has a deterministic revision id and an explicitdown_revisionlinking it to its parent. -
The migrator runs from inside the same image as the api service.
svc-studio's Dockerfile installsalembicas a runtime dependency (not dev-only), copiesalembic.iniandalembic/to the image, and the same SHA is pushed to Artifact Registry for both the api service and the migration job. One artifact, one SHA, one pipeline. -
Migrations execute as a Cloud Run Job (
studio-{env}-migrate) using a dedicated service account (studio-{env}-migrate@). The job is provisioned by the iac-studiocloud-run-migratelayer with a placeholder image andcommand = ["alembic"], args = ["upgrade", "head"]. svc-studio CI overrides the image at execute time:gcloud run jobs update studio-{env}-migrate --image …:<sha>gcloud run jobs execute studio-{env}-migrate --waitOnly after the job exits 0 does the deploy proceed to:
gcloud run services update studio-{env}-api --image …:<sha>A failed migration aborts the deploy; the api service stays on the previous (compatible) revision.
-
Privilege separation at the Postgres level. The
migratorPostgres role hascloudsqlsuperuser(DDL + GRANT). TheapiPostgres role has only DML (SELECT/INSERT/UPDATE/DELETE) on theappschema, granted automatically byALTER DEFAULT PRIVILEGES FOR ROLE migrator IN SCHEMA app. Application code therefore cannot runCREATE TABLE,ALTER TABLE,DROP …, or any DDL — even if a SQL injection were to occur, the worst possible outcome is data corruption within the existing schema, not schema modification. -
Forward-only, expand-contract discipline. Migrations are never reversed in production by running
alembic downgrade. Schema changes are designed so the previous-revision app code and the next-revision app code can both run against the new schema for at least one deploy:- Adding a column → nullable or with default; backfill in a separate migration; only after old code is gone, mark NOT NULL.
- Removing a column → stop reading it in code first, deploy, then drop in a later migration.
- Renaming a column → add new column, dual-write, backfill, switch reads, drop old column. Four deploys minimum.
-
Autogenerate is a draft, not a commit.
alembic revision --autogenerate -m "…"is allowed for bootstrapping a migration file, but every generated migration is reviewed by hand: index choices, column types (especially timestamps + numerics), enum changes, foreign-key cascades, and the absence of unintentionalop.drop_*calls.
Consequences
-
Single source of truth for the schema. The
alembic/versions/directory is the authoritative history. The DB'salembic_versiontable identifies exactly which migration is applied where. No drift between environments is possible without it being detectable by re-runningalembic currentagainst the DB. -
Code and schema deploy atomically. Because the migration job uses the same image SHA as the api service and runs first, there is no window in which the deployed code expects schema state that hasn't been applied. This eliminates an entire category of incident (the "deploy went out, half the requests are 500ing because the new code references a column that doesn't exist yet" class).
-
Application code cannot accidentally migrate the schema. A misconfigured ORM, a debug endpoint, a third-party library doing
CREATE INDEXon first use, a successful SQL injection — none of them can alter the schema, because the api role lacks DDL privileges at the Postgres permission level. The only way to migrate the schema is to push a commit through CI. This is the keystone of ADR-0014 in operational practice. -
Forward-only discipline costs developer time. A column rename is four deploys, not one. This is intentional: it is also four deploys with zero risk of dropping requests. Teams that have learned this discipline (Heroku release-phase pattern, Stripe, GitHub) report it pays for itself within months of the first avoided incident.
-
Migrations that are too slow for the deploy gate must be split. The
studio-{env}-migratejob has a 600s (10min) timeout. Migrations that exceed this (large table rewrites, longCREATE INDEXoutsideCONCURRENTLY) are an architecture smell and must be broken into a non-blocking concurrent migration plus a verification step. -
Adds a Python dependency Studio doesn't strictly need at runtime.
alembicis on the api service's runtime PATH so the same image can run as the migration job. ~1 MB image overhead and one extra entry inpyproject.toml. Acceptable. -
The migrator service account has cloudsqlsuperuser. This is a high-value identity. It is only used by the Cloud Run Job — never assumed by a human, never assigned to another service, never granted to anything outside the cloud-run-migrate layer. Its key material does not exist (Cloud Run Jobs use Workload Identity, not key files). This is enforced by IAM at the project level.
-
Bootstrap of a new environment (e.g. qa) requires running
alembic upgrade headonce before the first api deploy can succeed. The deploy pipeline does this automatically (the migration job runs every deploy, and is idempotent on a brand-new database — it just applies all migrations from scratch).
Alternatives considered
-
Hand-written SQL files in
cloud-sql-bootstrap. Rejected: there is no notion of "which file has been applied"; re-runs would fail or duplicate; no rollback semantics; cannot be autogenerated from model changes; pushes the entire schema history into the IaC repo where it will rot and be ignored. -
Base.metadata.create_all()/ "create tables on startup". Rejected: causes race conditions across multiple Cloud Run instances starting at the same time, has no notion of versions, cannot evolve a schema (only create it from scratch), and forces application code to run with DDL privileges (violates ADR-0014's app-layer-only model). -
Django migrations. Rejected: Studio is not on Django; pulling in Django ORM as a sidecar to manage schema for a non-Django service is absurd.
-
SQLModel / Tortoise migrations. Rejected: SQLModel's migration story is essentially "use Alembic anyway." Tortoise has its own migrator (
aerich) but is a much smaller ecosystem and ties us to Tortoise ORM, which we are not using. -
Prisma / Atlas / Skeema (or any non-Python schema tool). Rejected: adds a non-Python toolchain to a Python service. New binary in the Docker image, new mental model for the team, two languages to keep in sync. The win (declarative schema) is real but does not outweigh the cost for a single-service-single-DB setup. Worth re-evaluating if Studio ever spans many DBs.
-
Migrations as a
gcloud sql importstep from CI. Rejected: requires Cloud SQL imports of arbitrary SQL files, which need the Cloud SQL Auth Proxy or public IP. Our instance has no public IP. Doesn't solve the versioning problem either. -
Migrations executed by the api service on startup (singleton lock pattern). Rejected: requires the api service to run with DDL privileges, defeating ADR-0014's privilege-separation goal. Also couples migration timing to revision rollouts in a way that makes failed migrations harder to roll back.
-
No deploy-time gate; migrations are a separate manual step. Rejected: the entire incident class this ADR is designed to prevent assumes the migration is always up-to-date with the deployed code. A human-in-the-loop step will be skipped, deferred, or done in the wrong order eventually. The cost is one extra Cloud Run Job execution per deploy (~30s) — trivial vs. the on-call shift it pays for.
References
- ADR-0015 — Studio multi-tenant SaaS aligned with svc-access (the app-layer-isolation model this ADR enforces operationally)
- ADR-0016 — Studio repo split (defines svc-studio + iac-studio as separate repos)
- iac-studio#17 —
roles/cloudsql.clienton the api SA - iac-studio#18 — cloud-run-migrate layer (migrator SA + Cloud Run Job)
- iac-studio#19 — cloud-sql-bootstrap extension (app schema + ALTER DEFAULT PRIVILEGES)
- Alembic documentation
- Heroku release phase — the deploy-time-migration pattern this ADR adopts
- Postgres ALTER DEFAULT PRIVILEGES