Apply updates
Who this is for: a Mastio operator whose deploy boots with a pending-updates warning, or who wants to understand how Cullis ships cross-cutting PKI/protocol fixes without silently breaking agents.
Prerequisites
- A reasonably recent Mastio image (
docker compose -p cullis-mastio exec mcp-proxy printenv CULLIS_MASTIO_VERSION). On the bundle,./deploy.sh --upgrade <version>keeps you current. - Admin secret available
- A fresh backup of the deploy — see Disaster recovery
curl,jq
What a framework update is
Some bugs need more than a code change. Consider the Org CA that Cullis v0.1 emitted with BasicConstraints(pathLen=0) — a git pull + rebuild patched the generator, but every Mastio already bootstrapped kept the broken CA in its database. External verifiers (compliance auditors, stdlib x509 libraries) reject those chains.
A framework update is a Python migration class that ships alongside the code fix. The Mastio discovers it at boot, inserts a row in the pending_updates table, and — if the migration is marked critical and affects enrollment methods the proxy actively uses — halts signing until the operator applies it. The admin then calls one endpoint to run the migration, which mutates state idempotently and records a rollback snapshot.
Three surfaces:
- Registry —
mcp_proxy/updates/migrations/holds migration classes. Each migration declares itsmigration_id,migration_type,criticality,affects_enrollments, and adescriptionstring that shows up verbatim in the dashboard. - Boot detector — runs after Mastio identity bootstrap. Every registered migration’s
check()runs; pending rows insert withstatus='pending'; thecullis_pending_updates_total{status}gauge refreshes; acriticalmigration whoseaffects_enrollmentsintersects the live enrollment types flips the sign-halt flag. - Admin apply/rollback —
POST /proxy/updates/<id>/applyrunsup(), writes a snapshot tomigration_state_backups, marks the rowapplied.POST /proxy/updates/<id>/rollbackrestores from the snapshot.
Detect pending updates
/healthz
curl -sk https://localhost:9443/healthz | jq
Example output:
{
"status": "ok",
"warnings": ["org_ca_legacy_pathlen_zero"],
"pending_updates": 1
}
A clean Mastio omits both warnings and pending_updates.
Prometheus gauge
cullis_pending_updates_total{status="pending"} 1
cullis_pending_updates_total{status="applied"} 0
cullis_pending_updates_total{status="rolled_back"} 0
Alert on cullis_pending_updates_total{status="pending"} > 0 for more than 24 hours in production.
Dashboard + admin API
The dashboard exposes the pending list at https://mastio.example.com/proxy/updates — JSON list is available at /proxy/updates/api:
ADMIN_SECRET="$(grep ^MCP_PROXY_ADMIN_SECRET proxy.env | cut -d= -f2)"
curl -sk https://localhost:9443/proxy/updates/api \
-H "X-Admin-Secret: $ADMIN_SECRET" | jq
Example:
[
{
"migration_id": "2026-04-23-org-ca-pathlen-1",
"description": "Rotate the Org CA to BasicConstraints(pathLen=1) preserving every agent's public key — repairs proxies whose Org CA was generated before PR #284 fixed the pathLen=0 bug.",
"migration_type": "cert-schema",
"criticality": "critical",
"affects_enrollments": ["connector"],
"status": "pending",
"detected_at": "2026-04-23T19:04:21Z"
}
]
A note on affects_enrollments: the label "connector" is the internal enrollment-type tag for Mastio-managed enrollment (the Mastio holds the Org CA private key and mints agent certs itself). It is NOT a reference to the Connector binary, which is no longer part of the public repo. BYOCA-enrolled agents are flagged separately because their cert material lives outside the Mastio.
Sign halt
If the boot detector sees a migration with criticality == "critical" and affects_enrollments that overlaps your deploy’s active enrollment types, it engages a sign halt: every signing call raises RuntimeError: signing halted — pending migration {id}. Agents cannot mint new DPoP-bound tokens until you apply the update.
This is deliberate. The alternative is issuing tokens external verifiers will reject for reasons you can’t diagnose from logs.
Indicators of a sign halt in progress:
ERROR cullis_proxy_sign_halt_pending_migrationin the logs with themigration_id/healthzreturns 200 but/v1/auth/tokenand every signing call fails fast with a human-readable503cullis_mastio_sign_haltedgauge = 1
Apply an update
Review the migration first. Read the description, check the PR linked in the release notes, confirm the rollback snapshot behavior matches your recovery tolerance.
From the admin API
ADMIN_SECRET="$(grep ^MCP_PROXY_ADMIN_SECRET proxy.env | cut -d= -f2)"
MASTIO="https://localhost:9443"
MIGRATION_ID="2026-04-23-org-ca-pathlen-1"
curl -sk -X POST "$MASTIO/proxy/updates/$MIGRATION_ID/apply" \
-H "X-Admin-Secret: $ADMIN_SECRET" \
-H "Content-Type: application/json" \
-d '{"confirm": true}'
Expected:
{"migration_id": "2026-04-23-org-ca-pathlen-1", "status": "applied", "applied_at": "2026-04-23T19:08:12Z"}
The migration runs idempotently: check() short-circuits to no-op if the state is already fixed. Re-applying is safe but the backup snapshot is overwritten.
From the dashboard
https://mastio.example.com/proxy/updates lists every registered migration with description, criticality badge, and two buttons (Apply / View rollback plan). Same outcome as the API call above.
Roll back
Every successful apply writes a snapshot to migration_state_backups keyed by migration_id. Roll back with:
curl -sk -X POST "$MASTIO/proxy/updates/$MIGRATION_ID/rollback" \
-H "X-Admin-Secret: $ADMIN_SECRET" \
-H "Content-Type: application/json" \
-d '{"confirm": true}'
Expected: {"migration_id": "...", "status": "rolled_back", "rolled_back_at": "..."}.
A second rollback for the same migration_id fails with 404 no snapshot — state has already moved. The row moves to status='rolled_back'; the boot detector won’t re-propose it to pending. If you want the migration back in pending, drop the row manually.
SQLite (default):
docker compose -p cullis-mastio exec mcp-proxy sqlite3 /data/mcp_proxy.db \
"DELETE FROM pending_updates WHERE migration_id = '$MIGRATION_ID';"
Postgres (opt-in):
psql -h "$PG_HOST" -U cullis -d cullis -c \
"DELETE FROM pending_updates WHERE migration_id = '$MIGRATION_ID';"
The detector will re-insert it on next boot if check() still returns True.
Worked example — Org CA pathLen=0
Migration id: 2026-04-23-org-ca-pathlen-1.
What it does
- Detects
BasicConstraints(pathLen=0)on the Org CA root - Generates a fresh RSA-4096 Org CA with
pathLen=1, inheritingnotAfterfrom the old CA - Re-signs every agent’s leaf certificate, preserving subject, public key, SAN, and validity
- Assigns fresh 128-bit leaf serials (RFC 5280 §4.1.2.2) to avoid stale-cache verifier conflicts
- Writes the pre-apply state to
migration_state_backupsbefore mutating anything
Why agents don’t re-enroll
The migration preserves agent public keys. Agents keep signing with the private keys they already hold. Only the cert chain changes — the SDK reloads the Org CA bundle on the next handshake without any user-visible step.
What is explicitly out of scope
- BYOCA agents. The org holds their private keys, not the Mastio, so the auto-migrator can’t re-sign leaves. BYOCA operators re-run
enroll_via_byocaagainst the new Org CA. Tracked separately. - Expired Org CA. The migration refuses to run if the current CA’s
notAfteris in the past — the repair would extend expiry silently. Use the operator-driven Org CA rotation instead (see Rotate keys § 2).
Verify after apply
cullis_pending_updates_total{status="pending"}= 0 (for this migration)/healthzdropsorg_ca_legacy_pathlen_zerofrom thewarningsarraycullis_mastio_sign_halted= 0- One agent end-to-end (mint a token + make a call) succeeds against the new chain
Troubleshoot
Apply returns 409 already applied
: The row is already status='applied'. Read detected_at + applied_at — someone else applied it (check the audit log, filter event_type=admin.update_applied). No action needed.
Apply returns 412 sign halt mismatch
: Another migration is engaged on the halt flag. Apply that one first, or drop it explicitly; GET /proxy/updates/api?status=pending lists them in order.
up() raises mid-apply
: The snapshot was written before the mutation — rollback is safe. Call POST /proxy/updates/{id}/rollback, inspect the logs (cullis_proxy_update_apply_failed with a traceback), then either fix the environment and retry, or escalate the migration as a defect.
I want to disable a migration I consider incompatible
: Set status='rolled_back' directly in pending_updates via SQL (commands above). The detector respects non-pending rows. This is a last resort — the migration shipped as critical for a reason, and the halt will re-engage on any subsequent detected_at if you delete the row.
Next
- Rotate keys — manual key rotation when a framework update isn’t the right tool
- Disaster recovery — take a backup before applying any update
- Audit export — confirm apply events landed in the hash chain
- Runbook § monitoring — where to point Prometheus at the new gauge