Metadata plane outage

Incident Report for Chalk

Postmortem

A database migration was executed against the metadata plane db that dropped & recreated an important index. Expectation was that because:

(1) this was transactional

(2) this is a very small table (~10s of rows)

the op would complete ~instantly and there would be no production impact.

Empirically, a lock was held on a critical table for 16 minutes as a consequence, causing queries against the db to hang, causing connections to spike, causing connection acquisition to hit a hard limit and stop.

Remediation: we’re adding blocking CI step to reject any PR that creates an index without “concurrently” – the issue was noted in code review, and but we believed was not relevant due to size of table.

Most critical traffic (i.e. auth token exchanges) hits caches that sit in front of this table, so outage was limited to low-volume endpoints. Most impact was limited to offline query scheduling. Online query traffic does not route via this server, so was unimpacted.

Posted Aug 15, 2025 - 18:49 UTC

Resolved

Connection pool saturation in metadata plane; instantaneous doubling of conn count. Investigating root cause, but resolved.

Impact on job scheduling & dashboard; online query traffic & auth token exchange unimpacted.
Posted Aug 15, 2025 - 18:27 UTC

Investigating

Seeing 504s on api.chalk.ai; investigating.
Posted Aug 15, 2025 - 18:08 UTC