Metadata plane outage

Incident Report for Chalk

Postmortem

A database migration was executed against the metadata plane db that dropped & recreated an important index. Expectation was that because:

(1) this was transactional

(2) this is a very small table (~10s of rows)

the op would complete ~instantly and there would be no production impact.

‌

Empirically, a lock was held on a critical table for 16 minutes as a consequence, causing queries against the db to hang, causing connections to spike, causing connection acquisition to hit a hard limit and stop.

‌

Remediation: we’re adding blocking CI step to reject any PR that creates an index without “concurrently” – the issue was noted in code review, and but we believed was not relevant due to size of table.

‌

Most critical traffic (i.e. auth token exchanges) hits caches that sit in front of this table, so outage was limited to low-volume endpoints. Most impact was limited to offline query scheduling. Online query traffic does not route via this server, so was unimpacted.

Posted Aug 15, 2025 - 18:49 UTC

Resolved

Connection pool saturation in metadata plane; instantaneous doubling of conn count. Investigating root cause, but resolved.

Impact on job scheduling & dashboard; online query traffic & auth token exchange unimpacted.

Posted Aug 15, 2025 - 18:27 UTC

Investigating

Seeing 504s on api.chalk.ai; investigating.

Posted Aug 15, 2025 - 18:08 UTC