A database migration was executed against the metadata plane db that dropped & recreated an important index. Expectation was that because:
(1) this was transactional
(2) this is a very small table (~10s of rows)
the op would complete ~instantly and there would be no production impact.
Empirically, a lock was held on a critical table for 16 minutes as a consequence, causing queries against the db to hang, causing connections to spike, causing connection acquisition to hit a hard limit and stop.
Remediation: we’re adding blocking CI step to reject any PR that creates an index without “concurrently” – the issue was noted in code review, and but we believed was not relevant due to size of table.
Most critical traffic (i.e. auth token exchanges) hits caches that sit in front of this table, so outage was limited to low-volume endpoints. Most impact was limited to offline query scheduling. Online query traffic does not route via this server, so was unimpacted.