Metadata plane partial outage impacting offline query scheduling & ui access

Incident Report for Chalk

Postmortem

Chalk has automated processes that update core infrastructure passwords on a periodic basis.

At approximately 11:45pm PT, a scheduling error caused two instances of the process to run simultaneously. This caused the ‘old’ password to be dropped two times, leading to some api server instances being unable to connect to their database.

We’ve resolved the configuration error that permitted two instances of the job to run.

Monitors alerted our on-call engineer; the issue was root-cased and resolved via a restart of api server instances running with the ‘old’ password.

Posted Apr 02, 2026 - 07:18 UTC

Resolved

From 11:50 PT->12:04am PT a significant percentage of metadata plane traffic returned 500 response codes with db login failures; this impacted:

- offline query triggers
- offline query status polling
- scheduled resolver execution
- very rarely, auth token exchange
- UI access

Root cause was an error in an automated password rotation process which dropped an 'old' password too early.
Posted Apr 02, 2026 - 07:17 UTC
This incident affected: Query API, Inbound Webhook API, and Control Plane.