-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Description
Currently (at least in Druid 28, I'm unsure if it's changed since,) when an invalid ingestion spec is received, the supervisor is deleted. This can cause a bug in ingestion spec submission/generation to quickly turn into a considerable outage. Instead, if an invalid supervisor spec is submitted then Druid should keep the supervisor around using the previous config and log the error.
Motivation
I accidentally caused an overnight outage with our Druid ingestion due to a bug in supervisor config generation. It submitted a config which did not pass validation (taskCountMin was over taskCountMax) which Druid caught, but deleted the supervisor because it had no valid configuration. Our alerting didn't catch it because the entire supervisor was deleted, so there was no data reporting that it was down. This outage would've been prevented if Druid didn't destroy supervisors on bad config updates.