Summary

On November 18th 2023, we experienced two instances of degraded performance. The first one started at 12:13:00 UTC and ended at 12:40:00 UTC, and the second one started 19:07:00 UTC and ended at 20:21:00 UTC.

A bug in the way that we apply rate limits at ingestion was the root cause of both of these instances. Our workspaces are assigned a maximum number of requests per second (usually determined by the plan that the workspace is subscribed to). When a workspace exceeds the request per second, and after a leaky bucket is filled (to allow for spikes), ingestion is switched to a low-priority queue for that workspace. This ensures that all our customers get a fair share of processing time and delivery of all messages in the normal priority queue.

Our implementation of the rate limiter lives on our service provider's infrastructure, Cloudflare. It's implemented using Durable Objects, a technology allowing us to implement a counter-style mechanism in a highly distributed environment.

In both of the above instances, one workspace was receiving requests at a rate of about 25,000 requests per second. Our ingestion system is designed to handle those kinds of volumes, however, as it turns out, Durable Objects have a limit of about 1000 requests per second per instance. An instance, in our case, is a single workspace. Therefore, since we ingested all those requests into the normal priority queue, and for each request that we receive, we have to call the Durable Object, out of the 25,000 requests per second that this workspace was sending, 1000 were going into the low priority queue as they should, but the remainder were going into the normal priority queue instead of the low one.

This resulted in our normal priority queue getting overwhelmed with requests from that single workspace. Consequently, the workers processing that queue had to work through the backlog of requests before getting to requests from other workspaces, which is exactly what our priority system was designed to prevent. Given the backlog, limits on our database connection capacity and CPU bottlenecks prevented us from consuming the queue faster. P99 Ingest to first attempt delivery latency peaked at 22.7 minutes during the first occurrence between 12:13:00 UTC and 12:40:00 UTC, and at 63.3 minutes during the second occurrence between 19:07:00 UTC and ended at 20:21:00 UTC.

Upon identifying the root cause, we issued a fix that changes a request’s priority to low if the call to the rate limiter results in an error identified as an overload (too many requests, > 1000 per second). Effectively, this works because every request over 1000 in a single second would be a low-priority request anyway.

This resolves the issue since it maintains separation (since we have an instance per workspace that could bust), and it correctly tags those requests above Cloudflare's limits as low priority as they should be.

Timeline

Lessons Learned

What went wrong

What went well