Summary

On November 3rd 2023 we discovered an issue where a small percentage of payload data was missing from both of our storage systems, BigTable and ClickHouse. We first became aware of the issue when an alert fired on our system, indicating that an event was stuck in a DELIVERING state, which should be a temporary state.

Upon examination, we discovered that the payload never made it to our storage system, but the event was successfully delivered because we also have a cache in between with a 10 minute TTL. This cache's purpose is to primarily reduce latency. Our primary focus at this point was to prevent this from happening as quickly as possible. Our system having a heavy reliance on message queues (Google PubSub in this instance), when we receive a request, it enters our ingestion queue, gets processed and then a message gets published to another queue which processes the upload to other stores. We discovered that the main code path would acknowledge the payload message (which in turn gets deleted) even if the publish operation to the upload queue fails. However, since payloads are also stored in our cache (Redis), the vast majority of them were successfully delivered. Events for which those 2 conditions held true did NOT get delivered:

0.00039% of all of our payloads were impacted by this issue between October 21st 2023 and November 3rd 2023. 0.00002% of the total payloads were never delivered. The remaining 0.00037% were delivered to their destination at least once.

A series of events led to this unfortunate situation:

This led to the end result of losing some of our payloads.

Timeline

Lessons Learned

What went wrong

Where we got lucky