Summary
On November 3rd 2023 we discovered an issue where a small percentage of payload data was missing from both of our storage systems, BigTable and ClickHouse. We first became aware of the issue when an alert fired on our system, indicating that an event was stuck in a DELIVERING state, which should be a temporary state.
Upon examination, we discovered that the payload never made it to our storage system, but the event was successfully delivered because we also have a cache in between with a 10 minute TTL. This cache's purpose is to primarily reduce latency. Our primary focus at this point was to prevent this from happening as quickly as possible. Our system having a heavy reliance on message queues (Google PubSub in this instance), when we receive a request, it enters our ingestion queue, gets processed and then a message gets published to another queue which processes the upload to other stores. We discovered that the main code path would acknowledge the payload message (which in turn gets deleted) even if the publish operation to the upload queue fails. However, since payloads are also stored in our cache (Redis), the vast majority of them were successfully delivered. Events for which those 2 conditions held true did NOT get delivered:
- The destination had more than 10 minutes of back pressure
- Publish operation for payload upload failed
0.00039% of all of our payloads were impacted by this issue between October 21st 2023 and November 3rd 2023. 0.00002% of the total payloads were never delivered. The remaining 0.00037% were delivered to their destination at least once.
A series of events led to this unfortunate situation:
- We recently switched our backend store for payloads. Previously our payloads were stored in our main PostgreSQL database. We gradually switched to Google BigTable over a period of about a month during which we were writing to both stores at the same time, and gradually switched to reading from BigTable to finally stop writing to PostgreSQL on October 19th.
- On October 20th, we modified part of our ingestion system to take transformation execution out of a database transaction. In doing so, we introduced a pattern where we now have commit operations (to the database) and post-commit operations. We treated the payload upload's publish operations as a post-commit operation. While the publish itself features a retry, post-commit operations failures result in the message getting acknowledged anyway. Therefore when the publish finally failed after all retries, the ingestion message got unfortunately (and wrongfully) acknowledged.
This led to the end result of losing some of our payloads.
Timeline
- October 19th 2023: Switch over to saving payloads to BigTable only (stop saving in PostgreSQL)
- October 21st 2023 01:54:00AM UTC: Released fix for transformation
- Nov 3rd 2023 1:00:00PM UTC: Detected first instance of missing payloads
- Nov 3rd 2023 20:26:00 UTC: Released fix to mitigate the issue
Lessons Learned
What went wrong
- We only detected the issue after 13 days. The issue was largely hidden by the fact that the frequency was low and that when it happened, the payloads were in the cache, and therefore the first attempt was successful.
- We had no other source to fall back on
Where we got lucky