On August 24th, 2024, at 02:00 UTC, we started having an issue with some events either missing or having the wrong status in the dashboard. This affected reporting (the views in the dashboard), not the event delivery itself.
Hookdeck's architecture is inspired by event sourcing design principles. State changes produce events, which are then captured in different systems to perform different actions. One of these systems is our Clickhouse database, where event state changes are synced and processed to produce a history. That data is then used to display the events list, request list, histograms, and metrics, among other things.
Specifically with Clickhouse, we use the VersionedCollapsingMergeTree engine. Clickhouse is designed to ingest data at a high rate and in large batches. We use the Kafka Connect Clickhouse sink connector to get data from our events processing engine into the database.
ClickHouse is designed to insert data in "parts." A part is a physical file on a disk representing the data from an insert operation, typically containing many rows. ClickHouse tables can optionally be partitioned, meaning data is logically grouped based on a specified criterion or criteria. Within each partition, there can be multiple parts, and these parts are merged in the background into larger parts to optimize query performance. This merging process is managed by the MergeTree engine.
For queries to be efficient, they should access the fewest number of parts possible. Therefore, the merging process must keep up with the incoming data rate. If ClickHouse's merge engine cannot merge parts faster than new parts are created, the number of parts increases. This can lead to performance degradation.
In our case, the incident was triggered by ClickHouse’s inability to keep up with the rate of incoming parts. As the number of parts continued to rise, it eventually reached the configured maximum for the table in question. When this limit was reached, ClickHouse began dropping parts to protect the server from being overwhelmed, leading to the observed incident. The rows contained in the part would not be inserted, and if an event is new, it would not show, and if it were an update, it would show the old state.
The incident started with some data missing, but degraded over the course of the day. It's difficult to know exactly how much of those updates we were missing, but we estimate that at peak we were missing about 50-60% of the data.
At that point, we attempted to restore the table from the same data that we continuously sink into Google Cloud Storage, but we ran into a data compatibility issue that we decided not to spend time resolving. Instead, we decided to snapshot the Postgres table through a Debezium connector since we had previously successfully achieved that. We successfully loaded the data into a Kafka topic; however, the snapshotting led to another incident.
Some research and calls with the Clickhouse team led us to try to recreate the table by partitioning it by month instead of day. On August 27th 2024, at 20:30 UTC, we created a new events table with a monthly partition and reloaded the data. On August 30th at 4:41 AM UTC, we successfully reloaded and backfilled the events table.
The parts count was lower after the reload, but queries were still taking long. On September 3rd, 2024, after researching and experimenting with different strategies, we tuned the Kafka Connector to insert larger batches of data less frequently. This resolved our issues by bringing the part count to a much lower number.
August 24 2024 02:00 UTC: Outage detected & incident started
August 24 2024 02:00 UTC - August 27 2024 10:00 UTC: Data integrity degraded. Internal investigation pointed to data making it into other Clickhouse tables, but not the one queried for the dashboard
August 26 2024 07:49 UTC: Found errors related to parts being dropped in the target table, confirmed that it is the root cause of our issues.
August 27 2024 10:24 UTC Engaged with Clickhouse support
August 27 2024 13:00 UTC: Attempted to reload table data from GCS
August 27 2024 20:30 UTC: Re-established new data streaming. Data from that date on showing as accurate. Old data still missing
August 27 2024 22:46 UTC: Switched to snapshotting strategy due to data compatibility issues between Avro files and Clickhouse
August 27 2024 19:45 UTC: Kicked off PostgreSQL events table snapshotting