High event delivery latency due to database disk issue

Summary

On August 28th, 2024, Hookdeck experienced an incident, causing severely degraded performance. Delivery latency (the time between when Hookdeck receives a request and issues the first attempt when there is no rate limit and the project is within its throughput limit) started increasing on August 28th at 09:54 UTC to reach 50 seconds. Starting at 12:40 UTC, latency started increasing to reach a peak of about 1.5 hours on our normal priority queue.

Our engineering team observed a higher-than-normal CPU I/O wait on our database servers (60%), and after trying to identify the root cause internally, we opened a ticket with our provider at 11:00 UTC, and after back and forth on RCA, we escalated the issue to our for further diagnostic at 18:57 UTC. After several hours of diagnostics, our provider confirmed that the issue was caused by their disk autoscaling feature that kicked in 5 times from the evening of the 28th to the 29th (more on why it kicked in below). Our provider confirmed that increasing disk size dynamically causes a new disk to be added (and therefore extends the original disk using LVM), and that the cloud provider can't guarantee that the disk provisioned is of NVMe type. Some of the disks that were added were SCSI disks. The added effect of striping and slower disk types caused disk IO performance to decrease severely to the point where it crippled our ability to process requests and events in our queues. The nature of the technical implementation of the provider's auto-scaling feature and the limitation that the autoscaling may not provision NVMe disk are undocumented by the provider. Provider's engineering and ops teams were deciding on best course of action.

To mitigate some of the effects of the generalized database slowdown, we made the tough decision to stop processing the low-priority traffic queue on August 28th at 15:06 UTC - that is the queue in which requests for those projects that go above their plan's throughput go. We also suspected that the issue could be related to specific projects that we identified as having especially high latency, so we diverted some of those sources to yet another queue (manual queue) in an attempt to stabilize things on August 28th at 01:30 PM. We re-enabled the low-priority queue and disabled it throughout the day as load permitted (see timeline). It was disabled intermittently for about 25 hours from August 28th at 09:10 UTC to August 29th at 10:10 UTC. As for the manual queue, once we realized that those workspaces weren't the cause of the issue, we re-diverted its traffic to our normal priority queue and processed its backlog, which was cleared on August 29th at 10:56 UTC.

Once we established the root cause with our provider and they determined the best course of action, we kicked off an online database restore to rebuild our database servers (master and failover replica) with a dedicated NVMe disk on August 28th at 00:55 UTC. That process was completed on August 29th at 14:56 UTC, thus taking ~14 hours).

By August 29th at 15:20 UTC, we had cleared our normal priority backlog and brought delivery latency back to baseline (2 seconds p99). We then proceeded to process our low-priority backlog and the backlog of those projects that had deviated from our manual queue.

After the fact, we realized that we had already started seeing performance degradation with our provider on August 19th and had opened a ticket with them at 10:10 UTC. We also realized that the disk had dynamically scaled on August 17th at 13:47 UTC. The ticket was promptly acknowledged and actioned, but it did not lead to any conclusions on why we were experiencing I/O performance degradation.

The processing of the low-priority queue's backlog was in progress when, on August 30th at 00:37 UTC, we experienced another issue - our database ran out of space to store multicast (axis) member ids. As we were inserting data at a very high rate to catch up with our backlog, PostgreSQL auto-vacuuming fell behind, which led to a more aggressive auto-vacuum kicking in to prevent multixact ID wraparound. xid and mxid are internal PG identifiers that are stored in a 32-bit wide integer and wrap around every ~4 billion transactions. Auto vacuuming releases old xid and mxid via a process called "freezing”. Since this is a finite space, the counter resets to 0 after 4 billion. PostgreSQL keeps track of the oldest xid and mxid so that when wraparound occurs (a normal part of the database operation), it is ensured that the database is not referencing new transactions in old ones. This is normally handled by routine auto vacuuming, but if the database gets close to the limit, then the aggressive auto vacuuming kicks in, and some writes are denied, which is what happened in this instance. This started on August 30th at 00:37 UTC and impacted about 80% of the writes. As auto vacuuming progressed, fewer and fewer writes failed until it fully resolved 2 hours later at 02:37 UTC, when auto vacuuming completed its run to freeze mids. The low-priority backlog was fully processed on August 30th at 14:43 UTC.

Cause of initial disk autoscaling

On August 23rd, we experienced an issue in which most events were either displaying an incorrect status or were missing. This issue's root cause analysis and details are documented in Missing events and incorrect events status in dashboard.

To resolve this issue, we had to snapshot a table over our Debezium Kafka connector, a process that involves creating a replication slot on PostgreSQL and pausing it until the data in the table has completed streaming to Kafka. This creates an accumulation of WAL files that require disk space, leading to our disks dynamically autoscaling. This operation was expected to trigger the auto-scaling; what was not expected was the impact of the performance of the auto-scaling.

Timeline

August 27 2024 19:45 UTC: Kicked off PostgreSQL events table snapshotting

August 28 2024 00:03 UTC: Database disk autoscaling occurred

August 28 2024 01:44 UTC: Database disk autoscaling occurred

August 28 2024 04:00 UTC: Database disk autoscaling occurred

August 28 2024 06:57 UTC: Database disk autoscaling occurred

August 28 2024 09:54 UTC: Disk I/O slowdown started occurring

August 28 2024 11:00 UTC: Opened ticket with provider and start RCA

August 28 2024 11:03 UTC: Database disk autoscaling occurred