Post Mortem 1 | Thoughts

Going to preface this whole thing with two notes:

I'm not really a good writer, nor do I really care that much. This is meant to be a creative outlet
This is not AI slop so..

The other week I ran into a gnarly (for me) cascading failure problem while on call at my job. Started off with an alert that a health check failed for a scheduled ETL job that essentially syncs two tables in two separate databases. We look at the rows sent to us that we don't have record of and create them, we look at rows we also have record of and update those, and then we look at rows we have that aren't in the file sent over to us, and we delete those.

What could go wrong?

The Initial Failure

I started off with logs and saw this job is running, but not to completion. I went nowhere for a few minutes, so I started looking for recent PRs and found a recent database migration to add some new columns. After looking at the behavior of the failing job, I piece together that because of the new columns added, the update operation now thinks the diff of rows to update is.. the entire table. Their file has values for all of the columns, and we're missing values for a few of them in every row.

Next came a backfill to ensure that the diff was smaller so that the job would run to completion but the next problem had already been festering for 24 hours.

The Cascade

The backfill was fine, but I noticed logs indicating this job was running way more than expected. This job was actually running in a kafka consumer, so looking at our queue I could clearly see more messages than expected on the main topic, but some of them were days old. Had a pretty tough time figuring out why these were there, and noticed that the retry queue was getting bigger and bigger as well. After some toil and grasping at straws. It finally dawned on me that the offsets just weren't committing. Finally had a thread to pull.

This turned out to be pretty easy to confirm by loosening a log search across all of our consumer pods and just looking for any error. Thankfully I also had a few error traces to start from as well and got sifted through our libraries. The source code of our ruby-kafka gem gave clarity to the whole situation.

The job that was running had a 5 second timeout, at which point it would kick the consumer out of the consumer group, but that consumer was still running. Upon kicking the consumer out, the coordinator would also enqueue retries (5 of them lol!). Eventually the consumer that was kicked out would finish and try to commit its offset, but the group coordinator no longer recognized that member and so it did nothing. At-least-once guarantees bit us in the ass. A new consumer would get added to the group and would rinse and repeat. As time went on these topics grew and we kept trying to process various snapshots of this database out of order and in parallel when the system was designed to run every 6 hours in order.

The Fix

I moved the ETL to a sidekiq worker with no retries and we reset the offsets and all was fine. Thankfully, this ETL ensures that our DB is synced with the other party's as long as it runs to completion. There was also a gnarly query that was hanging which was a fun separate problem to solve. Funny enough an engineer I have a lot of respect for wrote it. Even funnier is that I approved it. Perhaps I had too much respect.

Takeaways

The ruby-kafka gem we used is no longer maintained, and quite old. There are decisions there about message delivery guarantees that should probably be given to our internal libraries.. maybe. Idk.. not everyone understands the consequences of these types of decisions.
Just should not do ETL within a consumer process. If I'm not mistaken Kafka is typically used for high throughput low latency message streaming. Don't think I'm that far off here. I ultimately decided to defer the ETL to a sidekiq worker which is a much better way to handle non-blocking long running processes that you want to guarantee run to completion.
Really felt like the debt collector came here. A lot of smelly shit all at once surfaced. In my experience, I feel like engineers really like to talk about debt but seldom is the accumulated debt considered high-interest, and seldom does it bite you in the ass. Maybe that's a new concept.. "what is the interest rate of this debt?"