The Worst .NET Bug I've Ever Fixed

A 70+ hour journey into fixing a ~10 year old Akka.NET bug.

9 minutes to read

Split Brains
Duplicate Shards and Duplicate Cluster Singletons
Conclusions

The month of July, 2024 was pretty rough for me and the Petabridge crew from a workload standpoint. Our office got hit with a hurricane. My entire family and I were out sick with COVID in the middle of being hit by a hurricane. Fun times.

But the worst of it was a pair of truly nasty Akka.NET issues that both turned out to both be the same bug:

I ended up spending about 70+ hours myself investigating and eventually solving this bug starting from June 27th onwards.

If you’re an Akka.NET user and don’t want to read any further, just upgrade to Akka.NET v1.5.27 or later - the bugs are fixed in those versions of the software. Every other earlier version of the software, dating back to when Akka.Cluster.Tools was first introduced in 2015, has this bug. Please upgrade now.

Otherwise, read on.

Split Brains

The real kicker was the first issue: duplicate shards. That’s a severe bug for Akka.NET users because it violates the most important data consistency principle we have for using actors to manage stateful applications with Akka.Cluster.Sharding: all state for the same entity must be owned by a single instance of an actor!

Well, if we have two instanceS of that actor - we now have a split brain, which means two different competing concurrent variations of the truth for what should be a single source of truth!

I’ve made a video, “Split Brains Explained”, that goes into some detail on the nature of split brains, the harm they cause, and how Akka.NET natively deals with them for things like cluster formation - so watch that if you want more background on the spirit of the problem in the first place.

For this story, all you need to know is that having two competing versions of the truth can be pretty disruptive, destructive, and even potentially dangerous for something that was never designed to have more than a single “truth.”

Imagine an IOT system where your entity actors are processing MQTT events - each of these events has identifiers pertaining to a single, physical device. You want a single actor to model this device as a “digital twin” and raise alerts / alarms when that device observes something outside of threshold. Well, if we have two competing copies for that digital twin - false alarms might be raised non-stop OR a real alarm might never go off!

That’s why it was so critical to fix this issue - data consistency matters.

Duplicate Shards and Duplicate Cluster Singletons

I shot and edited a 90 minute long video detailing our investigation into resolving the bug itself, “The Hardest .NET Bug I’ve Ever Fixed” - it’s very long and very detailed, and hopefully: mildly entertaining and educational.

However, a quick summary of what we found and how we found it:

For an issue of this nature F5 step-through debugging, unit testing, and integration testing are all automatic losers - even with Akka.NET’s multi-node testing runner (MNTR) capabilities. We needed a much more automated and observable approach in order to have a hope of successfully reproducing this issue AND understanding how the problem occurs.
Automated smoke testing using Docker + Kubernetes with full OpenTelemetry tracing, metrics, log aggregation, and alerts would give us the best chance to find this bug.
We designed a shared test reproduction library with a standardized Akka.Hosting configuration with full Phobos telemetry for Akka.NET enabled - this would allow us to parameterize multiple versions of our reproduction experiments running in parallel, allow us to observe each of them independently, and would make it possible for us to use Phobos’ detailed configuration to carefully tune out the noise in the tracing system so we could make sense of what was happening without sifting through reams of irrelevant data.
We wrote custom telemetry that would allow a single actor in the cluster to aggregate all of the actor starts / stops that occurred in every other Akka.NET process and would flag it when duplicates occurred and lived for longer than ~10 seconds or so. We created alerts around these detection signals so a reproduction sample running in our test lab environment could proactively tell us “hey, there’s something to look at here.” We also used Petabridge.Cmd as a second method of verifying the reproduction, so we could be certain we weren’t wasting our time chasing ghosts.
After a couple of weeks of running process-of-elimination on theories that weren’t ultimately correct, we concluded that the duplicate sharding bug is most likely caused by the ShardCoordinator, an Akka.Cluster.Tools.Singleton actor, running two parallel instances - specifically, this could only occur when one instance is terminating and another gets promoted.
We re-designed our experiment to include a “cluster shuffler” - designed to force the ShardCoordinator get moved from node to node to node until we started observing duplicates, at which point the shuffling would stop. If all of the duplicates got cleaned up as a result of node shutdowns or actor terminations, the shuffler would resume. We also, and this was a key detail, designed our experiment such that whenever a node began exiting (always the ShardCoordinator node) every other node in the cluster would begin aggressively messaging the sharding system, with the goal of producing duplicate ShardHomeAllocation instances.
We were able to successfully produce stable duplicates in our test lab environment on the first attempt after these changes were made.
Using the traces we captured from Phobos, we verified that the problem was definitely two instances of the ShardCoordinator running concurrently, each performing their own separate ShardHomeAllocations for the same Shard independently from each other.
Using even more tracing data from Phobos, we were able to isolate the problem to the ClusterSingletonManager’s Youngest --> BecomingOldest transition code. Skip to 1:26:57 in “The Hardest .NET Bug I’ve Ever Fixed” for details.
Old code that’s been in-place since 2015 was ultimately responsible for this bug - the ClusterSingletonManager isn’t hardened against cases where previous “oldest” nodes in the cluster can be rapidly restarted with the same hostname and port as their previous incarnation. There’s a 20s configurable delay between being signaled that the oldest node has been MemberRemoved from the cluster and actually removing it from the ClusterSingletonManager’s “who is oldest?” list. This caused the wrong node to get messaged (a brand new, younger node - rather than the one currently hosting the singleton) during the hand-off and could result in duplicate if the restart process took less than 20 seconds. This problem was likely to occur in environments where rolling restarts are easily and efficiently automated - Kubernetes being a good example.
We adjusted the ClusterSingletonManager to trust the data sent to it during the OldestMemberChanged event, which now includes the address of the most recent oldest node, rather than whatever is at the front of its “oldest members” list.
We tested about 7000-8000 ShardCoordinator movements in our smoke testing environment and were no longer able to reproduce the bug - normally we could reproduce it after 10-12 movements.

Conclusions

The biggest takeaway for me, from all of this, is how valuable smoke testing can be - it’s “set it and forget it,” like using a slow cooker versus cooking on the stove. I don’t have to stand there and constantly manage it - it will run itself, tell me when there’s a problem, and I use the extensive telemetry data we gathered to understand what happened.

Phobos’ ability to trace /system actors, configurable and not enabled by default, broke this case open for us and made it possible for us to find this ancient bug. I strongly recommend you buy a license if you’re using Akka.NET in any kind of serious capacity.

And don’t forget - upgrade to Akka.NET v1.5.27 or later!

If you liked this post, you can share it with your followers or follow us on Twitter!

Written by Aaron Stannard on August 1, 2024

Read more about:
Akka.NET
Case Studies
Videos

Observe and Monitor Your Akka.NET Applications with Phobos

Did you know that Phobos can automatically instrument your Akka.NET applications with OpenTelemetry?

Click here to learn more.