Real World Akka.NET Clustering: State Machines

Using Akka.Cluster and State Machines to Build Reactive Systems

Petabridge recently launched its first cloud service, Sdkbin, and it’s been a great opportunity for our team to use the Akka.NET tools, patterns, and practices that we’ve been recommending to our customers for years.

In this blog post I’m going to walk through one of the most universal real-world Akka.NET actor patterns: the state machine.

Finite State Machines: a Primer

One of the major drawbacks of web programming is that due to its stateless and non-local affinity, many programming patterns that were commonplace in embedded, desktop, and client programming have lost their importance in everyday programming. These “lost patterns” tend to be stateful ones - and none more so than the “finite state machine” or “FSM” for shorthand.

Finite State Machine concept

The idea behind a finite state machine is simple but powerful:

  1. For a given type of entity modeled in your software system it can exist in one of many possible states - but it can only occupy a single state at a time;
  2. An entity usually has some “state data” objects that provide contextual information for the current unit of work in process - i.e. an Order entity might be in an Open state and have an OrderData object that contains a unique orderId, customerId, and so on;
  3. As the state machine processes events both its state and its state data can change - in particular, the state machine can transition from one state to another in response to 1 or more events being processed; and
  4. States can be re-entrant - an entity can return to its original state in the event of a timeout, error, or based on the type of event it processed.

When Do You Need a State Machine?

If you can model an operation using an atomic transaction, you probably do not need a state machine.

State machines are most useful when you need more than one discrete event to arrive in order to invoke some kind of reaction in your system.

A good example from Sdkbin: triggering a series of “onboarding” email notifications once a new organization gets created.

Sdkbin customer onboarding state machine

In this state machine, we need to do the following:

  1. Notify the creator of the organization that they can and should invite some members of their team to their organization on Sdkbin;
  2. Repeatedly remind the organization owner every 5 days that they need to invite users, up until some termination point (i.e. after 3 attempts;)
  3. If the owner does invite some users to their organization, then we need to remind each individual user to accept those invitations once every 5 days; and
  4. Once all of the users accept or decline their invitations, we the state machine transitions to a “complete” state where it can be shut down.

This type of design is very easy to model using Akka.NET actors, thanks to switchable actor behaviors:

NOTE: we’re just going to show a concise version of this code for the sake of brevity.

Message Protocol (partial)

public static class OrganizationProtocol{
    public sealed class SendOnboardingEmail{
        public static readonly SendOnboardingEmail Instance = new SendOnboardingEmail();
        private SendOnboardingEmail(){}
    }

    public sealed class SendReminders{
        public SendReminders(int count = 0)
        {
            ReminderCount = count;
        }
        public int ReminderCount {get;}

        public SendReminders NextTry(int increment = 1){
            return new SendReminders(ReminderCount + increment);
        }
    }

    public sealed class Timeout{
        public static readonly Timeout Instance = new Timeout();
        private Timeout(){}
    }

    public sealed class MemberInvited{
        public MemberInvited(string emailAddress){
            EmailAddress = emailAddress;
        }

        public string EmailAddress { get; }
    }
}

Actor State Implementation

// Making data structure mutable since it's internal-only
public sealed class OrganizationOnboardState{
    public string OrganizationId { get; set; }

    public int NumOfMembers {get; set; }

    public ImmutableHashSet<int> NotificationsSent { get; set; }

    public ImmutableHashSet<string> UsersInvited { get; set; }
}

Actor Implementation

public sealed class OrganizationReminderActor : ReceiveActor, IWithUnboundedStash, IWithTimers
{
    public OrganizationOnboardState State { get; }
    private int _reminderCount = 0;

    public IStash Stash { get;set; }
    public ITimerScheduler Timers { get; set; }

    public OrganizationReminderActor(string organizationId)
    {
        State = new OrganizationOnboardState{ OrganizationId = organizationId, NumOfMembers = 1 }
    }        

    public void OrganizationCreated(){
        Receive<SendOnboardingEmail>(_ => {
            // invoke method to send prompt to organization owner
        });

        // buffer any messages we can't currently process in the stash
        ReceiveAny(_ => Stash.Stash()); 
    }

    public void TransitionToInviteMembersSent(){
        State.NotificationsSent.Add(1); // add the id of the first email campaign
        Stash.UnstashAll(); // dump messages back into mailbox for FIFO processing
        Become(InviteMembersSent); // change behavior

        // time out this stage after 5 days
        Timers.StartSingleTimer("invite-timeout", Timeout.Instance, TimeSpan.FromDays(5.0d));
        Self.Tell(new SendReminders(_reminderCount));
    }

    public void InviteMembersSent(){
        Receive<SendReminders>(s => {
            // select and send campaign based on reminder count
        });

        Receive<MemberInvited>(m => {
            State.UsersInvited.Add(m.EmailAddress);
            // change state to "MembersInvited"
        });

        Receive<Timeout>(_ => {
            TransitionToInviteMembersSent(); // resend everything
        });
    }
}

It’s easy to model a finite state machine using a ReceiveActor, a PersistentReceiveActor, and finally we also have a formal FSM actor type defined in Akka.NET as well. But the real-world power of state machines occurs when you need to model a large number of concurrent entities simultaneously - this is where clustering, Cluster.Sharding, and FSMs come together.

State Machines at Scale

We were able to use a finite state machine to take a complex situation, successfully onboarding a new organization onto Sdkbin, and model it into a comprehensible unit of work via actors. That in itself is a win - but what about what we have to deal with in the real-world: onboarding an arbitrarily large number of organizations all at once and in different stages?

Finite state machines will get the job done there too - and in this case, we’ll accomplish it by using a common architectural pattern in Akka.NET:

  1. Using Akka.Cluster.Sharding to allocate a single state machine per-organization;
  2. Persisting state machine state using Akka.Persistence to a durable datastore, such as SQL Server or Azure Table Storage; and
  3. Routing all of our external event notifications (i.e. invites being accepted) back through Akka.Cluster.Sharding.

Akka.NET state machines distributed via Akka.Cluster.Sharding

What does this architecture help us do?

  1. It takes a single, programmable model (the state machine) for managing a complex series of interactions and makes it reproducible for an any number of concurrent instances;
  2. Through Akka.Cluster.Sharding, it evenly distributes this workload across all of the nodes in our cluster - even if the cluster topology changes;
  3. Akka.Cluster.Sharding guarantees that there will be at most one copy of a given entity (and its state machine) throughout the cluster, even on nodes that we currently cannot reach as a result of a network partition, so we can rest assured that we have a single source of truth for each entity; and
  4. Akka.Persistence gives us a model for being able to transfer and recover state in the event of a change in cluster topology, process crash, or re-balancing of the shards.

What this architecture does not currently do:

  1. Does not guarantee delivery of all messages - there is still a chance that some in-flight messages can be lost;
  2. Does not persist the scheduled timeouts that our state machine actors use - we wouldn’t want to keep all of these entity actors alive for five days while they sat there and did nothing; we’d probably want to use a durable scheduler such as Akka.Quartz.Actor to “wake up” these entity actors after an idle period of five days. That way we can keep our memory footprint limited to just the actors who are doing live work right now, rather than all of the possible actors who could be alive.

State machines are a natural way to model all sorts of complex, concurrent interactions in an easily testable, understandable, and scalable manner. I hope you’ve found this helpful and please leave a comment if you have any questions!

If you liked this post, you can share it with your followers or follow us on Twitter!
Written by Aaron Stannard on November 2, 2020

 

 

Observe and Monitor Your Akka.NET Applications with Phobos

Did you know that Phobos can automatically instrument your Akka.NET applications with OpenTelemetry?

Click here to learn more.