Akka.NET Internals: How Akka.Remote Connections Work

Today, we’re going to begin introducing the world to the internals of Akka.NET. These are the advanced technical topics and explain what is actually going on “under the hood” to make the magic happen.

We begin our exploration with Akka.Remote.

What Is Akka.remote?

Akka.Remote is one of the most important modules in the entire framework. Akka.Remote is the module which actually enables a big percentage of the awesome feature of Akka.NET, such as:

location transparency (not having to care which process—local or remote—an actor lives in)
ability to scale out via configuration, instead of code
Akka.Cluster and highly-available clustered systems (built on top of Akka.Cluster)
ability to remotely deploy actors from one machine to another

In short: understanding Akka.Remote is a powerful tool to have at your disposal for building distributed systems in Akka.NET.

Akka.NET Remote Connections Explained

The most important piece to understand about Akka.Remote is how remote connections work, and what is the topology of actors responsible for managing remote connections.

So, we created our first teaching video to explain how remote connections work! We hope you like it. The video covers all of the below, and more:

Akka.Remote transports
TCP and UDP remote connection
How to enable Akka.Remote via Configuration
Key terminology
Connecting to Remote Systems
Akka Handshake Protocol
Why do we bother?
Remoting System Actors
Writing/Reading with Endpoints
Association Lifecycle

Have a watch!

Can’t see the video? Click here to watch it on YouTube.

The full video transcript is below.

The slides from the video are available here.

Want to Go Deep on Akka.remote?

Want to really learn how Akka.Remote works, and how you can fully utilize its power to build resilient and scalable systems?

Come take our Akka.Remote advanced training course, which will go into much more depth and teach you how the internals really work, and how to put this all into practice in your systems.

The next Akka.Remote training session is in just a few weeks. We’ll see you there!

Have Questions?

Curios about something you saw in the video? Something unclear? Post your question or comment below and we’ll get you an answer.

Transcript

Hello everyone, my name is Aaron Stannard, I’m the Chief Technology Officer of Petabridge. A company that provides training support and consulting for our Akka.NET and today this is gonna be one of the first to our internals videos, we explains how some of the components built into our Akka.NET actually work under the hood and we’re gonna talk about Akka.Remote how it actually establish its connections to remote ActionSystems. So for those you aren’t familiar, Akka.Remote is the module that powers all the networking capabilities in the Akka.Cluster and lot of the other modules actually use.

[0:30] So understanding our Akka.Remote works is pretty fundamental to building distributed systems of top of it. So we’re gonna start by taking a look at some configuration options for two systems were gonna have talk to each other, so here’s a sample configuration. We have two systems, system one and system two and we’re going have them use Akka.Remote to communicate to each other. So, the first thing we need to do to enable remote communication is make sure that both systems set their actor reference provider to be the Akka.Remote

[1:00] remote actor ref provider. This action ref provider wanna get initialized what boots up, all the remoting capabilities we actually need. Support the actor systems to have this enable in order to communicate. Next, were gonna go to open the remoting section in this configuration file and were gonna specify some options for the Helios.TCP transport. So Helios is a reactive socket programming library that Akka.Remote references and enables both TCP and UDP communication.

[1:30] In this example we’re going to enable the Helios TCP transport and system one will connect a TCP socket on port 8080 on local host and system two will bind into a socket on port 8081 on local host. Talking about how sockets workers a little beyond the scope of this video but we’ll do some Helios internals videos in the future too. So start by taking a look at some other terminology that Akka.Remote uses internally, so first is the concept of a transport. A transport in the context

[2:00] of Akka.Remote is a literal network transport so it’ll use TCP connection is the default transport that most applications use, however, you can have multiple active transport so you can have a user-defined transport that uses a protocol it’s not supported by Helios or you could even have a Helios TCP and a Helios UTP transport active at the same time. But the one important distinction to make is you can only have one instance in a TCP transport open anytime.

[2:30] Next we have the concept of an Endpoint and Endpoint is a specific address binding for a transport so we bound ActorSystem one to local host port 8080 that created inbound Endpoint for that TCP transport, and in order for Endpoint to exists, we need to have a valid transport in order to that. Next we have the concept of Association. You think when Association as a connection between two Endpoints, one Endpoint

[3:00] on one ActiveSystem and another Endpoint on a second ActorSystem. And these actor, these ActorSystem associations depend on having a valid inbound Endpoint and a valid outbound Endpoint. We’ll talk a little bit more about what those terms mean a second and finally we have the notion of an association handle. Association handle is a service provider interface between the network transport and the actors in your ActorSystems actually receive those messages on the network so the Association handle is used by the systems actors

[3:30] built in Akka.Remote, your actors that you define as an end-user don’t see the Association handle or really any of these components at all outside up the configuration that were created earlier. So take a look at what the initial state have an ActorSystem might look like, so the initial state of our ActorSystems we have two independent systems that each occupy their own process, they might boot up around roughly the same time and the first thing we gonna do once the ActorSystem initializes

[4:00] its actor reference provider is there gonna open all their configured transports on their … reports is the reopening inbound TCP endpoint and port 8080 for ActorSystem one and open inbound TCP Endpoint on port 8081 for ActorSystem two and this transport will began listening for incoming Associations. So this point, neither ActorSystems connected to each other but they both have the ability to receive inbound communication attempts from one ActorSystems

[4:30] or the other. The cycle bit about how we actually connect to ActorSystem. So we want to connect ActorSystem we have either send a message or try to deploy an actor to the remote ActorSystem we want to connect to and we do that by using a fully articulated actor path just like in the example on the slide. So we’re all familiar path part on the right, they represent actors position on the hierarchy. So let’s talk about the left hand

[5:00] side of the actor path. So we have the name with the ActorSystem which is pretty obvious since its records but was less obvious is what happens to the Akka protocol on the set enabling transport. So if we turn on the Helios TCP transport and we want to communicate to a remote system that also using TCP, we add .TCP extension to the Akka protocol the very point. So all those actors can be addressed to via the protocol

[5:30] Akka.TCP if we enable UDP that might be Akka.UDP and then we have the address which follows the name of the ActorSystems, the address represents the inbound Endpoint to which that ActorSystem is listening to its protocol. So for trying to connect ActorSystem to a full actor path might look like Akka.TCP://MySystems@localhost:8081/users/actorName1.

[6:00] So let’s talk about the distinction between inbound and outbound Endpoints, so system 1 attempts to send a message to system 2. System 1 will create an outbound endpoint and system 2 will receive this message in port 8081 and will receive it on its inbound Endpoint, so the importance between these two distinctions is that an outbound Endpoint begins in a different state in the Association process than inbound Endpoint does because the inbound Endpoint

[6:30] already knows that the other Endpoint and the other another connection is there and listening so that is something that would have the bear in mind we start talking about the Akka protocol a little bit. Let’s go on walk through the Association process, so system 1 and system 2 are both up running, they have their TCB transport bound to 8080 and 8081 respectively and we need to go on creating outbound endpoint from one of these systems to the other. So system 1 is going did attempt to send message to

[7:00] system 2 over TCP on local host 8081 and in order to do this, we need to create a new outbound Endpoint. So what will actually under the hood with Helios TCP transport as you open a new TCP connection on a random Freeport number the operating system provide. So we gonna open up a new TCP connection and we send a handshake message over to system 2 was the TCP connection is open on the other side system to receive the handshake message

[7:30] from local host random is that random is the number of the outbound port on the outbound socket and we will accept this inbound Association and will send an Associate message back to system 1 on the outbound Endpoint and a connection will established at this point. So let’s walk through the details in the Akka handshake protocol a little bit more detail, so on either side of the handshake protocol we have two state machines and these are modeled using what’s called the protocol state

[8:00] actor on Akka.NET and this part of the system actors that are invisible to use an end-user. So we have is two state machines and system 1 during this process we just walked through, well user-defined actor attempt to send message to system 2 over TCP, says the first step in creating a handshake is someone attempts to send a message to some ActorSystem. So system 1 will check to see if there’s an available Endpoint a system 2 already we have a open Association

[8:30] begin already to communicate with the system 2 and we don’t need to go to the handshake process. So since no Endpoint available we need to create one, so we’re gonna create a brand new protocol state actor and we’re gonna put it into the closed state on the outbound side connection. So once the actual transporters was able to open a socket connections to stand to we’re going to shift from being in the closed state into the weak handshake state on system 1 and will send an associate message

[9:00] over the network to system 2. System 2 already has a socket connection open and waiting but has in creating a protocol state actor yet. The endpoint manager and actor we’re gonna talk a lot receives the associate message and checks to see if it’s quarantine in system 1. We’ll talk about a little bit more about what means in a minute but it has to quarantine the node, we can accept the associate attempts. So system 2 will create its protocol state actor initially a weak handshake state

[9:30] that has already knows that system 1 has an outbound protocol state actor that exist, pace up the fact this association message was sent from it, so when we check to see if the quarantine exists and if it doesn’t then the inbounds either the connection can shift into an open state which mean that its able to receive and send messages from user-defined actors to system 1. So it’ll send an associate message back to system 1 in order to complete the handshake. System 1 will check to see ever has in quarantine

[10:00] system 2 and quarantine it can happen if something goes wrong during this handshake process but if there hasn’t been issue it will accept the associate message from system 2 and it’ll complete the handshake process. And now the connection open on both sides of the protocol. So why do we bother with this handshake process. There is a two part explanation to this, first severe familiar with the TCP protocol, you know that there’s already a handshake put into that. So why do we have the

[10:30] need to this again and reason is because we need to guarantee connection-oriented behavior over anytime to transport including connectionless ones like the UDP protocol and the reason why we need a connection-oriented behavior is because actor references depend on being able to guarantee that they existed as some point in the past in order to be created. So the only way we can really do that reliably is with connection-oriented behavior, we’re able to tell any

[11:00] given time that both and the connection are active in receiving messages. So the handshake protocol is able to establish the other random both Association really does exist. Then we have a system of a heartbeats they sent back and forth over the network after the association is made as able to guarantee that connection is still open. We’ve been on systems that part connection-oriented by QTP. So that’s why, we’re not gonna talk very much about the heartbeat system in this particular video

[11:30] but we will address it at some point in the future. Okay, so let’s talk about the remoting system actors that powers that Akka.Remote, so first we have a system Guardian, you may notice that every time you create a user-defined actor whatever the applications the path begins with slash as user that’s because all your top level user-defined actors are actually children of the user Guardian. Well, all your top level were a system actors that built in are the children of the system Guardian,

[12:00] same idea but these are actors built in our Akka.NET. So the first remoting system actors want talk about the transport supervisor. These actors responsible for supervising the underlying that network transport and restarting it in the … a critical failure there and the transport supervisor creates an Akka protocol manager. The Akka protocol manager’s job is to create protocol state actors for every Endpoint that has attempted to be created on top of that transport. So every

[12:30] inbound and outbound Endpoint there exist one protocol state actor for the transport and the Akka protocol manager’s name includes the scheme on the underlying transport. So if we were to have a TCP transport and the Akka protocol manager will be called AkkaProtocolManager.TCP and then the UID. UID represents the idea of transport if in the event that it needs to restart so every time that your ID changes affected we’re connecting to either a new … with the

[13:00] transport and that factors into somebody decisions that are made by the endpoint manager the next … will talk about. The Endpoint manager is effectively the most important actor I Akka.Remote. And we’re gonna spend the next couple slides going more detail on what it does but the important thing you need to know was Endpoint managers responsible for the creation of Endpoint. So to recreate outbound Endpoint, the first actor we create is what’s called a reliable Endpoint writer. This is an actor that has

[13:30] some built-in behavior for being able to guarantee deliver of the messages over the network. So its responsible for making sure that all the messages needed to create an Association actually do arrive at their destination with some moderate amount of effort retry first. Then there is Endpoint writer. This is the actor that actually represents the ride a bull and an Endpoint, so any messages you want flushed onto the network are delivered via an Endpoint writer. And then there is an Endpoint reader which as you might have guessed reads

[14:00] messages from the network, so both endpoint writer and Endpoint reader communicate directly with the association handle that allows us to link directed to the transport through talk a little bit more about all these actors next. We also have the RemoteDaemon. The RemoteDaemon is responsible for the remote deployment actors, so ActorSystem 1 actually has the ability to serialized props for an actor and deploy ActorSystem 2, as long as ActorSystem 2 were actually implements the class

[14:30] that is specified a part about deployment. So the remote game is responsible for creating remote actor references and for killing them in the event that the network disassociates. So the RemoteDaemon is pretty specialize actor, we’re not gonna talk about much more this, this presentation but we might do a subsequent video on it for people who are really interested. The next we have is the RemoteWatcher which take the Deathwatch capabilities that are well defined in Akka and extends them so they can work across the network. So a

[15:00] actor on ActorSystem 1 can perceive Deathwatch notifications for an actor on ActorSystem 2 and in the event that the network goes down it will receive a Deathwatch notification. So okay, so as far as we’re concerned that actors dead, we can’t guarantee its existence anymore. So that’s where those two actors do. So we’re gonna talk the Endpoint manager a little bit more detail now. The Endpoint manager has a lot of responsibilities. The first is that is responsible for starting all the underlying transports,

[15:30] so even though the transport supervisors supervise them from that point onward at the very beginning of the ActorSystem boot process, one that Akka.remote is enabled the employment under the actor actually starts all those transports and binds them to their sockets and sets up the initial inbound Endpoint. The endpoint managers also responsible for applying policies in each Endpoint. So if we have a ActorSystem then downed at some point in the past,

[16:00] we may not want to connect to it again because there might be something that’s wrong with that machine or that process and we might require that system reboot before we can reconnect with. We’ll talk a little bit more about policies on the next line and endpoint manager also the first responder to all inbound Association attempts, so anytime someone attempts establish a connection to any given ActorSystem the endpoint manager makes a decision based of its policies whether we’re going to allow the Association attempt or not.

[16:30] And once the Association attempt is allowed through hands out to the Akka protocol manager. The employment manager is also responsible for creating the Endpoint writer and a reliable delivery supervisor and as we recall from a previous slide. The endpoint writers what’s actually creates the Endpoint reader to. So the initial right side of the transport … the initial right side up the Endpoint which is what we need to complete the handshake process is actually created by this actor,

[17:00] so the Endpoint manager is also ultimately responsible for supervising these actors. So in the event that an Endpoint, this Associates it’s also the endpoint manager’s job to figure out what to do about it. So that the Endpoint manager’s responsibilities in the nutshell. Let’s talk a little bit about its policies, so the first ball see wanna be one is the policy of pass which means that we’re Okay. We can connect to this transport, we can accept disassociation from this remote system.

[17:30] Next policy is a gated policy and this happens when a node shuts down so its usually a plan determination at least to some extent means that there is a clean shutdown attempt and we will block all connections to reconnect with its node to or retry window elapses. So that retry windows can be configured but … five seconds by default. So this is to prevent us from trying rapidly reconnect to a system we know won’t be available for a little while.

[18:00] As the purpose of giving a connection. A quarantine is a little different you really want to avoid the quarantine connection that means that during Association the node somehow failed to complete the handshake process. So there’s something effectively wrong with the transport or wrong with that machine or there’s something going on with the application is using it that were require the application must be restarted with the new you ID before will accept a connection from it. We want establish outbound connections to quarantine node but will

[18:30] accepted in inbound connections like quarantine node why ideas is different is that means the ActorSystem is rebooted and the quarantine window will lapse after a period of time will disable the quarantine have enough time passes not can also be configured to, was as you see with the process of interacting in reading and writing Endpoints looks like. So … about writing to endpoints, so the first thing that happens is a user-defined actor sends want to send a message to a remote actor reference

[19:00] or to actor selection sits on a remote system or maybe one of the remote deploying actor which case that’s really going to be done as a message this address to the RemoteDaemon on the machine. That’ll be transparent to you then where writer receive that user-defined message, a little serialize that message and write it to the Association handle. So we use a combination of a Google protocol buffers … with some important meta data and then we have a little blob of binary … actually contains your user-defined message. So the

[19:30] meta-data that’s included on that issue lies message includes things like which actor this is gonna be delivered to you, what’s the path a bit and so forth. And then the Association handle actually writes that serialize message the network transport. So that’s what the running to that Endpoint looks like. So reading an Endpoints as you might imagine works the opposite way. A message arrives in the network and the Association handle on the underlying transfer will push that message into the appropriate Endpoint reader. So this is the reactive system, we don’t have to pull for

[20:00] messages, they’re pushed into us. So the endpoint reader will receive that message and all the serialize the message and find the recipient so ago and figure out which actor should receive this message and have no actor can receive that message let’s say it’s a actor selection for an actor that doesn’t actually exist on the system then we’ll send that message to dead letters. Was it does find a valid recipient though it to use a special class column message dispatcher to actually send that message to the recipient.

[20:30] So that’s what we’re reading and writing from endpoints looks like. Let’s talk about the process up disassociation which is how remote communication between two actors systems ends. So we have an open association between system 1 and system 2 in this example. System 2 crashes hasn’t unplanned termination, so … the system 2 crashes in the process is terminated a restarted, the association is automatically closed because the socket connection that was powering the transport is now closed itself.

[21:00] So the transport disconnects no more messages will be received and when that transport closes on the system 1 side of the equation it will create disassociation notification for the Endpoint manager. So to propagate that message the Endpoint manager and the Endpoint manager will kill all remote deployed ActorSystems to, does this by communicating with RemoteDaemon and any further outbound messages that are intended for system to will be automatically delivered that letters.

[21:30] Now the endpoint policy that the Endpoint manager will apply the system 2 going forward will depend on the manner in which this crashed occurred. Unplanned termination what result negated policy, we’re gonna wait for that system come up more time to connect to it again at some period of time but system 2 is a plan termination will go ahead and buffer our connection attempts for five seconds by default, some configurable … of time. And in event that we attempt to reconnect the systems 2. So we disassociates then,

[22:00] then we might quarantine that connection until the process had chance to reboot. So that’s where the disassociation process looks like. So that’s it for this video. Thank you all for watching and if you wanna learn more about Akka.Remote, how to use it on your own applications, we encourage you to check out Petabridge Virtual Training. So these are half a day long live webinars with myself in Andrew and will teach you some ins and outs of that Akka.Remote so you can better use it in your own applications. So check out the URL on the bottom of the screen and give it a try. Thanks for watching.

[22:35]

If you liked this post, you can share it with your followers or follow us on Twitter!

Written by Andrew Skotzko on May 6, 2015

Read more about:
Akka.NET
Case Studies
Videos

Observe and Monitor Your Akka.NET Applications with Phobos

Did you know that Phobos can automatically instrument your Akka.NET applications with OpenTelemetry?

Click here to learn more.