.NET Performance Optimization: Deferred Allocations
How We Accelerated Phobos 2.5's Throughput by 161%
We just finished shipping Phobos 2.5 and it’s a massive performance upgrade over previous versions of Phobos.
For those that are not aware: Phobos is our commerical OpenTelemetry add-on for Akka.NET.
This past summer we posted about Phobos 2.4’s performance being 62% faster than Phobos 2.3.1 Phobos 2.5 is 161% faster than Phobos 2.3.1 - and in this post we’re going to share the generalized .NET performance optimization technique we’ve been using to accomplish these improvements: deferred allocations.
Problem Context
Imagine you have a performance-critical hotpath in your application, such as an Akka.NET actor or an ASP.NET Controller - we ideally want to keep the latency in this critical path as low as possible in order to maximize responsiveness and per-process throughput.
But, we are also given a secondary requirement - maybe we have to add logging or OpenTelemetry tracing here for observability purposes, or maybe we have to push some data points into an internal-facing analytics / reporting system for internal stakeholders.
Implementing that secondary requirement along the critical path is going to increase our processing time at the expense of our mission-critical processing and ultimately, our end-users. Deferring allocations to outside the critical path is how we can avoid this problem.
Let’s consider what we did with Phobos 2.5 and .NET’s OpenTelemetry pipeline as an example.
Phobos 2.3.1 Instrumentation Code and Performance Impact
Phobos installs inside your ActorSystem
and injects some of the following telemetry code into all of the actors therein:
// Allocation 1: Span creation (unavoidable allocation)
TelemetrySpan sp = Tracer.StartActiveSpan(underlyingMsg.GetType().GetOperationName(),
SpanKind.Consumer, parentContext: context.Value, InitialAttributes, startTime: startTime);
// Allocation 2: Adding attributes + stringifying data – 2 collection allocs + 2 strings
sp.SetAttribute(MsgSenderTagName, (Sender ?? ActorRefs.Nobody).Path.ToStringWithAddress())
.SetAttribute(MessageTagName, underlyingMsg.GetType().ToCleanName());
if (startTimeUtcTicks != null)
{
// Allocation 3: Adding 1 event – 1 collection alloc
sp.AddEvent("waiting", startTime);
}
// Allocation 4: Adding 1 event – 1 collection alloc, 1 SpanAttribute alloc
var attributes = new SpanAttributes();
// Allocation 5: 1 MAJOR stringification operation
attributes.Add("content", underlyingMsg.ToString());
sp.AddEvent("message", attributes);
This code uses the OpenTelemetry.Api
package to collect TelemetrySpan
s each time an actor processes a message (although this can be configured) and is the most expensive part of Phobos due to the large number of allocations and the string
-ification of ActorPath
s and the messages themselves.
To give you a general idea of the peformance impact, here are the numbers when running Phobos 2.3.1:
Metric | Units / s | Max / s | Average / s | Min / s | StdDev / s |
---|---|---|---|---|---|
TotalCollections [Gen0] | collections | 42.11 | 38.22 | 36.28 | 2.32 |
TotalCollections [Gen1] | collections | 2.59 | 2.12 | 1.81 | 0.31 |
TotalCollections [Gen2] | collections | 1.11 | 0.79 | 0.73 | 0.11 |
[Counter] MessageReceived | operations | 1,263,277.84 | 1,145,856.34 | 1,088,397.58 | 70,098.66 |
By comparison, when Phobos isn’t installed this same sample runs at about 6.2m messages per second for a single actor. So the default performance of an Akka.NET actor is roughly 1/6th as fast when Phobos is installed versus when it’s not - and the only discernable difference is whether or not telemetry is enabled.
How Deferred Allocation Can Work
In both Phobos 2.4 and 2.5, we gradually improved performance in this hotpath through the extensive use of deferred allocation - what does that mean?
- Hotpath - we have a hotpath, the actor’s message-handling routine. This code is performing latency-sensitive work and has to execute as quickly as possible for best results.
- Telemetry - users want and need telemetry to help understand how their software performs, and while this data is important it’s not as crucial or urgent to the real-time operation of the software.
That last sentence is key: “telemetry is not as crucial or urgent.” This means we can try to change when the telemetry is fully allocated and expanded so it happens asynchronously from our hotpath.
The allocations are still going to happen, but now they’ll happen in such a way that it’ll only impact the OpenTelemetry export pipeline - which is less urgent and important than the rest of our software.
Phobos 2.5.0 Instrumentation Code and Performance Impact
We changed our telemetry code in Phobos 2.5.0 to look like the following, in order to displace allocations from inside our actors to the OpenTelemetry export pipeline instead (which runs out of band:)
// Deferral 1: Lumps tags into struct, defers stringification. 0 allocs.
var tags = InitialTags.WithMessageTypeAndSender(underlyingMsg, Sender?.Path);
var perfSpan = ActivitySource.StartActivity(underlyingMsg.GetType().GetOperationName(),
ActivityKind.Consumer, context ?? default, tags);
if(perfSpan == null) return default;
// ensure that we are the current span
Activity.Current = perfSpan;
if (startTimeUtcTicks != null)
{
// Adds 1 node in linked list inside Activity
perfSpan.AddEvent(new ActivityEvent("waiting", startTime));
}
// Deferral 2: Adds 1 node in linked list inside Activity + ActivityTagsCollection
// but IEnumerable is not expanded until export pipeline
perfSpan.AddEvent(new ActivityEvent("message",
tags: new ActivityTagsCollection(CreateSpanAttributes(underlyingMsg))));
return perfSpan;
static IEnumerable<KeyValuePair<string, object>> CreateSpanAttributes(object msg)
{
// Deferral 3: message is not stringified until enumeration (export pipeline)
yield return new KeyValuePair<string, object>("content", msg);
}
So what’s happening here?
For starters, we’ve switched from using the OpenTelemetery.Api package to System.Diagnostics.Activity - the former being a wrapper over the latter, in order to access some APIs that make deferred allocation possible.
The ActivitySource.StartActivity
method accepts an IEnumerable<KeyValuePair<string, object>>
argument for all of the “tags” we want to apply to this Activity
. Our tags
data type implements IEnumerable<KeyValuePair<string, object>>
using a custom struct
that we’ll show later. This is beneficial to performance for two reasons:
- The
IEnumerable
doesn’t get iterated over until theActivity
is about to be exported - so we save on list / array allocations there; - The
object
will eventually get rendered into astring
- which means we can defer the stringification of the tag’s value until theActivity
is in the export pipeline.
Let’s take a look at the struct
we use to do this:
internal readonly struct BuiltInTags : IEnumerable<KeyValuePair<string, object>>
{
public string ActorPath { get; }
public string ActorType { get; }
public object MessageType { get; }
public ActorPath MsgSender { get; }
public IEnumerator<KeyValuePair<string, object>> GetEnumerator(){
yield return new KeyValuePair<string, object>(ActorPathTagName, ActorPath);
yield return new KeyValuePair<string, object>(ActorTypeTagName, ActorType);
// defer stringifying the message's type until we hit export pipeline
yield return new KeyValuePair<string, object>(MessageTagName, FormatMessageType(MessageType));
// defer stringifying the message's sender until we hit export pipeline
yield return new KeyValuePair<string, object>(MsgSenderTagName, FormatSenderPath(MsgSender));
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
Eliminating a collection allocation by using this readonly struct
saves us some allocations, but the real benefit is deferring the object
to string
conversion until this Activity
is getting exported by the OpenTelemetry tracing pipeline.
The other major area where we benefit from deferred stringification is the static
local function which emits an IEnumerable<KeyValuePair<string, object>>
for our message
event - this uses the exact same technique as the previous code we looked at.
// Deferral 2: Adds 1 node in linked list inside Activity + ActivityTagsCollection
// but IEnumerable is not expanded until export pipeline
perfSpan.AddEvent(new ActivityEvent("message",
tags: new ActivityTagsCollection(CreateSpanAttributes(underlyingMsg))));
return perfSpan;
static IEnumerable<KeyValuePair<string, object>> CreateSpanAttributes(object msg)
{
// Deferral 3: message is not stringified until enumeration (export pipeline)
yield return new KeyValuePair<string, object>("content", msg);
}
Same idea - the attributes for this trace event don’t get rendered into string
representations until we hit the trace export pipeline, therefore we can avoid that expensive stringification operation from occurring inside our hotpath.
Let’s take a look at some memory allocation data from Benchmark.NET to drive this point home:
Phobos 2.3.1 Memory Allocation
Method | MessageKind | Mean | Error | StdDev | Req/sec | Gen0 | |
---|---|---|---|---|---|---|---|
Allocated | |||||||
CreateNewRootSpan | Primitive | 112.6 ns | 2.18 ns | 5.64 ns | 8,880,473.05 | 0.0246 | 232 B |
CreateChildSpan | Primitive | 112.9 ns | 2.24 ns | 3.42 ns | 8,858,208.46 | 0.0246 | 232 B |
CreateChildSpanWithBaggage | Primitive | 113.0 ns | 2.28 ns | 2.14 ns | 8,852,716.94 | 0.0246 | 232 B |
CreateNewRootSpan | Class | 159.5 ns | 3.21 ns | 8.52 ns | 6,270,690.97 | 0.0348 | 328 B |
CreateChildSpan | Class | 165.4 ns | 3.54 ns | 10.45 ns | 6,044,475.13 | 0.0348 | 328 B |
CreateChildSpanWithBaggage | Class | 162.2 ns | 3.24 ns | 6.31 ns | 6,165,487.70 | 0.0348 | 328 B |
CreateNewRootSpan | Record | 201.4 ns | 2.40 ns | 2.13 ns | 4,965,186.73 | 0.0730 | 688 B |
CreateChildSpan | Record | 211.1 ns | 4.00 ns | 3.74 ns | 4,737,334.16 | 0.0730 | 688 B |
CreateChildSpanWithBaggage | Record | 208.1 ns | 3.60 ns | 3.37 ns | 4,805,978.43 | 0.0730 | 688 B |
Phobos 2.5.0 Memory Allocation
Method | MessageKind | Mean | Error | StdDev | Req/sec | Gen0 | Allocated |
---|---|---|---|---|---|---|---|
CreateNewRootSpan | Primitive | 59.97 ns | 0.637 ns | 0.565 ns | 16,676,180.98 | 0.0050 | 48 B |
CreateChildSpan | Primitive | 67.24 ns | 0.943 ns | 0.836 ns | 14,871,155.56 | 0.0050 | 48 B |
CreateChildSpanWithBaggage | Primitive | 77.87 ns | 0.871 ns | 0.814 ns | 12,842,151.57 | 0.0050 | 48 B |
CreateNewRootSpan | Class | 59.70 ns | 0.695 ns | 0.616 ns | 16,749,521.86 | 0.0050 | 48 B |
CreateChildSpan | Class | 63.69 ns | 0.715 ns | 0.669 ns | 15,701,818.74 | 0.0050 | 48 B |
CreateChildSpanWithBaggage | Class | 71.53 ns | 0.650 ns | 0.576 ns | 13,979,808.57 | 0.0050 | 48 B |
CreateNewRootSpan | Record | 56.55 ns | 0.258 ns | 0.229 ns | 17,684,262.41 | 0.0051 | 48 B |
CreateChildSpan | Record | 64.35 ns | 0.369 ns | 0.288 ns | 15,538,845.63 | 0.0050 | 48 B |
CreateChildSpanWithBaggage | Record | 72.41 ns | 0.655 ns | 0.613 ns | 13,809,634.69 | 0.0050 | 48 B |
Memory allocation figures changed quite a bit depending on how the underlying message type was implemented - record
types, for instance, use a StringBuilder
to pretty-print all of the properties and fields each time object.ToString()
is called, hence the high memory footprint.
By deferring this string
-ification we’re able to normalize the memory allocation down from something as high as 688 bytes to a flat 48 bytes per message.
Phobos 2.5.0 Throughput
From a throughput perspective, these deferred allocation techniques reduced Phobos’ latency / increased its throughput by 161%.
Metric | Units / s | Max / s | Average / s | Min / s | StdDev / s |
---|---|---|---|---|---|
TotalCollections [Gen0] | collections | 20.43 | 19.65 | 18.86 | 0.48 |
TotalCollections [Gen1] | collections | 7.04 | 5.83 | 4.96 | 0.65 |
TotalCollections [Gen2] | collections | 3.06 | 2.68 | 1.98 | 0.47 |
[Counter] MessageReceived | operations | 3,064,477.84 | 2,994,416.62 | 2,891,043.79 | 53,773.16 |
Roughly 3 million messages per second versus 1.1 million - this is all without changing the underlying data our telemetry collects. We’re just collecting it more efficiently.
Things to Bear in Mind
Deferred allocation techniques, such as deferred enumeration and stringification, are powerful methods for making existing code run much faster without having to compromise on requirements. Our benchmark results demonstrate this quite clearly.
However, there are some important trade-offs to bear in mind with this technique:
- Deferred stringification relies on asychrony - if the output is critical to your hotpath, then you might not be able to use this technique effectively.
- There’s a chance that errors with object expansion / allocation might not get discovered until after it’s too late to fix it due to this technique - deferred allocation prioritizes availability over consistency, essentially. That trade needs to align with your requirements for the data being deferred.
- The underyling consumer of the data needs to support deferred allocation in order for this to work - OpenTelemetry, System.Diagnostics.Activity, Microsoft.Extensions.Logging, and Akka.Event are all examples of APIs that support deferred allocation techniques.
- Read more about:
- Akka.NET
- Case Studies
- Videos
Observe and Monitor Your Akka.NET Applications with Phobos
Did you know that Phobos can automatically instrument your Akka.NET applications with OpenTelemetry?
Click here to learn more.