Beyond simple benchmarks

A practical guide to optimizing code

danielmarbach | daniel.marbach@particular.net | Daniel Marbach

[SimpleJob]
[MemoryDiagnoser]
public class StringJoinBenchmarks {

  [Benchmark]
  public string StringJoin() {
    return string.Join(", ", Enumerable.Range(0, 10).Select(i => i.ToString()));
  }

  [Benchmark]
  public string StringBuilder() {
    var sb = new StringBuilder();
    for (int i = 0; i < 10; i++)
    {
        sb.Append(i);
        sb.Append(", ");
    }

    return sb.ToString(0, sb.Length - 2);
  }

  [Benchmark]
  public string ValueStringBuilder() {
    var seperator = new ReadOnlySpan<char>(new char[] { ',', ' '});
    using var sb = new ValueStringBuilder(stackalloc char[30]);
    for (int i = 0; i < 10; i++)
    {
        sb.Append(i);
        sb.Append(seperator);
    }

    return sb.AsSpan(0, sb.Length - 2).ToString();
  }
}

"simple"

"We were able to see Azure Compute cost reduction of up to 50% per month, on average we observed 24% monthly cost reduction after migrating to .NET 6. The reduction in cores reduced Azure spend by 24%."

Microsoft Teams’ Infrastructure and Azure Communication Services’ Journey to .NET 6

Performance Aware

Bear Aware

Be curious....
Understand The Context

How is this code going to be executed at scale, and what would the memory characteristics be (gut feeling)
Are there simple low-hanging fruits I can apply to accelerate this code?
Are there things I can move away from the hot path by simply restructuring a bit my code?
What part is under my control and what isn't really?
What optimizations can I apply, and when should I stop?

The performance loop

Profile at least CPU and memory using a profiling harness
Improve parts of the hot path
Benchmark and compare
Profile improvements again with the harness and make adjustments where necessary
Ship and focus your attention to other parts

NServiceBus

go.particular.net/techorama-be-2024-quickstart

NServiceBus

Pipeline

public class RequestCultureMiddleware {
    private readonly RequestDelegate _next;

    public RequestCultureMiddleware(RequestDelegate next) {
        _next = next;
    }

    public async Task InvokeAsync(HttpContext context) {
        // Do work that does something before
        await _next(context);
        // Do work that does something after
    }
}

ASP.NET Core Middleware

public class Behavior : Behavior<IIncomingLogicalMessageContext> {
    public override Task 
    	Invoke(IIncomingLogicalMessageContext context, Func<Task> next) {
        // Do work that does something before
        await next();
        // Do work that does something after
    }
}

Behaviors

Profiling the pipeline

The harness

var endpointConfiguration = new EndpointConfiguration("PublishSample");
endpointConfiguration.UseSerialization<JsonSerializer>();
var transport = endpointConfiguration.UseTransport<MsmqTransport>();
transport.Routing().RegisterPublisher(typeof(MyEvent), "PublishSample");
endpointConfiguration.UsePersistence<InMemoryPersistence>();
endpointConfiguration.EnableInstallers();
endpointConfiguration.SendFailedMessagesTo("error");

var endpointInstance = await Endpoint.Start(endpointConfiguration);

Console.WriteLine("Attach the profiler and hit <enter>.");
Console.ReadLine();

var tasks = new List<Task>(1000);
for (int i = 0; i < 1000; i++)
{
    tasks.Add(endpointInstance.Publish(new MyEvent()));
}
await Task.WhenAll(tasks);

Console.WriteLine("Publish 1000 done. Get a snapshot");
Console.ReadLine();

Profiling the pipeline

The harness

public class MyEventHandler : IHandleMessages<MyEvent> {
    public Task Handle(MyEvent message, IMessageHandlerContext context)
    {
        Console.WriteLine("Event received");
        return Task.CompletedTask;
    }
}

Profiling the pipeline

The harness

Compiled and executed in Release mode
Runs a few seconds and keeps overhead minimal
Disabled Tiered JIT
<TieredCompilation>false</TieredCompilation>
Emits full symbols
<DebugType>pdbonly</DebugType>
<DebugSymbols>true</DebugSymbols>

var endpointConfiguration = new EndpointConfiguration("PublishSample");
endpointConfiguration.UseSerialization<JsonSerializer>();
var transport = endpointConfiguration.UseTransport<MsmqTransport>();
transport.Routing().RegisterPublisher(typeof(MyEvent), "PublishSample");
endpointConfiguration.UsePersistence<InMemoryPersistence>();
endpointConfiguration.EnableInstallers();
endpointConfiguration.SendFailedMessagesTo("error");

var endpointInstance = await Endpoint.Start(endpointConfiguration);

Console.WriteLine("Attach the profiler and hit <enter>.");
Console.ReadLine();

var tasks = new List<Task>(1000);
for (int i = 0; i < 1000; i++)
{
    tasks.Add(endpointInstance.Publish(new MyEvent()));
}
await Task.WhenAll(tasks);

Console.WriteLine("Publish 1000 done. Get a snapshot");
Console.ReadLine();

public class MyEventHandler : IHandleMessages<MyEvent> {
    public Task Handle(MyEvent message, IMessageHandlerContext context)
    {
        Console.WriteLine("Event received");
        return Task.CompletedTask;
    }
}

Profiling the pipeline

Publish

Memory Characteristics

Profiling the pipeline

Receive

Memory Characteristics

Profiling the pipeline

BehaviorChain

Memory Characteristics

Profiling the pipeline

Context matters

Memory Characteristics

Profiling the pipeline

Memory Characteristics

Profiling the pipeline

Memory Characteristics

Profiling the pipeline

CPU Characteristics

Profiling the pipeline

CPU Characteristics

Profiling the pipeline

CPU Characteristics

Publish

Profiling the pipeline

CPU Characteristics

Receive

Profiling the pipeline

CPU Characteristics

Profiling the pipeline

Benchmarking the pipeline

Copy and paste relevant code
Adjust it to the bare essentials to create a controllable environment

Extract Code

Trim down to relevant behaviors
Replaced dependency injection container with creating relevant classes
Replaced IO-operations with completed tasks

Extract Code

Benchmarking the pipeline

Get started with small steps
Culture change takes time
Make changes gradually

Performance Culture

[ShortRunJob]
[MemoryDiagnoser]
public class PipelineExecution {

    [Params(10, 20, 40)]
    public int PipelineDepth { get; set; }


    [GlobalSetup]
    public void SetUp()  {
        behaviorContext = new BehaviorContext();

        pipelineModificationsBeforeOptimizations = new PipelineModifications();
        for (int i = 0; i < PipelineDepth; i++)
        {
            pipelineModificationsBeforeOptimizations.Additions.Add(RegisterStep.Create(i.ToString(),
                typeof(BaseLineBehavior), i.ToString(), b => new BaseLineBehavior()));
        }

        pipelineModificationsAfterOptimizations = new PipelineModifications();
        for (int i = 0; i < PipelineDepth; i++)
        {
            pipelineModificationsAfterOptimizations.Additions.Add(RegisterStep.Create(i.ToString(),
                typeof(BehaviorOptimization), i.ToString(), b => new BehaviorOptimization()));
        }

        pipelineBeforeOptimizations = new BaseLinePipeline<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsBeforeOptimizations);
        pipelineAfterOptimizations = new PipelineOptimization<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsAfterOptimizations);
    }

    [Benchmark(Baseline = true)]
    public async Task Before() {
        await pipelineBeforeOptimizations.Invoke(behaviorContext);
    }

    [Benchmark]
    public async Task After() {
        await pipelineAfterOptimizations.Invoke(behaviorContext);
    }
}

Benchmarking the pipeline

[ShortRunJob]
[MemoryDiagnoser]
public class PipelineExecution {

    [Params(10, 20, 40)]
    public int PipelineDepth { get; set; }


    [GlobalSetup]
    public void SetUp()  {
        behaviorContext = new BehaviorContext();

        pipelineModificationsBeforeOptimizations = new PipelineModifications();
        for (int i = 0; i < PipelineDepth; i++)
        {
            pipelineModificationsBeforeOptimizations.Additions.Add(RegisterStep.Create(i.ToString(),
                typeof(BaseLineBehavior), i.ToString(), b => new BaseLineBehavior()));
        }

        pipelineModificationsAfterOptimizations = new PipelineModifications();
        for (int i = 0; i < PipelineDepth; i++)
        {
            pipelineModificationsAfterOptimizations.Additions.Add(RegisterStep.Create(i.ToString(),
                typeof(BehaviorOptimization), i.ToString(), b => new BehaviorOptimization()));
        }

        pipelineBeforeOptimizations = new BaseLinePipeline<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsBeforeOptimizations);
        pipelineAfterOptimizations = new PipelineOptimization<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsAfterOptimizations);
    }

    [Benchmark(Baseline = true)]
    public async Task Before() {
        await pipelineBeforeOptimizations.Invoke(behaviorContext);
    }

    [Benchmark]
    public async Task After() {
        await pipelineAfterOptimizations.Invoke(behaviorContext);
    }
}

Benchmarking the pipeline

[ShortRunJob]
[MemoryDiagnoser]
public class PipelineExecution {

    [Params(10, 20, 40)]
    public int PipelineDepth { get; set; }


    [GlobalSetup]
    public void SetUp()  {
        behaviorContext = new BehaviorContext();

        pipelineModificationsBeforeOptimizations = new PipelineModifications();
        for (int i = 0; i < PipelineDepth; i++)
        {
            pipelineModificationsBeforeOptimizations.Additions.Add(RegisterStep.Create(i.ToString(),
                typeof(BaseLineBehavior), i.ToString(), b => new BaseLineBehavior()));
        }

        pipelineModificationsAfterOptimizations = new PipelineModifications();
        for (int i = 0; i < PipelineDepth; i++)
        {
            pipelineModificationsAfterOptimizations.Additions.Add(RegisterStep.Create(i.ToString(),
                typeof(BehaviorOptimization), i.ToString(), b => new BehaviorOptimization()));
        }

        pipelineBeforeOptimizations = new BaseLinePipeline<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsBeforeOptimizations);
        pipelineAfterOptimizations = new PipelineOptimization<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsAfterOptimizations);
    }

    [Benchmark(Baseline = true)]
    public async Task Before() {
        await pipelineBeforeOptimizations.Invoke(behaviorContext);
    }

    [Benchmark]
    public async Task After() {
        await pipelineAfterOptimizations.Invoke(behaviorContext);
    }
}

Benchmarking the pipeline

Single Responsibility Principle
No side effects
Prevents dead code elimination
Delegates heavy lifting to the framework
Is explicit
- No implicit casting
- No var
Avoid running any other resource-heavy processes while benchmarking

practices

Benchmarking the pipeline

Benchmarking is really hard

BenchmarkDotNet will protect you from the common pitfalls because it does all the dirty work for you

[ShortRunJob]
[MemoryDiagnoser]
public class Step1_PipelineWarmup {
    // rest almost the same

    [Benchmark(Baseline = true)]
    public BaseLinePipeline<IBehaviorContext> Before() {
        var pipelineBeforeOptimizations = new BaseLinePipeline<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsBeforeOptimizations);
        return pipelineBeforeOptimizations;
    }

    [Benchmark]
    public PipelineOptimization<IBehaviorContext> After() {
        var pipelineAfterOptimizations = new PipelineOptimization<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsAfterOptimizations);
        return pipelineAfterOptimizations;
    }
}

Benchmarking the pipeline

[ShortRunJob]
[MemoryDiagnoser]
public class Step2_PipelineException {
    [GlobalSetup]
    public void SetUp() {
        ...
        var stepdId = PipelineDepth + 1;
        pipelineModificationsBeforeOptimizations.Additions.Add(RegisterStep.Create(stepdId.ToString(), typeof(Throwing), "1", b => new Throwing()));

        ...
        pipelineModificationsAfterOptimizations.Additions.Add(RegisterStep.Create(stepdId.ToString(), typeof(Throwing), "1", b => new Throwing()));

        pipelineBeforeOptimizations = new Step1.PipelineOptimization<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsBeforeOptimizations);
        pipelineAfterOptimizations = new PipelineOptimization<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsAfterOptimizations);
    }

    [Benchmark(Baseline = true)]
    public async Task Before() {
        try
        {
            await pipelineBeforeOptimizations.Invoke(behaviorContext).ConfigureAwait(false);
        }
        catch (InvalidOperationException)
        {
        }
    }

    [Benchmark]
    public async Task After() {
        try
        {
            await pipelineAfterOptimizations.Invoke(behaviorContext).ConfigureAwait(false);
        }
        catch (InvalidOperationException)
        {
        }
    }
    
    class Throwing : Behavior<IBehaviorContext> {
        public override Task Invoke(IBehaviorContext context, Func<Task> next)
        {
            throw new InvalidOperationException();
        }
    }
}

Benchmarking the pipeline

[ShortRunJob]
[MemoryDiagnoser]
public class Step2_PipelineException {
    [GlobalSetup]
    public void SetUp() {
        ...
        var stepdId = PipelineDepth + 1;
        pipelineModificationsBeforeOptimizations.Additions.Add(RegisterStep.Create(stepdId.ToString(), typeof(Throwing), "1", b => new Throwing()));

        ...
        pipelineModificationsAfterOptimizations.Additions.Add(RegisterStep.Create(stepdId.ToString(), typeof(Throwing), "1", b => new Throwing()));

        pipelineBeforeOptimizations = new Step1.PipelineOptimization<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsBeforeOptimizations);
        pipelineAfterOptimizations = new PipelineOptimization<IBehaviorContext>(null, new SettingsHolder(),
            pipelineModificationsAfterOptimizations);
    }
}

Benchmarking the pipeline

[ShortRunJob]
[MemoryDiagnoser]
public class Step2_PipelineException {
    [GlobalSetup]
    public void SetUp() {
      ...
    }

    [Benchmark(Baseline = true)]
    public async Task Before() {
        try
        {
            await pipelineBeforeOptimizations.Invoke(behaviorContext);
        }
        catch (InvalidOperationException)
        {
        }
    }

    [Benchmark]
    public async Task After() {
        try
        {
            await pipelineAfterOptimizations.Invoke(behaviorContext);
        }
        catch (InvalidOperationException)
        {
        }
    }
    ...
}

Benchmarking the pipeline

Profiling the pipeline (Again)

Profiling the pipeline (again)

Publish

Memory Characteristics

Before

After

Receive

Memory Characteristics

Profiling the pipeline (again)

After

Before

Memory Characteristics

Profiling the pipeline (again)

Receive

After

Before

Memory Characteristics

Profiling the pipeline (again)

After

Before

oh look, there is nothing 😌

CPU Characteristics

Publish

Profiling the pipeline (again)

CPU Characteristics

Receive

Profiling the pipeline (again)

Getting lower on the stack

The harness

await using var serviceBusClient = new ServiceBusClient(connectionString);

await using var sender = serviceBusClient.CreateSender(destination);
var messages = new List<ServiceBusMessage>(1000);
for (int i = 0; i < 1000; i++) {
    messages.Add(new ServiceBusMessage(UTF8.GetBytes($"Deep Dive {i} Deep Dive {i} Deep Dive {i} Deep Dive {i} Deep Dive {i} Deep Dive {i}")));

    if (i % 100 == 0) {
        await sender.SendMessagesAsync(messages);
        messages.Clear();
    }
}

await sender.SendMessagesAsync(messages);

WriteLine("Messages sent");
Console.WriteLine("Take snapshot");
Console.ReadLine();

var countDownEvent = new CountdownEvent(1000);

var processorOptions = new ServiceBusProcessorOptions {
    AutoCompleteMessages = true,
    MaxConcurrentCalls = 100,
    MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(10),
    ReceiveMode = ServiceBusReceiveMode.PeekLock,
};

await using var receiver = serviceBusClient.CreateProcessor(destination, processorOptions);
receiver.ProcessMessageAsync += async messageEventArgs => {
    var message = messageEventArgs.Message;
    await Out.WriteLineAsync(
        $"Received message with '{message.MessageId}' and content '{UTF8.GetString(message.Body)}' / binary {message.Body}");
    countDownEvent.Signal();
};
// rest omitted
await receiver.StartProcessingAsync();

countDownEvent.Wait();

Console.WriteLine("Take snapshot");
Console.ReadLine();

await receiver.StopProcessingAsync();

Getting lower on the stack

The harness

await using var serviceBusClient = new ServiceBusClient(connectionString);

await using var sender = serviceBusClient.CreateSender(destination);
var messages = new List<ServiceBusMessage>(1000);
for (int i = 0; i < 1000; i++) {
    messages.Add(new ServiceBusMessage(UTF8.GetBytes($"Deep Dive {i} Deep Dive {i} Deep Dive {i} Deep Dive {i} Deep Dive {i} Deep Dive {i}")));

    if (i % 100 == 0) {
        await sender.SendMessagesAsync(messages);
        messages.Clear();
    }
}

await sender.SendMessagesAsync(messages);

WriteLine("Messages sent");
Console.WriteLine("Take snapshot");
Console.ReadLine();

Getting lower on the stack

The harness

var countDownEvent = new CountdownEvent(1000);

var processorOptions = new ServiceBusProcessorOptions
{
    AutoCompleteMessages = true,
    MaxConcurrentCalls = 100,
    MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(10),
    ReceiveMode = ServiceBusReceiveMode.PeekLock,
};

await using var receiver = serviceBusClient.CreateProcessor(destination, processorOptions);
receiver.ProcessMessageAsync += async messageEventArgs => {
    var message = messageEventArgs.Message;
    await Out.WriteLineAsync(
        $"Received message with '{message.MessageId}' and content '{UTF8.GetString(message.Body)}' / binary {message.Body}");
    countDownEvent.Signal();
};
// rest omitted
await receiver.StartProcessingAsync();

countDownEvent.Wait();

Console.WriteLine("Take snapshot");
Console.ReadLine();

await receiver.StopProcessingAsync();

Getting lower on the stack

Memory Characteristics

Getting lower on the stack

Memory Characteristics

Getting lower on the stack

Preventing regressions

C:\Projects\performance\src\tools\ResultsComparer> dotnet run --base "C:\results\before" 
--diff "C:\results\after" --threshold 2%

Guidance Preventing Regressions
ResultComparer Tool

C:\Projects\performance\src\benchmarks\micro> dotnet run -c Release -f net8.0 \
    --artifacts "C:\results\before"

C:\Projects\performance\src\benchmarks\micro> dotnet run -c Release -f net8.0 \
    --artifacts "C:\results\after"

"CPU-bound benchmarks are much more stable than Memory/Disk-bound benchmarks, but the average performance levels still can be up to
three times different across builds."

Andrey Akinshin - Performance stability of GitHub Actions

Beyond simple benchmarks

A practical guide to optimizing code

"simple"

Performance Aware

Bear Aware

Be curious.... Understand The Context

The performance loop

NServiceBus

NServiceBus

Pipeline

ASP.NET Core Middleware

Behaviors

Profiling the pipeline

The harness

The harness

The harness

Publish

Memory Characteristics

Receive

Memory Characteristics

BehaviorChain

Memory Characteristics

Context matters

Memory Characteristics

Memory Characteristics

Memory Characteristics

CPU Characteristics

CPU Characteristics

CPU Characteristics

Publish

CPU Characteristics

Receive

CPU Characteristics

Testing

Improving

Benchmarking the pipeline

Extract Code

Extract Code

Performance Culture

practices

Profiling the pipeline (Again)

Publish

Memory Characteristics

Before

After

Receive

Memory Characteristics

After

Before

Memory Characteristics

Receive

After

Before

Memory Characteristics

After

Before

CPU Characteristics

Publish

CPU Characteristics

Receive

Getting lower on the stack

Getting lower on the stack

Getting lower on the stack

The harness

The harness

The harness

Memory Characteristics

Memory Characteristics

Preventing regressions

Beyond simple benchmarks

A practical guide to optimizing code

Be curious....
Understand The Context