`DistributedApplicationTestingBuilder` frustrations #7126

afscrome · 2025-01-15T22:44:06Z

afscrome
Jan 15, 2025

Been using DistributedApplicationTestingBuilder a bit now and when it works it well it's amazing. But when it doesn't work, it's a painful and frustrating experience.

Some of these I have captured in issues already logged in this repo, but I thought it useful to put this in one place to inform #7057

Logging in Test Frameworks isn't obvious

The docs don't do a great job of explaining how to configure logging, without which you have essentially no details on what's gong on - particularly in CI - (See dotnet/docs-aspire#2096). Doc improvements would help here, perhaps with some additions to the default templates.

Even with that, all resources get dumped into the same output, so if you're troubleshooting why FOO didn't start up, you've got to filter through a lot of noisy logs to find out. (And if other services make use of FOO w/out using WaitFor, those will may well be spamming the log with errors because FOO isn't fully available adding even more noise to filter through).

I also wonder if is value in a helper method which write resource logs to a directory, with each resource/replica getting their own file. This directory could then be published as a build artifact, and provide a way to more easily view the logs of individual resources. Probably combined with some test framework infrastructure to give each test it's own directory, and possibly only publishing for failed tests. Aspire doesn't needs to handle all the test framework specific issues, but a LogToDirectory building block could be useful.

DistributedApplicationTestingBuilder could also do with some targeted log levels, or add its' own log entries based on aspire events. e.g. Default Aspire.Hosting.ApplicationModel.ResourceNotificationService to Debug level to log state changes (or subscribe to events and publisht hem to a logger). Ditto for health checks including pass / fail (along with details of why they failed). (This overlaps with the "Hard to see current state of test host" section below)

Timeouts leave you hanging

There are many failure scenarios in aspire which will hang forever. When you have the dashboard, this works great - you can see what hasn't started and use the tools in the dashboard to browse and filter logs to understand why the bits that failed did so. DistributedApplicationTestingBuilder doesn't include the dashboard (and if it did, the dashboard wouldn't be usable in CI).

The current pattern to avoid this is:

using var cancellationTokenSource = new CancellationTokenSource(TimeSpan.FromSeconds(30));
await _app.StartAsync(cancellationTokenSource.Token);
// or
await _app.StartAsync().WaitAsync(TimeSpan.FromSeconds(30));

Which this fixes the immediate problem of avoiding infinite hanging, all it does is result in the error System.TimeoutException : The operation has timed out. which doesn't give you any help in working out where to look for the root cause, leaving you hanging in a different way...

One thing that I think could help here is to have native timeout support on waiting methods that can provide targeted errors when things fail.

app.StartAsync

Resources FOO, BAR and BAZ failed to start within TIMEOUT
Or
BeforeStart event did not complete within TIMEOUT

Aside: I think it's surprising that StartAsync hangs at all - I now know enough about Aspire internals that I understand why, but it does feel like the delays are due to a leaky abstraction rather than making sense. But as long as StartAsync can timeout, we can at least clarify what failed.

resourceNotificationService.WaitForResourceAsync / resourceNotificationService.WaitForCompletion

FOO failed to reach X state within TIMEOUT. It ended up in the Y state

(Possibly also including a short snippet of the last 10 lines of stdout from the service)

resourceNotificationService.WaitForResourceHealthyAsync:

Foo failed to become healthy within TIMEOUT.

HealthCheck1 (Healthy)

HealthCheck2 (Degraded) - {healthReport.Description}

There is definitely further improvements / tuning that could be done on these error messages - treat these as a starting point that is many times better than The operation has timed out

These could possibly be implemented directly on ResourceNotificationService, although it could make more sense to give DistributedApplicationTestingBuilder it's own version of these, optimised for test scenarios.

Hard to see current state of test host

Again, another area that falls down due to the dashboard not being available.

There is a lot of good information in the ResourceEvent published by ResourceNotificationService, but you have to know they are there, and know how to subscribe to the WatchAsync event to receive them. It would be really helpful if this data could be more easily / obviously accessible within the test host.

For Local dev, this could be done by a ResourceStates property, benefiting from some of the DebuggerDisplay work done on those fields - see #5632 (comment)

I don't think #6795 is sufficient to fix this issue as the data is still somewhat hidden behind WatchAsync - I'd expect to be able to get access to this data in something I can navigate to the current state through the locals / watch windows. (I.e. a synchronous method / property), without having to go through a full blown subscription.

I've had several test failures which I was only able to solve was by using a crude state dumper like the following. I know enough about aspire to know I can get this data out of ResourceNotificationService, but this data feels to useful to not be more visible.

internal class StateDumper
{
   private readonly Dictionary<(IResource Resource, string InstanceId), ResourceEvent> _resourceEvents = [];
   // This shoudl be initialised in between `builder.BuildAsync()` and `app.StartAsync()`
   public StateDumper(DistributedApplication application)
   {
      //TODO: Should get a proper CTS and implement IAsyncDisposable
      var resourceNotifciationService = application.Services.GetRequiredService<ResourceNotificationService>().WatchAsync(default);
      _ = Task.Run(async () =>
      {
         await foreach (var evt in resourceNotifciationService)
         {
            _resourceEvents[(evt.Resource, evt.ResourceId)] = evt;
         }
      });
   }
   public void DumpState()
   {
      var instancesByResource = _resourceEvents.GroupBy(x => x.Key.Resource);
      foreach (var resource in instancesByResource)
      {
         Console.WriteLine("=============================");
         Console.WriteLine($"RESOURCE: {resource.Key.Name} ({resource.Key.GetType().Name})");
         foreach (var instance in resource)
         {
            var snapshot = instance.Value.Snapshot;
            Console.WriteLine("-----------------------------");
            Console.WriteLine($"INSTANCE: {instance.Value.ResourceId} ({snapshot.ResourceType}");
            Console.WriteLine("-----------------------------");
            Console.WriteLine($"STATE: {snapshot.State?.Text ?? "UNKNOWN"}");
            Console.WriteLine($"EXIT CODE: {snapshot.ExitCode}");
            Console.WriteLine($"HEALTH STATUS: {snapshot.HealthStatus}");
            Console.WriteLine($"CREATED: {snapshot.CreationTimeStamp}");
            Console.WriteLine($"START: {snapshot.StartTimeStamp}");
            Console.WriteLine($"STOP: {snapshot.StopTimeStamp}");
            Console.WriteLine("HEALTH REPORTS:");
            foreach (var report in snapshot.HealthReports)
            {
               Console.WriteLine($"- {report.Name}: {report.Status} - {report.Description}, {report.ExceptionText}");
            }
            Console.WriteLine("PROPERTIES");
            foreach (var prop in snapshot.Properties)
            {
               var value = prop.IsSensitive ? "****" : prop.Value;
               Console.WriteLine($"- {prop.Name}: {value}");
            }
            Console.WriteLine("URLS");
            foreach (var url in snapshot.Urls)
            {
               Console.WriteLine($"- {url.Name}: {url.Url}");
            }
            Console.WriteLine("ENVIRONMENT VARIABLES");
            foreach (var envVar in snapshot.EnvironmentVariables)
            {
               Console.WriteLine($"- {envVar.Name}: {envVar.Value}");
            }
            Console.WriteLine("VOLUMES");
            foreach (var volume in snapshot.Volumes)
            {
               Console.WriteLine($"- {volume.Target}: {volume.Source}");
            }
         }
         Console.WriteLine("=============================");
      }
   }
}

davidfowl · 2025-01-16T05:16:43Z

davidfowl
Jan 16, 2025
Maintainer

@ReubenBond is working on improvements to testing for 9.1, channel your frustration 😄 (it seems #5878 is important for testing as well).

Even with that, all resources get dumped into the same output, so if you're troubleshooting why FOO didn't start up, you've got to filter through a lot of noisy logs to find out. (And if other services make use of FOO w/out using WaitFor, those will may well be spamming the log with errors because FOO isn't fully available adding even more noise to filter through).

I feel this while debugging our own flaky tests.

@afscrome See #7131 to see what our trace logs look like for resources.

7 replies

DamianEdwards Jan 18, 2025
Maintainer

DistributedApplicationTestingBuilder could also do with some targeted log levels, or add its' own log entries based on aspire events. e.g. Default Aspire.Hosting.ApplicationModel.ResourceNotificationService to Debug level to log state changes (or subscribe to events and publisht hem to a logger). Ditto for health checks including pass / fail (along with details of why they failed). (This overlaps with the "Hard to see current state of test host" section below)

I'm in two minds about having DistributedApplicationTestingBuilder change logging defaults. AFAIK WebApplicationFactory has similar behavior that requires one to manually configure the logging to "turn it up" for testing scenarios so our behavior is consistent with that. On the other hand, I agree that without turning up the logging it's basically impossible to diagnose what's going on in a test failure. In the samples repo I have a custom factory method for creating IDistrubtedApplicationTestingBuilder instances that does what's required to basically max out logging. The other challenge is the different behavior between testing frameworks and runners with regards to what logs are captured during a test run.

As a first step we could consider adding a new helper method to the Aspire test project templates that creates the builder and configures it for max logging, and of course change the test to use it. This also gives a single place to easily change how the builder is created for all tests. Changing the defaults of DistributedApplicationTestingBuilder to always configure for max logging seems a bit excessive but it's possible my mind could be changed on that.

I also wonder if is value in a helper method which write resource logs to a directory, with each resource/replica getting their own file. This directory could then be published as a build artifact, and provide a way to more easily view the logs of individual resources. Probably combined with some test framework infrastructure to give each test it's own directory, and possibly only publishing for failed tests. Aspire doesn't needs to handle all the test framework specific issues, but a LogToDirectory building block could be useful.

This really feels like it doesn't belong in Aspire internals honestly. We forward all resource logs through ILoggerFactory so sending logs from certain categories to different outputs is something that should already be possible with the existing features there, e.g. providers and filters.

There is a lot of good information in the ResourceEvent published by ResourceNotificationService, but you have to know they are there, and know how to subscribe to the WatchAsync event to receive them. It would be really helpful if this data could be more easily / obviously accessible within the test host.

Not sure how to approach this one honestly. I like the idea of being able to easily inspect resource notifications and state via the usual debugger tools just by virtue of navigating from the DistributedApplication instance in the context of a test, but it's not immediately clear to me what the right approach is. For logging we have logging fakes which make it easy to capture logs for inspection and assertion (which I use in the samples tests to verify no errors are emitted). Perhaps a similar approach could be taken for resource notifications, such that one line could setup a subscriber that makes it easy to inspect the resource notifications as current state, e.g. by capturing to a local variable.

afscrome Jan 20, 2025
Author

AFAIK WebApplicationFactory has similar behavior that requires one to manually configure the logging to "turn it up" for testing scenarios so our behavior is consistent with that.

I guess I look at this the other way - I see it from the perspective that the DistributedApplicationTestingBuilder has turned off the dashboard functionality, so you could view as restoring light weight dashboard functionality.

In light of this, one way you could view of these issues is that whilst they are mainly showing up through DistributedApplicationTestingBuilder, they really actually affect any u se of DistributedAppliationBuilder when DistributedApplicationOptions.DisableDashboard is set. And so this is less about adding special capabilities to DistributedApplicationTestingBuilder, but rather adding some capabilities into the core to better allow access to diagnostics data when the dashboard UI is not enabled / avaialble.

changing the defaults of DistributedApplicationTestingBuilder to always configure for max logging

I agree, I don't want to change all Logging to max - I already have a problem of too much logging so to add even more noise. I'd rather see very specific logs targeted (off the top of my head

Resource State changes
Health Check failures

You could get similar effects without changing logging level - e.g. the test host could have it's own subscriber to resource state changes and emit it's own log entries at INFO level so state changes do show up in default configuration. Or perhaps the core of aspire needs to differentiate some log levels when being run w/out the dashboard - e.g. resource state changes at Trace makes perfect sense with no dashboard, but perhaps they should be at a higher level when the dashboard is not present. (As really this issue isn't specific to the test host, it's specific to any scenario without the dashboard.

This really feels like it doesn't belong in Aspire internals honestly

Yea, I agree the test framework specific functionality doesn't belong in Aspire, but there may be some building blocks to enable this functionality that are needed. I'm planning one exploring that this week and will report back with what I find.

Perhaps a similar approach could be taken for resource notifications, such that one line could setup a subscriber that makes it easy to inspect the resource notifications as current state, e.g. by capturing to a local variable.

My only concern here is that this requires you to a) know you should be capturing these things and b) turned on before the issue happens. It's not so much a problem for CI when you don't have live access to the state, but is problematic locally. Especially as a lot of these issues can be racey and not easily reproduceable - if I've go the debugger break on a timeout exception, I want to be able to go straight in and see high level state, rather than need to turn on verbose logging and then hope I can reproduce the issue next time around. In #5632 I've had a play around with exposing resource events through a debugger display attribute. I'm not convinced that's quite the right approach, but having the resource event data accessible just by hovering over the distributed application is incredibly powerful. This is something else I'm hoping to explore some more. Whilst useful for local dev, it's far from the full story as it doesn't help the CI story.

davidfowl Jan 22, 2025
Maintainer

I feel now like we need to do 3 things:

Make the resource logging, telemetry capture and resource notifications log to the console. This is our lowest common denominator.
Make it possible to use the dashboard during test runs.
Make it possible to load a test run into the dashboard after the fact to visualize the traces.

PS: I've been debugging tests recently and it is absolutely maddening to look at the console output.

cc @leslierichardson95 @JamesNK

DamianEdwards Jan 22, 2025
Maintainer

Make the resource logging details and robust to the console. This is our lowest common denominator.

Can you fix the typo in this as I'm having trouble actually understanding what the "to do" for this one is 😄

The others, I agree.

davidfowl Jan 22, 2025
Maintainer

Fixed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DistributedApplicationTestingBuilder` frustrations #7126

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

DistributedApplicationTestingBuilder frustrations #7126

afscrome Jan 15, 2025

Logging in Test Frameworks isn't obvious

Timeouts leave you hanging

Hard to see current state of test host

Replies: 1 comment · 7 replies

davidfowl Jan 16, 2025 Maintainer

DamianEdwards Jan 18, 2025 Maintainer

afscrome Jan 20, 2025 Author

davidfowl Jan 22, 2025 Maintainer

DamianEdwards Jan 22, 2025 Maintainer

davidfowl Jan 22, 2025 Maintainer

`DistributedApplicationTestingBuilder` frustrations #7126

afscrome
Jan 15, 2025

Replies: 1 comment 7 replies

davidfowl
Jan 16, 2025
Maintainer

DamianEdwards Jan 18, 2025
Maintainer

afscrome Jan 20, 2025
Author

davidfowl Jan 22, 2025
Maintainer

DamianEdwards Jan 22, 2025
Maintainer

davidfowl Jan 22, 2025
Maintainer