Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling monitoring of system actors has no effect on Live Actors and Actor Start/Stop panels #86

Closed
object opened this issue Jan 16, 2025 · 26 comments
Labels
monitoring 📈 Monitoring issues

Comments

@object
Copy link

object commented Jan 16, 2025

Finally we can see our customized F# actor type names (thanks for fixing it so quickly), so we can see our actor types in "Live Actors by Type" and "Actor Starts and Stops" panels. But the panels show mostly system actors, even though monitoring of them is disabled with the following code:

        builder
            .SetMonitorUserActors(true)
            .SetMonitorSystemActors(false)
            .SetMonitorEventStream(true)
            .SetMonitorMailboxDepth(false)

For example, "Live Actors by Type" dashboard looks like this:

Akka.Remote.Transport.ProtocolStateActor 112
Akka.Remote.EndpointReader 112
Akka.Remote.EndpointWriter 112
Akka.Remote.ReliableDeliverySupervisor 110
upload-globalconnect-file 71
upload-globalconnect-file-eventpublisher 71
upload-mediaset-controller 52
Akka.Cluster.Sharding.Shard 30

Only 3 of 8 actor types represent our own actors, the rest are system actors which are of no interest for us to collect metrics for.
Shouldn't SetMonitorSystemActors(false) disable their monitoring so they don't appear in dashboards? If not, would it be possible to support monitor filtering, so we can configure what actor types should be excluded?

@Aaronontheweb
Copy link
Member

Hi @object - looks like the issue here is the Akka.NET controls how those actorstarted et al metrics are emitted. The MonitorSystemActors stuff has historically been used just to control Phobos emissions. We can just filter that on the Phobos side so you don't see any of them appear in your charts either. I'll get on that today.

@Aaronontheweb Aaronontheweb added the monitoring 📈 Monitoring issues label Jan 16, 2025
@Aaronontheweb
Copy link
Member

This is fixed in our latest dev build but we're going to test it in our lab first

@Aaronontheweb
Copy link
Member

Did some testing of this and it still needs more time to cook - we're going to have to make a patch to Akka.NET in order to have this work consistently. In this particular instance, shutting off all /system actor metrics also big-footed /system actors that we want metrics for, such as Akka.Cluster.Sharding and its entities.

@Aaronontheweb Aaronontheweb reopened this Jan 20, 2025
@Aaronontheweb
Copy link
Member

Aaronontheweb commented Jan 20, 2025

We basically need to push the decision about which metrics get recorded to the edges (the actors themselves), not the accumulator we use to produce the metric.

@Aaronontheweb
Copy link
Member

This issue has been resolved, properly in Phobos 2.8.1 https://phobos.petabridge.com/articles/releases/RELEASE_NOTES.html

@object
Copy link
Author

object commented Jan 22, 2025

@Aaronontheweb Thanks for the quick update. I have rebuilt our services with Akka 1.5.36 and Phobos 2.8.1. I no longer see in the dashboard some of the system actors that were listed earlier, but other actors from Akka namespace are there. Here's what I see now:

Akka.Persistence.Journal.AsyncWriteJournal+Resequencer 1127618
Akka.Persistence.Extras.PersistenceSupervisor 113 (this one is a part of Akka Extras packages)
Akka.Cluster.Sharding.ShardRegion 28
Akka.Cluster.Sharding.Shard 21
Akka.Cluster.Tools.PublishSubscribe.Internal.Topic 15

How is actor classified as a system actor? By namespace or from a pre-defined list? Can the decision be made based on a namespace?

@Aaronontheweb
Copy link
Member

It's classified by where the actor is on the hierarchy - actors that fall under the /system hierarchy have both metrics and tracing disabled by default. The exceptions to that are some of the Akka.Cluster.Sharding actors and the DistributedPubSub actors.

We can probably disable metrics on those by default too - it's just that normally we treat the sharding / distributed pub sub actors like they are /user actors because users have requested traceability on there. But, we do have the ability to tell them "trace like /user actors but do metrics like /system actors"

@Aaronontheweb
Copy link
Member

The one of your list that makes me the most interested are the Akka.Persistence.Journal.AsyncWriteJournal+Resequencer - that makes me wonder if that's actually an Akka.NET bug.

@Aaronontheweb
Copy link
Member

Well I'll be damned, that is an Akka.NET bug for certain: https://github.com/akkadotnet/akka.net/blob/759e93f55f66e79d8356f6ad1526144126d0b479/src/core/Akka.Persistence/Journal/AsyncWriteJournal.cs#L84

How on earth did you end up with 100k of them though?

@object
Copy link
Author

object commented Jan 22, 2025

It's a first time I see Akka.Persistence.Journal.AsyncWriteJournal+Resequencer in that list, and I should have seen it earlier because of its huge number of instances. That must certainly be a bug. It can't be that high. We don't have so many live actors, and only a fraction of them are persistent.

@Aaronontheweb
Copy link
Member

It's a first time I see Akka.Persistence.Journal.AsyncWriteJournal+Resequencer in that list, and I should have seen it earlier because of its huge number of instances. That must certainly be a bug. It can't be that high. We don't have so many live actors, and only a fraction of them are persistent.

Well, Phobos is doing it's job here then and letting us know about real problems with the framework itself 😂

@Aaronontheweb
Copy link
Member

@object using previous versions of Phobos, do you see a huge number of journal actors too or just re-sequencers?

@object
Copy link
Author

object commented Jan 22, 2025

Using "/system" to classify an actor as system makes sense, even though it would be nice to have a way to exclude cluster sharding. Perhaps a different flag, e.g. SetMonitorShardManagementActors? But that's not a crucial thing.

@object
Copy link
Author

object commented Jan 22, 2025

@object using previous versions of Phobos, do you see a huge number of journal actors too or just re-sequencers?

It's a first time I see such huge number here. Usually it's max a couple of hundred of actors of a certain type, and it is correct. Our system it about media files management, not stock trading. During peak times we may have a few thousand files simultaneously handled, most of the times it's just hundreds.

@Aaronontheweb
Copy link
Member

I'm wondering if there's a journal implementation that is creating huge numbers of them during ActorSystem recovery OR if the journal is restarting rapidly and spamming a bunch of them

@Aaronontheweb
Copy link
Member

Filed an issue here: akkadotnet/akka.net#7480 - which Akka.Persistence journal are you using?

@object
Copy link
Author

object commented Jan 22, 2025

It's SQL Server. But I see huge number of exceptions after upgrading. Checking what's going on.

@object
Copy link
Author

object commented Jan 22, 2025

Could not load file or assembly 'Microsoft.Bcl.AsyncInterfaces, Version=8.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'. The system cannot find the file specified.

This is exactly the same error I saw after trying to update Akka.Management to version 1.5.35. I filed another bug there:

akkadotnet/Akka.Management#3079

But now it's in Akka.Persistence.Sql.Journal.SqlWriteJournal

And it may explain absurd number of instances. I will rollback to previous version.

@object
Copy link
Author

object commented Jan 22, 2025

@Aaronontheweb
I believe it was caused by updating Akka.Persistence.Sql from 1.5.30 to 1.5.35. But I don't have a clue why it suddenly caused problem with Microsoft.Bcl.AsyncInterfaces. As I aready said, same thing happend with Akka.Management when I upgraded it from 1.5.33 to 1.5.35

@Aaronontheweb
Copy link
Member

I think we've gotten to the bottom of this - some of our packages are not dual targeted:

  1. Akka.Persistence.Sql
  2. Akka.Discovery.Azure

Our BCL upgrade to 8.x is creating problems for these packages. We're going to fix this by shipping a dual-targeted version.

@object
Copy link
Author

object commented Jan 22, 2025

Rolling back helped. No exceptions and Akka.Persistence.Journal.AsyncWriteJournal+Resequencer disappeared from the list.

@Aaronontheweb
Copy link
Member

It looks like the BCL upgrade we did for .NET Standard packages only was the issue - we'll address this in a new release of Akka.NET, but for now we can also fix it by pushing some package upgrades and ensuring that they are dual-targeted

@object
Copy link
Author

object commented Jan 22, 2025

I see. That explains it. Was it only two packages (Akka.Persistence.Sql and Akka.Discovery.Azure) that were affected?

@Aaronontheweb
Copy link
Member

I see. That explains it. Was it only two packages (Akka.Persistence.Sql and Akka.Discovery.Azure) that were affected?

First two we've found so far - but basically it was any package that didn't do dual-targeting has this issue. The other thing I'll do is downgrade the BCL version back to 6.0.

@Aaronontheweb
Copy link
Member

This will stop the Resequencer spam in the future: akkadotnet/akka.net#7481

@Aaronontheweb
Copy link
Member

The DLL hell issue will be resolved via akkadotnet/akka.net#7482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
monitoring 📈 Monitoring issues
Projects
None yet
Development

No branches or pull requests

2 participants