Add worker pool for log pushers #1499

jefchien · 2025-01-13T15:20:10Z

Description of the issue

One of the bottlenecks for throughput for the agent is that each target (log group/stream) has a single-threaded pusher which blocks until it has successfully sent the request or run out of retries. This limits the amount of requests that can be made and limits the size/rate of increase for the tailed log file.

Description of changes

By supporting concurrency with the sender, the agent is able to send and prepare batches more efficiently.

Moved the pusher.go into a pusher package and split the queue/batch functionality from the sender.
Added a simple WorkerPool that queues up tasks, which are taken by the first available worker.
Added a Concurrency field. If configured, the CloudWatch logs will have a shared pool of workers to send PLE requests.
Changed the perEventHeaderBytes from 200 back to 26. This was arbitrarily changed as part of Add New Metrics For Where Customer Are Using The Agent #913, but does not match the PLE specifications that it was meant to be taken from.

The maximum batch size is 1,048,576 bytes. This size is calculated as the sum of all event messages in UTF-8, plus 26 bytes for each log event.

Added a TargetManager to prevent multiple calls to CreateLogStream or CreateLogGroup when sends are called concurrently for a new log group/stream. Partially adopted from OTEL implementation https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/internal/aws/cwlogs/pusher.go#L386.
Created diagram using https://asciiflow.com/#/

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Added unit tests.

One example in the unit tests for comparison is the TestPusher. With a 50ms latency and 100000 events, the sender pool configuration with 5 workers took 1/3 of the time as the single-threaded sender.

    --- PASS: TestPusher/WithSender (0.67s)
    --- PASS: TestPusher/WithSenderPool (0.23s)

Benchmarks in progress.

Requirements

Before commit the code, please do the following steps.

Run make fmt and make fmt-sh
Run make lint

Split up queue/batch from sender.

translator/tocwconfig/sampleConfig/log_filter.conf

chadpatel · 2025-01-13T16:40:31Z

plugins/outputs/cloudwatchlogs/README.md

+For each configured target (log group/stream), the output plugin maintains a queue for log events that it batches.
+Once each batch is full or the flush interval is reached, the current batch is sent using the PutLogEvents API to Amazon CloudWatch.
+
+When concurrency is enabled, the pusher uses a shared worker pool to allow multiple concurrent sends.


why isn't this default behavior? Default of like min(4, # of cores)

Are we considering it experimental until we get some real world feedback?

Yeah, that's the current plan.

chadpatel · 2025-01-13T16:50:02Z

plugins/outputs/cloudwatchlogs/internal/pusher/pool.go

+	workerCount atomic.Int32
+	wg          sync.WaitGroup
+	stopCh      chan struct{}
+	stopped     atomic.Bool


my initial thought is mixing atomics and channels is kind of weird

Do you mean the stopped and the stopCh? I think you're right. I can probably remove the atomic.Bool and just use the channel. I initially just had stopped and didn't remove it once I added the channel.

plugins/outputs/cloudwatchlogs/cloudwatchlogs.go

Makefile

lisguo · 2025-01-15T21:10:11Z

plugins/outputs/cloudwatchlogs/cloudwatchlogs.go

+	client := c.createClient(logThrottleRetryer)
+	agent.UsageFlags().SetValue(agent.FlagRegionType, c.RegionType)
+	agent.UsageFlags().SetValue(agent.FlagMode, c.Mode)
+	if containerInsightsRegexp.MatchString(t.Group) {


I thought container insights doesn't send through cloudwatch output plugin? The agent only uses emf?

Oh I see you just moved it from createClient()...

Historically, we had functionality where if the log group matched the regex (^/aws/.*containerinsights/.*/(performance|prometheus)$), then we'd count it as container insights. I don't think this path is used anymore, but we can verify that and clean it up in a separate PR.

lisguo · 2025-01-15T21:14:21Z

plugins/outputs/cloudwatchlogs/internal/pusher/pool.go

+	go p.worker()
+}
+
+func (p *workerPool) worker() {


super nit: worker() func doesn't have a comment

I'll add one. I need to clean up the comments.

lisguo · 2025-01-15T21:22:19Z

plugins/outputs/cloudwatchlogs/internal/pusher/queue.go

+)
+
+type Queue interface {
+	AddEvent(e logs.LogEvent)


Can you provide an example of when we would add a blocking event vs non blocking?

This is also legacy functionality. The AddNonBlockingEvent was used for EMF logs. We've moved that over to OTEL, but don't know if there are any customers using TOML to still send EMF logs this way.

lisguo · 2025-01-15T21:28:40Z

plugins/outputs/cloudwatchlogs/internal/pusher/target.go

+	})
+
+	if err == nil {
+		m.logger.Debugf("successfully created log stream %v", t.Stream)


shouldn't we log the error instead?

The error is nil here, so there's nothing to log.

lisguo · 2025-01-15T21:29:56Z

translator/config/schema.json

+        "concurrency": {
+          "description": "The number of concurrent workers available for cloudwatch logs export",
+          "type": "integer",
+          "minimum": 1


should we define a max?

We didn't for the X-Ray concurrency field

amazon-cloudwatch-agent/translator/config/schema.json

Lines 1172 to 1176 in 09f8b09

"concurrency": {

"description": "Maximum number of concurrent calls to AWS X-Ray to upload documents",

"type": "integer",

"minimum": 1

},

jefchien requested a review from a team as a code owner January 13, 2025 15:20

Add worker pool for log pushers.

43ec0f1

Split up queue/batch from sender.

jefchien force-pushed the logs-thread branch from 6d3a60d to 43ec0f1 Compare January 13, 2025 15:21

chadpatel reviewed Jan 13, 2025

View reviewed changes

jefchien added 2 commits January 13, 2025 12:46

Remove atomic.Bool from workerPool.

b71972d

Merge branch 'main' into logs-thread

2556d61

okankoAMZ reviewed Jan 15, 2025

View reviewed changes

plugins/outputs/cloudwatchlogs/cloudwatchlogs.go Show resolved Hide resolved

Makefile Show resolved Hide resolved

lisguo reviewed Jan 15, 2025

View reviewed changes

chadpatel approved these changes Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add worker pool for log pushers #1499

Add worker pool for log pushers #1499

jefchien commented Jan 13, 2025 •

edited

Loading

chadpatel Jan 13, 2025

jefchien Jan 13, 2025 •

edited

Loading

chadpatel Jan 13, 2025

jefchien Jan 13, 2025

lisguo Jan 15, 2025

lisguo Jan 15, 2025

jefchien Jan 15, 2025

lisguo Jan 15, 2025

jefchien Jan 15, 2025

lisguo Jan 15, 2025

jefchien Jan 15, 2025

lisguo Jan 15, 2025

jefchien Jan 15, 2025

lisguo Jan 15, 2025

jefchien Jan 15, 2025

	"concurrency": {
	"description": "Maximum number of concurrent calls to AWS X-Ray to upload documents",
	"type": "integer",
	"minimum": 1
	},

Add worker pool for log pushers #1499

Are you sure you want to change the base?

Add worker pool for log pushers #1499

Conversation

jefchien commented Jan 13, 2025 • edited Loading

Description of the issue

Description of changes

License

Tests

Requirements

Choose a reason for hiding this comment

jefchien Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jefchien commented Jan 13, 2025 •

edited

Loading

jefchien Jan 13, 2025 •

edited

Loading