-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: fix for cluster aggregation performance #631
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,8 +9,11 @@ | |
*/ | ||
|
||
const Registry = require('./registry'); | ||
const { Grouper } = require('./util'); | ||
const { Grouper, hashObject } = require('./util'); | ||
const { aggregators } = require('./metricAggregators'); | ||
const fs = require('fs'); | ||
const path = require('path'); | ||
const os = require('os'); | ||
// We need to lazy-load the 'cluster' module as some application servers - | ||
// namely Passenger - crash when it is imported. | ||
let cluster = () => { | ||
|
@@ -175,19 +178,28 @@ function addListeners() { | |
request.done(new Error(message.error)); | ||
return; | ||
} | ||
|
||
message.metrics.forEach(registry => request.responses.push(registry)); | ||
request.pending--; | ||
|
||
if (request.pending === 0) { | ||
// finalize | ||
requests.delete(message.requestId); | ||
clearTimeout(request.errorTimeout); | ||
|
||
const registry = AggregatorRegistry.aggregate(request.responses); | ||
const promString = registry.metrics(); | ||
request.done(null, promString); | ||
} | ||
fs.readFile(message.filename, 'utf8', (err, data) => { | ||
if (err) { | ||
request.done(err); | ||
return; | ||
} else { | ||
const metrics = JSON.parse(data); | ||
metrics.forEach(registry => request.responses.push(registry)); | ||
fs.unlink(message.filename, e => { | ||
if (e) | ||
console.error(`Error deleting file ${message.filename}:`, e); | ||
}); | ||
request.pending--; | ||
if (request.pending === 0) { | ||
// finalize | ||
requests.delete(message.requestId); | ||
clearTimeout(request.errorTimeout); | ||
const registry = AggregatorRegistry.aggregate(request.responses); | ||
const promString = registry.metrics(); | ||
request.done(null, promString); | ||
} | ||
} | ||
}); | ||
} | ||
}); | ||
} | ||
|
@@ -198,10 +210,32 @@ function addListeners() { | |
if (message.type === GET_METRICS_REQ) { | ||
Promise.all(registries.map(r => r.getMetricsAsJSON())) | ||
.then(metrics => { | ||
process.send({ | ||
type: GET_METRICS_RES, | ||
requestId: message.requestId, | ||
metrics, | ||
metrics.forEach(registry => { | ||
registry.forEach(value => { | ||
const hash = hashObject(value); | ||
const key = `${value.metricName}_${hash}`; | ||
value.hash = key; | ||
}); | ||
}); | ||
// adding request id in file path to handle concurrency | ||
const filename = path.join( | ||
os.tmpdir(), | ||
`metrics-${process.pid}-${message.requestId}.json`, | ||
); | ||
fs.writeFile(filename, JSON.stringify(metrics), err => { | ||
Comment on lines
+220
to
+225
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm really hesitant to involve files, since that suddenly involves the page cache, file system, more syscalls and possibly other surprises like filesystem permissions, running out of disk space and cleaning up. How much of an improvement did this yield? Did you benchmark before or after Node.js v18.6.0, which has this optimization? How much disk I/O load did your system have when benchmarking? I think IPC is "supposed" to be faster than involving the filesystem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The benchmarks that I have performed are on our own application. The screenshots of the results are shared in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In these benchmarks (issue link) the scraping interval used was every 5 seconds and the throughput on the app was about 1100-1200 RPS. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @zbjornson
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @zbjornson @SimenB we went live in production with these changes about 3 months ago and everything is working fine for us. We are not seeing any additional latencies post this change. We serve more than 100,000 RPS at peak for this service. If required I can share the production results as well. |
||
if (err) { | ||
process.send({ | ||
type: GET_METRICS_RES, | ||
requestId: message.requestId, | ||
error: err.message, | ||
}); | ||
} else { | ||
process.send({ | ||
type: GET_METRICS_RES, | ||
requestId: message.requestId, | ||
filename, | ||
}); | ||
} | ||
}); | ||
}) | ||
.catch(error => { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About IPC
Can we have it as a customizable option and keep the default behavior instead ?
So users could pick if they want file vs ipc communication and there would be no breaking change.
Instead of sending filename to worker, could the worker keep a permanent file (named with its pid) and use IPC only to notify the master to read the file (you could use the 1st line of the file to reconcile requestId) or use
fs.watchFile
Instead of file, could using unix socket or named pipes (if available) give better perf ?
About hashing
The master process should be used only to initiate and restart workers and aggregate metrics. So it should not need that much CPU compared to workers that handle the workload. And this will put a little more load to all the workers at constant interval.
I guess you could have bad performance when you have a lot of workers and short scrape interval then.
Thus same for this feature, could it be an option to configure prom-client with hashing either in master or distributed on workers, and by default keep the current behavior to introduce no breaking changes.