-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle duplicated metrics gracefully #79
Comments
Damn! When this happens, are there scrape errors? |
Indeed.
How can I tell if there are scrape errors? I've only looked in the graphite_exporter log file, but did not see anything out of the ordinary. I'll try and look in the prometheus log files tomorrow.
Looks normal AFAIKT. Hard to tell if metrics are missing from
Nothing out of the ordinary
Did not try with the debug flag. Can do that tomorrow as well. |
On the Prometheus server, the |
Also noticed this:
I've never seen these messages before and I don't know what they mean. |
To be clear, those are coming from the exporter, not Prometheus? |
The log entries are from the exporter (if I understood the question correctly). They came after about 3 hours of runtime. |
That's probably a client_golang bug then, @beorn7 I can see this happening if you have a gzipped response, and then there's a problem encoding a metric family. |
If there is enough "evidence", then I suggest removing the release for now. |
The question is what's causing the underlying error. Could you add some extra logging around that bit of code? |
I was able to poll the
I've seen different metric names and labels reported, and different number of errors from time to time Ideally, this should not be written to the |
The error there is expected and correct from an exposition standpoint. The question is what's causing it inside the graphite exporter. |
Still, some error lines in the logfile when this happens would have been nice. BTW, I enabling debug level logging, but could not find anything useful in the log. |
The error happens at scrape time, so the exporter main doesn't have access
to the error. Logging from within the library would be good but I don't
know what a good interface for this would be.
The error you posted is input and mapping dependent. Could you share your
mapping configuration, and any inputs that could get mapped to these metric
names? I suspect we are keeping something in the exporter keyed on
different Graphite metric names that then maps down to the same Prometheus
time series.
…On Thu, Feb 28, 2019, 21:18 Paul Nyheim ***@***.***> wrote:
Still, some error lines in the logfile when this happens would have been
nice.
BTW, I enabling debug level logging, but could not find anything useful in
the log.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#79 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICBgHOv74zP_6GY1VqT5z1oOdvIAOxks5vSDmDgaJpZM4bWzE8>
.
|
This likely happened before but was silently handled (in an undefined way)
by the client library. We came from an ancient version, and it has grown a
lot more validations since then.
…On Thu, Feb 28, 2019, 21:31 Matthias Rampke ***@***.***> wrote:
The error happens at scrape time, so the exporter main doesn't have access
to the error. Logging from within the library would be good but I don't
know what a good interface for this would be.
The error you posted is input and mapping dependent. Could you share your
mapping configuration, and any inputs that could get mapped to these metric
names? I suspect we are keeping something in the exporter keyed on
different Graphite metric names that then maps down to the same Prometheus
time series.
On Thu, Feb 28, 2019, 21:18 Paul Nyheim ***@***.***> wrote:
> Still, some error lines in the logfile when this happens would have been
> nice.
>
> BTW, I enabling debug level logging, but could not find anything useful
> in the log.
>
> —
> You are receiving this because you commented.
>
>
> Reply to this email directly, view it on GitHub
> <#79 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAICBgHOv74zP_6GY1VqT5z1oOdvIAOxks5vSDmDgaJpZM4bWzE8>
> .
>
|
Sorry this looks like my carelessness during bumping up client_golang. FYI a similar error happened when running e2e test, hence this fix 3d09334. Indeed this is a behaviour change in client_golang. I feel it could be addressed by not creating new metric every time in Lines 192 to 201 in 3d09334
originalName (like in statsd_exporter).
However I do have a question in mind and I'm not very familiar with the graphite protocol: if two metric with the same name and labels arrives between two prometheus scrape, the current behaviour will be exporter only export the later value (since they are Gauge). Is this expected? |
Yes, that's the semantics of the exporter and Prometheus. It's a little
more complex in real Graphite because one of them could come with an older
timestamp but we don't support that kind of backfilling. We only export the
latest received sample.
…On Mon, Mar 4, 2019, 07:24 Wangchong Zhou ***@***.***> wrote:
Sorry this looks like my carelessness during bumping up client_golang. FYI
a similar error happened when running e2e test, hence this fix 3d09334
<3d09334>.
Indeed this is a behaviour change in client_golang. I feel it could be
addressed by not creating new metric every time in
https://github.com/prometheus/graphite_exporter/blob/3d093347bac809689ebd31a344a4fbe9a10c9b35/main.go#L192-L201,
instead we can cache the metric instance for same originalName (like in
statsd_exporter).
However I do have a question in mind and I'm not very familiar with the
graphite protocol: if two metric with the same name and labels arrives
between two prometheus scrape, the current behaviour will be exporter only
export the later value (since they are Gauge). Is this expected?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#79 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICBnzV63kbxFAG3Hs7a9v0MthdE6T_ks5vTLwZgaJpZM4bWzE8>
.
|
Sorry for late follow-up. I think @brian-brazil is right here. If the encoding fails somewhere in the middle, we cannot send an HTTP error anymore. I'll think about a way to detect this and handle better. |
Further update from my own testing / findings to date:
I assume because we're avoiding the client exporer error By extending the window of time I can avoid it with our data's behavior. Certainly if logging debug output from the prometheus library is possible, it would be handy in the future to debug this stuff easier, as I have no real way of knowing what is tripping up the exporter client library. I think above @fffonion is onto something, when suggesting this above:
It seems I can get a working pipeline but only 2x the duration I set with |
I don't think keeping persistent metric instances around is the best way. It fixes the issue incidentally because it forces us to dedup, but we can also dedup and then generate a constmetric. At least as a direct fix for the issue at hand, what if we add straight-up deduplication to this loop? Basically, remember which That would also allow logging the duplicated samples, without relying on being able to instrument the client library. As an alternative approach, just above that loop we're copying samples out of the |
As you perform the deduplicate loop, can you look at the age of the entry vs what is already staged for passing into the Prometheus client exporter duplicate name wise? Design wise the goal is to ensure 'newer' entries of more than one duplicate found always end up on top as what is exported, a kind of 'bubbles up' sortation wise during that decision making. After all records are iterated over, only the 'newest' entry wins for passing up then. This feature could be an 'arg' like For us right now, our plan for production today sadly is to set a very long retention time like Hope this helps. Happy to test anything out on my side, let me know thx. |
I don't understand why a longer expiry helps? |
In our case we're receiving data from field devices in part, which have sporadic connectivity. Think of it as a IoT type use case in part. So we're getting a mixture of 'new' and also sometimes older graphite data. It is all mixed together. The majority of our data is 'fresh' - but it appears to be the sporadic elements that pose the problem for us. We really only care about 'latest' data, but this is a pinch that contrasts how say graphite vs prometheus handles this particular situation. We find once that mixed set of older/newer data hits most of it is dropped by graphite-exporter (presumably the prometheus exporter library angry about duplicates?). I can't get any logging from the scrape exporter layer to confirm or understand my pinch surface area, which seems to be a limitation of how the prometheus export client is being used presently. From looking at source, and increasing the expiration time via From a high level strategy, we're at a point now where Graphite isn't capable of scaling, and we're migrating to Prometheus. This is a two step approach, first replacing Graphite's role as is by using graphite-exporter to get that data in Prometheus, and then the slower refactoring and implementation of data collection fixtures (push vs poll etc) with the legacy graphite approach perhaps entirely being displaced in time. Is there anything else I can gather or do to help out? |
So just to close out, we did much more investigation on our end and found that several origin data points graphite data was being 'duplicated' for prior teams convenience in the past - those two feeds were being sent into our single graphite-exporter - and when faced with those duplicate points at times (or delayed x vs y arrival times), graphite-exporter was 'doing its job'. The issue above still stands, that any duplicates HTTP/503 version 0.5.0 - so we'll use 0.4.2 until fixed. Of course not having logging output from its prometheus client exporter made finding the root cause hard, so a +1 on if that can ever be exposed logging wise. I think the concept or idea of a Thanks again for your consideration and efforts to date! |
Thank you for the update!
I think deduplication need not be optional, the failure mode when ingesting
duplicate data is too bad. I don't see when you would _ever_ want to accept
a duplicate (and then fail on scrape).
…On Tue, Mar 12, 2019 at 6:28 PM rtkgjacobs ***@***.***> wrote:
So just to close out, we did much more investigation on our end and found
that several origin data points graphite data was being 'duplicated' for
prior teams convenience in the past - those two feeds were being sent into
our single graphite-exporter - and when faced with those duplicate points
at times (or delayed x vs y arrival times), graphite-exporter was 'doing
its job'. The issue above still stands, that any duplicates 503 version
0.5.0 - so we'll use 0.4.2 until fixed. Of course not having logging output
from its prometheus client exporter made finding the root cause hard, so a
+1 on if that can ever be exposed logging wise.
I think the concept or idea of a --graphite.dedupe makes sense as a
safety flag or legacy use case where someone wants to firehose some legacy
data in and not spend the immense effort to ensure its 100% clean, but
we've found our biggest pinch getting this to prod. It's also forcing a
better design pattern where we scrape from major sources vs trying to use a
single point ingress (which is not very prometheus-proper i'll agree) and
likely run a single graphite-exporter far too hot.
Thanks again for your consideration and efforts to date!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#79 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICBqKBffybWzRN8jzCFBLZtxHcK6Pgks5vV_G5gaJpZM4bWzE8>
.
|
I am seeing this issue with Apache Storm and the built-in graphite reporter (https://storm.apache.org/releases/current/javadocs/org/apache/storm/metrics2/reporters/GraphiteStormReporter.html). With quick debugging it seems that parallelism in Apache Storm causes duplicate metrics to be sent to Graphite exporter (each have the same key):
I would be happy by either being able to ignore duplicates or group by with operator. At the moment the behaviour is quite strict and causes graphite-exporter to go into failure state when duplicates are found. (tested with docker tag master and latest) |
if anyone wants to pick this up, please do. as for implementation, one way out (that I would prefer long-term) could be to extract the relevant parts of the statsd exporter as we already do for the mapper itself. Logically, the graphite exporter does pretty much the same as the statsd exporter, except it only accepts gauges and the line format is slightly different. they have diverged a long time ago, and in the end, we keep fixing the same issues independently in both. Strategically, I would like to factor out all the reusable parts (listener, mapper[done], collector, …) into separate packages within the statsd exporter repo, and then only plug them together with a different line parser here. |
There are more thoughts on what I have in mind for the split here prometheus/statsd_exporter#234 (comment) |
Summary:
The exporter remembers metrics by Graphite name. It is possible to get into situations where multiple samples (graphite metrics) result in the same Prometheus metric.
At the moment, this makes the exporter catastrophically fail (return error 500 on
/metrics
). Instead, it should discard the older sample and use the new sample's value.Original issue description:
We tried using v0.5.0 in our dev env today, upgrading from 0.4.2.
After the upgrade, metrics seem to be dropping out regularly, as can be clearly seen in this grafana dashboard.
So something is definitively wrong here.
There are no changes to the environment or config between 0.4.2 and 0.5.0, and we can not see anything abnormal in the graphite_exporter logfiles.
This looks like a critical bug to us.
It should be noted that we have seen similar behavior before where we had UDP packet drops, but this was resolved by tuning the UDP receive buffer as explained here: https://nav.uninett.no/doc/4.3/howto/tuning_graphite.html
The text was updated successfully, but these errors were encountered: