Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl script blocking itself on multiple icinga events #8

Open
ronindesign opened this issue Feb 18, 2016 · 10 comments
Open

Perl script blocking itself on multiple icinga events #8

ronindesign opened this issue Feb 18, 2016 · 10 comments

Comments

@ronindesign
Copy link

ronindesign commented Feb 18, 2016

When Icinga triggers multiple issues, the NotificationCommand "notify-service-by-pagerduty" fires multiples times.
One of the calls makes it, locking / blocking on file: /tmp/pagerduty/lockfile
All of the other instances of notify-service-by-pagerduty fail, with their shell script exiting on the following error:

/var/log/icinga/icinaga.log:

[2016-02-18 13:22:38 -0800] warning/PluginNotificationTask: Notification command for object 'celli.sports-it.com!apt' (PID: 15295, arguments: 'sh' '-c' '/usr/local/bin/pagerduty_icinga.pl enqueue -f pd_nagios_object=service') terminated with exit code 11, output: pagerduty_icinga[15297]: flock /tmp/pagerduty_icinga/lockfile failed: Resource temporarily unavailable
Resource temporarily unavailable at /usr/local/bin/pagerduty_icinga.pl line 221.

/var/log/syslog:

pagerduty_icinga[15297]: flock /tmp/pagerduty_icinga/lockfile failed: Resource temporarily unavailable

This happens because each icinga event triggers an enqueue on pagerduty_icinga.pl, which internally calls (or tries to call) the method 'lock_and_flush_queue'. Only one instance gets the successful locks, the others are blocked.

This is not a fatal issue. If I have my cron job setup correctly, 1 minute later, the other entries will be called when 'pagerduty_icinga.pl flush' is called.
However, this is still not ideal. The pagerduty_icinga.pl enqueue process should either only enqueue (without attempting flush, and thus blocking itself) or it should implement some passive check timeout / keepalive option in the pearl script for the 'lock_and_flush_queue' section.

These processes finish almost immediately, so a keepalive would only need to be a matter of a few seconds, after which the calls could still be allowed to fail out, there would just now be a small buffer / threshold were multiple calls could be made successively.

@oryagel
Copy link

oryagel commented Jun 4, 2017

I'm having the same issue.
I added a cron job as well to trigger pagerduty_icinga.pl flush but I still have issues were incidents are not created due to the same locking.
In the syslog I can see the following errors: flock /tmp/pagerduty_icinga/lockfile failed: Resource temporarily unavailable

@ronindesign
Copy link
Author

No edits to this repo in a couple years, doubt this will get a fix.. I personally never resolved this.

@oryagel
Copy link

oryagel commented Jun 4, 2017

Thanks. I also opened a ticket with PD, let's wait for their thoughts. Did the cron job workaround worked for you or did you still get these locks?

I find it strange that these two products don't have a better integration.

@ronindesign
Copy link
Author

So I worked on this over a year ago, but if I remember correctly:

Using the cron job doesn't fix the lock issue (or resulting errors), they still happen regardless.
For me, when there was a lock conflict, it simply meant the rest of the queue was delayed 1 minute (until when the cron runs again).
Otherwise, all entries were called, and nothing was missed, some were just delayed.

I hope that makes sense? I can elaborate further if needed.

@oryagel
Copy link

oryagel commented Jun 25, 2017

PagerDuty are working on a new integration using the Nagios agent. It should be out shortly.

@ChrisHeerschap
Copy link

So has the new integration using the nagios agent been released?

@ronindesign
Copy link
Author

Looks they they have nagios integration now, did a quick search and came up with:
https://www.pagerduty.com/docs/guides/nagios-perl-integration-guide/
https://www.pagerduty.com/docs/guides/nagios-integration-guide/

@ChrisHeerschap
Copy link

I found that first one as well, but it references https://github.com/PagerDuty/pagerduty-nagios-pl which says:

Latest commit 6fecda3 on Jul 28, 2014

That's even older than this repo.

Looks like it might be https://github.com/PagerDuty/pdagent-integrations ... wonder how much of a pain it'll be to make that work with Icinga2.

@ronindesign
Copy link
Author

ronindesign commented Mar 13, 2018

Looks like there is already some icinga2 support built in:
PagerDuty/pdagent-integrations@5f675d9
PagerDuty/pdagent-integrations#23
https://www.pagerduty.com/docs/guides/icinga2-integration-guide/

EDIT: added some links

@lpossamai
Copy link

lpossamai commented Jan 9, 2019

I am also having this issue. Just like @ronindesign mentioned. Was anybody able to find a solution for this?

UPDATE: Wrong permissions on folder /tmp/pagerduty_icinga.

chown nagios:nagios -R /tmp/pagerduty_icinga fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants