Replace reboot-notifier cron email notification with an Icinga check
I'd love to receive less email, and to use have our monitoring dashboard be the place to go to know what we have to do on our systems. Also, the Icingaweb2 dashboard always knows what's the current state of things, and has a concept of state transition, as opposed to a set of emails.
#2 Updated by bertagaz almost 2 years ago
- Assignee changed from bertagaz to intrigeri
- QA Check changed from Info Needed to Dev Needed
bertagaz, what do you think? Please reassign to me for implementation if you agree, I'll use it as a way to test the fancy new monitoring setup doc.
Why not? That will probably not solve the emails problem, as Icinga2 will send some too, unless you intend to disable notifications for that. But I like the "one place to rule them all" idea. :) Hope the doc won't be too fuzzy.
- Status changed from Confirmed to In Progress
- % Done changed from 0 to 10
Here's a plan:
- use the
check_file_ageplugin with its
ignore-missingoption, pointing it to the flag file created by
/run/reboot-required), and tweaking the
--critical-agesettings passed to the
check_file_ageplugin (to start with I'll make it so that reboot needed = warning as soon as detected, and reboot needed becomes critical after 48 hours)
- test that this new check works as intended
- disable email notification from
- if we still receive too much email from Icinga about this (I think I was wrong when I wrote "Icinga2 sends email only on state change" above), tweak the notification settings for this check
- % Done changed from 20 to 30
disable email notification from
if we still receive too much email from Icinga about this (I think I was wrong when I wrote "Icinga2 sends email only on state change" above), tweak the notification settings for this check
By default, we'll receive 1 email/day from Icinga2 for each service that needs rebooting (just like with
reboot-notifier) but that'll start only 48h after the reboot need is identified, so the sysadmin on duty now has a good chance to reboot systems, or to acknowledge the problem if they have a good reason to postpone the reboots, before Icinga2 starts spamming the whole team. If we want to change the notification rate we need to add a new notification type to
templates/monitoring/notifications.conf.erb and conditionals about a custom
vars.$something we could set in services.
- Assignee changed from intrigeri to groente
- % Done changed from 30 to 50
- QA Check set to Ready for QA
/run/reboot-requiredon bridge.lizard in order to test that the check works. I'll wait 48h to make sure it switches to critical in due time.
It did switch to critical after 48h, we got a notification about it, and then one of us (who might not have followed the discussion here and thus was perhaps not aware it was part of an experiment) rebooted that VM, so the check switched back to normal, which is expected.
So I think we're done here. If we ever want to fine-tune the notification rate, see #11598#note-11.