Monitor if isobuilders systems are running fine
We experienced times where our isobuilders were slowly getting all down when a branch was triggering the OOM during its build.
We should use our monitoring system to check using systemd and/or anything else if the isobuilders systems are running fine, so that we know if we have to restart them or their jenkins-slave service.
- Status changed from Confirmed to In Progress
- Assignee changed from bertagaz to intrigeri
- % Done changed from 0 to 50
- QA Check set to Ready for QA
- Feature Branch set to puppet-tails:feature/11858-monitor-systemd
pynagsystemd sounds like a good candidate. I'll give a try to this one.
I've committed everything in the dedicated branch, merged it in master and deployed that. We now have a systemd check on all agents as we discussed in #13582. To test it, just find one check that will be run soon, and set one service as failing on the related host (e.g by misconfiguring and restarting it so that it does fail to start). Then you'll see an alert in icinga2 about this service failing.
- Assignee changed from groente to bertagaz
- QA Check changed from Ready for QA to Dev Needed
As reported by groente today, apparently this does not work for the jenkins-slave service, which is precisely the one that made us create this ticket in the first place.
Reproduced again: isotester4 was offline in Jenkins for ~1.5 days but the jenkins-slave service was seen as successfully started by systemd.
Error: Invalid or corrupt jarfile /var/run/jenkins/slave.jar. So I guess this ticket shall be blocked by a new one about making the
jenkins-slave service able to report its state reliably.