Project

General

Profile

Bug #11858

Monitor if isobuilders systems are running fine

Added by bertagaz about 1 year ago. Updated 14 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
10/03/2016
Due date:
% Done:

50%

QA Check:
Dev Needed
Feature Branch:
puppet-tails:feature/11858-monitor-systemd
Type of work:
Sysadmin
Blueprint:
Starter:
Yes
Affected tool:

Description

We experienced times where our isobuilders were slowly getting all down when a branch was triggering the OOM during its build.

We should use our monitoring system to check using systemd and/or anything else if the isobuilders systems are running fine, so that we know if we have to restart them or their jenkins-slave service.


Related issues

Related to Tails - Bug #11632: ISO builds from branch that need more RAM can break all our Jenkins isobuilders without us being notified Resolved 08/11/2016
Related to Tails - Bug #12009: Jenkins ISO builders are highly unreliable Resolved 12/01/2016
Related to Tails - Bug #13582: Monitoring bridge Duplicate 08/04/2017
Blocks Tails - Feature #13242: Core work 2017Q4: Sysadmin (Maintain our already existing services) Confirmed 06/29/2017

History

#1 Updated by intrigeri about 1 year ago

  • Assignee set to bertagaz

(Assuming that's what you meant given you've set a target version.)

#2 Updated by intrigeri about 1 year ago

  • Related to Bug #11632: ISO builds from branch that need more RAM can break all our Jenkins isobuilders without us being notified added

#3 Updated by bertagaz about 1 year ago

  • Target version changed from Tails_2.7 to Tails_2.9.1

#4 Updated by intrigeri about 1 year ago

  • Related to Bug #12009: Jenkins ISO builders are highly unreliable added

#5 Updated by anonym 12 months ago

  • Target version changed from Tails_2.9.1 to Tails 2.10

#6 Updated by anonym 11 months ago

  • Target version changed from Tails 2.10 to Tails_2.11

#7 Updated by bertagaz 9 months ago

  • Target version changed from Tails_2.11 to Tails_2.12

#8 Updated by bertagaz 9 months ago

  • Target version changed from Tails_2.12 to Tails_3.0

#9 Updated by bertagaz 8 months ago

  • Target version changed from Tails_3.0 to Tails_3.1

#10 Updated by bertagaz 7 months ago

  • Target version changed from Tails_3.1 to Tails_3.2

#11 Updated by intrigeri 6 months ago

  • Blocks Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added

#12 Updated by groente 4 months ago

  • Starter set to Yes

A simple check whether

systemctl --quiet is-failed \*

returns 0 (in which case something is wrong) should do the trick, both for the isobuilders and #13582

#13 Updated by groente 4 months ago

#14 Updated by intrigeri 4 months ago

systemctl is-system-running might do exactly what we want.

#15 Updated by bertagaz 3 months ago

  • Target version changed from Tails_3.2 to Tails_3.3

#16 Updated by bertagaz 3 months ago

  • Blocks deleted (Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services))

#17 Updated by bertagaz 3 months ago

  • Blocks Feature #13242: Core work 2017Q4: Sysadmin (Maintain our already existing services) added

#18 Updated by bertagaz about 2 months ago

pynagsystemd sounds like a good candidate. I'll give a try to this one.

#19 Updated by bertagaz about 2 months ago

  • Status changed from Confirmed to In Progress
  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 0 to 50
  • QA Check set to Ready for QA
  • Feature Branch set to puppet-tails:feature/11858-monitor-systemd

bertagaz wrote:

pynagsystemd sounds like a good candidate. I'll give a try to this one.

I've committed everything in the dedicated branch, merged it in master and deployed that. We now have a systemd check on all agents as we discussed in #13582. To test it, just find one check that will be run soon, and set one service as failing on the related host (e.g by misconfiguring and restarting it so that it does fail to start). Then you'll see an alert in icinga2 about this service failing.

#20 Updated by intrigeri about 2 months ago

  • Assignee changed from intrigeri to groente

(As per "Shifts for 2018Q1 + intrigeri's involvement in the sysadmin team".)

#21 Updated by intrigeri about 1 month ago

As reported by groente today, apparently this does not work for the jenkins-slave service, which is precisely the one that made us create this ticket in the first place.

#22 Updated by anonym 27 days ago

  • Target version changed from Tails_3.3 to Tails_3.5

#23 Updated by intrigeri 14 days ago

  • Assignee changed from groente to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

intrigeri wrote:

As reported by groente today, apparently this does not work for the jenkins-slave service, which is precisely the one that made us create this ticket in the first place.

Reproduced again: isotester4 was offline in Jenkins for ~1.5 days but the jenkins-slave service was seen as successfully started by systemd. jenkins-slave.log said Error: Invalid or corrupt jarfile /var/run/jenkins/slave.jar. So I guess this ticket shall be blocked by a new one about making the jenkins-slave service able to report its state reliably.

Also available in: Atom PDF