Project

General

Profile

Feature #8652

Feature #5734: Monitor servers

Feature #9484: Deploy the monitoring setup to production

Evaluate how the initial monitoring setup behaves and adjust things accordingly

Added by intrigeri over 3 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
Start date:
01/09/2015
Due date:
% Done:

100%

QA Check:
Pass
Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

This includes evaluating how much disk space it costs us to save downtime information.


Related issues

Blocked by Tails - Feature #8650: Configure monitoring for the most critical services Resolved 01/09/2015

History

#1 Updated by intrigeri over 3 years ago

  • Blocks Feature #8653: Configure monitoring for other high-priority services added

#2 Updated by intrigeri about 3 years ago

  • Blocks deleted (Feature #8653: Configure monitoring for other high-priority services)

#3 Updated by intrigeri about 3 years ago

  • Assignee changed from bertagaz to Dr_Whax
  • Target version changed from Tails_1.8 to Tails_1.5
  • Parent task changed from #5734 to #9482

#5 Updated by intrigeri about 3 years ago

  • Blocked by Feature #8650: Configure monitoring for the most critical services added

#6 Updated by intrigeri about 3 years ago

  • Blocks Feature #8653: Configure monitoring for other high-priority services added

#7 Updated by intrigeri about 3 years ago

  • Target version changed from Tails_1.5 to Tails_1.6

#8 Updated by bertagaz almost 3 years ago

  • Target version changed from Tails_1.6 to Tails_1.7

#9 Updated by Dr_Whax almost 3 years ago

  • Target version changed from Tails_1.7 to Tails_1.8

Since we should test it out and i'm traveling soon, maybe we should postpone this a bit.

#10 Updated by intrigeri almost 3 years ago

  • Description updated (diff)

#11 Updated by intrigeri over 2 years ago

  • Assignee changed from Dr_Whax to bertagaz
  • Target version changed from Tails_1.8 to Tails_2.0

#12 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_2.0 to Tails_2.2

Postponing this part of the monitoring setup, as it will be unlikely done for the previously planed deadline.

#14 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_2.2 to Tails_2.3

#15 Updated by bertagaz about 2 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20

I guess this starts with the release of notifications, and thus has started with #8651.

#16 Updated by bertagaz about 2 years ago

  • Parent task changed from #9482 to #9484

To me this is a subtask of #9484

#17 Updated by bertagaz about 2 years ago

I wanted to roll-over this last parenting change. I initially wanted to reorganize a bunch of ticket parents to clarify the difference between the prototype and the production state, which is in Redmine parenting hierarchy IMO hardly representing what's going on. But I get 'Parent task is invalid' errors while reparenting other tickets, and now I can't reparent back this ticket...

#18 Updated by bertagaz about 2 years ago

Apart from the notifications, on the check side it seems the whisperback one is not reliable, probably because of Tor network troubles to reach the hidden service.

This is the main source of probems we can seein the history: apt-snapshot-disk has been resolved, and autotest-s* were "artificial" tests of the notifications.

This is actually the main cause of notification spams because of false positives caused by network problems. Even HTTP checks at the moment are quiet (no notifications), and have very few false positives from time to time (mostly timeouts of 10 seconds reached).

I'm unsure how to solve the whisperback one though. The hidden service network overhead seems to ensure we'll have false positives. I tried to modify the {check,retry}_interval to higher levels (c=20m and r=10m), it didn't lead to less notifications. I'll investigate a bit more on this side.

We could also maybe rather check locally everything is running fine with (yet to be defined) commands run from the agent on whisperback.li. We'd lost the network view, which OTOH is the main source of problems probably. But this may require a far more elaborated homebrew plugin, depending on what we would define as a correct local check...

#19 Updated by bertagaz about 2 years ago

  • Target version changed from Tails_2.3 to Tails_2.4

#20 Updated by intrigeri about 2 years ago

  • Blocks deleted (Feature #8653: Configure monitoring for other high-priority services)

#21 Updated by bertagaz about 2 years ago

  • Target version changed from Tails_2.4 to Tails_2.5

#22 Updated by bertagaz about 2 years ago

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 20 to 50
  • QA Check set to Ready for QA

After one month and a half or so of production, my conclusion is that it's working quite well. It already has pointed us to fix issues, is useful while doing sysadmin shifts to get things that need care, and doesn't mailbomb us. So I'd be to close this ticket.

#23 Updated by intrigeri about 2 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

After one month and a half or so of production, my conclusion is that it's working quite well. It already has pointed us to fix issues, is useful while doing sysadmin shifts to get things that need care, and doesn't mailbomb us.

Yes!

So I'd be to close this ticket.

Yes… once what the task that the description of this ticket mentions has been done.

#24 Updated by bertagaz about 2 years ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:

So I'd be to close this ticket.

Yes… once what the task that the description of this ticket mentions has been done.

We don't keep that much data, so I don't think disk space is an issue:

~# du -sh /var/lib/mysql/ /var/lib/icinga2
100M    /var/lib/mysql/
1.1M    /var/lib/icinga2

#25 Updated by intrigeri about 2 years ago

  • Status changed from In Progress to Resolved
  • QA Check changed from Ready for QA to Pass

We don't keep that much data, so I don't think disk space is an issue:

I doubt that this really includes "downtime information", since IIRC we had actually disabled storing such data. To clarify, what I meant when this was added to this ticket's description is: data that would allow to compute per-service availability stats some day, i.e. when we want to identify unreliable services. But whatever, at this point I don't think it's a must, so let's forget about it.

#26 Updated by intrigeri about 2 years ago

  • Assignee deleted (intrigeri)
  • % Done changed from 50 to 100

#28 Updated by bertagaz about 2 years ago

intrigeri wrote:

I doubt that this really includes "downtime information", since IIRC we had actually disabled storing such data. To clarify, what I meant when this was added to this ticket's description is: data that would allow to compute per-service availability stats some day, i.e. when we want to identify unreliable services.

It's stored in the MySQL database, so the data is not so huge as I reported, and we're good on this.

Also available in: Atom PDF