Project

General

Profile

Bug #11562

Monitor servers from the htpdate pools

Added by bertagaz over 1 year ago. Updated 4 days ago.

Status:
Confirmed
Priority:
Normal
Assignee:
Category:
Time synchronization
Target version:
Start date:
07/14/2016
Due date:
% Done:

0%

QA Check:
Feature Branch:
Type of work:
Sysadmin
Blueprint:
Easy:
Affected tool:

Description

While tackling #10494, it came up that some of the HTTP servers of the htpdate pools were buggy. This has some incidence for Tails to boot correctly, and our test suite to run nicely. We should monitor if this servers are up and answering correctly to the CURL requests made by htpdate to ensure this service is reliable.


Related issues

Related to Tails - Bug #13472: Replace www.centos.org in htpdate pools Resolved 07/15/2017
Blocks Tails - Bug #10495: The 'the time has synced' step is fragile In Progress 11/06/2015
Blocks Tails - Feature #13242: Core work 2017Q4: Sysadmin (Maintain our already existing services) Confirmed 06/29/2017

History

#1 Updated by intrigeri over 1 year ago

Excellent idea!

The consequences of a failing check will likely need to be different from what we do for our own services: we can't fix the web servers that are in the HTP pools, all we can do is to drop them from the pool in next Tails release. So, what matters here is aggregated availability stats, rather than real-time up/down status info.

Email notifications would be useless noise, and as a sysadmin I'd rather not see info about such failures on our dashboard's "Current Incidents" page, if possible: sysadmins' duty does not include maintaining the HTP pools we use, and I don't want to train myself to ignore incidents.

But the RM (or the Foundations team?) needs to regularly check, e.g. at the beginning of each release cycle, if some servers in the pool are too unreliable, so that they can be replaced. How can they be given access to the aggregated availability stats they need to do this job? The easiest their task, the greatest the chances that it'll actually be done regularly.

#2 Updated by anonym about 1 year ago

  • Target version changed from Tails_2.6 to Tails_2.7

#3 Updated by bertagaz about 1 year ago

  • Target version changed from Tails_2.7 to Tails_2.9.1

#4 Updated by anonym 11 months ago

  • Target version changed from Tails_2.9.1 to Tails 2.10

#5 Updated by intrigeri 11 months ago

  • Target version changed from Tails 2.10 to Tails_2.11

#6 Updated by bertagaz 9 months ago

  • Target version changed from Tails_2.11 to Tails_2.12

#7 Updated by bertagaz 9 months ago

  • Target version changed from Tails_2.12 to Tails_3.0

#8 Updated by intrigeri 8 months ago

  • Type of work changed from Code to Sysadmin

#9 Updated by bertagaz 6 months ago

  • Target version changed from Tails_3.0 to Tails_3.1

#10 Updated by bertagaz 6 months ago

  • Target version changed from Tails_3.1 to Tails_3.2

#11 Updated by intrigeri 5 months ago

  • Blocks Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added

#12 Updated by intrigeri 4 months ago

  • Blocks Bug #10495: The 'the time has synced' step is fragile added

#13 Updated by bertagaz 4 months ago

  • Related to Bug #13472: Replace www.centos.org in htpdate pools added

#14 Updated by bertagaz 2 months ago

  • Target version changed from Tails_3.2 to Tails_3.3

#15 Updated by bertagaz 2 months ago

  • Blocks deleted (Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services))

#16 Updated by bertagaz 2 months ago

  • Blocks Feature #13242: Core work 2017Q4: Sysadmin (Maintain our already existing services) added

#17 Updated by bertagaz about 2 months ago

  • Target version changed from Tails_3.3 to Tails_3.4

#18 Updated by bertagaz 27 days ago

One idea about this: with #13541 and the feature/13541-save-more-data-on-htpdate-or-tor-failures branch merge, we're now collecting htpdate logs each time there's sudch a failure of that kind in our isotesters. We could gather this files and use them as a source to output statistics about servers failures. That'd give an overview closer to server failure in almost real Tails context, rather than using basic URL fetching or coding some htpdate behavior simulation (depending how we want to test this servers).

intrigeri wrote:

But the RM (or the Foundations team?) needs to regularly check, e.g. at the beginning of each release cycle, if some servers in the pool are too unreliable, so that they can be replaced. How can they be given access to the aggregated availability stats they need to do this job? The easiest their task, the greatest the chances that it'll actually be done regularly.

Then maybe there are different options:

  • It could be accessible through a web page. Could be hosted on www.lizard. That could even be the starter of some status.t.b.o page, where to output such informations + where to also publicly output Jenkins builds statuses. Or maybe joined with other type of stats on a metrics.t.b.o page?
  • Given the people we're talking about, and the impact it has on our test suite in Jenkins, maybe the tails-ci list is a good recipient. We could send email notifications there.

#19 Updated by bertagaz 4 days ago

  • Target version changed from Tails_3.4 to Tails_3.5

Also available in: Atom PDF