Project

General

Profile

Bug #10494

Retry htpdate when it fails

Added by anonym about 2 years ago. Updated 4 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Time synchronization
Target version:
Start date:
07/17/2016
Due date:
% Done:

100%

QA Check:
Dev Needed
Feature Branch:
bugfix/10494-retry-curl-in-htpdate
Type of work:
Code
Blueprint:
Easy:
Affected tool:

Description

After #9516 our automated test suite has started reporting a lot of failures of this kind:

    ./features/step_definitions/common_steps.rb:344:in `/^the time has synced$/'

because the amount of Tor bootstrapping issues (that generally happens right before where the above error may appear) has decreased.

Just like with #9516, this is an actual problem with Tails, and not just a test suite issue. We should add some retrying logic around htpdate, possibly with a retry_for()-based shell wrapper. Or we implement that type of retry-logic (retries + timeouts) in htpdate's Perl.


Subtasks

Bug #11575: Reintroduce and fix CapabilityBoundingSet in htpdate.serviceRejected


Related issues

Related to Tails - Bug #11577: Fix htpdate pool and timeout Resolved 07/18/2016
Blocks Tails - Bug #10495: The 'the time has synced' step is fragile In Progress 11/06/2015
Blocked by Tails - Bug #11574: Investigate security consequences of restarting htpdate until it succeeds Rejected 07/17/2016

Associated revisions

Revision 6cc9a8cc (diff)
Added by bertagaz over 1 year ago

Raise a bit $allowed_per_pool_failure_ratio.

Refs: #10494

Revision 60e22421 (diff)
Added by bertagaz over 1 year ago

Revert "Raise a bit $allowed_per_pool_failure_ratio."

The previous ratio makes sense, and the problem don't really comes from
this setting.

This reverts commit 6cc9a8cce3a2ccd2a60b3a1b423bf54ad3b05ec7.

Refs: #10494

Revision c6782127 (diff)
Added by bertagaz over 1 year ago

Use a shell wrapper to retry_tor() htpdate.

Refs: #10494

Revision d4bb25b0 (diff)
Added by bertagaz over 1 year ago

Remove CapabilityBoudingSet from htpdate service.

Otherwise the shell wrapper fails to send the NEWNYM signal to Tor, with
permission denied error when opening the Tor controller cookie.

Refs: #10494

Revision f7b65871 (diff)
Added by bertagaz over 1 year ago

Add a scenario to test htpdate restart on failure.

Refs #10494

Revision 5917e197 (diff)
Added by bertagaz over 1 year ago

Replace broken urls in HTP_POOL_PAL.

Chavez.indymedia.org is down so is replaced by espiv.net. sarava.org url
needs to have the www. prefix prepended to work correctly.

Refs: #10494

Revision 87cda086 (diff)
Added by bertagaz over 1 year ago

Replace www.rsa.com by www.rackspace.com in HTP_POOL_FOE.

Seems that this website does not like Tor client so much, and returns
500 to such HTTP requests. Let's use a more robust website.

Refs: #10494

Revision 37e19b99 (diff)
Added by bertagaz over 1 year ago

Replace broken urls in HTP_POOL_PAL.

Chavez.indymedia.org is down so is replaced by espiv.net. sarava.org url
needs to have the www. prefix prepended to work correctly.

Refs: #10494

Revision fa1ec1d0 (diff)
Added by bertagaz over 1 year ago

Replace www.rsa.com by www.rackspace.com in HTP_POOL_FOE.

Seems that this website does not like Tor client so much, and returns
500 to such HTTP requests. Let's use a more robust website.

Refs: #10494

Revision 2e10750b (diff)
Added by bertagaz over 1 year ago

More relevant variable name.

Refs: #10494

Revision e38ec9eb (diff)
Added by bertagaz over 1 year ago

Wording.

Refs: #10494

Revision c15b2c54 (diff)
Added by bertagaz over 1 year ago

Move retry_htpdate outside of $PATH.

Refs: #10494

Revision c4b04388 (diff)
Added by bertagaz over 1 year ago

Replace connect-timeout CURL option with max-time.

Refs: #10494, #11577

Revision 969b78e5 (diff)
Added by bertagaz 3 months ago

Add ability to retry curl invocation when fetching url fails (refs: #10494).

It's also possible to specify that a new Tor circuit must be requested
between each retry.

Revision cb295055
Added by anonym 3 months ago

Merge remote-tracking branch 'origin/bugfix/10494-htpdate-fix-date-header-regexp' into stable

Refs: #10494

Revision 5ee783bd (diff)
Added by bertagaz 3 months ago

Htpdate: fix newnym command syntax.

Better to use a list for command arguments when using system().

Refs: #10494

Revision db1a49fa (diff)
Added by bertagaz 3 months ago

Htpdate: do not send newnym request and say we'll retry when we're not.

Refs: #10494

Revision 283014a4 (diff)
Added by bertagaz 3 months ago

Htpdate: more explicit debug message.

Explain for which $url no file could be downloaded. Will help to debug
and maybe parse the logs.

Refs: #10494

History

#4 Updated by anonym about 2 years ago

  • Blocks Bug #10495: The 'the time has synced' step is fragile added

#5 Updated by anonym about 2 years ago

  • Related to Feature #9516: Restart Tor if bootstrapping stalls for too long added

#6 Updated by anonym about 2 years ago

  • Assignee set to kytv
  • Target version set to Tails_1.8

#7 Updated by intrigeri almost 2 years ago

On feature/jessie it might be simple to do that with tweaks to the systemd unit file (config/chroot_local-includes/lib/systemd/system/htpdate.service). I think this should be investigated before producing more Wheezy-area code.

#8 Updated by intrigeri almost 2 years ago

  • Target version changed from Tails_1.8 to Tails_2.0

Better postpone this and focus on your more urgent tasks :)

#9 Updated by intrigeri almost 2 years ago

It would be fine to postpone this. In any case, please prioritize your SponsorsM4 stuff (Icedove) higher.

#10 Updated by intrigeri almost 2 years ago

  • Target version changed from Tails_2.0 to Tails_2.2

I think it's too late to change such risky bits of our code base now for 2.0 => postponing.

#11 Updated by intrigeri almost 2 years ago

  • Target version changed from Tails_2.2 to Tails_2.4

intrigeri wrote:

Better postpone this and focus on your more urgent tasks :)

I think this is still valid.

#12 Updated by bertagaz over 1 year ago

  • Assignee changed from kytv to bertagaz

#13 Updated by anonym over 1 year ago

  • Assignee changed from bertagaz to anonym

Hopefully Chutney (#9521) will fix the tordate parts.

#14 Updated by anonym over 1 year ago

  • Related to Bug #10238: Investigate why Tor sometimes is slow at or even fails bootstrapping added

#15 Updated by anonym over 1 year ago

  • Assignee changed from anonym to bertagaz

Oops! Wrong ticket! :S

#16 Updated by anonym over 1 year ago

  • Related to deleted (Bug #10238: Investigate why Tor sometimes is slow at or even fails bootstrapping)

#17 Updated by bertagaz over 1 year ago

  • Target version changed from Tails_2.4 to Tails_2.5

#18 Updated by bertagaz over 1 year ago

  • Status changed from Confirmed to In Progress

intrigeri wrote:

On feature/jessie it might be simple to do that with tweaks to the systemd unit file (config/chroot_local-includes/lib/systemd/system/htpdate.service). I think this should be investigated before producing more Wheezy-area code.

I've seen this recently while running the #6309 test. Seems to be one of the most important transient network problem left.

I've tried to get how to implement a restart with the systemd unit file if the service doesn't start correctly, but it seems we can't use Restart=on-failure with service of Type=oneshot. So I'm unsure how to deal with that using systemd. Maybe another service that waits for some time to see if the success file appears, and restart htpdate if it doesn't after a certain time? Any other idea?

#19 Updated by bertagaz over 1 year ago

bertagaz wrote:

intrigeri wrote:

On feature/jessie it might be simple to do that with tweaks to the systemd unit file (config/chroot_local-includes/lib/systemd/system/htpdate.service). I think this should be investigated before producing more Wheezy-area code.

I've seen this recently while running the #6309 test. Seems to be one of the most important transient network problem left.

I've tried to get how to implement a restart with the systemd unit file if the service doesn't start correctly, but it seems we can't use Restart=on-failure with service of Type=oneshot. So I'm unsure how to deal with that using systemd. Maybe another service that waits for some time to see if the success file appears, and restart htpdate if it doesn't after a certain time? Any other idea?

OTOH I wonder if using this systemd restarting workaround is not just a ugly hack around a problem that lays in the htpdate code itself.

I've made some manuel tests of htpdate in a Tails 2.4 without network.

Here's how I saw it running:

  • Send a first request to one http server of each pool.
  • Wait two minutes.
  • Send a new request to one new http server of each pool.
  • Wait two minutes.
  • ...

It does this loop 5 times (until allowed_per_pool_failure_ratio is reached), effectively running 10 minutes each time I ran it.

Now in the test suite, we wait five minutes for it to complete, so it means that effectively htpdate only has time to send 2 batches of requests to gather time, which is not much on a not so reliable network.

We could use the --connect-timeout option of CURL to make sure htpdate is sending requests a bit more often, and not wait two minutes for an answer of each requests.

If we consider that we follow the test suite limit and want htpdate to be running 5 minutes max (or do we want htpdate to run indefinitely until it succeeded?), with a connection timeout of 30 seconds, it means we'd have to make it retry new servers 10 times. This could be reached by raising allowed_per_pool_failure_ratio and/or by adding new servers to the pool.

I'm testing a branch locally with --connect-timeout 30 and allowed_per_pool_failure_ratio set to 0.5 just to see if it fails less often on my shitty network. Will push it if it seems conclusive so that it gets some testing in Jenkins too.

#20 Updated by bertagaz over 1 year ago

  • % Done changed from 0 to 20
  • Feature Branch set to bugfix/10494-retry-htpdate

bertagaz wrote:

I'm testing a branch locally with --connect-timeout 30 and allowed_per_pool_failure_ratio set to 0.5 just to see if it fails less often on my shitty network. Will push it if it seems conclusive so that it gets some testing in Jenkins too.

Did not see htpdate errors at home anymore while running this branch, so pushing to see how it behaves in Jenkins, even if I couldn't find an occurrence of this error there.

#21 Updated by bertagaz over 1 year ago

bertagaz wrote:

bertagaz wrote:

I'm testing a branch locally with --connect-timeout 30 and allowed_per_pool_failure_ratio set to 0.5 just to see if it fails less often on my shitty network. Will push it if it seems conclusive so that it gets some testing in Jenkins too.

Did not see htpdate errors at home anymore while running this branch, so pushing to see how it behaves in Jenkins, even if I couldn't find an occurrence of this error there.

10 runs in Jenkins without any time syncing failure. I've pushed a bugfix/10494-retry-htpdate-with-more-fragile-test branch which includes more online scenarios (#10444, #10441 and #10496). The current branch does not have that much of them, let's use bate on the bug.

Would be interested to know what others think about this new settings.

#22 Updated by bertagaz over 1 year ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check set to Info Needed

bertagaz wrote:

10 runs in Jenkins without any time syncing failure. I've pushed a bugfix/10494-retry-htpdate-with-more-fragile-test branch which includes more online scenarios (#10444, #10441 and #10496). The current branch does not have that much of them, let's use bate on the bug.

And then it failed (here and twice here at least)... So the timeout trick does not seem to resolve the bug entirely, still the reasoning behind makes sense to me and at home I've seen this helping out when network was in bad shape. So the following is still true:

Would be interested to know what others think about this new settings.

But in the end, it seems we'll still have to fallback on the proposed implementation of this ticket: restart htpdate if it fails. Three options were possible to implement it:

  1. in htpdate perl's code
  2. using systemd features
  3. using a shell wrapper to the service

First one is out of bounds to me (but not others maybe?).
I've made some researches on the second, and it seems systemd service auto-restarting on failure is only possible for simple type of service. On this I'm unsure, and may need guidance.
The third option sounds like the easiest for me.

Assigning to intrigeri, as he has more insights about systemd, and may have inputs to help here. I'll keep on digging this option for now.

#23 Updated by bertagaz over 1 year ago

  • Assignee changed from intrigeri to bertagaz
  • % Done changed from 20 to 30
  • QA Check deleted (Info Needed)

bertagaz wrote:

I've made some researches on the second, and it seems systemd service auto-restarting on failure is only possible for simple type of service. On this I'm unsure, and may need guidance.
The third option sounds like the easiest for me.

Assigning to intrigeri, as he has more insights about systemd, and may have inputs to help here. I'll keep on digging this option for now.

I did some more researches and try outs on the second option (systemd). Switching to a service of type simple does not really works to handle the restarts through systemd (Restart=on-failure service option). For it to work, it seems we'd have to remove the RemainAfterExit option of the htpdate service, but that breaks it. So I don't think that's the way to go.

So I've pushed a shell wrapper in the branch (and merged it in the bugfix/10494-retry-htpdate-with-more-fragile-test branch). I've also reverted the commit that changed allowed_per_pool_failure_ratio, as I don't think that's the source of the problem. I've kept the connect-timeout change though, I think it still makes sense.

One thing I didn't get very well is why the CapabilityBoundingSet option of the service was preventing the wrapper to reading the Tor controller auth cookie. I had to remove this option from the service file so that the wrapper was able to send the NEWNYM signal. Maybe someone has a clue?

Works pretty well at home, let see how it works in Jenkins anyway.

#24 Updated by bertagaz over 1 year ago

Pushed a scenario testing the restarts of htpdate.

#25 Updated by bertagaz over 1 year ago

I've run 120 times this scenario:

Given I have started Tails from DVD without network and logged in
When the network is plugged
And Tor has built a circuit
Then the time has synced

With an ISO that has the htpdate restarting feature, but not the --connect-timeout option to CURL. It failed only twice. Not bad. I added a patch that count and reports the number of time htpdate is restarted, and seen it happen (up to 4 restarts) and save some runs.

I'm now running the same with an ISO that also has the --connect-timeout trick. Will report on completion.

#27 Updated by intrigeri over 1 year ago

  • Related to deleted (Feature #9516: Restart Tor if bootstrapping stalls for too long)

#28 Updated by intrigeri over 1 year ago

  • Blocked by Bug #10441: Synaptic test is fragile added

#29 Updated by intrigeri over 1 year ago

  • Blocked by deleted (Bug #10441: Synaptic test is fragile)

#30 Updated by bertagaz over 1 year ago

  • Assignee changed from bertagaz to anonym
  • % Done changed from 30 to 50
  • QA Check set to Ready for QA

I've fixed 3 of the urls in 2 pools that were buggy. I guess it'll help.

Setting this ticket as RfQA, as its purpose is implemented already, and from my testing it works quite well and certainly ameliorates the situation.

bertagaz wrote:

With an ISO that has the htpdate restarting feature, but not the --connect-timeout option to CURL. It failed only twice. Not bad. I added a patch that count and reports the number of time htpdate is restarted, and seen it happen (up to 4 restarts) and save some runs.

I'm now running the same with an ISO that also has the --connect-timeout trick. Will report on completion.

I'll report my researches regarding the test suite on #10495, which is the canonical place for that.

#31 Updated by bertagaz over 1 year ago

bertagaz wrote:

I've fixed 3 of the urls in 2 pools that were buggy. I guess it'll help.

Not that I've created #11562, as I believe we should prevent this from happening.

#32 Updated by bertagaz over 1 year ago

  • Related to Feature #9477: Write regression tests and tests for new features added

#33 Updated by bertagaz over 1 year ago

  • Related to deleted (Feature #9477: Write regression tests and tests for new features)

#34 Updated by bertagaz over 1 year ago

  • Related to Feature #9477: Write regression tests and tests for new features added

#35 Updated by intrigeri over 1 year ago

  • Assignee changed from anonym to bertagaz
  • QA Check changed from Ready for QA to Info Needed

I've fixed 3 of the urls in 2 pools that were buggy. I guess it'll help.

Great!

Setting this ticket as RfQA, as its purpose is implemented already, and from my testing it works quite well and certainly ameliorates the situation.

Now I'm slightly confused. It looks like the test we have in place has identified a bug in Tails (buggy htpdate pools configuration), that was causing some part of the htpdate failures we've seen. So I can't help but ask:

  • is the proposed workaround (retrying htpdate when it fails) needed, given a correct htpdate pool config?
  • isn't the proposed workaround likely to hide such configuration problems in the future? (I know that we have #11562 now, but still.)

Also, sorry to raise these concerns this late, but I'm worried about the consequences on the security design of the whole thing. I don't remember the details, but IIRC we had --allowed_per_pool_failure_ratio for a reason; what are the consequences of the proposed change, in terms of increased power for an attacker who blocks htpdate connections to selected servers in these pools, until we use the ones he controls? I didn't do the maths, but at first glance it seems to me that we might be essentially killing the "if there are too many that fail for any given pool, e.g. because of failed certificate checking or being unreachable, the pool is considered to be potentially compromised" check. I'm not saying all this is bad, but at the very least I'd like to see the design doc updated accordingly, and some indication that the security impact of this change has been thought through.

In passing: MAX_RESTARTS seems to actually specify the number of attempts, so its name is misleading.

About d4bb25b: I'd be sad to drop the protection entirely, after putting some effort myself into setting it up. I'd rather see the list of needed (=> allowed) capabilities updated. Why not just add the other capabilities you need?

retry_htpdate should not be in the $PATH, as it's clearly not meant to be used by hand.

With all that in mind, how about we take a shortcut and cherry-pick the htpdate configuration bugfix into a dedicated branch based on stable, and merge it ASAP (I can review+merge)? This should 1. improve UX in the real world as soon as 2.5 is out; 2. help answer my question above, about whether we still need a workaround after we've fixed the config, thanks to more Jenkins data; 3. give us time to think about the security consequences, without being in a hurry. What do you think?

#36 Updated by intrigeri over 1 year ago

intrigeri wrote:

With all that in mind, how about we take a shortcut and cherry-pick the htpdate configuration bugfix into a dedicated branch based on stable, and merge it ASAP (I can review+merge)?

IMO that bugfix branch should also have the lowered --connect-timeout, because it seems to make a lot of sense, regardless of restarting htpdate N times or not (in particular given the 5 vs. 10 minutes thing you discovered). Keep in mind that tails.git is not the canonical location for the htpdate script, though: it lives in our htp.git repo, and from there we copy new versions to tails.git when needed.

#37 Updated by bertagaz over 1 year ago

intrigeri wrote:

Now I'm slightly confused. It looks like the test we have in place has identified a bug in Tails (buggy htpdate pools configuration), that was causing some part of the htpdate failures we've seen. So I can't help but ask:

  • is the proposed workaround (retrying htpdate when it fails) needed, given a correct htpdate pool config?

Yes, I think it still makes sense. I don't think the failures are only due to the 3 faulty HTTP servers in the pool (even if it's certainly the reason of some of them), but also to Tor bootstrapping problems, or crappy network. This htpdate requests happen very early in Tails boot process, at a moment where it's possible that Tor isn't really ready yet.

  • isn't the proposed workaround likely to hide such configuration problems in the future? (I know that we have #11562 now, but still.)

Yes, but we can't say either that without it, we'll notice the problem. In fact we already didn't. That's why I created #11562.

Also, sorry to raise these concerns this late, but I'm worried about the consequences on the security design of the whole thing. I don't remember the details, but IIRC we had --allowed_per_pool_failure_ratio for a reason; what are the consequences of the proposed change, in terms of increased power for an attacker who blocks htpdate connections to selected servers in these pools, until we use the ones he controls? I didn't do the maths, but at first glance it seems to me that we might be essentially killing the "if there are too many that fail for any given pool, e.g. because of failed certificate checking or being unreachable, the pool is considered to be potentially compromised" check. I'm not saying all this is bad, but at the very least I'd like to see the design doc updated accordingly, and some indication that the security impact of this change has been thought through.

Yes, I had this question in mind, that's why I asked for inputs about that (at the worth moment sadly). On this point I admit I'm not sure. Maybe the pool length are a bit short (5 tries per pool before dying is not so much) and we should add more urls in each. But then, we introduce more maintenance troubles. I'll give more thought on that, but my sole brain might be not enough to get a real answer. ;)

In passing: MAX_RESTARTS seems to actually specify the number of attempts, so its name is misleading.

Right, I noticed that but didn't correct the variable name.

About d4bb25b: I'd be sad to drop the protection entirely, after putting some effort myself into setting it up. I'd rather see the list of needed (=> allowed) capabilities updated. Why not just add the other capabilities you need?

Well, I've tried to add some to the setting, but it didn't change anything. I think I even tried to add CAP_SYS_ADMIN IIRC. I didn't find which capability was needed here. Help on this would be welcome.

retry_htpdate should not be in the $PATH, as it's clearly not meant to be used by hand.

Good point!

With all that in mind, how about we take a shortcut and cherry-pick the htpdate configuration bugfix into a dedicated branch based on stable, and merge it ASAP (I can review+merge)? This should 1. improve UX in the real world as soon as 2.5 is out; 2. help answer my question above, about whether we still need a workaround after we've fixed the config, thanks to more Jenkins data; 3. give us time to think about the security consequences, without being in a hurry. What do you think?

I can certainly create another branch for that (though I don't know how to cherry-pick this commits without changing their commit ID, any idea?). But as you've seen in my last note on #10495, I've already seen failures despite the pool fix.

#38 Updated by intrigeri over 1 year ago

  • is the proposed workaround (retrying htpdate when it fails) needed, given a correct htpdate pool config?

Yes, I think it still makes sense. I don't think the failures are only due to the 3 faulty HTTP servers in the pool (even if it's certainly the reason of some of them), but also to Tor bootstrapping problems, or crappy network.

I have the same guts feelings (except the "Tor bootstrapping problems" which I think should not be an issue -- see bellow -- and even more so on Jenkins with chutney), but now I'd like to wait a bit and see more data about it. We'll have it once we have merged the subset of this branch that we're discussing below :)

This htpdate requests happen very early in Tails boot process, at a moment where it's possible that Tor isn't really ready yet.

I'm surprised, as I'm pretty sure we've set it up so that htpdate is started after wait_for_working_tor. If you really see that happen, then that's a serious bug. Can you please clarify?

[...] at the very least I'd like to see the design doc updated accordingly, and some indication that the security impact of this change has been thought through.

Yes, I had this question in mind, that's why I asked for inputs about that (at the worth moment sadly). On this point I admit I'm not sure. Maybe the pool length are a bit short (5 tries per pool before dying is not so much) and we should add more urls in each. But then, we introduce more maintenance troubles. I'll give more thought on that, but my sole brain might be not enough to get a real answer. ;)

OK, so for now we're blocking on the security aspect. I would advise creating a dedicated Research subtask, and postponing the whole thing (except bits discussed below) to 2.6, so that it's clear what's the big blocker.

In passing: MAX_RESTARTS seems to actually specify the number of attempts, so its name is misleading.

Right, I noticed that but didn't correct the variable name.

I suspect that fixing it now would eat less time than the book-keeping needed to remember that we'll have to fix it later if we take this branch eventually. YMMV :)

About d4bb25b: I'd be sad to drop the protection entirely, after putting some effort myself into setting it up. I'd rather see the list of needed (=> allowed) capabilities updated. Why not just add the other capabilities you need?

Well, I've tried to add some to the setting, but it didn't change anything. I think I even tried to add CAP_SYS_ADMIN IIRC. I didn't find which capability was needed here. Help on this would be welcome.

I don't remember how I came up with the initial list exactly, but I certainly did not guess, so there's surely a way to find out what exact operation is blocked, and from this to infer what capability is needed to unblock it. I'm sorry I don't remember how :/ … but that's the way I would do it. I'd say subtask++, and don't bother until we know we really want to do that restart dance (i.e. blocked by the security aspects).

I can certainly create another branch for that

Yes, please. I'd like to see the HTP pool fixed in 2.5, and it's quite clear to me that this htpdate restart work won't make it into 2.5: we need time for the security discussion, and it's disputable if the change qualifies for a point-release.

(though I don't know how to cherry-pick this commits without changing their commit ID, any idea?).

You can't, given how these IDs are computed (from data that includes the parent commit ID).

But as you've seen in my last note on #10495, I've already seen failures despite the pool fix.

We don't have to fix all problems in July :)

#39 Updated by bertagaz over 1 year ago

  • Blocked by Bug #11574: Investigate security consequences of restarting htpdate until it succeeds added

#40 Updated by bertagaz over 1 year ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Info Needed to Ready for QA

intrigeri wrote:

This htpdate requests happen very early in Tails boot process, at a moment where it's possible that Tor isn't really ready yet.

I'm surprised, as I'm pretty sure we've set it up so that htpdate is started after wait_for_working_tor. If you really see that happen, then that's a serious bug. Can you please clarify?

Well, it seems to me that even if Tor reports having 100% bootstrapped, it still have difficulties to provide a functional circuit right after that. But I don't have data or proof, it's more of a feeling and would need more debugging to see if there's really a problem here.

OK, so for now we're blocking on the security aspect. I would advise creating a dedicated Research subtask, and postponing the whole thing (except bits discussed below) to 2.6, so that it's clear what's the big blocker.

So that's #11574.

In passing: MAX_RESTARTS seems to actually specify the number of attempts, so its name is misleading.

Right, I noticed that but didn't correct the variable name.

I suspect that fixing it now would eat less time than the book-keeping needed to remember that we'll have to fix it later if we take this branch eventually. YMMV :)

2e10750

About d4bb25b: I'd be sad to drop the protection entirely, after putting some effort myself into setting it up. I'd rather see the list of needed (=> allowed) capabilities updated. Why not just add the other capabilities you need?

Well, I've tried to add some to the setting, but it didn't change anything. I think I even tried to add CAP_SYS_ADMIN IIRC. I didn't find which capability was needed here. Help on this would be welcome.

I don't remember how I came up with the initial list exactly, but I certainly did not guess, so there's surely a way to find out what exact operation is blocked, and from this to infer what capability is needed to unblock it. I'm sorry I don't remember how :/ … but that's the way I would do it. I'd say subtask++, and don't bother until we know we really want to do that restart dance (i.e. blocked by the security aspects).

Created #11575 then.

I can certainly create another branch for that

Yes, please. I'd like to see the HTP pool fixed in 2.5, and it's quite clear to me that this htpdate restart work won't make it into 2.5: we need time for the security discussion, and it's disputable if the change qualifies for a point-release.

(though I don't know how to cherry-pick this commits without changing their commit ID, any idea?).

You can't, given how these IDs are computed (from data that includes the parent commit ID).

Right. Pushed bugfix/10494-fix-htpdate-pools

So setting this ticket RfQA, but that's only for this branch. Once merged, this ticket shouldn't be closed. Assigning to our 2.5 RM.

But as you've seen in my last note on #10495, I've already seen failures despite the pool fix.

We don't have to fix all problems in July :)

Seen one of this failure in a new batch of 100 runs btw.

retry_htpdate should not be in the $PATH, as it's clearly not meant to be used by hand.

c15b2c5

IMO that bugfix branch should also have the lowered --connect-timeout, because it seems to make a lot of sense, regardless of restarting htpdate N times or not (in particular given the 5 vs. 10 minutes thing you discovered).

Agreed, imported in the previously mentioned new branch.

Keep in mind that tails.git is not the canonical location for the htpdate script, though: it lives in our htp.git repo, and from there we copy new versions to tails.git when needed.

Didn't remember that. Pushed the timeout change (and fixed an old forgotten one) in this repo.

#41 Updated by bertagaz over 1 year ago

  • Target version changed from Tails_2.5 to Tails_2.6

#42 Updated by bertagaz over 1 year ago

intrigeri wrote:

intrigeri wrote:

With all that in mind, how about we take a shortcut and cherry-pick the htpdate configuration bugfix into a dedicated branch based on stable, and merge it ASAP (I can review+merge)?

IMO that bugfix branch should also have the lowered --connect-timeout, because it seems to make a lot of sense, regardless of restarting htpdate N times or not (in particular given the 5 vs. 10 minutes thing you discovered).

Note that with this timeout patch, a run of htpdate now takes only 2mn30 sec (30 sec * 5 tries per pool), so we're lower than the 5 minutes delay of this try_for.

#43 Updated by intrigeri over 1 year ago

  • Related to Bug #11577: Fix htpdate pool and timeout added

#44 Updated by intrigeri over 1 year ago

Right. Pushed bugfix/10494-fix-htpdate-pools So setting this ticket RfQA, but that's only for this branch. Once merged, this ticket shouldn't be closed. Assigning to our 2.5 RM.

Thanks! I've created #11577 to track that other branch, otherwise I'll become insane trying to follow the Redmine semantics at play.

#46 Updated by intrigeri over 1 year ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

#47 Updated by bertagaz over 1 year ago

Now I'm starting to think that rather than using a shell wrapper we should implement the retrying at the CURL level. That would just cut the 'killing the $allowed_per_pool_failure_ratio feature' debate. If we retry several times each pool HTTP servers, then we wouldn't pick new ones. I don't know why I didn't think about that before, maybe a melt of "oh-their-god-it's-perl" feeling with the lack of consideration of the security implications of a shell wrapper. But I'm not sure that it would be the most robust solution.

I'll push a new branch implementing the retry this way, so that we'll start to have some Jenkins stats about such change to decide after 2.5 is out.

#48 Updated by bertagaz over 1 year ago

  • Related to deleted (Feature #9477: Write regression tests and tests for new features)

#49 Updated by anonym about 1 year ago

  • Target version changed from Tails_2.6 to Tails_2.7

#50 Updated by bertagaz about 1 year ago

bertagaz wrote:

I'll push a new branch implementing the retry this way, so that we'll start to have some Jenkins stats about such change to decide after 2.5 is out.

In the end I didn't do it: after some live discussions, the best way we found to implement this was to do it inside htpdate itself. It should call the curl command several times with the same webserver URIs, and request NEWNYMs in between.

This way it won't try more webservers than what it does now, only try the same list multiple times, hence avoiding to raise the question about #11574.

This also means the systemd unit changes in htpdate.service that were to be reverted by #11575 are unnecessary, as they won't be merged anyway.

Next is to implement a loop system in /usr/local/sbin/htpdate around curl invocation.

#51 Updated by intrigeri about 1 year ago

This way it won't try more webservers than what it does now, only try the same list multiple times, hence avoiding to raise the question about #11574.

Sounds better indeed!

#52 Updated by intrigeri about 1 year ago

Next is to implement a loop system in /usr/local/sbin/htpdate around curl invocation.

These changes might be too invasive to go into a minor release. I suggest you avoid this question entirely by postponing to 2.8 and focusing on non-internal SponsorS tasks during the 2.7 cycle.

#53 Updated by bertagaz about 1 year ago

  • Target version changed from Tails_2.7 to Tails_2.9.1

intrigeri wrote:

These changes might be too invasive to go into a minor release. I suggest you avoid this question entirely by postponing to 2.8 and focusing on non-internal SponsorS tasks during the 2.7 cycle.

Agree!

#54 Updated by intrigeri 12 months ago

  • Target version changed from Tails_2.9.1 to Tails 2.10

As said above: "These changes might be too invasive to go into a minor release".

#55 Updated by anonym 10 months ago

  • Target version changed from Tails 2.10 to Tails_2.11

#56 Updated by bertagaz 9 months ago

  • Target version changed from Tails_2.11 to Tails_2.12

As said above: "These changes might be too invasive to go into a minor release".

#58 Updated by bertagaz 9 months ago

  • Target version changed from Tails_2.12 to Tails_3.0

#59 Updated by bertagaz 6 months ago

  • Target version changed from Tails_3.0 to Tails_3.1

#60 Updated by bertagaz 6 months ago

  • Target version changed from Tails_3.1 to Tails_3.2

#61 Updated by bertagaz 3 months ago

  • Feature Branch changed from bugfix/10494-retry-htpdate to bugfix/10494-htpdate-fix-date-header-regexp

bertagaz wrote:

Next is to implement a loop system in /usr/local/sbin/htpdate around curl invocation.

I've been spending some hours implementing that (given the failure ratio in the test suite), and have something working close to be pushed.

But doing so, I discovered another problem seeing how much time we get the 'Could not find any date header' errors in the htpdate log.

Looking at the headers the web servers send, I noticed that some of us do not use the first-letter-uppercased convention for HTTP headers. So the regexp in newestDateHeader fails eventually. I've pushed a tiny aeb903a6fe8422c3beb677110c56f04b28c6108b that just fix that (hoping it does not hurt regexp laws). Here htpdate don't fail anymore since then, so it should greatly help. But let see how it behaves in Jenkins first.

I'll push the retry implementation in an another branch later, but it should get much less important to do with the above change hopefully.

#62 Updated by bertagaz 3 months ago

bertagaz wrote:

I'll push the retry implementation in an another branch later, but it should get much less important to do with the above change hopefully.

Pushed in the bugfix/10494-retry-curl-in-htpdate branch. The way to request a new Tor circuit is very Tails specific though, as it's using the Tails shell library to send the NEWNYM command. But I don't know if other projects are using this software, and we seem to be the upstream of the Perl version. This functionnality is deactivated by default.

#63 Updated by bertagaz 3 months ago

  • Assignee changed from bertagaz to anonym
  • QA Check changed from Dev Needed to Ready for QA

bertagaz wrote:

Looking at the headers the web servers send, I noticed that some of us do not use the first-letter-uppercased convention for HTTP headers. So the regexp in newestDateHeader fails eventually. I've pushed a tiny aeb903a6fe8422c3beb677110c56f04b28c6108b that just fix that (hoping it does not hurt regexp laws). Here htpdate don't fail anymore since then, so it should greatly help. But let see how it behaves in Jenkins first.

Ok, so Jenkins ran that 15 times (https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-10494-htpdate-fix-date-header-regexp/ and https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-10494-13541-htpdate-fix-date-header-regexp/) without any htpdate error, which should statistically have happened if you look how much time it happens on other branches. So it confirms my finding.

So I'm putting the bugfix/10494-htpdate-fix-date-header-regexp branch RfQA to the 3.2 RM. Also note that if you merge it, you have to merge the same branch in the htp.git repo we have on immerda.

Once merged, don't close this ticket yet, but reassign it to me so that I'l keep tracking this problem and see if we need the other retry branch. Hope so, spend some time hacking perl... :)

#64 Updated by bertagaz 3 months ago

  • Assignee changed from anonym to bertagaz
  • QA Check changed from Ready for QA to Dev Needed
  • Feature Branch changed from bugfix/10494-htpdate-fix-date-header-regexp to bugfix/10494-retry-curl-in-htpdate

bertagaz wrote:

Once merged, don't close this ticket yet, but reassign it to me so that I'l keep tracking this problem and see if we need the other retry branch. Hope so, spend some time hacking perl... :)

So it has been (fixed) and merged. Now let see if we ever see time syncing error again and need to retyr curl or not.

#65 Updated by intrigeri 3 months ago

bertagaz wrote:

bertagaz wrote:

Once merged, don't close this ticket yet, but reassign it to me so that I'l keep tracking this problem and see if we need the other retry branch. Hope so, spend some time hacking perl... :)

So it has been (fixed) and merged. Now let see if we ever see time syncing error again and need to retyr curl or not.

Please ensure the corresponding change is merged in htp.git too.

#66 Updated by bertagaz 2 months ago

  • Target version changed from Tails_3.2 to Tails_3.3

#67 Updated by bertagaz about 2 months ago

  • Target version changed from Tails_3.3 to Tails_3.4

#68 Updated by bertagaz 20 days ago

  • QA Check changed from Dev Needed to Ready for QA

Flagging this ticket RfQA to test #12633.

#69 Updated by bertagaz 20 days ago

  • QA Check changed from Ready for QA to Dev Needed

Unflagging.

#70 Updated by bertagaz 4 days ago

  • Target version changed from Tails_3.4 to Tails_3.5

Also available in: Atom PDF