Project

General

Profile

Feature #9516

Feature #9515: Improve test suite robustness vs transient network errors

Restart Tor if bootstrapping stalls for too long

Added by anonym over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Test suite
Target version:
Start date:
06/02/2015
Due date:
% Done:

100%

QA Check:
Pass
Feature Branch:
kytv/feature/9516-restart-tor-on-bootstrap-failure
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

This is both w.r.t. the initial Tor bootstrap, and after that, when we reconnect to the Tor network again after restoring a snapshot. We may have to retry multiple times in a row.

clock_gettime_monotonic.patch View (2.83 KB) anonym, 10/13/2015 08:22 AM


Related issues

Related to Tails - Bug #9478: How to deal with transient network errors in the test suite? Resolved 05/27/2015
Related to Tails - Bug #10238: Investigate why Tor sometimes is slow at or even fails bootstrapping Rejected 09/23/2015
Related to Tails - Bug #10537: Fix restart-tor and tor-has-bootstrapped semantics vs. test suite on Jessie Resolved 11/12/2015

Associated revisions

Revision 7f798ea5 (diff)
Added by kytv over 2 years ago

Restart Tor if bootstrapping stalls for more than 20 seconds

Will-fix: #9516

Revision 400a4323 (diff)
Added by kytv over 2 years ago

Start tor in the test suite using the 'restart-tor' script

Will-fix: #9516

Revision 1ed0a6a2 (diff)
Added by kytv about 2 years ago

Restart Tor if bootstrapping stalls for more than 20 seconds

Will-fix: #9516

Revision 14ae9bda (diff)
Added by kytv about 2 years ago

Start tor in the test suite using the 'restart-tor' script

restart-tor will block until bootstrapping is done (or failed) but we
want to detect the failure in wait_until_tor_is_working() in the test
suite, which we run right after invoking restart-tor here.

Will-fix: #9516

Revision d30225ff (diff)
Added by anonym about 2 years ago

Don't use restart-tor when using a bridge.

See added comment why we cannot do that. Also, while we're at it,
let's move some code from 20-time.sh into here since that's where it
belongs.

Will-fix: #9516

Revision 430651c8
Added by anonym about 2 years ago

Merge remote-tracking branch 'kytv/feature/9516-restart-tor-on-bootstrap-failure' into feature/9516-restart-tor-on-bootstrap-failur

Fix-committed: #9516

Revision 7a19e96e
Added by anonym about 2 years ago

Merge remote-tracking branch 'kytv/feature/9516-restart-tor-on-bootstrap-failure' into devel

Fix-committed: #9516

Revision 729baa2b (diff)
Added by anonym 9 months ago

Stop restarting tor if bootstrapping stalls.

Preliminary tests indicate that tor might have fixed the issues we
used to experience with the bootstrap process stalling and requiring a
restart to kickstart it.

Refs: #10238, #9516
Will-fix: #12411

History

#1 Updated by anonym over 2 years ago

Since we always have to run wait_until_tor_is_working() when Tor is connecting to the network, it may be a good place to fix this. The following might be a good starting point (tl;dr: restart Tor when the bootstrap progress has stalled for more than 15 seconds):

--- a/features/support/helpers/misc_helpers.rb
+++ b/features/support/helpers/misc_helpers.rb
@@ -75,8 +75,23 @@ rescue unique_timeout_exception => e
 end

 def wait_until_tor_is_working
-  try_for(270) { @vm.execute(
-    '. /usr/local/lib/tails-shell-library/tor.sh; tor_is_working').success? }
+  bootstrap_progress = 0
+  last_bootstrap_change = Time.now
+  try_for(270) do
+    new_bootstrap_progress = @vm.execute_successfully('. /usr/local/lib/tails-shell-library/tor.sh; tor_bootstrap_progress').stdout.to_i
+    if new_bootstrap_progress == 100
+      return
+    elsif new_bootstrap_progress > bootstrap_progress
+      bootstrap_progress = new_bootstrap_progress
+      last_bootstrap_change = Time.now
+    elsif Time.now - last_bootstrap_change > 15
+      STDERR.puts "Tor bootstrap stalled at #{bootstrap_progress}% for 15 seconds, restarting Tor..." if $config["DEBUG"]
+      @vm.execute_successfully('service tor restart')
+      bootstrap_progress = 0
+      last_bootstrap_change = Time.now
+    end
+    false
+  end
 end

 def convert_bytes_mod(unit)

I've had to use this myself when running the test suite in some places.

#3 Updated by anonym over 2 years ago

  • Related to Bug #9478: How to deal with transient network errors in the test suite? added

#4 Updated by intrigeri over 2 years ago

  • Related to deleted (Bug #9478: How to deal with transient network errors in the test suite?)

#6 Updated by intrigeri over 2 years ago

  • Related to Bug #9478: How to deal with transient network errors in the test suite? added

#8 Updated by kytv over 2 years ago

  • Status changed from Confirmed to In Progress
  • Assignee changed from kytv to anonym
  • QA Check set to Info Needed

anonym wrote:

Since we always have to run wait_until_tor_is_working() when Tor is connecting to the network, it may be a good place to fix this. The following might be a good starting point (tl;dr: restart Tor when the bootstrap progress has stalled for more than 15 seconds):
[...]

What if we do this outside of the test suite, e.g. for all users in /usr/local/sbin/restart-tor or the nm hooks?

I converted/translated your ruby code into shell and am building an ISO for testing.

#9 Updated by anonym over 2 years ago

  • Assignee changed from anonym to kytv
  • QA Check changed from Info Needed to Dev Needed

This may be a good idea since it will help users that experience some types of bootstrap issues.

Also, I noted that my hack in wait_until_tor_is_working() can cause a race condition; if it just decided to restart Tor, but Tor bootstraps right after and our NetworkManager hook noticed that Tor is working and proceeds, then htpdate will fail since the SOCKSPort isn't there. Quite unlikely, but yeah.

If we do this from restart-tor (another option would be the Tor init script) we need to always use it when (re)starting Tor, so we have some places where it needs to be introduced: git grep "service tor \(re\)\?start" -- config/ features/

#10 Updated by kytv over 2 years ago

All right. If this ends up being deemed good, excellent. :) I'm quite pleased with how it's working thus far, but I'm still testing and not quite ready to submit it for review.

#11 Updated by kytv over 2 years ago

  • Assignee changed from kytv to intrigeri
  • % Done changed from 0 to 30
  • QA Check changed from Dev Needed to Ready for QA
  • Feature Branch set to kytv:test/9516-tor-bootstrap

I've been using an ISO built from this branch for more than a week while working on the other tickets in this group (#9517, #9518). I've also run the entire test suite a couple of times with an ISO built from this branch as well. Tor has not failed to bootstrap and while running with --debug it's been obvious that the changes in this branch have helped to ensure that Tor is ready more quickly than without it.

I'm confident that it will improve things for both testers and users.

#12 Updated by kytv over 2 years ago

  • Assignee changed from intrigeri to kytv
  • Target version changed from Tails_1.4.1 to Tails_1.5
  • QA Check changed from Ready for QA to Dev Needed

I'll take this one back to do a bit more testing.

#13 Updated by kytv over 2 years ago

  • % Done changed from 30 to 0
  • Feature Branch deleted (kytv:test/9516-tor-bootstrap)

Upon further testing I see that there are problems which are more frequently evident in the time_syncing.feature when using this branch.

#14 Updated by anonym over 2 years ago

kytv wrote:

Upon further testing I see that there are problems which are more frequently evident in the time_syncing.feature when using this branch.

Can you re-push kytv:test/9516-tor-bootstrap? I can probably provide some help with any tordate-related issues.

#15 Updated by intrigeri over 2 years ago

So, is the current theory to do that in the test suite only, on directly in Tails itself? If the latter, please update the ticket's description and category.

#16 Updated by kytv over 2 years ago

  • Target version changed from Tails_1.5 to Tails_1.6

#17 Updated by anonym over 2 years ago

anonym wrote:

kytv wrote:

Upon further testing I see that there are problems which are more frequently evident in the time_syncing.feature when using this branch.

Can you re-push kytv:test/9516-tor-bootstrap? I can probably provide some help with any tordate-related issues.

Ping?

#18 Updated by anonym over 2 years ago

anonym wrote:

Also, I noted that my hack in wait_until_tor_is_working() can cause a race condition; if it just decided to restart Tor, but Tor bootstraps right after and our NetworkManager hook noticed that Tor is working and proceeds, then htpdate will fail since the SOCKSPort isn't there. Quite unlikely, but yeah.

I believe we'd have similar race if implemented in Tails proper. We'd need to ensure that we block other things from starting to use Tor until the new bootstrap check has succeeded and we won't restart Tor. For the htpdate case above it could be implemented by making wait_for_working_tor (which is used to "queue" its start in 20-time.sh) also wait until this new bootstrap checker is done. However, there may be other (potentially future) places where this will cause issues. Not sure if there's any good general solution. Starting Tor without a SocksPort and only adding it once the bootstrap check is done? Sounds crazy.

#19 Updated by kytv over 2 years ago

  • Feature Branch set to kytv:test/9516-tor-bootstrap

Here's my WIP branch.

#20 Updated by kytv over 2 years ago

  • Assignee changed from kytv to anonym
  • QA Check changed from Dev Needed to Ready for QA

…and assigning to anonym for his perusal / comments / suggestions.

#21 Updated by kytv over 2 years ago

#22 Updated by kytv over 2 years ago

#23 Updated by anonym over 2 years ago

  • Assignee changed from anonym to kytv
  • QA Check changed from Ready for QA to Dev Needed

kytv wrote:

…and assigning to anonym for his perusal / comments / suggestions.

From commit 7f798ea:

+try_for 270 maybe_restart_tor

Maybe we should check this return status and show a helpful notification on failure? Previously we haven't done anything, we've just continue and "optimistically" hoped stuff would work (logging stuff only to syslog, which isn't very user friendly), but while we're thinking about these things, perhaps we can improve that? We could say, please try reconnecting to the network (á Window-style). :)

From commit 0101de3;

don't use restart-tor when using a bridge

We discussed this obscure commit over IRC. I pushed something that I think explains things better, and improves a few other things, in commit: e0c7c1fc in the branch named the same as yours (test/9516-tor-bootstrap) in Tails Git repo.

Does it make sense?

#24 Updated by anonym over 2 years ago

  • % Done changed from 0 to 60

#25 Updated by kytv over 2 years ago

  • % Done changed from 60 to 0

anonym wrote:

Does it make sense?

Absolutely.

#26 Updated by kytv over 2 years ago

  • % Done changed from 0 to 60

#27 Updated by anonym over 2 years ago

I still think there is a minimal risk for some crazy races with this
vs tails-upgrade-checker-wrapper and tails-security-check-wrapper
imagine this crazy scenario:
restart-tor just times out so it wants to restart Tor
but right after that Tor bootstraps, but before restart-tor restarts tor
one of these wrappers notices this, via /usr/local/sbin/tor-has-bootstrapped
and they start fetching something using Tor's SocksPort
and then restart-tor restarts Tor
=> the wrapper fails
all that must happen in a split second, so the "race window" is very tiny and unlikely
those wrappers are started via xdg
it would be better if they were started via NM hooks too
so blocking is ensured until Tor actually is working
I think we can bring much more sense to this using systemd
once we're based on Jessie

#28 Updated by intrigeri over 2 years ago

those wrappers are started via xdg
it would be better if they were started via NM hooks too
so blocking is ensured until Tor actually is working
I think we can bring much more sense to this using systemd
once we're based on Jessie

IIRC I've already ported those services to systemd on feature/jessie.

#29 Updated by kytv about 2 years ago

  • Target version changed from Tails_1.6 to Tails_1.7

#30 Updated by intrigeri about 2 years ago

  • Priority changed from Normal to High

The deployment this is blocking is happening.

#31 Updated by anonym about 2 years ago

  • Related to Bug #10238: Investigate why Tor sometimes is slow at or even fails bootstrapping added

#32 Updated by anonym about 2 years ago

I suggest adding the following debug info to help us debug bootstrapping issues if they continue happening with this branch:

--- a/features/support/helpers/misc_helpers.rb
+++ b/features/support/helpers/misc_helpers.rb
@@ -120,6 +120,14 @@ end
 def wait_until_tor_is_working
   try_for(270) { $vm.execute(
     '. /usr/local/lib/tails-shell-library/tor.sh; tor_is_working').success? }
+rescue Timeout::Error => e
+  c = $vm.execute("grep restart-tor /var/log/syslog")
+  if c.success?
+    debug_log("From syslog:\n" + c.stdout.sub(/^/, "  ")
+  else
+    debug_log("Nothing was syslog:ed about 'restart-tor'")
+  end
+  raise e
 end

 def convert_bytes_mod(unit)

WARNING! The above code has not been tested, but I think the intention is pretty clear.

#33 Updated by kytv about 2 years ago

  • Assignee changed from kytv to anonym
  • % Done changed from 60 to 70
  • QA Check changed from Dev Needed to Ready for QA
  • Feature Branch changed from kytv:test/9516-tor-bootstrap to kytv:feature/9516-restart-tor-on-bootstrap-failure

#34 Updated by kytv about 2 years ago

anonym wrote:

I suggest adding the following debug info

Added, cheers. :)

#35 Updated by anonym about 2 years ago

  • Assignee changed from anonym to kytv
  • QA Check changed from Ready for QA to Dev Needed

I haven't reviewed your latest branch state, but I just realized the following while doing something unrelated: in post_snapshot_restore_hook we do not have to do $vm.spawn("restart-vidalia") any more; now when we run restart-tor, it will start vidalia for us. In fact, us running it as well probably introduces some funny races. Could you have a look at this until I return to this branch in a few days?

#36 Updated by kytv about 2 years ago

  • Assignee changed from kytv to anonym
  • QA Check changed from Dev Needed to Ready for QA

restart-tor will call restart-vidalia if vidalia is already running. That's actually been the case for a while now.

a3b52392 (Tails developers 2012-11-15 16:56:26 +0100 57) if pgrep "\<vidalia\>" >/dev/null; then
05676de6 (Tails developers 2014-03-03 12:48:29 +0100 58)     killall -SIGKILL vidalia
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 59)     # Since Tor just restarted we wait for a while until the
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 60)     # ControlPort hopefully is up.
11faf45f (anonym           2015-05-03 16:29:07 +0200 61)     counter=0
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 62)     until [ "${counter}" -ge 10 ] || nc -z localhost 9051 2>/dev/null; do
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 63)         sleep 1
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 64)         counter=$((${counter}+1))
d84bce20 (Tails developers 2012-11-17 16:05:09 +0100 65)     done
a3b52392 (Tails developers 2012-11-15 16:56:26 +0100 66)     restart-vidalia
a3b52392 (Tails developers 2012-11-15 16:56:26 +0100 67) fi

But indeed it seems to me that we don't need to do it ourselves in the test suite. Changes pushed.

#37 Updated by kytv about 2 years ago

#38 Updated by kytv about 2 years ago

#39 Updated by anonym about 2 years ago

#40 Updated by anonym about 2 years ago

  • Status changed from In Progress to Fix committed
  • % Done changed from 70 to 100

#41 Updated by anonym about 2 years ago

  • Assignee deleted (anonym)
  • QA Check changed from Ready for QA to Pass

This branch definitely improves the situation for me.

It doesn't help in bridge mode, but in my experience there's less bootstrapping issues with bridges. Any way, I opened #10338 for that.

#42 Updated by intrigeri about 2 years ago

(Moved to #6094 since I don't see this bug with current devel. Sorry for the noise.)

#43 Updated by anonym about 2 years ago

  • File clock_gettime_monotonic.patch View added
  • Status changed from Fix committed to In Progress
  • Assignee set to kytv
  • % Done changed from 100 to 70
  • QA Check changed from Pass to Dev Needed

intrigeri wrote:

(Moved to #6094 since I don't see this bug with current devel. Sorry for the noise.)

I actually think you were on to something. References: #6094#note-60 and #6094#note-62

I think restart-tor is buggy when the clock is incorrect. Specifically, it uses the system clock (via date) both explicitly in maybe_restart_tor(), and implicitly in try_for(). Hence, if "tordate" (i.e. config/chroot_local-includes/etc/NetworkManager/dispatcher.d/20-time.sh) changes the time while we still run try_for 270 maybe_restart_tor in restart-tor, all sorts of craziness can happen.

We need a monotonic clock to implement these things in a robust manner vs time changes. A drop-in monotonic replacement for date +%s would be perl -w -MTime::HiRes=clock_gettime,CLOCK_MONOTONIC -E 'say clock_gettime(CLOCK_MONOTONIC)' | sed 's/\..*$//'. See the attached patch for a complete (but untested) fix! Thoughts?

(That said, try_for is still ugly since it will not timeout if something inside never returns, and it has the issue with "overtime". Perhaps we can implement it in perl too, using some timer/timeout facility + shell evaluation? But that would just be a bonus.)

#44 Updated by kytv about 2 years ago

anonym wrote:

We need a monotonic clock to implement these things in a robust manner vs time changes. A drop-in monotonic replacement for date +%s would be perl -w -MTime::HiRes=clock_gettime,CLOCK_MONOTONIC -E 'say clock_gettime(CLOCK_MONOTONIC)' | sed 's/\..*$//'. See the attached patch for a complete (but untested) fix! Thoughts?

Testing

#45 Updated by kytv about 2 years ago

  • Assignee changed from kytv to anonym
  • QA Check changed from Dev Needed to Ready for QA
  • Feature Branch changed from kytv:feature/9516-restart-tor-on-bootstrap-failure to kytv/feature/9516-restart-tor-on-bootstrap-failure

After a fair amount of testing I think the supplied patch is OK.

#46 Updated by anonym about 2 years ago

  • Status changed from In Progress to Fix committed
  • % Done changed from 70 to 100

#47 Updated by anonym about 2 years ago

  • Assignee deleted (anonym)
  • QA Check changed from Ready for QA to Pass

#48 Updated by anonym about 2 years ago

  • Status changed from Fix committed to Resolved

#49 Updated by anonym about 2 years ago

  • Related to Bug #10494: Retry htpdate when it fails added

#50 Updated by intrigeri about 2 years ago

  • Related to Bug #10537: Fix restart-tor and tor-has-bootstrapped semantics vs. test suite on Jessie added

#51 Updated by intrigeri over 1 year ago

  • Related to deleted (Bug #10494: Retry htpdate when it fails)

Also available in: Atom PDF