Project

General

Profile

Bug #10238

Investigate why Tor sometimes is slow at or even fails bootstrapping

Added by anonym almost 2 years ago. Updated 5 months ago.

Status:
Rejected
Priority:
Elevated
Assignee:
-
Category:
-
Target version:
Start date:
09/23/2015
Due date:
% Done:

0%

QA Check:
Feature Branch:
feature/tor-bootstrap-failure-stats-for-10238
Type of work:
Research
Blueprint:
Easy:
Affected tool:

Description

I.e. why do we need something like #9516 (which clearly isn't a proper solution)? This is currently a huge issue for test suite robustness and, I imagine, frustrating for Tails users.

The expected outcome of this ticket is to report an excellent bug to the Tor developers.


Related issues

Related to Tails - Feature #9516: Restart Tor if bootstrapping stalls for too long Resolved 06/02/2015
Related to Tails - Feature #12411: Stop restarting tor if bootstrapping stalls Resolved 03/29/2017
Related to Tails - Bug #13541: Tor still sometimes fails to bootstrap in the test suite In Progress 07/31/2017

Associated revisions

Revision acb8baa3 (diff)
Added by anonym over 1 year ago

Save Tor's log when bootstrapping fails.

This will be useful when debuggin #10238.

Refs: #10238

Revision 729baa2b (diff)
Added by anonym 6 months ago

Stop restarting tor if bootstrapping stalls.

Preliminary tests indicate that tor might have fixed the issues we
used to experience with the bootstrap process stalling and requiring a
restart to kickstart it.

Refs: #10238, #9516
Will-fix: #12411

History

#1 Updated by anonym almost 2 years ago

  • Related to Feature #9516: Restart Tor if bootstrapping stalls for too long added

#2 Updated by intrigeri almost 2 years ago

I suggest first trying to reproduce with 0.2.7.x.

#3 Updated by intrigeri almost 2 years ago

  • Description updated (diff)

#4 Updated by anonym almost 2 years ago

  • Blocks Bug #10497: wait_until_tor_is_working helper is fragile added

#5 Updated by anonym almost 2 years ago

intrigeri gave me a tip that perhaps Tor bug #17772 is related to this. I built Tor 0.2.7.5 with the fix applied and ran the following scenario 50 times for the patched vs vanilla Tor versions (atop feature/jessie, FWIW):

  Scenario: Tor bootstrapping
    Given I have started Tails from DVD without network and logged in
    When the network is plugged
    Then Tor bootstraps successfully

I gathered some data regarding restart-tor but could see no relevant difference in how many times it had to restart Tor while bootstrapping (due to "stalling"), or the time it took. In fact, the patched Tor behaved a tiny bit worse, but I attribute that to expected variance.

I did the same comparison for when re-bootstrapping Tor after restoring from a snapshot, i.e.:

  Scenario: Tor re-bootstrapping
    I have started Tails from DVD and logged in and the network is connected

but the measurements showed the same thing; no noticeable difference.

#6 Updated by anonym almost 2 years ago

FWIW, I have some Git history showing that I definitely was aware of this issue on 2015-08-27 (but I am 99% sure I was aware of it at least two weeks before that, FWIW), and that we experienced it in Tor 0.2.5.x but no in Tor 0.2.4.x. However, given some discussion on tails-dev@ it seems what I experienced back then was much worse than what it is for me now. However, I do remember that back then it could vary wildly, seeming dependant on time. Any way, let's assume it's the same issue.

Let's look at some Tor release dates (according to the Git tags) in order to have some basis for bisecting when the issue appeared:

  • tor-0.2.5.1-alpha: 2013-10-01
  • tor-0.2.5.2-alpha: 2014-02-13
  • tor-0.2.5.3-alpha: 2014-03-22
  • tor-0.2.5.4-alpha: 2014-04-25
  • tor-0.2.5.5-alpha: 2014-06-18
  • tor-0.2.5.6-alpha: 2014-07-28

We can disregard Tor releases past that since I know I had seen the bootstrapping issues before them.

So we started installing Tor 0.2.5.x (specifically 2.5.5-alpha, given the above dates) in the feature/tor-0.2.5.x-alpha branch with commit c8d137a on 2014-07-06. Now here's something: I sort of remember working on feature/tor-0.2.5.x-alpha and not experiencing the issue. Then suddenly they kicked in. That could mean that the issue was introduced in 0.2.5.6-alpha. It's weak, and obviously some deeper investigation must be done.

#7 Updated by anonym almost 2 years ago

Perhaps related, please have a look at this Tor log excerpt for when Tor is restarted after an initial, successful bootstrap:

Dec 10 04:22:23.000 [notice] Tor 0.2.7.5 (git-6184c873e90d93b2) opening new log file.
Dec 10 04:22:23.732 [notice] Tor v0.2.7.5 (git-6184c873e90d93b2) running on Linux with Libevent 2.0.21-stable, OpenSSL 1.0.1k and Zlib 1.2.8.
Dec 10 04:22:23.732 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
Dec 10 04:22:23.733 [notice] Read configuration file "/usr/share/tor/tor-service-defaults-torrc".
Dec 10 04:22:23.733 [notice] Read configuration file "/etc/tor/torrc".
Dec 10 04:22:23.735 [notice] Opening Socks listener on 127.0.0.1:9050
Dec 10 04:22:23.735 [notice] Opening Socks listener on 127.0.0.1:9061
Dec 10 04:22:23.736 [notice] Opening Socks listener on 127.0.0.1:9062
Dec 10 04:22:23.736 [notice] Opening Socks listener on 127.0.0.1:9150
Dec 10 04:22:23.736 [notice] Opening DNS listener on 127.0.0.1:5353
Dec 10 04:22:23.736 [notice] Opening Transparent pf/netfilter listener on 127.0.0.1:9040
Dec 10 04:22:23.736 [notice] Opening Control listener on 127.0.0.1:9051
Dec 10 04:22:23.737 [notice] Opening Control listener on /var/run/tor/control
Dec 10 04:22:23.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
Dec 10 04:22:23.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Dec 10 04:22:33.000 [notice] Bootstrapped 0%: Starting
Dec 10 04:22:34.000 [notice] Bootstrapped 80%: Connecting to the Tor network
Dec 10 04:22:34.000 [notice] Signaled readiness to systemd
Dec 10 04:22:34.000 [notice] Bootstrapped 85%: Finishing handshake with first hop
Dec 10 04:22:34.000 [notice] Bootstrapped 90%: Establishing a Tor circuit
Dec 10 04:22:34.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Dec 10 04:22:34.000 [notice] Bootstrapped 100%: Done

Notice how Tor waits for exactly 10 seconds before starting to bootstrap (0%). That seems wasteful, and probably it doesn't play well with the current 20 second timeout we have for restarting Tor (in restart-tor) when there is no bootstrap progress.

#8 Updated by anonym almost 2 years ago

  • % Done changed from 0 to 10

Interesting. Disabling the seccomp sandbox eliminates this 10 second delay. Hmm. While that is an issue in itself, I dare say I may be on to something more, i.e. the root of these bootstrapping issues.

Of course, we didn't enable the sandbox until early this year, so not the summer of 2014 when I started experiencing issues. But perhaps it was early this year when the rest of you started experiencing these bootstrapping issues? After all, the issues I had in 2014 was much worse than what I have now (perhaps it was my ISP like I was theorizing at some points?) so maybe these are two different things?

Also, I've noticed that the bootstrapping issues seem to not occur in bridge mode, when the sandboxing is disabled. Investigating further...

#9 Updated by intrigeri almost 2 years ago

Hint: Jenkins can give you stats with a different ISP than yours (e.g. push a branch that has the revert of flagging all this stuff fragile and disables Sandboxing, or whatever is useful :)

#10 Updated by anonym almost 2 years ago

intrigeri wrote:

Hint: Jenkins can give you stats with a different ISP than yours (e.g. push a branch that has the revert of flagging all this stuff fragile and disables Sandboxing, or whatever is useful :)

The emphasis on "*was*" in that parenthesis is crucial, because I do not think I have the same issue any more...

#11 Updated by anonym almost 2 years ago

Here's a brain dump and stats dump of an experiment I did using the automated test suite:

I've run 50 scenarios each of the following:

  • Tor's seccomp Sandbox enabled
    • Boot Tails from scratch and let Tor bootstrap
    • Restore from a snapshot where Tor has bootstrapped, let Tor re-bootstrap
  • Tor's seccomp Sandbox diabled
    • Boot Tails from scratch and let Tor bootstrap
    • Restore from a snapshot where Tor has bootstrapped, let Tor re-bootstrap
Notably, I also:
  • disabled our restart-tor-on-20-second-timout thing in restart-tor to see the default Tor behaviour
  • resetted the iptables counters to be able to see the network activity (I'll be looking at the 1 rules for INPUT and OUTPUT only, assuming Tor is the only process using the network which should be the case -- we are not even using the ControlPort at this stage)

Some observations:

Tor's seccomp sandbox

The seccomp sandboxing seem to affect the time bootstraps take -- looking at the average time spent in wait_until_tor_is_working() it was ~7 seconds longer when enabled.

When sandboxing is enabled the Tor log looks just like in the one I posted above (#10238#note-7) -- Tor does nothing for the first 10 seconds, then it starts bootstrapping. I monitored the iptables counters, and indeed, nothing was received on the network during those 10 seconds, but stuff was sent. Perhaps the sandbox prevents packets from being read for 10 seconds and hence also delaying the bootstrapping log entries? It's the same both for the initial bootstrap and subsequent rebootstraps.

I have no good explanation why the average time is just bumped by 7 seconds, and not 10. Any way, sandboxing has a clear issue, and I'll report this, even if it's not the full solution.

Variance in bootstrap times

Interestingly, without sandboxing there's quite a bit less variance when *re*bootstrapping: without sandboxing it took 6 seconds all times except three where it took 4,5 and 9; with sandboxing it took 14 all times except three where it took 13, 13 and 29.

For initial bootstraps there's more variance. Without sandboxing it actually only took <10 seconds half of the time, and less than <20 seconds in all but a few outliers: 72, 131, 136, 235. With sandboxing the numbers looked very similar, but with the added 8 seconds.

From the log I can see that for the 72 number it stalled at:

Bootstrapped 10%: Finishing handshake with directory server

for 65 seconds, then it got a few:
channel_tls_process_certs_cell(): Got some good certificates from 194.109.206.212:443: Authenticated it.

and bootstrapping finished in a few seconds.

For the 131, 136 and 235 numbers there was no instances of good old:

Tried for 120 seconds to get a connection to [scrubbed]. Giving up. (waiting for circuit)

it was just an exact 120 seconds delay between some of the bootstrap progress entries (e.g. 62% and 68%). For the 235 number I also saw
No circuits are opened. Relaxed timeout for circuit 2 (a General-purpose client 1-hop circuit in state doing handshakes with channel state open) to 60000ms. However, it appears the circuit has timed out anyway. 0 guards are live.

after ~60 seconds of inactivity, which is new.

Failures

Only one scenario out of the total 200 failed, so I will ignore that as a statistical fluke. Clearly, bootstrapping works pretty well without us restarting Tor like we do now. In fact, it's completely conceivable that restarting makes the situation worse if a slow directory mirror/authority is used.

A new run that doesn't look like any of the above

While writing this I reran this case:
  • Tor's seccomp Sandbox enabled
    • Restore from a snapshot where Tor has bootstrapped, let Tor re-bootstrap

and now it of course behaves very differently than above. The average time was increased with 100 seconds and most scenarios has this in the Tor log:

Tried for 120 seconds to get a connection to amnesia:0. Giving up. (waiting for circuit)

Yay... (note that "amnesia:0" thing (I have SafeLoggin 0) which seems odd -- any ideas?). Several scenarios failed, which is highly unlike the previous runs. Any way, this horrendous run is, I guess, why we introduced the timout-restart thing in restart-tor. This is the big bug with Tor's bootstrapping that we need to identify and solve.

BTW, I started this run two minutes past a new hour, so I'm wondering if it could have something to do with the consensus just having been refreshed. Also, right after this run I started a new one, that worked perfectly again, except the last four scenarios that all took 120 seconds longer than normal (and then we had not yet crossed into a new hour, btw).

Perhaps this new run invalidates a lot I wrote above, except the 10 seconds of idling that sandboxing introduces.

#12 Updated by anonym almost 2 years ago

When sandboxing is enabled the Tor log looks just like in the one I posted above (#10238#note-7) -- Tor does nothing for the first 10 seconds, then it starts bootstrapping. I monitored the iptables counters, and indeed, nothing was received on the network during those 10 seconds, but stuff was sent.

First let me make clear that my iptables counter based approach is pretty fault prone since I can only look at the state (RELATED ESTABLISHED) ACCEPT rules for INPUT and OUTPUT. What I can see is that something is sent (counters are non-zero on OUTPUT), but nothing is received (counter are zero on @INPUT). It could however be some other application that are responsible, or that it is Tor but it's local (loopback) traffic and not an attempt to contact the Tor network. Indeed, it looks like the latter:

Here's the debug level log for when Tor is seemingly idling when the sandbox is enabled:

Dec 12 14:56:57.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
Dec 12 14:56:58.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Dec 12 14:56:58.000 [info] crypto_global_init(): NOT using OpenSSL engine support.
Dec 12 14:56:58.000 [info] evaluate_evp_for_aes(): This version of OpenSSL has a known-good EVP counter-mode implementation. Using it.
Dec 12 14:56:58.000 [info] sandbox_getaddrinfo(): (Sandbox) getaddrinfo succeeded.
Dec 12 14:57:08.000 [info] sandbox_getaddrinfo(): (Sandbox) getaddrinfo failed.
Dec 12 14:57:08.000 [info] sandbox_getaddrinfo(): (Sandbox) getaddrinfo succeeded.
Dec 12 14:57:08.000 [notice] Bootstrapped 0%: Starting

Notice the 10 seconds of idling starting at 14:56:58. When the idle stops, Tor logs that a sandboxed getaddrinfo has failed. I tcpdump@ed and learned that Tor tried to resolve the @amnesia hostname, and given our DNS setup, its own DNSPort (5353) for that, which maybe isn't so sane.

Strangely enough, the idlesness can be cured by doing either of:

  • disabling the firewall, or
  • configuring a working DNS on the LAN in /etc/resolv.conf and allowing its use in the firewall

but the getaddrinfo failure remains in the log. In the first case (disabling the firewall) it's strange, because the debian-tor user has no restrictions on the (IPv4) network. Well, it's blocked from LAN DNS but in that case it's not configured. I wondered if the tor process still hadn't chuid:ed, so I added a rule giving full access to root, but that didn't help. Nothing is logged as "Dropped" by our firewall, though... hm. Also, I tried disabling apparmor in case access to /etc/hosts was blocked, but it didn't change anything.

Without the sandbox, I see no DNS packets, and no getaddrinfo failures logged. So, it seems Tor has issues when the sandbox is used, and there's no system resolver or it is Tor's own DNSPort; it seems the sandbox itself needs to resolve that hostname.

#13 Updated by anonym over 1 year ago

  • Priority changed from Normal to Elevated
  • Target version set to Tails_2.3

I'll find a way to easily reproduce this in a Jessie VM (at least outside Tails, which does too many crazy things in this area) and report an upstream bug on the Tor tracker.

#14 Updated by anonym over 1 year ago

  • Related to Bug #10494: Retry htpdate when it fails added

#15 Updated by anonym over 1 year ago

  • Related to deleted (Bug #10494: Retry htpdate when it fails)

#16 Updated by anonym over 1 year ago

anonym wrote:

I'll find a way to easily reproduce this in a Jessie VM (at least outside Tails, which does too many crazy things in this area) and report an upstream bug on the Tor tracker.

Done and the cause is identified. Hopefully we'll get a fix backported.

#17 Updated by intrigeri over 1 year ago

Done and the cause is identified.

Wow, amazing!

#18 Updated by anonym over 1 year ago

  • Target version changed from Tails_2.3 to Tails_2.4

#19 Updated by intrigeri over 1 year ago

  • Status changed from Confirmed to In Progress

What this ticket was about has been done a couple months ago, and so it can be closed, no?

#20 Updated by anonym over 1 year ago

  • Target version changed from Tails_2.4 to Tails_2.6

intrigeri wrote:

What this ticket was about has been done a couple months ago, and so it can be closed, no?

Nope. I've found and fixed one issue (Tor bug #18548) but there is still something unreliable about it, see #10238#Variance-in-bootstrap-times and #10238#Failures. Sadly, the one fixed is only about delay, not these crazy problems. :(

Any way, let's not kid ourselves; I won't have time for any more of the for a while => postponing. The Chutney stuff still seems to have improved the test suite side of this quite significantly.

#21 Updated by intrigeri over 1 year ago

Nope. I've found and fixed one issue [...] but there is still something unreliable about it

So for the second one, "The expected outcome of this ticket is to report an excellent bug to the Tor developers" is not done, right? Sorry, I missed that part :)

#22 Updated by intrigeri over 1 year ago

  • Blocks deleted (Bug #10497: wait_until_tor_is_working helper is fragile)

#23 Updated by anonym over 1 year ago

  • % Done changed from 10 to 20
  • Feature Branch set to feature/tor-bootstrap-failure-stats-for-10238

Pushed (untested) feature/tor-bootstrap-failure-stats-for-10238 which I hope can help us collect some data when Tor fails to bootstrap.

#24 Updated by bertagaz about 1 year ago

anonym wrote:

Pushed (untested) feature/tor-bootstrap-failure-stats-for-10238 which I hope can help us collect some data when Tor fails to bootstrap.

Sadly it disappeared, so we lost the stats, but I've refreshed the branch and pushed it again to start to collect stats again. I think I'll add a patch to save the chutney logs too, and maybe one to have more detailed logs from the Tails Tor client.

#25 Updated by anonym about 1 year ago

  • Target version changed from Tails_2.6 to Tails_2.9.1

#26 Updated by anonym 9 months ago

  • Target version changed from Tails_2.9.1 to Tails 2.10

#27 Updated by anonym 9 months ago

  • Target version changed from Tails 2.10 to Tails_2.12

I'll try to get help from the pros during the March 2017 Tor dev meeting, which I intend to attend.

#28 Updated by intrigeri 6 months ago

  • Related to Feature #12411: Stop restarting tor if bootstrapping stalls added

#29 Updated by anonym 5 months ago

  • Status changed from In Progress to Rejected
  • Assignee deleted (anonym)
  • % Done changed from 20 to 0

I cannot reproduce this any more, so I'll optimistically close this ticket.

#30 Updated by bertagaz about 2 months ago

  • Related to Bug #13541: Tor still sometimes fails to bootstrap in the test suite added

Also available in: Atom PDF