Project

General

Profile

Bug #15162

Remove rsync.torproject.org from the mirrors synchronization chain

Added by intrigeri 9 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
01/10/2018
Due date:
% Done:

100%

QA Check:
Pass
Feature Branch:
Type of work:
Communicate
Blueprint:
Starter:
Affected tool:

Description

Context: mirrors in our download pool rsync our data from rsync.torproject.org, which itself rsyncs it from rsync.lizard, which is not publicly available.

As we've noticed in the last 2 days, the rsync.torproject.org service lacks a maintainer, and the people who maintain related services don't have our needs in mind. We've been lucky that the breakage we experienced has been resolved promptly, thanks to weasel making time for it on the spot, but if weasel had been AFK on holiday we would have been in a really bad situation regarding the release.

I think the reasons why we did not want to serve our data over rsync to mirrors are obsolete and I propose we remove it from the loop, make our own rsync server publicly accessible and have mirror operators migrate their cronjob from rsync.tpo to it.

Advantages:

  • Our rsync sync' chain is simplified.
  • We control the entire rsync mirror sync chain. As a consequence, improvements like #11152 easier to implement.
  • We don't have to maintain rsync.torproject.org service ourselves: as weasel told us yesterday, it needs a maintainer if we want it to remain up (like any TPO service as per their new policy).
  • This gives us most of #15159 for free: we already monitor that our rsync server works.

Cons (real or potential):

  • More bandwidth costs for lizard hosting (that is sponsored by Tor):
    • Legit usage: currently we have 40 mirrors so each year that's 40 * number of ISOs we publish * 1.2 GB. lizard has pushed 124 GiB since it was rebooted 4 days ago so I think the impact is totally negligible => non-issue.
    • Abusive usage: we already make a bunch of ISO images available publicly over HTTPS from lizard (https://nightly.tails.boum.org/) so whoever wants to pull tons of data from that system can already do it => non-issue.
  • Upload bandwidth usage peaks when publishing a new ISO: lizard will need to upload quite some data in the hour that follows the time when we add a new ISO to our rsync server. Assuming that's 2 GiB per mirror (ISO + a couple IUKs), with 40 mirrors, that's more than what our 100 Mbps link can sustain. Potential consequences:
    • Our link is saturated and other services suffer. If this happens we can cap the upload bandwidth of the rsync daemon (or VM, whichever is easiest).
    • It will happen that a mirror runs a second instance of the rsync pull cronjob while the previous one has not finished yet. Oops! I don't know how rsync handles this. We should wrap the documented rsync cronjob with flock and have mirror operators apply this change at the same time as they switch the rsync server URL in that same cronjob. I'm tempted to document a systemd.timer unit as well since they're not affected by such double-run issues, but that's bonus and not necessary.

I volunteer to implement the needed changes (sysadmin, doc) and to coordinate this migration with mirror operators.

Once this is completed and we're happy with how it works, we'll ask the tpo admins to:

  1. disable our component on their rsync server
  2. adjust their rsync client cronjob to work just like every other mirror's

… and then we can add their mirror to the pool.


Related issues

Related to Tails - Feature #15159: Monitoring of our mirrors' ability to sync Rejected 01/09/2018
Related to Tails - Bug #15687: Have lizard plugged to a Gbps switch Resolved 06/27/2018
Blocks Tails - Bug #11152: Have SSL on our rsync communications with mirrors Confirmed 02/21/2016

Associated revisions

Revision 8e21f6d8 (diff)
Added by intrigeri 5 months ago

Add a level of DNS indirection between mirrors and the rsync server they pull from (refs: #15162).

This will allow us to control when we switch from rsync.torproject.org
(that mirrors.rsync.tails.boum.org currently points to) to rsync.tails.boum.org.

Revision 0f98900d (diff)
Added by intrigeri 5 months ago

Ensure mirrors don't run more than one rsync command concurrently (refs: #15162)

History

#1 Updated by intrigeri 9 months ago

  • Assignee set to u
  • Target version set to Tails_3.5

Let's make a decision quickly (hence target version = 3.5) and then I can proceed and start the migration (that will take time), aiming at completing it in the next few months.

I'd like input from the mirrors & sysadmins teams, and ideally I'd like anonym's opinion. Please reassign to one of the other watchers who did not comment yet once you did :)

#2 Updated by intrigeri 9 months ago

  • Related to Feature #15159: Monitoring of our mirrors' ability to sync added

#3 Updated by intrigeri 9 months ago

  • Related to Bug #11152: Have SSL on our rsync communications with mirrors added

#4 Updated by u 9 months ago

  • Assignee changed from u to anonym

Sounds like a great plan to me.

If the rsync server could be located elsewhere might this potentially mitigate the links saturation issue?

I'm happy to help with coordinating this with mirror operators.

@anonym, please reassign to @groente, @sajolida, @bertagaz.

#5 Updated by intrigeri 9 months ago

If the rsync server could be located elsewhere might this potentially mitigate the links saturation issue?

That's a possibility but identifying this "elsewhere", finding a suitable arrangement and then setting up technical things there has a cost that might not be worth the benefits. I mean, if we're ready to do that, we can as well take responsibility for the rsync service hosted at torproject.org and we're done (except in this case we don't get a number of the benefits of the change I've proposed).

Also, sorry I forgot to mention it but the limitation is actually caused by the network switch lizard is plugged on: currently we're plugged on a 100Mbps switch but there's a possibility to move to a 1Gbps one. Last time I talked with SeaCCP folks it was more complicated than just moving a network cable, but IIRC they mentioned that if we ever need more than 100Mbps this option could be considered.

I'm happy to help with coordinating this with mirror operators.

:)

#6 Updated by anonym 8 months ago

  • Target version changed from Tails_3.5 to Tails_3.6

#7 Updated by anonym 8 months ago

  • Assignee changed from anonym to bertagaz
  • QA Check set to Info Needed

intrigeri wrote:

  • Upload bandwidth usage peaks when publishing a new ISO: lizard will need to upload quite some data in the hour that follows the time when we add a new ISO to our rsync server. Assuming that's 2 GiB per mirror (ISO + a couple IUKs), with 40 mirrors, that's more than what our 100 Mbps link can sustain. Potential consequences:

I am a bit concerned:

40 * ~"ISOs + IUKs"  / 100 mbit/second =
40 * 2*1024**3 bytes / (100*1024**2 / 8) bytes/second =
6553.6 seconds =
almost two hours

which OTOH isn't so much worse than the delay imposed by first having to sync to archive.tpo like we currently do. I feel like we need to test this to know how this will work in practice, and I suppose and RC like 3.6~rc1 is the ideal time to battle test this. How do you feel about that, bert?

#8 Updated by intrigeri 7 months ago

  • Description updated (diff)
  • Assignee changed from bertagaz to groente
  • Priority changed from Normal to Elevated

anonym wrote:

I am a bit concerned:

I forgot to mention: if our 100Mbps uplink really becomes a problem, it's an option to upgrade to a 1Gbps one but IIRC it requires quite some work from our friends at Riseup.

I feel like we need to test this to know how this will work in practice and I suppose an RC like 3.6~rc1 is the ideal time to battle test this.

FYI, there's no cheaper way of testing this than to implement the change in production and we can't revert it easily+quickly: implementing this change will take some sysadmin + doc time and then about a month to track mirror operators until they've all adjusted their config. Reverting this change (if the test is not successful) will take about a month as well until mirror operators have adjusted their config again. Given the final release is generally published less than 2 weeks after the RC, a RC is not a particularly good time to battle test this, and actually there's no particularly good time because 1+1 = 2 months does not fit between two stable releases. So all in all, the cost of testing is so high that I don't think we can sensibly block our decision on testing results: if we decide to go ahead with this plan, we'd better be confident it will work.

How do you feel about that, bert?

We've already got input from anonym, intrigeri and u. I also want input from sajolida and another sysadmin (but not necessarily bertagaz, especially since he's too busy elsewhere to be responsive here). groente?

#9 Updated by intrigeri 7 months ago

(groente, please reassign to sajolida once you're done commenting :)

#10 Updated by groente 7 months ago

  • Assignee changed from groente to sajolida

I'm a bit uncomfortable with the idea of deliberately saturating lizard's uplink, but if we cap the rsync VM to say 70mbit, I can live with the extra hour it will take to sync everything. So, on the condition of capping the VM bandwidth, I'm good with removing torprojects from the chain and distributing straight from rsync.lizard.

#11 Updated by intrigeri 7 months ago

I'm a bit uncomfortable with the idea of deliberately saturating lizard's uplink, but if we cap the rsync VM to say 70mbit, I can live with the extra hour it will take to sync everything. So, on the condition of capping the VM bandwidth, I'm good with removing torprojects from the chain and distributing straight from rsync.lizard.

Makes sense to me!

#12 Updated by sajolida 6 months ago

  • Target version changed from Tails_3.6 to Tails_3.7

#13 Updated by intrigeri 6 months ago

We're having issues again during the 3.6.1 release with rsync.tpo so I'm even more convinced we should go ahead with this idea.

Note for later: we need to wrap with flock/lckdo the rsync cronjob we ask mirror operators to use. Otherwise, if their download does not complete by the time a second rsync is started concurrently, the second one will delete the temporary file where the first rsync is downloading. If this situation happens it's likely that many mirrors are affected at the same time, preventing each other from ever completing the sync'.

#14 Updated by geb 6 months ago

intrigeri wrote:

FYI, there's no cheaper way of testing this than to implement the change in production and we can't revert it easily+quickly: implementing this change will take some sysadmin + doc time and then about a month to track mirror operators until they've all adjusted their config. Reverting this change (if the test is not successful) will take about a month as well until mirror operators have adjusted their config again

If I may, I would like to suggest using a generic DNS name, like rsync.tails.boum.org (seems to already redirect to lizard) instead of pointing directly to a given host: Using DNS CNAME, it would allow to quickly redirect to any given host, as long the path is kept consistent. It would help to revert the change quicly and to prevent the need to synchronize with mirrors operators in case of a further change.

If you choose to consider this idea, it may be relevant to consider using two DNS name:
- one being dedicated to be the main source of releases (actually rsync.lizard.t.b.o / rsync.t.b.o)
- one being dedicated to be used by mirrors for pulling. (actually rsync.torproject.org, could be mirrors.rsync.t.b.o)
It would allow to keep flexibility and to not hardcode that mirrors should fetch their update from the main source of releases. Especially considering that some mirrors already offer rsync service that could leveraged to reduce trafic directly going to the main source of releases.

#15 Updated by sajolida 5 months ago

  • Assignee changed from sajolida to intrigeri
  • QA Check deleted (Info Needed)

I don't have much to add...

#16 Updated by intrigeri 5 months ago

  • Status changed from Confirmed to In Progress
  • Assignee changed from intrigeri to u
  • % Done changed from 0 to 20
  • Type of work changed from Discuss to Communicate

I've:

  • capped the outbound bandwidth used by the rsync VM: virsh domiftune rsync 52:54:00:cb:cb:04 --outbound 70000,80000,256 --config (tested with lower values, works fine)
  • made our rsync daemon serve our data publicly via the "amnesia-archive" module (same name as on rsync.tpo for a smoother transition and easy rollback using geb's idea)
  • had mirrors.rsync.tails.b.o be a CNAME to rsync.torproject.org (geb's idea again)
  • adjusted our mirroring doc and Puppet class to point to mirrors.rsync.tails.b.o instead of rsync.tpo
  • added locking (flock/lckdo) to the command in our mirroring doc and Puppet class

Next steps:

  1. ask mirror operators to apply these changes to their production config; we can't know when they've applied them except if they tell us so we need to track their answers; then wait until enough mirrors have migrated and at some point disable mirrors that have not migrated
  2. at some well chosen time, press the big red button: point the mirrors.rsync.tails.b.o CNAME to rsync.tails.b.o (lizard)
  3. battle test this e.g. for a RC (or by uploading whatever big file whenever we want)
  4. ask SeaCCP if lizard can be plugged on a 1Gbps switch

u, you mentioned earlier that you could help with the coordination with mirror operators. Can you handle the first next step listed above? If yes, then reassign to me once you're done and I'll handle the next steps. If not, just reassign to me and I'll handle it.

#17 Updated by intrigeri 5 months ago

intrigeri wrote:

Next steps:
[…]
ask SeaCCP if lizard can be plugged on a 1Gbps switch

Done!

#18 Updated by bertagaz 5 months ago

  • Target version changed from Tails_3.7 to Tails_3.8

#19 Updated by intrigeri 4 months ago

intrigeri wrote:

intrigeri wrote:

Next steps:
[…]
ask SeaCCP if lizard can be plugged on a 1Gbps switch

Done!

So this looks doable. I'm discussing the timeline with SeaCCP folks.

#20 Updated by intrigeri 4 months ago

intrigeri wrote:

u, you mentioned earlier that you could help with the coordination with mirror operators. Can you handle the first next step listed above? If yes, then reassign to me once you're done and I'll handle the next steps. If not, just reassign to me and I'll handle it.

Ping? I'm fine with handling it myself, just let me know if you can't handle this soonish :)

#21 Updated by u 4 months ago

intrigeri wrote:

intrigeri wrote:

u, you mentioned earlier that you could help with the coordination with mirror operators. Can you handle the first next step listed above? If yes, then reassign to me once you're done and I'll handle the next steps. If not, just reassign to me and I'll handle it.

Ping? I'm fine with handling it myself, just let me know if you can't handle this soonish :)

Sorry, this got lost somehow from my list. I've sent out the email, and we can possibly keep track of replies directly in mirrors.json. Using some sort of keywork in the "notes" line. For example "rsync switched". Or something else if you have a better idea.

#22 Updated by u 4 months ago

  • Assignee changed from u to intrigeri

#23 Updated by intrigeri 4 months ago

I've sent out the email,

Excellent!

and we can possibly keep track of replies directly in mirrors.json. Using some sort of keywork in the "notes" line. For example "rsync switched".

Sounds good to me, let's do it this way (I don't think it's worth adding a field to the JSON schema for this so piggy-backing on "notes" is probably the best we can do).

#24 Updated by intrigeri 4 months ago

11 mirrors have switched already :) I'll update our metadata in batches in the next few weeks and will ping (or suggest you ping) the remaining ones then. I hope we'll be done in time for the 3.8 release (even if it implies disabling a few mirrors: those that can't apply such a small change within a few weeks will cause other problems later along the way so I won't be sad).

#25 Updated by intrigeri 3 months ago

  • % Done changed from 20 to 30

So, almost 3 weeks after our request, 28 mirrors out of the 40 enabled ones have updated their cronjob and told us. I suspect more mirror operators did it without telling us. I've pinged the remaining 12 mirrors (except kernel.org that is tracked elsewhere). I'll wait ~10 more days and then I'll disable those that have not updated yet, let them know, and re-enable them if they apply the change in the 2 weeks that follow that last call.

Next step is in ~10 days which is when I plan to upload the tentative 3.8 ISO. So depending on progress and availability I might switch the DNS to the our own rsync server before I upload the ISO or not. We'll see.

#26 Updated by intrigeri 3 months ago

intrigeri wrote:

So, almost 3 weeks after our request, 28 mirrors out of the 40 enabled ones have updated their cronjob and told us. I suspect more mirror operators did it without telling us. I've pinged the remaining 12 mirrors (except kernel.org that is tracked elsewhere). I'll wait ~10 more days

So, one month after our initial email, and 10 days after a ping, 33 mirrors out of 40 have updated their rsync cronjob and told us about it. Next step is:

I'll disable those that have not updated yet, let them know, and re-enable them if they apply the change in the 2 weeks that follow that last call.

On it!

Next step is in ~10 days which is when I plan to upload the tentative 3.8 ISO. So depending on progress and availability I might switch the DNS to the our own rsync server before I upload the ISO or not. We'll see.

I'll probably try it out.

#27 Updated by intrigeri 3 months ago

intrigeri wrote:

I'll disable those that have not updated yet, let them know

On it!

Done.

#28 Updated by intrigeri 3 months ago

  • Description updated (diff)

#29 Updated by intrigeri 3 months ago

  • % Done changed from 30 to 40

Pointed the CNAME to our own rsync server. Mirrors started to download the 3.8 ISO from there. Fixed the outbound cap that I had got wrong (units mistake, duh).

#30 Updated by intrigeri 3 months ago

  • Target version changed from Tails_3.8 to Tails_3.9

FTR it took ~2 hours for all mirrors to pick up the new ISO. IUKs were not added to the rsync server yet.

That's all I do here during this cycle => see you in the 3.9 area!

#31 Updated by intrigeri 3 months ago

  • Description updated (diff)

#32 Updated by intrigeri 3 months ago

So, this worked fine during the 3.8 release process. One mirror operator (geb) told me his cronjob raised a few "max connection (10) reached" errors but nobody else complained. If other mirror operators complain I'll raise the "max connection" setting: upside will be no/less errors raised to mirror operators, downside will be higher latency until a couple mirrors get the ISO first => some delay in our release process between the upload and when we can start doing the manual tests.

I've updated our internal doc (mirrors.git) and the manual test suite.

Next steps:

  • communicate with tpo admins as documented in the ticket description: I'll track this here
  • follow-up on the "plug lizard into a 1Gbps switch" topic: I'll track this on #15687

#33 Updated by intrigeri 3 months ago

  • Related to Bug #15687: Have lizard plugged to a Gbps switch added

#34 Updated by intrigeri 3 months ago

  • Related to deleted (Bug #11152: Have SSL on our rsync communications with mirrors)

#35 Updated by intrigeri 3 months ago

  • Blocks Bug #11152: Have SSL on our rsync communications with mirrors added

#36 Updated by intrigeri 3 months ago

  • Priority changed from Elevated to Normal
  • % Done changed from 40 to 50

intrigeri wrote:

Next steps:

  • communicate with tpo admins as documented in the ticket description: I'll track this here

I've reached out to the tpo admins, requesting these config changes.

The important part of the work was done => downgrading priority.

#37 Updated by intrigeri 3 months ago

  • Assignee changed from intrigeri to u
  • QA Check set to Ready for QA
  • communicate with tpo admins as documented in the ticket description: I'll track this here

I've reached out to the tpo admins, requesting these config changes.

They applied the requested change very quickly and I've thus added their mirror to the pool.

So I think we're done here. u, can you please take a look and see if I missed something?

#38 Updated by u about 1 month ago

  • QA Check changed from Ready for QA to Pass

LGTM! :)

#39 Updated by u about 1 month ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Also available in: Atom PDF