Project

General

Profile

Bug #11830

Tagged APT snapshots' backup is impractical

Added by intrigeri about 1 year ago. Updated 27 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
09/23/2016
Due date:
% Done:

80%

QA Check:
Ready for QA
Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

For each release we currently add about 6-7GB of data to backup, which is painful when running backups on a poor Internet connection. I think we should investigate deduplication:

  • either in the source filesystem itself, which has the advantage of saving storage space on lizard;
    • using hardlinks-based deduplication tools) should work, e.g. http://jak-linux.org/projects/hardlink/ that we use during our ISO build process
    • using a filesystem that deduplicates data would not help on the backup side (unless we use tools specific to that filesystem to back up our data); and last time I checked, no such filesystem was ready for production use on Linux
  • or in the backup process itself, e.g. using bup instead of rdiff-backup
    • bup supports pull-style backups (see bup-on(1))

Related issues

Blocks Tails - Feature #13242: Core work 2017Q4: Sysadmin (Maintain our already existing services) Confirmed 06/29/2017

History

#1 Updated by intrigeri about 1 year ago

  • Description updated (diff)

#2 Updated by intrigeri about 1 year ago

  • Status changed from Confirmed to In Progress
  • Assignee changed from intrigeri to bertagaz
  • Target version changed from 284 to Tails_2.7
  • % Done changed from 0 to 10
  • QA Check set to Info Needed

The hardlink(1)-based solution seems to be pretty efficient on our current 31GB tagged snapshots repo:

reprepro-tagged-snapshots@apt:~$ time hardlink --dry-run --ignore-time /srv/apt-snapshots/tagged/repositories/
Mode:     dry-run
Files:    38410
Linked:   29650 files
Compared: 0 xattrs
Compared: 41484 files
Saved:    22.34 GiB
Duration: 77.18 seconds

... and the amount of space saved will only grow as we add releases. I'm tempted to simply go this way and be done with it. Given it's super easy to run this command via cron, I'm setting a closer target version (this can be done very quickly if we want).

bertagaz, what do you think?

#3 Updated by intrigeri about 1 year ago

  • Type of work changed from Research to Sysadmin

#4 Updated by bertagaz about 1 year ago

  • Assignee changed from bertagaz to intrigeri

intrigeri wrote:

bertagaz, what do you think?

That sounds pretty nice, and I agree that's probably the best path to follow. Do you have an idea how much it ameliorates the backup time?

#5 Updated by intrigeri about 1 year ago

  • QA Check changed from Info Needed to Dev Needed

That sounds pretty nice, and I agree that's probably the best path to follow.

OK, I'll go ahead then.

Do you have an idea how much it ameliorates the backup time?

  • a full backup of our current tagged snapshot repo should require transferring 9GB instead of 31GB
  • next incremental backup of our tagged snapshot repo should require transferring only unique files that were added, instead of 6-7GB

#6 Updated by intrigeri about 1 year ago

  • % Done changed from 10 to 50
  • QA Check deleted (Dev Needed)

Deployed, let's see what happens the first time the cronjob runs (in a couple hours).

#7 Updated by intrigeri about 1 year ago

As expeced:

$ sudo du -csh /srv/apt-snapshots/tagged/repositories/*
5.4G    /srv/apt-snapshots/tagged/repositories/2.4
131M    /srv/apt-snapshots/tagged/repositories/2.4-rc1
577M    /srv/apt-snapshots/tagged/repositories/2.5
1.1G    /srv/apt-snapshots/tagged/repositories/2.6
615M    /srv/apt-snapshots/tagged/repositories/2.6-rc1
4.0K    /srv/apt-snapshots/tagged/repositories/robots.txt
7.8G    total

I'll now try building an ISO that uses one of these tagged snapshots.

#8 Updated by intrigeri about 1 year ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 50 to 100

The build system managed to download most .deb's and then I killed it. So it works!

#9 Updated by intrigeri 9 months ago

  • Status changed from Resolved to In Progress
  • Assignee set to bertagaz
  • Target version changed from Tails_2.7 to Tails_2.12
  • % Done changed from 100 to 10
  • QA Check set to Info Needed

Argh, the actual consequence of this problem is still here. Apparently rdiff-backup ignores hardlinks, and as a result: the same data is downloaded and stored N times (each tagged repo directory in my backup store takes multiple GB). Sorry I didn't notice this earlier. bertagaz, can you confirm? (I'd like to make sure this is not due to some weirdness of my own system.)

#11 Updated by intrigeri 8 months ago

  • Target version changed from Tails_2.12 to Tails_3.0~rc1

You're on duty next week, so you should be able to answer my question by the end of the month.

#12 Updated by intrigeri 7 months ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 10 to 100
  • QA Check deleted (Info Needed)

intrigeri wrote:

Argh, the actual consequence of this problem is still here. Apparently rdiff-backup ignores hardlinks, and as a result: the same data is downloaded and stored N times (each tagged repo directory in my backup store takes multiple GB).

I've looked closer and actually I was wrong: each tagged snapshot takes exactly as much space locally as on apt.lizard, and I've verified with stat --format='%i' that .deb's with the same name/version are de-duplicated via hardlinks locally as well. So everything works as expected here :)

#13 Updated by intrigeri 5 months ago

  • Status changed from Resolved to In Progress
  • Assignee set to intrigeri
  • Target version changed from Tails_3.0~rc1 to Tails_3.2
  • % Done changed from 100 to 80
  • QA Check set to Ready for QA

I'm back here again :/ It seems that we're still transferring much more data than we could. I think I know why; hardlink(1) says:

       -O or --keep-oldest
              Among equal files, keep the oldest file (least recent  modification  time).
              By  default, the newest file is kept. If --maximize or --minimize is speci‐
              fied, the link count has a higher precedence than the time of modification.

The way I understand this, "by default, the newest file is kept" implies that in practice, files that were in the tagged repo for Tails version N-1 will become hardlinks to the same files in the tagged repo for Tails version N once it's out. And indeed, the tagged repo for 3.0 is 4.7GB big, while those for 3.0~betaN and 3.0~rcN are all 2.1GB or smaller. If I'm not mistaken, this implies that we will re-download these duplicated files when performing a backup, and the copy we already had will become a hardlink to the new copy. This feels wrong, and I think the --keep-oldest option should avoid this problem. I've done this and deployed, but this does not take effect immediately: it'll only impact newly duplicated files, so we'll only know after the 3.1 release how it went. I'll thus evaluate the outcome during the 3.2 cycle.

As a data point, what we have now is:

4.7G    3.0
225M    3.0.1
3.6G    3.0-alpha1
2.1G    3.0-beta1
1.5G    3.0-beta2
545M    3.0-beta3
1.2G    3.0-beta4
771M    3.0-rc1
47M     3.0-rc2

#14 Updated by intrigeri 5 months ago

  • Blocks Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added

#15 Updated by intrigeri 4 months ago

  • Assignee changed from intrigeri to bertagaz

We now have:

5.1G    3.0
225M    3.0.1
4.0G    3.0-alpha1
2.3G    3.0-beta1
1.6G    3.0-beta2
546M    3.0-beta3
1.2G    3.0-beta4
771M    3.0-rc1
47M     3.0-rc2
725M    3.1

The fact the 3.1 snapshot is small seems to indicate that the problem has indeed been fixed. bertagaz, please confirm this while doing the backups during your current sysadmin shift.

#16 Updated by anonym 3 months ago

  • Target version changed from Tails_3.2 to Tails_3.3

#17 Updated by intrigeri 2 months ago

  • Blocks Feature #13242: Core work 2017Q4: Sysadmin (Maintain our already existing services) added

#18 Updated by intrigeri 2 months ago

  • Blocks deleted (Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services))

#19 Updated by anonym 27 days ago

  • Target version changed from Tails_3.3 to Tails_3.5

Also available in: Atom PDF