Bug #12595

Feature #5630: Reproducible builds

Not enough space in /var/lib/jenkins on isobuilders

Added by intrigeri 2 months ago. Updated 19 days ago.

Status:In ProgressStart date:05/25/2017
Priority:NormalDue date:
Assignee:bertagaz% Done:

0%

Category:Continuous Integration
Target version:Tails_3.2
QA Check:Dev Needed Blueprint:
Feature Branch: Easy:
Type of work:Sysadmin Affected tool:

Description

This ticket is about two issues:

  • Short term: noise on our monitoring dashboard panel/notifications. Under normal conditions we use up to 20GB out of 23GB there, and our monitoring doesn't allow that.
  • Medium term: we lack disk space in /var/lib/jenkins on isobuilders. Once #12576 is done we'll be hosting multiple baseboxes in /var/lib/jenkins/.vagrant.d/ so we will need even more space there, so this disk space issues is currently blocking such performance optimizations. See #12531#note-23 and follow-ups where the research about how many baseboxes we need to store there is being worked on.

Related issues

Related to Tails - Feature #12002: Estimate hardware cost of reproducible builds in Jenkins Resolved 11/28/2016
Blocks Tails - Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all In Progress 05/22/2017
Blocked by Tails - Bug #13425: Upgrade lizard's storage (2017 edition) In Progress 07/05/2017

History

#1 Updated by intrigeri 2 months ago

  • Blocks Feature #12002: Estimate hardware cost of reproducible builds in Jenkins added

#2 Updated by intrigeri 2 months ago

  • Related to Bug #12574: isobuilders system_disks check keeps switching between OK and WARNING since the switch to Vagrant added

#3 Updated by bertagaz 2 months ago

  • Blocks Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all added

#4 Updated by intrigeri 2 months ago

Blocks Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all added

I don't understand why, but perhaps it's not important.

Let's just fix it and I'll shut up :)

#5 Updated by bertagaz 2 months ago

intrigeri wrote:

Blocks Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all added

I don't understand why, but perhaps it's not important.

Because at the moment we don't have enough space to build a new basebox AND host several baseboxes in /var/lib/jenkins/.vagrant.d/, which will probably happen if we switch to basebox:clean_old.

#6 Updated by intrigeri 2 months ago

Because at the moment we don't have enough space to build a new basebox AND host several baseboxes in /var/lib/jenkins/.vagrant.d/, which will probably happen if we switch to basebox:clean_old.

Wow, interesting! If that means we're going to store multiple baseboxes both in /var/lib/libvirt/images and in /var/lib/jenkins/.vagrant.d, then perhaps it's a problem ⇒ file a ticket about it and discuss potential solutions with anonym (before allocating disk space specifically to accommodate these two sets of mostly-duplicated data).

Meta: I have almost no clue how the whole thing works, so perhaps there's a very good reason to do this, and then forget it, sorry.

#7 Updated by intrigeri about 2 months ago

  • Blocks deleted (Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all)

#8 Updated by intrigeri about 2 months ago

  • Blocks Feature #12576: Have Jenkins use basebox:clean_old instead of basebox:clean_all added

#9 Updated by bertagaz about 2 months ago

intrigeri wrote:

Wow, interesting! If that means we're going to store multiple baseboxes both in /var/lib/libvirt/images and in /var/lib/jenkins/.vagrant.d, then perhaps it's a problem ⇒ file a ticket about it and discuss potential solutions with anonym (before allocating disk space specifically to accommodate these two sets of mostly-duplicated data).

As I get it, we delete the baseboxes in /var/lib/libvirt/images through the use of the forcecleanup option. I think if we end up having several ones in this directory, it's because of a build failure, that prevent rake from executing the related clean_up_builder_vms function in this case. I'll track the builds to see if/why this situation happens in the next days.

#10 Updated by bertagaz about 2 months ago

  • Assignee changed from bertagaz to anonym
  • QA Check set to Info Needed

bertagaz wrote:

intrigeri wrote:

Wow, interesting! If that means we're going to store multiple baseboxes both in /var/lib/libvirt/images and in /var/lib/jenkins/.vagrant.d, then perhaps it's a problem ⇒ file a ticket about it and discuss potential solutions with anonym (before allocating disk space specifically to accommodate these two sets of mostly-duplicated data).

As I get it, we delete the baseboxes in /var/lib/libvirt/images through the use of the forcecleanup option. I think if we end up having several ones in this directory, it's because of a build failure, that prevent rake from executing the related clean_up_builder_vms function in this case. I'll track the builds to see if/why this situation happens in the next days.

Hmmm, assigning to anonym, because I just checked and there's still at least one volume in the libvirt default storage after a build is finished (e.g. tails-builder-amd64-jessie-20170524-8cc1ccbade_vagrant_box_image_0.img). And indeed I don't see code in the Rakefile that takes care of removing such volumes, only the ones that have the same name than the vagrant VM. Is that expected? I guess we could simply remove it too, if we store it in ~/.vagrant.d/?

#11 Updated by intrigeri about 2 months ago

  • Subject changed from isobuilders jenkins-data-disk check keeps switching between OK and WARNING since the switch to Vagrant to isobuilders jenkins-data-disk check keeps switching between OK, WARNING and CRITICAL since the switch to Vagrant

#12 Updated by intrigeri about 2 months ago

I'm concerned we'll become quickly confused (and it'll be hard to maintain Redmine ticket semantics) if we discuss the clean up process of /var/lib/libvirt/images/ on a ticket that's about something else entirely, i.e. /var/lib/jenkins, so please, anonym:

  • address the "are we really going to store essentially the same data twice on each isobuilder?" question here;
  • answer questions that are specific to the libvirt storage pool GC process on #12599, rather than here.

Thanks :)

#13 Updated by anonym about 2 months ago

bertagaz wrote:

bertagaz wrote:

intrigeri wrote:

Wow, interesting! If that means we're going to store multiple baseboxes both in /var/lib/libvirt/images and in /var/lib/jenkins/.vagrant.d, then perhaps it's a problem ⇒ file a ticket about it and discuss potential solutions with anonym (before allocating disk space specifically to accommodate these two sets of mostly-duplicated data).

As I get it, we delete the baseboxes in /var/lib/libvirt/images through the use of the forcecleanup option. I think if we end up having several ones in this directory, it's because of a build failure, that prevent rake from executing the related clean_up_builder_vms function in this case. I'll track the builds to see if/why this situation happens in the next days.

Hmmm, assigning to anonym, because I just checked and there's still at least one volume in the libvirt default storage after a build is finished (e.g. tails-builder-amd64-jessie-20170524-8cc1ccbade_vagrant_box_image_0.img).

Ah, that is right; the file

~/.vagrant.d/boxes/${BOX_NAME}/0/libvirt/box.img

is copied to
/var/lib/libvirt/images/"${BOX_NAME}_vagrant_box_image_0.img

by Vagrant whenever it sets up a domain using that base box (unless it already exists). This is redundant.

And indeed I don't see code in the Rakefile that takes care of removing such volumes, only the ones that have the same name than the vagrant VM.

There is code for removing such volume in clean_up_basebox(), so we only do it when removing base boxes.

Is that expected?

Yes. I had underestimated how much of a problem using more disk space was, so I didn't think this would matter.

I guess we could simply remove it too, if we store it in ~/.vagrant.d/?

Yup, Vagrant will make the copy if needed. So I guess we can do this cleanup by default, since a ~800 MiB disk copy shouldn't increase the build time too much for non-Jenkins users. Just to be sure we're in sync, I implemented the change I believe to be safe (but I haven't tested it!) on the feature/12599 branch (69faba0c1d5e7b517616103f8e1c14528bdb55e8). What do you think?

#14 Updated by intrigeri about 2 months ago

Yes. I had underestimated how much of a problem using more disk space was, so I didn't think this would matter.

Meta: I'm constantly advocating against spending substantial engineering time on issues that can trivially be solved with a bit more hardware. This nagging from my part probably participated in creating the situation we're trying to fix here. Now, there are caveats I want to clarify to try and fix the confusion I have perhaps created:

  • Alas, our hardware doesn't grow magically on-demand. So ideally, new requirements need to be roughly evaluated & communicated so whatever hardware purchase & installation is needed by a change can happen before we deploy stuff… and break bit of our infra (#12002). But granted, often we go through several iterations, design changes, and it's simply impossible to accurately evaluate the final requirements in advance. The only realistic solution I can think of, to avoid this chicken'n'egg problem, as long as we're managing the bare metal our stuff runs on, is to do infra development in a setup that looks very much like the production one, but is not the production one.
  • Sometimes only a little bit of software engineering effort is enough to avoid raising hardware requirements, and it's worth spending this time instead of upgrading hardware (which has a cost in terms of sysadmin work). Seeing your fix for #12599, I guess that ticket falls into this category :)

#15 Updated by anonym about 2 months ago

  • Assignee changed from anonym to bertagaz
  • QA Check deleted (Info Needed)

[Reassigning back to bert now that the question he had for me is answered.]

#16 Updated by intrigeri about 2 months ago

  • Subject changed from isobuilders jenkins-data-disk check keeps switching between OK, WARNING and CRITICAL since the switch to Vagrant to Not enough space in /var/lib/jenkins on isobuilders
  • Description updated (diff)

Clarified scope of this ticket so we don't track the mere short-term monitoring issue.

#17 Updated by intrigeri about 2 months ago

  • Blocks deleted (Feature #12002: Estimate hardware cost of reproducible builds in Jenkins)

#18 Updated by intrigeri about 2 months ago

  • Related to deleted (Bug #12574: isobuilders system_disks check keeps switching between OK and WARNING since the switch to Vagrant)

#19 Updated by intrigeri about 2 months ago

  • Related to Feature #12002: Estimate hardware cost of reproducible builds in Jenkins added

#20 Updated by intrigeri about 2 months ago

  • Target version changed from Tails_3.0 to Tails_3.1

I'll build the 3.0 ISO in two days, so let's not do potentially disruptive changes on our infra at this point.

#21 Updated by bertagaz 20 days ago

  • Status changed from Confirmed to In Progress
  • Assignee changed from bertagaz to intrigeri
  • QA Check set to Info Needed

Now that 3.0 is out and #12002 is over, I propose we add 7G to each isobuilders' /var/lib/jenkins. That way they would go up to 30G, which should handle a bunch of baseboxes, probably enough for the time being. We'll still have 100G left to allocate wherever needed while we wait for #11806. If we do that we should be able to tackle #12576.

#22 Updated by intrigeri 20 days ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Info Needed to Dev Needed

Now that 3.0 is out and #12002 is over, I propose we add 7G to each isobuilders' /var/lib/jenkins. That way they would go up to 30G, which should handle a bunch of baseboxes, probably enough for the time being. We'll still have 100G left to allocate wherever needed while we wait for #11806.

We need these 100G for other matters (ever-growing data for services that existed before vagrant-libvirt), we've been struggling with disk space in a painful way since a couple months "thanks" to the timing of the vagrant-libvirt deployment vs. hardware planning, and I've grown tired of this situation already, so: no, sorry; I really don't want to make things even worse. Let's deal with the storage upgrade that the vagrant-libvirt stuff requires before allocating even more space to it.

#23 Updated by bertagaz 20 days ago

  • Blocked by Bug #13425: Upgrade lizard's storage (2017 edition) added

#24 Updated by bertagaz 20 days ago

intrigeri wrote:

We need these 100G for other matters (ever-growing data for services that existed before vagrant-libvirt), we've been struggling with disk space in a painful way since a couple months "thanks" to the timing of the vagrant-libvirt deployment vs. hardware planning, and I've grown tired of this situation already, so: no, sorry; I really don't want to make things even worse. Let's deal with the storage upgrade that the vagrant-libvirt stuff requires before allocating even more space to it.

Having a look at the spreadsheet I'm not sure to see which services will require that much data. I thought taking 30G and leaving 100G free was affordable while we purchase more HDDs. But I get you're upset. Let's wait then.

#25 Updated by intrigeri 19 days ago

  • Target version changed from Tails_3.1 to Tails_3.2

(Please focus on making builds robust again first, and postpone the performance improvements. Sorry we could not discuss this at the CI team meeting today.)

Also available in: Atom PDF