Project

General

Profile

Bug #11583

Bug #10288: Fix newly identified issues to make our test suite more robust and faster

UEFI boot tests fail on Jenkins

Added by intrigeri almost 2 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Test suite
Target version:
Start date:
07/21/2016
Due date:
% Done:

100%

QA Check:
Pass
Feature Branch:
test/11583-uefi-boot-is-fragile-stretch
Type of work:
Research
Blueprint:
Starter:
Affected tool:

Description

Once #10720 is workaround'ed, "Booting Tails from a USB drive in UEFI mode" always fail with a black screen.


Related issues

Related to Tails - Bug #12141: UEFI boot on QEMU is broken since 2.10~rc1 Resolved 01/13/2017
Blocked by Tails - Bug #11588: Sometimes fails to boot from USB on Jenkins with I/O errors Resolved 07/22/2016

Associated revisions

Revision 3b0abd2b (diff)
Added by intrigeri almost 2 years ago

Revert "Test suite: mark UEFI boot test as fragile." (refs: #11583)

This reverts commit 4a61e393327d8b0fe49e836288ab446dd057cf68.

Revision 437d5689 (diff)
Added by intrigeri almost 2 years ago

Revert "Test suite: mark UEFI boot test as fragile." (refs: #11583)

This reverts commit 4a61e393327d8b0fe49e836288ab446dd057cf68.

Revision e49a07a0 (diff)
Added by intrigeri almost 2 years ago

Revert "Test suite: mark UEFI boot test as fragile." (refs: #11583)

Revision 9539ae88
Added by anonym about 1 year ago

Merge remote-tracking branch 'origin/test/11583-uefi-boot-is-fragile-stretch' into feature/stretch

Fix-committed: #11583

History

#1 Updated by intrigeri almost 2 years ago

  • Feature Branch set to test/11583-uefi-boot-is-fragile

Flagged as fragile.

#2 Updated by intrigeri almost 2 years ago

  • Assignee set to intrigeri
  • Target version set to Tails_2.6

Random idea: check if AppArmor blocks access to the OVMF firmware.

#3 Updated by intrigeri almost 2 years ago

intrigeri wrote:

Random idea: check if AppArmor blocks access to the OVMF firmware.

It doesn't.

#4 Updated by bertagaz almost 2 years ago

From what I saw while testing #10777, it seems like the firmware is not very reliable. It sometimes fails to boot, with different symptoms (black screen, freezes on the boot device list screen,...). I'll report more in depth later. The OVMF doc says it can have trouble to do boot when using the KVM feature on recent qemu. Using -no-kvm option is said to help (that's a "may").

Stretch OVMF package is much more up to date, so I'm testing it at home with this one, and installed it on isotester6 as a job with UEFI scenario was involved. There's also a qemu-efi package in Stretch, with the edk2 EFI bootloader. Could be another candidate if OVMF is not enough reliable.

#5 Updated by intrigeri almost 2 years ago

Thanks for looking into this!

Stretch OVMF package is much more up to date, so I'm testing it at home with this one, and installed it on isotester6 as a job with UEFI scenario was involved.

OK, let's try this indeed!

However: let's please not do any such thing without encoding it in Puppet.. especially when no ticket is tracking the clean up step: I'd like our Puppet recipes to remain an accurate description of the current state of our systems.

There's also a qemu-efi package in Stretch, with the edk2 EFI bootloader. Could be another candidate if OVMF is not enough reliable.

Great! Seems worth a try. Note that it's built from the ovmf source package as well, and the package description doesn't make it very clear how this firmware differs from the one shipped in the ovmf binary package (presumably it has less features, e.g. no Secure Boot support).

Another debugging step I want to take is to verify if we are experiencing a mere display issue, or a more serious problem with that UEFI firmware (e.g. I would drop anything that expects to see the boot menu, let it timeout and boot, and see if Tails boots as a result).

(And here again, I find it strange that this problem happens on Jenkins, while I've never seen it elsewhere.)

#6 Updated by intrigeri almost 2 years ago

  • Blocked by Bug #11588: Sometimes fails to boot from USB on Jenkins with I/O errors added

#7 Updated by bertagaz almost 2 years ago

intrigeri wrote:

Stretch OVMF package is much more up to date, so I'm testing it at home with this one, and installed it on isotester6 as a job with UEFI scenario was involved.

OK, let's try this indeed!

Got some errors at home, did not debug yet if it was the same than with the previous OVMF version. Will report later too.

However: let's please not do any such thing without encoding it in Puppet.. especially when no ticket is tracking the clean up step: I'd like our Puppet recipes to remain an accurate description of the current state of our systems.

Yeah, that's bad I know. I wanted to give it a quick try, and thought to note this by-hand change on this ticket.

There's also a qemu-efi package in Stretch, with the edk2 EFI bootloader. Could be another candidate if OVMF is not enough reliable.

Great! Seems worth a try. Note that it's built from the ovmf source package as well, and the package description doesn't make it very clear how this firmware differs from the one shipped in the ovmf binary package (presumably it has less features, e.g. no Secure Boot support).

From what I understood from the upstream sources, both firmware share the same repo. Not sure too what's the difference, maybe just integrated feature and compile options.

Another debugging step I want to take is to verify if we are experiencing a mere display issue, or a more serious problem with that UEFI firmware (e.g. I would drop anything that expects to see the boot menu, let it timeout and boot, and see if Tails boots as a result).

I can do that at home once my 50 runs of two scenarios are over.

(And here again, I find it strange that this problem happens on Jenkins, while I've never seen it elsewhere.)

Seen it several time at home, as other weird behaviors of this bootloader. More on that soon.

#8 Updated by intrigeri almost 2 years ago

bertagaz wrote:

Yeah, that's bad I know.

Then I've just reverted it. (I have zero faith in leaving a note on a ticket as a good reminder for such things -- especially such a ticket that might not get fixed quickly, and then comments may accumulate, and that note may easily be lost deeeep in there; besides, adding a pinning entry with Puppet is cheap.)

#9 Updated by bertagaz almost 2 years ago

So, I've run this two scenario while testing #10777:

  Scenario: Legacy boot
    Given I have started Tails without network from a USB drive without a persistent partition \
      and stopped at Tails Greeter's login screen
    And I log in to a new session
    Then Tails is running from USB drive "__internal" 

  Scenario: UEFI boot
    Given I have started Tails without network from a USB drive without a persistent partition \
      and stopped at Tails Greeter's login screen
    Then I power off the computer
    Given the computer is set to boot in UEFI mode
    When I start Tails from USB drive "__internal" with network unplugged and I login
    Then Tails is running from USB drive "__internal" 
    And Tails has started in UEFI mode

For the UEFI scenario, run 50 times with the Jessie OVMF and the above scenarios, which resulted in:

  • 4 failures of type: After going out of the bootloader setup screen, UEFI never goes on booting syslinux, it keeps displaying the device probing list, and gets killed by a timeout.
  • 3 failures of type: Goes up to the kernel command line and type the additional boot options, then doesn't seem to hit enter or anything and freezes. After 10 minutes, timeout and reboot the VM. Goes up to the syslinux screen, but then the kernel command line doesn't seem to be opened, and Tails starts after the 3sec syslinux timeout.
  • 1 failure of type: Shows the Tianocore logo screen then switch to black screen until it reaches a timeout.
  • 1 failure of type: Seems to freeze on the Tinaocore logo screen, but finally goes on, pass the bootloader setup menu, and starts booting from the devices. Probe the usual two first, but then ends up into the UEFI shell.

I've also did a 50 times run of the above scenarios with OVMF from Stretch, which resulted in:

  • 2 failure of type: Show the Tianocore logo, pass the bootloader setup menu, then display a non-blinking cursor on a black screen until it reaches a timeout.
  • 1 failure of type: Goes on up to typing the options to the kernel command line, sits 5 minutes on this screen then show a kernel panic with message "Initramfs unpacking failed: XZ-compressed data is corrupt"
  • 1 failure of type: Seems to freeze on the Tinaocore logo screen, but finally goes on booting from the devices. Boot the usual two first, but then ends up into the UEFI shell.
  • 1 failure of type: freezes on the Tianocore logo screen and gets killed by the timeout.

So no big amelioration with more recent OVMF it seems, there still are some bugs. There are some patterns in this failures.

Tried once with the qemu-efi firmware, but it didn't boot at all, show only a black screen indefinitely. Maybe I've misconfigured it.

Note that Plymouth is broken in UEFI mode, so a possible short-term workaround could be to wait for the "Loading, please wait..." message to be displayed, and reboot if it is not after a certain time thanks to #10777.

The OVMF documentation says it's possible to save debug logs from the firmware on the host. I have a patch that setup the necessary Qemu options to dump them and save them on failure with other artifacts. But the Debian package uses the "-b RELEASE" compile option, which deactivate them. I'll build a Debian package with the debugging enabled to gather this stats from my runs at home. Can be interesting if we ever want to report/ask for help to upstream.

#10 Updated by intrigeri almost 2 years ago

Note: the topic branch has test/11588-usb-on-jenkins merged in, so it's affected by any improvement or regression documented on #11588 (e.g. currently: "crashes during memory erasure on shutdown, but with #10733 merged on top it seems to be fine"; once I've confirmed that #10733 fixes that later today, the fix will flow into the topic branch for this ticket as well).

#11 Updated by intrigeri almost 2 years ago

  • Assignee deleted (intrigeri)
  • Target version deleted (Tails_2.6)
  • Feature Branch changed from test/11583-uefi-boot-is-fragile to wip/test/11583-uefi-boot-is-fragile

bertagaz wrote:

  • 3 failures of type: Goes up to the kernel command line and type the additional boot options, then doesn't seem to hit enter or anything and freezes. After 10 minutes, timeout and reboot the VM. Goes up to the syslinux screen, but then the kernel command line doesn't seem to be opened, and Tails starts after the 3sec syslinux timeout.
  • 1 failure of type: Goes on up to typing the options to the kernel command line, sits 5 minutes on this screen then show a kernel panic with message "Initramfs unpacking failed: XZ-compressed data is corrupt"

These failure modes look suspisciously like the ones described on #11588. I suspect that a number of the other failures you've seen also share the same root cause. IMO we should fix #11588 before we try to make any sense of this very ticket: as long as we have fragile USB mass storage device emulation, we can't possibly have robust UEFI boot off USB, and we have no way to know for sure if the problems we see are specific to UEFI or not. This is just a single scenario, that passes reliably enough for 2 of our usual RMs (so is not a problem at release time) => let's not spend too much time on it now, we have much higher-impact places to work on in this test suite.

#12 Updated by bertagaz almost 2 years ago

  • Assignee set to intrigeri
  • Target version set to Tails_2.6

intrigeri wrote:

bertagaz wrote:

  • 3 failures of type: Goes up to the kernel command line and type the additional boot options, then doesn't seem to hit enter or anything and freezes. After 10 minutes, timeout and reboot the VM. Goes up to the syslinux screen, but then the kernel command line doesn't seem to be opened, and Tails starts after the 3sec syslinux timeout.
  • 1 failure of type: Goes on up to typing the options to the kernel command line, sits 5 minutes on this screen then show a kernel panic with message "Initramfs unpacking failed: XZ-compressed data is corrupt"

These failure modes look suspisciously like the ones described on #11588. I suspect that a number of the other failures you've seen also share the same root cause. IMO we should fix #11588 before we try to make any sense of this very ticket: as long as we have fragile USB mass storage device emulation, we can't possibly have robust UEFI boot off USB, and we have no way to know for sure if the problems we see are specific to UEFI or not. This is just a single scenario, that passes reliably enough for 2 of our usual RMs (so is not a problem at release time) => let's not spend too much time on it now, we have much higher-impact places to work on in this test suite.

Interesting. Your reasoning make sense, let see what result you get with #11588.

#13 Updated by intrigeri almost 2 years ago

  • Assignee deleted (intrigeri)
  • Target version deleted (Tails_2.6)

#14 Updated by intrigeri over 1 year ago

  • Assignee set to intrigeri
  • Target version set to Tails_2.7
  • Feature Branch changed from wip/test/11583-uefi-boot-is-fragile to test/11583-uefi-boot-is-fragile

Refreshed the branch, adding to Jenkins. I'll have a look in a week or so & see if it's still broken (there are some slim chances that #11588 has fixed it).

#15 Updated by intrigeri over 1 year ago

Last 3 runs fail with "Boot Failed. EFI Floppy" followed by failed attempts at doing PXE. I can't reproduce this failure locally (sid). Running it again now that #10777 is fixed (who knows).

#16 Updated by intrigeri over 1 year ago

intrigeri wrote:

Last 3 runs fail with "Boot Failed. EFI Floppy" followed by failed attempts at doing PXE. I can't reproduce this failure locally (sid). Running it again now that #10777 is fixed (who knows).

Same problem even with the branch for #10777 (except we now go through the UEFI firmware setup a few times, and sometimes see "Boot Failed. EFI Floppy", and sometimes a black screen with a cursor).

#17 Updated by intrigeri over 1 year ago

  • Assignee deleted (intrigeri)
  • Target version deleted (Tails_2.7)

#18 Updated by anonym over 1 year ago

I just refreshed this branch for gathering data for #12141. We'll see.

#19 Updated by anonym over 1 year ago

  • Related to Bug #12141: UEFI boot on QEMU is broken since 2.10~rc1 added

#20 Updated by anonym over 1 year ago

  • Assignee set to anonym

I just pushed a new branch called test/11583-uefi-boot-is-fragile-stretch (note the -stretch suffix) to see whether this problem remains on Stretch. I guess the problem will remain, but let's see. I'll take over the ticket until then.

#21 Updated by intrigeri about 1 year ago

  • Feature Branch changed from test/11583-uefi-boot-is-fragile to test/11583-uefi-boot-is-fragile-stretch

anonym wrote:

I just pushed a new branch called test/11583-uefi-boot-is-fragile-stretch (note the -stretch suffix) to see whether this problem remains on Stretch. I guess the problem will remain, but let's see. I'll take over the ticket until then.

Updated and pushed it again. You might want to set a target version so you look at the results before the job is garbage collected :)

#22 Updated by intrigeri about 1 year ago

... and hopefully #12511 will fix it, who knows :)

#23 Updated by intrigeri about 1 year ago

  • Status changed from Confirmed to In Progress
  • Target version set to Tails_3.0~rc1
  • % Done changed from 0 to 50
  • QA Check set to Ready for QA

#24 Updated by anonym about 1 year ago

  • Status changed from In Progress to Fix committed
  • Assignee deleted (anonym)
  • % Done changed from 50 to 100
  • QA Check changed from Ready for QA to Pass

intrigeri wrote:

It now passes: https://jenkins.tails.boum.org/job/test_Tails_ISO_test-11583-uefi-boot-is-fragile-stretch/1/cucumberTestReport/installing-tails-to-a-usb-drive/booting-tails-from-a-usb-drive-in-uefi-mode/ so please review and merge :)

Excellent! It also passes for me locally now (IIRC it has been broken for me for the past few months). Merged!

#25 Updated by intrigeri about 1 year ago

  • Status changed from Fix committed to Resolved

Also available in: Atom PDF