Project

General

Profile

Bug #12142

The nec-xhci virtual USB controller + tails-persistence-setup causes a VM freeze on Debian Stretch or newer hosts

Added by anonym 9 months ago. Updated 23 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Test suite
Target version:
Start date:
01/13/2017
Due date:
% Done:

30%

QA Check:
Feature Branch:
Type of work:
Communicate
Blueprint:
Easy:
Affected tool:

Description

While running the tests for Tails 2.10~rc1 I needed this:

--- a/features/domains/default.xml
+++ b/features/domains/default.xml
@@ -18,7 +18,9 @@
   <on_crash>restart</on_crash>
   <devices>
     <emulator>/usr/bin/qemu-system-x86_64</emulator>
-    <controller type='usb' index='0' model='nec-xhci'/>
+    <controller type='usb' index='0' model='ich9-ehci1'/>
+    <controller type='usb' index='0' model='ich9-uhci1'/>
+    <controller type='ide' index='0'/>
     <controller type='sata' index='0'/>
     <controller type='virtio-serial' index='0'/>
     <interface type='network'>

Otherwise the testing VM sort of froze -- some things worked, some not. It seemed like the USB controller got messed up. Indeed, I was tail:ing the journal, and some sort of USB controller reset was logged at the exact moment stuff stopped working.


Related issues

Related to Tails - Bug #11588: Sometimes fails to boot from USB on Jenkins with I/O errors Resolved 07/22/2016
Blocks Tails - Feature #13240: Core work 2017Q4: Test suite maintenance Confirmed 06/29/2017

History

#1 Updated by intrigeri 9 months ago

I could try reproducing this if it helps.

Note that switched to nec-xhci for a reason so let's make sure we don't break stuff it repaired, while repairing the problem you saw.

#2 Updated by anonym 9 months ago

  • Assignee changed from anonym to intrigeri
  • QA Check set to Info Needed

This comment was completely wrong. INGORE!

#3 Updated by anonym 9 months ago

Uh... completely ignore the previous comment.

intrigeri wrote:

I could try reproducing this if it helps.

That'd be great! Running the Booting Tails from a USB drive without a persistent partition and creating one scenario should be enough.

#4 Updated by intrigeri 9 months ago

  • Assignee changed from intrigeri to anonym
  • QA Check deleted (Info Needed)

Reproduced :/

#5 Updated by intrigeri 9 months ago

I guess that next step is to identify if this is the result of a change in Tails, in Debian (guest), or in Debian (host).

#6 Updated by anonym 9 months ago

Now when I reproduced, the VM remained fully functional, but the persistence setup still freezes.

This is what I get in the journal:

amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: sd 8:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
amnesia kernel: sd 8:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 66 49 86 00 08 00 00
amnesia kernel: blk_update_request: I/O error, dev sda, sector 6703494
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573254, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573255, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573256, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573257, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573258, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573259, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573260, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573261, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573262, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1573263, lost async page write
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: INFO: task mkfs.ext4:7244 blocked for more than 120 seconds.
amnesia kernel:       Tainted: G           OE   4.8.0-0.bpo.2-amd64 #1
amnesia kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
amnesia kernel: mkfs.ext4       D ffffa13dcc726000     0  7244   6078 0x20020000
amnesia kernel:  ffffa13dcc433080 ffffa13dff0e0080 0000000000000102 000000000000000e
amnesia kernel:  ffffa13dd5f44000 ffffa13dd5f43d08 7fffffffffffffff ffffffff8f9f0060
amnesia kernel:  ffffa13dd5f43d88 0007ffffffffffff ffffffff8f9ef7d1 0000000000000000
amnesia kernel: Call Trace:
amnesia kernel:  [<ffffffff8f9f0060>] ? bit_wait_timeout+0xa0/0xa0
amnesia kernel:  [<ffffffff8f9ef7d1>] ? schedule+0x31/0x80
amnesia kernel:  [<ffffffff8f9f2d8c>] ? schedule_timeout+0x21c/0x3c0
amnesia kernel:  [<ffffffff8f59010d>] ? pagevec_lookup_tag+0x1d/0x30
amnesia kernel:  [<ffffffff8f9f0060>] ? bit_wait_timeout+0xa0/0xa0
amnesia kernel:  [<ffffffff8f9eefe4>] ? io_schedule_timeout+0xb4/0x130
amnesia kernel:  [<ffffffff8f4bfdc6>] ? prepare_to_wait+0x56/0x80
amnesia kernel:  [<ffffffff8f9f0077>] ? bit_wait_io+0x17/0x60
amnesia kernel:  [<ffffffff8f9efb5c>] ? __wait_on_bit+0x5c/0x90
amnesia kernel:  [<ffffffff8f57fe1e>] ? find_get_pages_tag+0x15e/0x300
amnesia kernel:  [<ffffffff8f57ea94>] ? wait_on_page_bit+0xc4/0xe0
amnesia kernel:  [<ffffffff8f4c0130>] ? autoremove_wake_function+0x40/0x40
amnesia kernel:  [<ffffffff8f57eb87>] ? __filemap_fdatawait_range+0xd7/0x150
amnesia kernel:  [<ffffffff8f57ec0f>] ? filemap_fdatawait_range+0xf/0x30
amnesia kernel:  [<ffffffff8f581799>] ? filemap_write_and_wait_range+0x49/0x70
amnesia kernel:  [<ffffffff8f63f2b6>] ? blkdev_fsync+0x16/0x40
amnesia kernel:  [<ffffffff8f6383e8>] ? do_fsync+0x38/0x60
amnesia kernel:  [<ffffffff8f63865c>] ? SyS_fsync+0xc/0x10
amnesia kernel:  [<ffffffff8f403cf8>] ? do_int80_syscall_32+0x58/0x160
amnesia kernel:  [<ffffffff8f9f5473>] ? entry_INT80_compat+0x33/0x40
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: sd 8:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
amnesia kernel: sd 8:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 66 51 86 00 08 00 00
amnesia kernel: blk_update_request: I/O error, dev sda, sector 6705542
amnesia kernel: buffer_io_error: 2038 callbacks suppressed
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575302, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575303, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575304, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575305, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575306, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575307, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575308, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575309, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575310, lost async page write
amnesia kernel: Buffer I/O error on dev dm-0, logical block 1575311, lost async page write
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
amnesia kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd

#7 Updated by anonym 9 months ago

intrigeri wrote:

I guess that next step is to identify if this is the result of a change in Tails, in Debian (guest), or in Debian (host).

I think "Debian (host)" because I can reproduce with Tails 2.9.1, but I did not have this issue when working on that release ~1 month ago.

#8 Updated by anonym 9 months ago

  • Related to Bug #11588: Sometimes fails to boot from USB on Jenkins with I/O errors added

#9 Updated by anonym 9 months ago

  • Assignee changed from anonym to intrigeri
  • QA Check set to Info Needed

Downgrading to qemu-system-x86 1:2.7+dfsg-3+b1 fixes the issue for me. Can you please confirm?

#10 Updated by intrigeri 9 months ago

  • Assignee changed from intrigeri to anonym

Downgrading to qemu-system-x86 1:2.7+dfsg-3+b1 fixes the issue for me. Can you please confirm?

Confirmed.

FWIW hw/usb/hcd-xhci.c has changed quite a bit in v2.7.0..v2.8.0. It's tempting to try reverting these changes and see what happens.

#11 Updated by anonym 9 months ago

intrigeri wrote:

Downgrading to qemu-system-x86 1:2.7+dfsg-3+b1 fixes the issue for me. Can you please confirm?

Confirmed.

FWIW hw/usb/hcd-xhci.c has changed quite a bit in v2.7.0..v2.8.0.

I saw that too ("135 insertions(+), 103 deletions(-)" for the record).

It's tempting to try reverting these changes and see what happens.

Sure, but it seems like a lot of work. I'll give git-bisect a try.

If I fail, I suspect we'll have to live with this issue for a while, so what about patching the domain with the fix above (i.e. switching USB controller) if we detect qemu-system-x86 > 2.7.0? And make it part of #11739 (or similar) to find a proper solution.

#12 Updated by intrigeri 9 months ago

If I fail, I suspect we'll have to live with this issue for a while, so what about patching the domain with the fix above (i.e. switching USB controller) if we detect qemu-system-x86 > 2.7.0?

Yes, if it doesn't reintroduce the bugs we fixed by moving to nec-xhci.

#13 Updated by anonym 9 months ago

anonym wrote:

intrigeri wrote:

It's tempting to try reverting these changes and see what happens.

Sure, but it seems like a lot of work. I'll give git-bisect a try.

Reverting the 10 commits made to hw/usb/hcd-xhci.c indeed fixes the issue for me! So, now I am bisecting (which will take a while...).

intrigeri wrote:

If I fail, I suspect we'll have to live with this issue for a while, so what about patching the domain with the fix above (i.e. switching USB controller) if we detect qemu-system-x86 > 2.7.0?

Yes, if it doesn't reintroduce the bugs we fixed by moving to nec-xhci.

According to #11588 it was introduced to fix USB-related issues on Jenkins, no mention of problems in anyone's local setup. Let's put an EOL of this workaround by reverting it in the branch for #11739.

#14 Updated by anonym 9 months ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20
  • QA Check deleted (Info Needed)

The bad commit is:

05f43d44e4 xhci: limit the number of link trbs we are willing to process

which is the fix for CVE-2016-8576.

I fixed it with:

--- a/hw/usb/hcd-xhci.c
+++ b/hw/usb/hcd-xhci.c
@@ -53,7 +53,7 @@
  * to the specs when it gets them */
 #define ER_FULL_HACK

-#define TRB_LINK_LIMIT  4
+#define TRB_LINK_LIMIT  32

 #define LEN_CAP         0x40
 #define LEN_OPER        (0x400 + 0x10 * MAXPORTS)

I'm reasoning that, regarding cycle detection in general, allowing at most 4 levels of links seems like a pretty low number. I know nothing about low-level USB stuff like the Transfer Ring, but I still feel like this bump is safe; a high number should only mean that we get a performance hit when encountering cycles but then we have a fatal error any way so who cares?

I've reported it to Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=851694

I've also sent the patch to the QEMU devel mailing list, but it hasn't appeared in their archive yet.

As for now, we still need the workaround until Debian ships a fixed package by applying my patch or something similar, or by waiting for a fix from upstream. I'm wondering if it's worth adding the workaround, or if we should hope for a speedy reaction from the Debian QEMU team. I think I'll wait a few days and see if anything happens.

#15 Updated by intrigeri 9 months ago

As for now, we still need the workaround until Debian ships a fixed package by applying my patch or something similar, or by waiting for a fix from upstream. I'm wondering if it's worth adding the workaround, or if we should hope for a speedy reaction from the Debian QEMU team. I think I'll wait a few days and see if anything happens.

Sounds great! The few affected people should be able to apply your patch, or to downgrade QEMU, so there's no big rush.

#16 Updated by anonym 9 months ago

anonym wrote:

I've also sent the patch to the QEMU devel mailing list, but it hasn't appeared in their archive yet.

#17 Updated by anonym 9 months ago

  • Target version changed from Tails 2.10 to Tails_2.11

#18 Updated by intrigeri 8 months ago

What's the status of the patches sent upstream? (Maybe time to ping them?) It would be really nice if this was fixed in time for Stretch, otherwise we'll waste quite some time dealing with the fallout.

#19 Updated by anonym 8 months ago

  • Assignee changed from anonym to intrigeri
  • QA Check set to Info Needed

intrigeri wrote:

What's the status of the patches sent upstream?

  • libvirt upstream: nothing.
  • Debian BTS: the package maintainer asked me to resolve this upstream (which was already in progress).

(Maybe time to ping them?)

Yes! But first I want to know better what we can expect here.

It would be really nice if this was fixed in time for Stretch, otherwise we'll waste quite some time dealing with the fallout.

So I'm pretty sure Stretch will keep qemu 2.8, so at best we'll have the Debian package apply the patch. Since it is an extremely simple bugfix I guess it could be introduced even after Stretch is released, right? I'm just wondering what to write in my ping to the libvirt maintainers so they get the urgency. Or maybe there is no urgency vs Stretch?

#20 Updated by intrigeri 8 months ago

  • Assignee changed from intrigeri to anonym
  • QA Check changed from Info Needed to Dev Needed

(Maybe time to ping them?)

Yes!

:)

But first I want to know better what we can expect here.

Sure.

It would be really nice if this was fixed in time for Stretch, otherwise we'll waste quite some time dealing with the fallout.

So I'm pretty sure Stretch will keep qemu 2.8, so at best we'll have the Debian package apply the patch. Since it is an extremely simple bugfix I guess it could be introduced even after Stretch is released, right?

It might be an option (that could work or not), but really we should try hard to avoid relying on it IMO: the earlier it's fixed, the more chances it has to reach Stretch, the less work it causes for everyone involved, and the less risks it creates for users.

I'm just wondering what to write in my ping to the libvirt maintainers so they get the urgency. Or maybe there is no urgency vs Stretch?

Please communicate that we really need to see this fixed in time for Stretch, that's been frozen for a month already: we can't rely on fixing it later.

#21 Updated by anonym 8 months ago

  • Target version changed from Tails_2.11 to Tails_2.12

#22 Updated by intrigeri 6 months ago

Chances are that Stretch gets QEMU 2.8.1, so if our fix goes in 2.8.2 it might go into Stretch as well.

#23 Updated by anonym 6 months ago

  • Target version changed from Tails_2.12 to Tails_3.0

#24 Updated by intrigeri 6 months ago

Target version changed from Tails_2.12 to Tails_3.0

IMO our last chance to get this fixed in Stretch (and save developers & sysadmins some pain in the next 2 years) requires pinging upstream really soon. Do you need help writing this email? Feel free to reassign to me for 3.0~rc1 and I'll nag them, I don't find it hard to do personally.

#25 Updated by anonym 6 months ago

intrigeri wrote:

Target version changed from Tails_2.12 to Tails_3.0

IMO our last chance to get this fixed in Stretch (and save developers & sysadmins some pain in the next 2 years) requires pinging upstream really soon. Do you need help writing this email? Feel free to reassign to me for 3.0~rc1 and I'll nag them, I don't find it hard to do personally.

Sorry, I know I'm bad at pinging! :/ I have tried twice now to create a simple reproducer to post there as a sort of ping, but failed. Now I sent a ping to the list, plus Cc:ed the committer who introduced this potential regression, asking if my patch makes sense.

#26 Updated by intrigeri 5 months ago

  • Target version changed from Tails_3.0 to Tails_3.0~rc1

#27 Updated by anonym 5 months ago

  • Target version changed from Tails_3.0~rc1 to Tails_3.1

Pinged again.

#28 Updated by anonym 4 months ago

  • Blocks Feature #13239: Core work 2017Q3: Test suite maintenance added

#29 Updated by intrigeri 3 months ago

  • Subject changed from The nec-xhci virtual USB controller + tails-persistence-setup causes a VM freeze on Debian Sid hosts to The nec-xhci virtual USB controller + tails-persistence-setup causes a VM freeze on Debian Stretch or newer hosts

#30 Updated by anonym 3 months ago

  • Target version changed from Tails_3.1 to Tails_3.2

#31 Updated by intrigeri about 1 month ago

  • Target version changed from Tails_3.2 to Tails_3.3

#32 Updated by intrigeri 23 days ago

The exact same change that you've submitted upstream was independently applies in commit 99f9aeba5d461f79c9ce73f968ba0feb77fc1f5a. Let's say it's good news..

Next steps:

  1. test this to confirm the bug is fixed: I could not do that as our test suite fails earlier with QEMU 2.10 (Call to virDomainAttachDeviceFlags failed: internal error: unable to execute QEMU command 'device_add': Property 'usb-storage.drive' can't find value 'drive-usb-disk0' (Guestfs::Error)), I'll report this in a dedicated ticket that'll block this one
  2. propose the Debian maintainers to backport the fix to Stretch

#33 Updated by intrigeri 23 days ago

  • % Done changed from 20 to 30
  • QA Check deleted (Dev Needed)
  • Type of work changed from Code to Communicate

intrigeri wrote:

  1. test this to confirm the bug is fixed: I could not do that as our test suite fails earlier with QEMU 2.10 (Call to virDomainAttachDeviceFlags failed: internal error: unable to execute QEMU command 'device_add': Property 'usb-storage.drive' can't find value 'drive-usb-disk0' (Guestfs::Error)), I'll report this in a dedicated ticket that'll block this one

Reported #14719 and tested "manually" outside of our test suite: it works! :)

  1. propose the Debian maintainers to backport the fix to Stretch

Done. I'll let you take this back from there.

#34 Updated by intrigeri 17 days ago

  • Blocks Feature #13240: Core work 2017Q4: Test suite maintenance added

#35 Updated by intrigeri 17 days ago

  • Blocks deleted (Feature #13239: Core work 2017Q3: Test suite maintenance)

Also available in: Atom PDF