action #137564
closedtest fails sporadically in Autoyast iSCSI - cannot boot from ROM, '/usr/share/qemu/ipxe.lkrn' file may need update
Added by tinawang123 about 1 year ago. Updated about 1 year ago.
Description
Motivation
Related failed job: https://openqa.suse.de/tests/12413982#step/bootloader_start/3
All the failed jobs related with iSCSI
During qemu command set: -boot once=d
But it cannot boot from DVD
Updated by rfan1 about 1 year ago
Some findings from my side:
Based on the openQA result, the passed jobs on worker2
with which qemu 6.2.x is used. however for the failed jobs on worker30
, qemu 7.1.x is used.
I am not sure if the new qemu version makes some difference. I will try to investigate more and ask for some help from qe-tools and qe-virtualization team
Updated by okurz about 1 year ago
Let's follow up with that hypothesis. We don't have many machines left on Leap 15.4 but still some. So I triggered
for i in {001..010}; do for class in sapworker1 openqaworker14 worker30 worker40; do openqa-clone-job --within-instance openqa.suse.de/t12413982 INCLUDE_MODULES=prepare_profile,bootloader_start _GROUP=0 BUILD=poo137564 TEST+=-poo137564-$class-$i WORKER_CLASS=$class; done; done
https://openqa.suse.de/tests/overview?version=15-SP6&distri=sle&build=poo137564
openqaworker14 still has Leap 15.4 and regarding version of qemu:
sudo salt --no-color -L 'sapworker1.qe.nue2.suse.org,openqaworker14.qa.suse.cz,worker30.oqa.prg2.suse.org,worker40.oqa.prg2.suse.org' cmd.run 'rpm -q qemu'
shows
worker40.oqa.prg2.suse.org:
qemu-7.1.0-150500.49.6.1.x86_64
worker30.oqa.prg2.suse.org:
qemu-7.1.0-150500.49.6.1.x86_64
sapworker1.qe.nue2.suse.org:
qemu-7.1.0-150500.49.6.1.x86_64
openqaworker14.qa.suse.cz:
qemu-6.2.0-150400.37.20.1.x86_64
https://openqa.suse.de/tests/overview?version=15-SP6&distri=sle&build=poo137564 shows clearly that in 30/30 runs on qemu-7.1.0 booting from CDROM failed. Also the logs show an error, e.g. see https://openqa.suse.de/tests/12417758/logfile?filename=autoinst-log.txt
[2023-10-07T13:19:15.557786+02:00] [debug] [pid:95172] QEMU: KVM internal error. Suberror: 1
[2023-10-07T13:19:15.557879+02:00] [debug] [pid:95172] QEMU: emulation failure
[2023-10-07T13:19:15.557930+02:00] [debug] [pid:95172] QEMU: EAX=00000000 EBX=00000000 ECX=00001000 EDX=00001000
[2023-10-07T13:19:15.557983+02:00] [debug] [pid:95172] QEMU: ESI=00011020 EDI=000903ab EBP=00009cbc ESP=00002fbe
[2023-10-07T13:19:15.558028+02:00] [debug] [pid:95172] QEMU: EIP=00000392 EFL=00000006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
[2023-10-07T13:19:15.558072+02:00] [debug] [pid:95172] QEMU: ES =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
[2023-10-07T13:19:15.558110+02:00] [debug] [pid:95172] QEMU: CS =0008 00010200 ffffffff 00809b00 DPL=0 CS16 [-RA]
[2023-10-07T13:19:15.558169+02:00] [debug] [pid:95172] QEMU: SS =9cbc 0009cbc0 0000ffff 00009300 DPL=0 DS16 [-WA]
[2023-10-07T13:19:15.558233+02:00] [debug] [pid:95172] QEMU: DS =1020 00010200 0000ffff 00009300 DPL=0 DS16 [-WA]
[2023-10-07T13:19:15.558290+02:00] [debug] [pid:95172] QEMU: FS =1000 00010000 0000ffff 00009300 DPL=0 DS16 [-WA]
[2023-10-07T13:19:15.558330+02:00] [debug] [pid:95172] QEMU: GS =1000 00010000 0000ffff 00009300 DPL=0 DS16 [-WA]
[2023-10-07T13:19:15.558378+02:00] [debug] [pid:95172] QEMU: LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
[2023-10-07T13:19:15.558422+02:00] [debug] [pid:95172] QEMU: TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
[2023-10-07T13:19:15.558464+02:00] [debug] [pid:95172] QEMU: GDT= 0009fb84 0000001f
[2023-10-07T13:19:15.558509+02:00] [debug] [pid:95172] QEMU: IDT= 00000000 000003ff
[2023-10-07T13:19:15.558554+02:00] [debug] [pid:95172] QEMU: CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
[2023-10-07T13:19:15.558591+02:00] [debug] [pid:95172] QEMU: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
[2023-10-07T13:19:15.558636+02:00] [debug] [pid:95172] QEMU: DR6=00000000ffff0ff0 DR7=0000000000000400
[2023-10-07T13:19:15.558675+02:00] [debug] [pid:95172] QEMU: EFER=0000000000000000
[2023-10-07T13:19:15.558713+02:00] [debug] [pid:95172] QEMU: Code=6e 74 69 6e 75 65 0a 00 0a 50 61 79 6c 6f 61 64 20 69 6e 61 <63> 63 65 73 73 69 62 6c 65 20 2d 20 63 61
[2023-10-07T13:19:15.558855+02:00] [debug] [pid:95172] QEMU: 6e 6e 6f 74 20 63 6f 6e 74 69 6e 75 65 0a 00 67
The successful jobs like https://openqa.suse.de/tests/12417727/logfile?filename=autoinst-log.txt do not show this. in 10/10 jobs on qemu-6 on https://openqa.suse.de/tests/overview?version=15-SP6&distri=sle&build=poo137564 no problem is visible.
Other successful jobs on Leap 15.5 workers with qemu do not show this problem, e.g. see https://openqa.suse.de/tests/12409619/logfile?filename=autoinst-log.txt . I suggest you look further in the direction of this qemu/kvm error why the referenced test scenario seems to trigger this problem but not other qemu-x86_64 jobs.
Updated by leli about 1 year ago
- Status changed from New to Workable
- Target version set to Current
Updated by rfan1 about 1 year ago
- Status changed from Workable to In Progress
Thanks much @okurz for the kindly update!
After investigating more, [the system seems hang at ipxe rom and fails to process to boot from cdtom] I can find some clue now. for this iscsi installation test, we use ipxe.lkrn file to start the kernel boot and try to hook the iscsi lun and then install the system, the current ipxe.lkrn file might be out of date and needs to be updated:
-kernel /usr/share/qemu/ipxe.lkrn -append dhcp && sanhook iscsi:xx.xx.xx.xx::3260:1:iqn.2016-02.openqa.de:for.openqa
I tried to rebuild the /usr/share/qemu/ipxe.lkrn
file followed the steps mentioned in https://ipxe.org/download and replaced the one in worker30
# git clone https://github.com/ipxe/ipxe.git
# cd ipxe/src
# make
the case can pass now with the new ipxe.lkrn file: https://openqa.suse.de/tests/12417826
-rw-r--r-- 1 root root 339007 Oct 7 16:04 ipxe.lkrn [new one]
-rw-r--r-- 1 root root 310511 Oct 7 16:08 ipxe.lkrn.orig [old one]
However, I don't think I am following the right process to update/replace the iplx.lkrn file. IMO this file should be maintained by salt
or some other openQA process.
Can tools teams share some light on this?
In summary, after updating the ipxe.lkrn file, the issue is gone on worker30 with qemuversion=7.1.x
Updated by rfan1 about 1 year ago
- Subject changed from test fails sporadically in Autoyast iSCSI - cannot boot from ROM to test fails sporadically in Autoyast iSCSI - cannot boot from ROM, 'ipxe.lkrn' file may need update
Updated by rfan1 about 1 year ago
- Subject changed from test fails sporadically in Autoyast iSCSI - cannot boot from ROM, 'ipxe.lkrn' file may need update to test fails sporadically in Autoyast iSCSI - cannot boot from ROM, '/usr/share/qemu/ipxe.lkrn' file may need update
Updated by rfan1 about 1 year ago
I can find the pkg 'ipxe-bootimgs' has build the binary file, let me try with it.
Updated by rfan1 about 1 year ago
Well, the test passed with the binary file from pkg ipxe-bootimgs
https://openqa.suse.de/tests/12455790
Let me try to post PR/MR later to use this new one
Updated by okurz about 1 year ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1016 merged and deployed.
Updated by rfan1 about 1 year ago
- Status changed from In Progress to Feedback
PR/MR are both merged, let me check next openQA run, maybe I should have to wait next openQA deployment to make sure all changes take effect!
Updated by rfan1 about 1 year ago
- Status changed from Feedback to Resolved
With kindly help from tools team, the issue is fixed now.