action #25982
closed[sle][functional][sporadic][hard][u][s390x] test fails in bootloader - "Please make sure your installation medium is available.""
100%
Description
Observation¶
openQA test in scenario sle-15-Installer-DVD-s390x-skip_registration+workaround_modules@zkvm fails in
bootloader_zkvm
Message in autoinst-log.txt:
no/incomplete answer.
Please make sure your installation medium is available.
Reproducible¶
Fails since (at least) Build 300.1 (current job)
Acceptance criteria¶
AC1: DONE: get an idea about the root cause
AC2: DONE: gather statistics - it's pretty sporadic
AC3: DONE: Come up with a fix or a workaround to mitigate the issue -> better debug options with this PR, real workaround is to restart the job
AC4: DONE: Implement post_fail_hook to bootloader_s390 and bootloader_zkvm to gather e.g. linuxrc logs (/var/log/linuxrc) -> see Pull Request
Expected result¶
Last good: 299.1 (or more recent)
Problem¶
Happens in about 5/100 cases (see comments) and most likely for the first tests in a new build, shortly after a new build hits openQA. Retriggering helps here.
- REJECTED H1: Jobs are scheduled before the build was actually completely synced to osd -> E1-1 Check logs of osd:/var/log/openqa_rsync.log and the jobs log to see if the job is triggered before syncing is complete
- REJECTED H2: Something about caching over network?
H3: Many jobs starting at the same time cause an overload causing the installer to get a negative reply trying to access the FTP server ->
- E3-1 Trigger many jobs in parallel for an existing build, e.g. the previous build when the test queue for s390x is empty
- E3-2 Check if there are any logs on osd regarding the ftp server to see if there was a deny or apache logs?#
ACCEPTED H4: Network adapter cannot be set up correctly and therefor linuxrc falls back to dhcp which is not set up for s390 subnet
Further details¶
Always latest result in this scenario: latest
Workaround¶
Retrigger jobs. They should be able to find the repo after retriggering.
Updated by okurz about 7 years ago
- Subject changed from [sle][functional]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available." to [sle][functional][sporadic]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available."
- Priority changed from Normal to Low
retriggered test succeeeded: https://openqa.suse.de/tests/1209567#
Updated by okurz almost 7 years ago
- Subject changed from [sle][functional][sporadic]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available." to [sle][functional][sporadic][hard]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available."
- Status changed from New to Rejected
- Assignee set to okurz
Looking back in history it happened exactly twice in 50 jobs so 2/50. The second instance is https://openqa.suse.de/tests/1212170#step/bootloader_zkvm/26 in build 302.1 . Since then quite stable and not seen again. Not sure if we need to do something about it. Closing in the hope whenever we see it again someone will find this issue again and reopen.
Updated by mgriessmeier almost 7 years ago
- Status changed from Rejected to New
:( twice in Build 437.3
https://openqa.suse.de/tests/1441590
https://openqa.suse.de/tests/1441662
Updated by riafarov almost 7 years ago
- Target version changed from Milestone 13 to Milestone 14
Updated by mgriessmeier almost 7 years ago
- Subject changed from [sle][functional][sporadic][hard]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available." to [sle][functional][sporadic][hard][research][u] test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available."
- Description updated (diff)
- Status changed from New to Workable
Updated by mgriessmeier almost 7 years ago
- Subject changed from [sle][functional][sporadic][hard][research][u] test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available." to [sle][functional][sporadic][hard][research][u][s390x] test fails in bootloader - "Please make sure your installation medium is available.""
my guess is that somehow the test starts before the media is fully synced
I remember to see this from time to time shortly after a new build hits openQA
but restart doesn't show it again
Updated by okurz almost 7 years ago
- Subject changed from [sle][functional][sporadic][hard][research][u][s390x] test fails in bootloader - "Please make sure your installation medium is available."" to [sle][functional][sporadic][hard][u][s390x] test fails in bootloader - "Please make sure your installation medium is available.""
- Description updated (diff)
- Assignee deleted (
okurz)
yes, I can confirm this hypothesis. Two jobs failed on this for build 489.1 - I think they were actually zVM ones - and both succeeded to find the installation medium after retriggering.
Adapted description with hypotheses and experiments to conduct
Updated by mgriessmeier almost 7 years ago
- Due date changed from 2018-03-13 to 2018-03-27
- Target version changed from Milestone 14 to Milestone 15
Updated by okurz almost 7 years ago
- Due date deleted (
2018-03-27) - Target version changed from Milestone 15 to Milestone 17
Updated by okurz almost 7 years ago
- Priority changed from Normal to High
Updated by okurz almost 7 years ago
- Due date set to 2018-04-10
- Target version changed from Milestone 17 to Milestone 15
that issue is too annoying to keep around untouched for complete milestone periods.
Updated by okurz almost 7 years ago
- Due date changed from 2018-04-10 to 2018-03-27
adding late to S13 as we have capacity
Updated by mloviska almost 7 years ago
Isn't the problem that VLAN was not added on NIC after the creation of network config?
enc1: network config created
8021q: adding VLAN 0 to HW filter on device enc1
Starting hardware detection... ok
(If a driver is not working for you, try booting with brokenmodules=driver_name.)
Virtio Storage 0
drivers: virtio_blk*
Virtio Ethernet Card 0
drivers: virtio_net*
enc1: network config created
Sending DHCP request to enc1...
no/incomplete answer.
sle-15-Installer-DVD-s390x-Build300.1-create_hdd_minimal_base+sdk@zkvm
enc1: network config created
8021q: adding VLAN 0 to HW filter on device enc1
Starting hardware detection... ok
(If a driver is not working for you, try booting with brokenmodules=driver_name.)
Virtio Storage 0
drivers: virtio_blk*
Virtio Ethernet Card 0
drivers: virtio_net*
Loading Installation System (1/6) - 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%100%
Loading Installation System (2/6) - 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%100%
Loading Installation System (3/6) - 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%100%
Loading Installation System (4/6) - 0% 2% 3% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 92% 93% 94% 95% 96% 97% 98% 99%100%
Loading Installation System (5/6) - 0% 14% 31% 44% 61% 78% 94%100%
Loading Installation System (6/6) - - missing (optional)
BAD PASSWORD: it is based on a dictionary word
Updated by mgriessmeier almost 7 years ago
- Status changed from Workable to In Progress
- Assignee set to mgriessmeier
latest occurence in Build 522.1
mloviska wrote:
Isn't the problem that VLAN was not added on NIC after the creation of network config?
could be related - maybe it takes too long? - I'll check this
Updated by mgriessmeier almost 7 years ago
mloviska wrote:
Isn't the problem that VLAN was not added on NIC after the creation of network config?
enc1: network config created 8021q: adding VLAN 0 to HW filter on device enc1 Starting hardware detection... ok (If a driver is not working for you, try booting with brokenmodules=driver_name.) Virtio Storage 0 drivers: virtio_blk* Virtio Ethernet Card 0 drivers: virtio_net* enc1: network config created Sending DHCP request to enc1... no/incomplete answer.
This might apply for z/VM, but the logs above are from KVM and there we don't use dhcp
Updated by mgriessmeier almost 7 years ago
it's really hard to debug this - o.s.d doesn't provide logs for ftp server
and since it's hardly reproducible this might be a waste of time
regarding a workaround, I'm thinking of checking for this message and simply retry booting the installation system - which has one culprit: in order to test this probably we need a reliable reproducer or do it blindly on the production system...
Updated by okurz almost 7 years ago
mgriessmeier wrote:
it's really hard to debug this - o.s.d doesn't provide logs for ftp server
and since it's hardly reproducible this might be a waste of timeregarding a workaround, I'm thinking of checking for this message and simply retry booting the installation system - which has one culprit: in order to test this probably we need a reliable reproducer or do it blindly on the production system...
I think you mean s/culprit/pitfall/ or drawback :)
I would be ok to test this out in production. It is limited to s390x tests, we will see errors we produced quite early or the old problem not fixed soon so whatever is the fastest way to make sure this does not impact our testing - go for it. However, I would like to see this issue fully understood by us even though we might have this mentioned workaround in place.
Updated by mgriessmeier almost 7 years ago
failing example in RC2: https://openqa.suse.de/tests/1563925/
Updated by mgriessmeier almost 7 years ago
- Status changed from In Progress to Workable
After talking to steffen winterfeldt, who is the linuxrc maintainer, I found out that this behaviour is most likely due to linuxrc failing to set up a network device and falling back to dhcp, which is not configured for the s390 subnet.
to find out why it sometimes fails to set up a network adapter, we would need to collect linuxrc logs
I already tried to implement a post_fail_hook for bootloader_s390 and bootloader_zkvm, but failed so far.
I don't think that I can deliver on this during this sprint.
Updated by mgriessmeier almost 7 years ago
- Status changed from Workable to In Progress
Updated by mgriessmeier almost 7 years ago
- Due date changed from 2018-03-27 to 2018-04-10
Updated by mgriessmeier almost 7 years ago
- Description updated (diff)
- Status changed from In Progress to Feedback
Updated by mgriessmeier almost 7 years ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100
PR is merged - hopefully we will get some good debug out of this
Updated by okurz almost 7 years ago
- Status changed from Resolved to Feedback
Sorry but I object to your reasoning regarding AC3 that "the real workaround is to retrigger the job". We do not even have a bug report about this issue and I want to stay with the original AC3 as stated "Come up with a fix or a workaround to mitigate the issue" … within the job, not on the level of humans retriggering a job.
Updated by mgriessmeier almost 7 years ago
okurz wrote:
Sorry but I object to your reasoning regarding AC3 that "the real workaround is to retrigger the job". We do not even have a bug report about this issue and I want to stay with the original AC3 as stated "Come up with a fix or a workaround to mitigate the issue" … within the job, not on the level of humans retriggering a job.
was worth a try ;)
Updated by mgriessmeier over 6 years ago
Linuxrc and wickedd logs from failing test:
LINUXRC:
18:27:47 <2>: update_device_list(0)
18:27:47 <2>: rescanning devices
18:27:47 <2>: device not found (err = 0): *label/OEMDRV
18:27:47 <2>: url mount: trying disk:/?device=*label/OEMDRV (path = /)
18:27:47 <2>: url_setup_device: disk:/?device=*label/OEMDRV
18:27:47 <2>: Failed to download AutoYaST file.
18:27:47 <2>: dud url: disk:/?device=*usb*&all=1&quiet=1
18:27:47 <4>: url = disk:/?device=*usb*&all=1&quiet=1
18:27:47 <2>: Reading driver update: disk:/?device=*usb*&all=1&quiet=1
18:27:47 <2>: update_device_list(0)
18:27:47 <2>: rescanning devices
18:27:47 <2>: device not found (err = 0): *usb*
18:27:47 <2>: url mount: trying disk:/?device=*usb* (path = /)
18:27:47 <2>: url_setup_device: disk:/?device=*usb*
18:27:47 <2>: dud url: disk:/?device=*label/OEMDRV&quiet=1
18:27:47 <4>: url = disk:/?device=*label/OEMDRV&quiet=1
18:27:47 <2>: Reading driver update: disk:/?device=*label/OEMDRV&quiet=1
18:27:49 <4>: wicked ifup eth0
18:28:19 <4>: exec: wicked ifup eth0 = 158
18:28:19 <4>: stdout + stderr:
eth0 device-not-running
18:28:20 <4>: net_update_state:
18:28:20 <4>: lo: up,
18:28:20 <4>: eth0: device-not-running
18:28:20 <4>:
18:28:20 <2>: eth0 network setup failed
18:28:20 <2>: url mount: trying ftp://openqa/SLE-15-Installer-DVD-s390x-Build538
.1-Media1?device=lo (path = SLE-15-Installer-DVD-s390x-Build538.1-Media1)
18:28:20 <2>: (Loopback network interface)
18:28:20 <2>: url_setup_device: ftp://openqa/SLE-15-Installer-DVD-s390x-Build538
.1-Media1?device=lo
18:28:20 <2>: setup_interface: lo ignored
18:28:20 <2>: device not found (err = 1):
18:28:20 <2>: repository: not found
WICKED:
wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-down => device-exists)
wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-exists => device-ready)
wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-ready => device-setup)
wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-setup => protocols-up)
wickedd[328]: Error: ni_process_reap: process 1175 has not exited yet; now doing a blocking waitpid()
wickedd-nanny[336]: ::: ni_managed_device_prog ess(eth0) target(network-up [network-up..max]), transition(protocols-up => firewall-up)
wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported
wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported
wickedd[328]: ::: ni_system_interface_link_change(eth0)
wickedd[328]: ::: bringing up eth0
wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported
wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported
wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported
wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported
wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(firewall-up => device-up)
wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-up => link-up)
wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(link-up => link-authenticated)
wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(link-authenticated =>lldp-up)
wickedd[328]: ::: eth0: received ipv4:static lease update in state granted
wickedd[328]: ::: eth0: adjusting mtu for ipv4:static lease in state applying: success [after 0m0.0s]
wickedd[328]: ::: eth0: applying addresses for ipv4:static lease in state applying: deferred [since 0m0.1s]
wickedd[328]: Error: eth0: IPv4 duplicate address 10.161.155.147 detected (in use by 02:00:29:00:00:30)!
wickedd[328]: ::: eth0: applying addresses for ipv4:static lease in state applying: success [after 0m0.330s]
wickedd[328]: ::: eth0: verifying adressses foripv4:static lease in state applying: success [after 0m0.0s]
wickedd[328]: ::: ipv4: adding new static:eth0 lease route ipv4 0.0.0.0/0 via 10.161.159.254 dev eth0 type unicast table main scope universe protocol boot
wickedd[328]: ::: __ni_rtnl_send_newroute(create ipv4 0.0.0.0/0 via 10.161.159.254 dev eth0 type unicast table main scope universe protocol boot)
wickedd[328]: ::: netlink reports error -101
wickedd[328]: Error: __ni_rtnl_send_newroute(ipv4 0.0.0.0/0 via 10.161.159.254 dev eth0 type unicast table main scope universe protocol boot): ni_nl_talk failed [Unspecific failure]
wickedd[328]: ::: eth0: applying routes for ipv4:static lease in state applying: failure [after 0m0.0s]
wickedd[328]: ::: eth0: updater for lease ipv4:static in state applying finished: failure [0m0.387s]
wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported
wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported
wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported
wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported
Updated by mgriessmeier over 6 years ago
I guess that's the issue here ;)
wickedd[328]: Error: eth0: IPv4 duplicate address 10.161.155.147 detected (in use by 02:00:29:00:00:30)!
Updated by okurz over 6 years ago
that's cool. I assume that is another s390x z/VM instance?
Updated by mgriessmeier over 6 years ago
might be - I'm currently investigating with gerhard
Updated by mgriessmeier over 6 years ago
Issue for z/VM is solved - there was an old instance still hanging around which we used as interim worker in production during the storage outage...
We reused the old ips to keep things simple...
unfortunately this means that the issue on zkvm/suse kvm, where this issue was initially created for, still exists...
I didn't see it on kvm for a long time... which means we need to wait for another occurence to get the debug output which was implemented there...
I'd suggest to close this ticket and create a new one as followup when the issue occurs again.
WDYT olli?
Updated by mloviska over 6 years ago
Issue appeared again.
https://openqa.suse.de/tests/1650784#step/bootloader_zkvm/26