Project

General

Profile

Actions

action #25982

closed

[sle][functional][sporadic][hard][u][s390x] test fails in bootloader - "Please make sure your installation medium is available.""

Added by okurz over 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Start date:
2017-10-11
Due date:
2018-04-10
% Done:

100%

Estimated time:
Difficulty:
hard

Description

Observation

openQA test in scenario sle-15-Installer-DVD-s390x-skip_registration+workaround_modules@zkvm fails in
bootloader_zkvm

Message in autoinst-log.txt:

no/incomplete answer.
Please make sure your installation medium is available.

Reproducible

Fails since (at least) Build 300.1 (current job)

Acceptance criteria

AC1: DONE: get an idea about the root cause
AC2: DONE: gather statistics - it's pretty sporadic
AC3: DONE: Come up with a fix or a workaround to mitigate the issue -> better debug options with this PR, real workaround is to restart the job
AC4: DONE: Implement post_fail_hook to bootloader_s390 and bootloader_zkvm to gather e.g. linuxrc logs (/var/log/linuxrc) -> see Pull Request

Expected result

Last good: 299.1 (or more recent)

Problem

Happens in about 5/100 cases (see comments) and most likely for the first tests in a new build, shortly after a new build hits openQA. Retriggering helps here.

  • REJECTED H1: Jobs are scheduled before the build was actually completely synced to osd -> E1-1 Check logs of osd:/var/log/openqa_rsync.log and the jobs log to see if the job is triggered before syncing is complete
  • REJECTED H2: Something about caching over network?
  • H3: Many jobs starting at the same time cause an overload causing the installer to get a negative reply trying to access the FTP server ->

    • E3-1 Trigger many jobs in parallel for an existing build, e.g. the previous build when the test queue for s390x is empty
    • E3-2 Check if there are any logs on osd regarding the ftp server to see if there was a deny or apache logs?#
  • ACCEPTED H4: Network adapter cannot be set up correctly and therefor linuxrc falls back to dhcp which is not set up for s390 subnet

Further details

Always latest result in this scenario: latest

Workaround

Retrigger jobs. They should be able to find the repo after retriggering.

Actions #1

Updated by okurz over 6 years ago

  • Subject changed from [sle][functional]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available." to [sle][functional][sporadic]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available."
  • Priority changed from Normal to Low

retriggered test succeeeded: https://openqa.suse.de/tests/1209567#

Actions #2

Updated by okurz over 6 years ago

  • Target version set to Milestone 13
Actions #3

Updated by okurz about 6 years ago

  • Subject changed from [sle][functional][sporadic]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available." to [sle][functional][sporadic][hard]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available."
  • Status changed from New to Rejected
  • Assignee set to okurz

Looking back in history it happened exactly twice in 50 jobs so 2/50. The second instance is https://openqa.suse.de/tests/1212170#step/bootloader_zkvm/26 in build 302.1 . Since then quite stable and not seen again. Not sure if we need to do something about it. Closing in the hope whenever we see it again someone will find this issue again and reopen.

Actions #4

Updated by mgriessmeier about 6 years ago

  • Status changed from Rejected to New
Actions #5

Updated by mgriessmeier about 6 years ago

  • Priority changed from Low to Normal
Actions #6

Updated by riafarov about 6 years ago

  • Target version changed from Milestone 13 to Milestone 14
Actions #7

Updated by okurz about 6 years ago

  • Due date set to 2018-03-13
Actions #8

Updated by mgriessmeier about 6 years ago

  • Subject changed from [sle][functional][sporadic][hard]test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available." to [sle][functional][sporadic][hard][research][u] test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available."
  • Description updated (diff)
  • Status changed from New to Workable
Actions #9

Updated by mgriessmeier about 6 years ago

  • Subject changed from [sle][functional][sporadic][hard][research][u] test fails in bootloader_zkvm - "no/incomplete answer. Please make sure your installation medium is available." to [sle][functional][sporadic][hard][research][u][s390x] test fails in bootloader - "Please make sure your installation medium is available.""

my guess is that somehow the test starts before the media is fully synced

I remember to see this from time to time shortly after a new build hits openQA
but restart doesn't show it again

Actions #10

Updated by okurz about 6 years ago

  • Subject changed from [sle][functional][sporadic][hard][research][u][s390x] test fails in bootloader - "Please make sure your installation medium is available."" to [sle][functional][sporadic][hard][u][s390x] test fails in bootloader - "Please make sure your installation medium is available.""
  • Description updated (diff)
  • Assignee deleted (okurz)

yes, I can confirm this hypothesis. Two jobs failed on this for build 489.1 - I think they were actually zVM ones - and both succeeded to find the installation medium after retriggering.

Adapted description with hypotheses and experiments to conduct

Actions #11

Updated by mgriessmeier about 6 years ago

  • Due date changed from 2018-03-13 to 2018-03-27
  • Target version changed from Milestone 14 to Milestone 15
Actions #12

Updated by okurz about 6 years ago

  • Due date deleted (2018-03-27)
  • Target version changed from Milestone 15 to Milestone 17
Actions #14

Updated by okurz about 6 years ago

  • Due date set to 2018-04-10
  • Target version changed from Milestone 17 to Milestone 15

that issue is too annoying to keep around untouched for complete milestone periods.

Actions #15

Updated by okurz about 6 years ago

  • Due date changed from 2018-04-10 to 2018-03-27

adding late to S13 as we have capacity

Actions #16

Updated by mloviska about 6 years ago

Isn't the problem that VLAN was not added on NIC after the creation of network config?

enc1: network config created
 8021q: adding VLAN 0 to HW filter on device enc1

Starting hardware detection... ok
 (If a driver is not working for you, try booting with brokenmodules=driver_name.)

 Virtio Storage 0
   drivers: virtio_blk*
 Virtio Ethernet Card 0
   drivers: virtio_net*
 enc1: network config created
 Sending DHCP request to enc1...
 no/incomplete answer.

sle-15-Installer-DVD-s390x-Build300.1-create_hdd_minimal_base+sdk@zkvm

enc1: network config created
 8021q: adding VLAN 0 to HW filter on device enc1

Starting hardware detection... ok
 (If a driver is not working for you, try booting with brokenmodules=driver_name.)

 Virtio Storage 0
   drivers: virtio_blk*
 Virtio Ethernet Card 0
   drivers: virtio_net*
 Loading Installation System (1/6) -            0%  1%  2%  3%  4%  5%  6%  7%  8%  9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%100%
 Loading Installation System (2/6) -            0%  1%  2%  3%  4%  5%  6%  7%  8%  9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%100%
 Loading Installation System (3/6) -            0%  1%  2%  3%  4%  5%  6%  7%  8%  9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%100%
 Loading Installation System (4/6) -            0%  2%  3%  5%  6%  7%  8%  9% 10% 11% 12% 13% 14% 15% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 92% 93% 94% 95% 96% 97% 98% 99%100%
 Loading Installation System (5/6) -            0% 14% 31% 44% 61% 78% 94%100%
 Loading Installation System (6/6) -           - missing (optional)
 BAD PASSWORD: it is based on a dictionary word
Actions #17

Updated by mgriessmeier about 6 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mgriessmeier

latest occurence in Build 522.1

mloviska wrote:

Isn't the problem that VLAN was not added on NIC after the creation of network config?

could be related - maybe it takes too long? - I'll check this

Actions #18

Updated by mgriessmeier about 6 years ago

mloviska wrote:

Isn't the problem that VLAN was not added on NIC after the creation of network config?

enc1: network config created
 8021q: adding VLAN 0 to HW filter on device enc1

Starting hardware detection... ok
 (If a driver is not working for you, try booting with brokenmodules=driver_name.)

 Virtio Storage 0
   drivers: virtio_blk*
 Virtio Ethernet Card 0
   drivers: virtio_net*
 enc1: network config created
 Sending DHCP request to enc1...
 no/incomplete answer.

This might apply for z/VM, but the logs above are from KVM and there we don't use dhcp

Actions #19

Updated by mgriessmeier about 6 years ago

it's really hard to debug this - o.s.d doesn't provide logs for ftp server
and since it's hardly reproducible this might be a waste of time

regarding a workaround, I'm thinking of checking for this message and simply retry booting the installation system - which has one culprit: in order to test this probably we need a reliable reproducer or do it blindly on the production system...

Actions #20

Updated by okurz about 6 years ago

mgriessmeier wrote:

it's really hard to debug this - o.s.d doesn't provide logs for ftp server
and since it's hardly reproducible this might be a waste of time

regarding a workaround, I'm thinking of checking for this message and simply retry booting the installation system - which has one culprit: in order to test this probably we need a reliable reproducer or do it blindly on the production system...

I think you mean s/culprit/pitfall/ or drawback :)

I would be ok to test this out in production. It is limited to s390x tests, we will see errors we produced quite early or the old problem not fixed soon so whatever is the fastest way to make sure this does not impact our testing - go for it. However, I would like to see this issue fully understood by us even though we might have this mentioned workaround in place.

Actions #21

Updated by mgriessmeier about 6 years ago

failing example in RC2: https://openqa.suse.de/tests/1563925/

Actions #22

Updated by mgriessmeier about 6 years ago

  • Status changed from In Progress to Workable

After talking to steffen winterfeldt, who is the linuxrc maintainer, I found out that this behaviour is most likely due to linuxrc failing to set up a network device and falling back to dhcp, which is not configured for the s390 subnet.
to find out why it sometimes fails to set up a network adapter, we would need to collect linuxrc logs

I already tried to implement a post_fail_hook for bootloader_s390 and bootloader_zkvm, but failed so far.
I don't think that I can deliver on this during this sprint.

Actions #23

Updated by mgriessmeier about 6 years ago

  • Description updated (diff)
Actions #24

Updated by mgriessmeier about 6 years ago

  • Status changed from Workable to In Progress
Actions #25

Updated by mgriessmeier about 6 years ago

  • Due date changed from 2018-03-27 to 2018-04-10
Actions #26

Updated by cwh about 6 years ago

  • Difficulty set to hard
Actions #28

Updated by mgriessmeier about 6 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100

PR is merged - hopefully we will get some good debug out of this

Actions #29

Updated by okurz almost 6 years ago

  • Status changed from Resolved to Feedback

Sorry but I object to your reasoning regarding AC3 that "the real workaround is to retrigger the job". We do not even have a bug report about this issue and I want to stay with the original AC3 as stated "Come up with a fix or a workaround to mitigate the issue" … within the job, not on the level of humans retriggering a job.

Actions #30

Updated by mgriessmeier almost 6 years ago

okurz wrote:

Sorry but I object to your reasoning regarding AC3 that "the real workaround is to retrigger the job". We do not even have a bug report about this issue and I want to stay with the original AC3 as stated "Come up with a fix or a workaround to mitigate the issue" … within the job, not on the level of humans retriggering a job.

was worth a try ;)

Actions #31

Updated by mgriessmeier almost 6 years ago

Linuxrc and wickedd logs from failing test:


LINUXRC:
    18:27:47 <2>: update_device_list(0)                                             
    18:27:47 <2>: rescanning devices                                                
    18:27:47 <2>: device not found (err = 0): *label/OEMDRV                         
    18:27:47 <2>: url mount: trying disk:/?device=*label/OEMDRV (path = /)          
    18:27:47 <2>: url_setup_device: disk:/?device=*label/OEMDRV                     
    18:27:47 <2>: Failed to download AutoYaST file.                                 
    18:27:47 <2>: dud url: disk:/?device=*usb*&all=1&quiet=1                        
    18:27:47 <4>: url = disk:/?device=*usb*&all=1&quiet=1                           
    18:27:47 <2>: Reading driver update: disk:/?device=*usb*&all=1&quiet=1          
    18:27:47 <2>: update_device_list(0)                                             
    18:27:47 <2>: rescanning devices                                                
    18:27:47 <2>: device not found (err = 0): *usb*                                 
    18:27:47 <2>: url mount: trying disk:/?device=*usb* (path = /)                  
    18:27:47 <2>: url_setup_device: disk:/?device=*usb*                             
    18:27:47 <2>: dud url: disk:/?device=*label/OEMDRV&quiet=1                      
    18:27:47 <4>: url = disk:/?device=*label/OEMDRV&quiet=1                         
    18:27:47 <2>: Reading driver update: disk:/?device=*label/OEMDRV&quiet=1        
    18:27:49 <4>: wicked ifup eth0                                                  
    18:28:19 <4>: exec: wicked ifup eth0 = 158                                      
    18:28:19 <4>: stdout + stderr:                                                  
    eth0            device-not-running                                              
    18:28:20 <4>: net_update_state:                                                 
    18:28:20 <4>: lo: up,                                                           
    18:28:20 <4>: eth0: device-not-running                                          
    18:28:20 <4>:                                                                   
    18:28:20 <2>: eth0 network setup failed                                         
    18:28:20 <2>: url mount: trying ftp://openqa/SLE-15-Installer-DVD-s390x-Build538
    .1-Media1?device=lo (path = SLE-15-Installer-DVD-s390x-Build538.1-Media1)       
    18:28:20 <2>: (Loopback network interface)                                      
    18:28:20 <2>: url_setup_device: ftp://openqa/SLE-15-Installer-DVD-s390x-Build538
    .1-Media1?device=lo                                                             
    18:28:20 <2>: setup_interface: lo ignored                                       
    18:28:20 <2>: device not found (err = 1):                                       
    18:28:20 <2>: repository: not found                      


WICKED:
    wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-down => device-exists)                                                                        
    wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-exists => device-ready)                                                                       
    wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-ready => device-setup)                                                                        
    wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-setup => protocols-up)                                                                        
    wickedd[328]: Error: ni_process_reap: process 1175 has not exited yet; now doing a blocking waitpid()                          
    wickedd-nanny[336]: ::: ni_managed_device_prog ess(eth0) target(network-up [network-up..max]), transition(protocols-up => firewall-up)                                                                         
    wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported                                            
    wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported                                       
    wickedd[328]: ::: ni_system_interface_link_change(eth0)                                                                        
    wickedd[328]: ::: bringing up eth0             
    wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported                                            
    wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported                                       
    wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported                                            
    wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported                                       
    wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(firewall-up => device-up)                                                                            
    wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(device-up => link-up)
    wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(link-up => link-authenticated)                                                                       
    wickedd-nanny[336]: ::: ni_managed_device_progress(eth0) target(network-up [network-up..max]), transition(link-authenticated =>lldp-up)                                                                       
    wickedd[328]: ::: eth0: received ipv4:static lease update in state granted                                                     
    wickedd[328]: ::: eth0: adjusting mtu for ipv4:static lease in state applying: success [after 0m0.0s]                          
    wickedd[328]: ::: eth0: applying addresses for ipv4:static lease in state applying: deferred [since 0m0.1s]                    
    wickedd[328]: Error: eth0: IPv4 duplicate address 10.161.155.147 detected (in use by 02:00:29:00:00:30)!                       
    wickedd[328]: ::: eth0: applying addresses for ipv4:static lease in state applying: success [after 0m0.330s]                   
    wickedd[328]: ::: eth0: verifying adressses foripv4:static lease in state applying: success [after 0m0.0s]                    
    wickedd[328]: ::: ipv4: adding new static:eth0 lease route ipv4 0.0.0.0/0 via 10.161.159.254 dev eth0 type unicast table main scope universe protocol boot                                                     
    wickedd[328]: ::: __ni_rtnl_send_newroute(create ipv4 0.0.0.0/0 via 10.161.159.254 dev eth0 type unicast table main scope universe protocol boot)                                                              
    wickedd[328]: ::: netlink reports error -101   
    wickedd[328]: Error: __ni_rtnl_send_newroute(ipv4 0.0.0.0/0 via 10.161.159.254 dev eth0 type unicast table main scope universe protocol boot): ni_nl_talk failed [Unspecific failure]                          
    wickedd[328]: ::: eth0: applying routes for ipv4:static lease in state applying: failure [after 0m0.0s]                        
    wickedd[328]: ::: eth0: updater for lease ipv4:static in state applying finished: failure [0m0.387s]                           
    wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported                                            
    wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported                                       
    wickedd[328]: ::: eth0: getting ethtool.eee options failed: Operation not supported                                            
    wickedd[328]: ::: eth0: getting ethtool.coalesce options failed: Operation not supported                                                              
Actions #32

Updated by mgriessmeier almost 6 years ago

I guess that's the issue here ;)
wickedd[328]: Error: eth0: IPv4 duplicate address 10.161.155.147 detected (in use by 02:00:29:00:00:30)!

Actions #33

Updated by okurz almost 6 years ago

that's cool. I assume that is another s390x z/VM instance?

Actions #34

Updated by mgriessmeier almost 6 years ago

might be - I'm currently investigating with gerhard

Actions #35

Updated by mgriessmeier almost 6 years ago

Issue for z/VM is solved - there was an old instance still hanging around which we used as interim worker in production during the storage outage...
We reused the old ips to keep things simple...

unfortunately this means that the issue on zkvm/suse kvm, where this issue was initially created for, still exists...
I didn't see it on kvm for a long time... which means we need to wait for another occurence to get the debug output which was implemented there...

I'd suggest to close this ticket and create a new one as followup when the issue occurs again.
WDYT olli?

Actions #36

Updated by okurz almost 6 years ago

  • Status changed from Feedback to Resolved

Agreed, good job!

Actions

Also available in: Atom PDF