Project

General

Profile

action #26044

[functional][sle][s390x][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time

Added by nicksinger almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
SUSE QA - Milestone 17
Start date:
2017-10-13
Due date:
2018-06-19
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-Installer-DVD-s390x-btrfs@zkvm and many more fail in
bootloader_zkvm

The problem here is that in rare cases when we boot an existing hdd image, it could happen that it still has the same ip as a vm which runs in another worker (e.g. the worker where the qcow was created)
We need to think about a better way to change the ip addresses or even delete them from a created qcow, and set the new, correct one when the image gets booted

Workaround

  • Retrigger the job

Reproducible

Fails since (at least) Build 303.1 in many different test suites. Always the same few characters missing.


Related issues

Related to openQA Tests - action #36745: [openqa][sle][functional][u][s390x][zkvm][kernel] Broken boot due "Test died: no candidate needle with tag(s) 'password-prompt' matched"Resolved2018-06-04

History

#1 Updated by mgriessmeier almost 4 years ago

The "URL shown cut off" thingy is probably not the root cause - this is just because of character limitation there.

most likely it is/was a temporary network issue:

Sending DHCP request to enc1...
no/incomplete answer.

I've restarted all jobs failing in bootloader_zkvm and will monitor them

#2 Updated by mgriessmeier almost 4 years ago

  • Status changed from New to In Progress
  • Assignee set to mgriessmeier

#3 Updated by mgriessmeier almost 4 years ago

  • Status changed from In Progress to Resolved

none of the 5 jobs I've retriggered is showing the issue, so I guess it was related to some network issue and probably also related to the poweroff
https://openqa.suse.de/tests/overview?distri=sle&version=15&build=303.1&groupid=110&arch=s390x&failed_modules=bootloader_zkvm

What we could consider is a better user feedback, but IMO Test died: yast didn't start is already sufficient here
please reopen if seeing this again...

#4 Updated by zluo over 3 years ago

  • Status changed from Resolved to Workable

this issue happened again:
https://openqa.suse.de/tests/1630483#step/bootloader_zkvm/26

re-opened this ticket.

#5 Updated by okurz over 3 years ago

  • Subject changed from [functional][sle][s390] test fails in bootloader_zkvm because FTP url is cut off to [functional][sle][s390][u][fast] test fails in bootloader_zkvm because FTP url is cut off
  • Due date set to 2018-04-24
  • Target version set to Milestone 15

#6 Updated by mgriessmeier over 3 years ago

it's not about the "cut off FTP URL" this is just a displaying issue
The reason for this is a small time frame when jobs are starting in parallel could cause duplicate IP Addresses because of the network rewrite we do in qcow images on s390 kvm

right now the workaround for this is to restart the job

#7 Updated by mgriessmeier over 3 years ago

  • Subject changed from [functional][sle][s390][u][fast] test fails in bootloader_zkvm because FTP url is cut off to [functional][sle][s390][u][fast] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time
  • Description updated (diff)
  • Priority changed from Urgent to High

#8 Updated by mgriessmeier over 3 years ago

  • Subject changed from [functional][sle][s390][u][fast] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time to [functional][sle][s390][u][infrastructure][sporadic] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time
  • Description updated (diff)
  • Due date changed from 2018-04-24 to 2018-05-08
  • Target version changed from Milestone 15 to Milestone 16

moving to next sprint, no capacity left and issue is not urgent as a workaround exists

#10 Updated by okurz over 3 years ago

That seems to be becoming more annyoing:

Can we put in a check that the network was restarted without any error output and fail otherwise?

#11 Updated by okurz over 3 years ago

I am not sure if the following shows the same problem but it looked interesting at least. I triggered 10 jobs with:

$ for i in {1..10} ; do openqa_clone_job_osd --skip-chained-deps 1668775 TEST=okurz_bsc#1091186_$i BUILD=598.1:bsc1091186 _GROUP="Test Development: SLE 15" ; done

Results on
https://openqa.suse.de/tests/overview?version=15&build=598.1%3Absc1091186&distri=sle&groupid=96

and soon after the triggering 4 were running, four failed in "boot_to_desktop". This looks related.

Suggestions:

  • Crosscheck the IP-addresses assigned to each machine
  • Crosscheck if there is output about duplicate IP adress or not
  • Crosscheck timestamps

#12 Updated by mgriessmeier over 3 years ago

  • Status changed from Workable to In Progress

#13 Updated by mgriessmeier over 3 years ago

okurz wrote:

I am not sure if the following shows the same problem but it looked interesting at least. I triggered 10 jobs with:

$ for i in {1..10} ; do openqa_clone_job_osd --skip-chained-deps 1668775 TEST=okurz_bsc#1091186_$i BUILD=598.1:bsc1091186 _GROUP="Test Development: SLE 15" ; done

Results on
https://openqa.suse.de/tests/overview?version=15&build=598.1%3Absc1091186&distri=sle&groupid=96

and soon after the triggering 4 were running, four failed in "boot_to_desktop". This looks related.

Suggestions:

  • Crosscheck the IP-addresses assigned to each machine
  • Crosscheck if there is output about duplicate IP adress or not
  • Crosscheck timestamps

3/4 are failing due to duplicate IP-Addresses
I'm currently working on a list of IP-Assignments for each and every worker and the IPs which are used by the tests
so suggesting to carry over to next Sprint

#14 Updated by mgriessmeier over 3 years ago

  • Due date changed from 2018-05-08 to 2018-05-22

#15 Updated by okurz over 3 years ago

mgriessmeier wrote:

I'm currently working on a list of IP-Assignments for each and every worker and the IPs which are used by the tests

sounds like you are implementing a DHCP server ;)

#16 Updated by riafarov over 3 years ago

  • Subject changed from [functional][sle][s390][u][infrastructure][sporadic] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time to [functional][sle][s390][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time

#17 Updated by mloviska over 3 years ago

Duplicate address issue over here as well. https://openqa.suse.de/tests/1682086/file/autoinst-log.txt
Interesting that it did not fail in boot_to_desktop or bootloader_zkvm but in kdump_and_crash during reboot
https://openqa.suse.de/tests/1682086#step/kdump_and_crash/35

#18 Updated by mgriessmeier over 3 years ago

mloviska wrote:

Duplicate address issue over here as well. https://openqa.suse.de/tests/1682086/file/autoinst-log.txt

but that message appeared during boot_to_desktop, weird that it could proceed...

Interesting that it did not fail in boot_to_desktop or bootloader_zkvm but in kdump_and_crash during reboot
https://openqa.suse.de/tests/1682086#step/kdump_and_crash/35

that seems to be a different issue to be honest

#20 Updated by nicksinger over 3 years ago

We could build a reproduce setup on s390p8. Our approach to circumvent this is following:

  1. Boot the image which was created previously by a different job
  2. Break in GRUB2 and add systemd.unit=rescue.target to the kernel cmdline
    • This ensures wicked does not start before we (re)configure the network interface configuration
    • Using systemd.mask=wicked did not work for us since we could not unmask the server after we're done re-configuring
  3. Edit /etc/sysconfig/network/ifcfg-* to change the desired values, save, close
  4. Exit and continue the boot by typing exit into the systemd rescue shell

These steps ensure that the system never acquires its configured IPs when it boots up for a second time (since we can't ensure that the initial creation-worker hasn't already started another installation with the same IP).

#21 Updated by okurz over 3 years ago

That sounds like we are not really testing the normal boot flow anymore. I don't understand. Can we not just use a proper DHCP setup?

#22 Updated by mgriessmeier over 3 years ago

okurz wrote:

Can we not just use a proper DHCP setup?

needs to be clarified with SUSE-IT

#23 Updated by mgriessmeier over 3 years ago

  • Due date changed from 2018-05-22 to 2018-06-05

#24 Updated by mgriessmeier over 3 years ago

mgriessmeier wrote:

okurz wrote:

Can we not just use a proper DHCP setup?

needs to be clarified with SUSE-IT

SUSE-IT was on a training in the last two days - I will push this forward today

#25 Updated by mgriessmeier over 3 years ago

opened infra ticket: https://infra.nue.suse.com/Ticket/Display.html?id=113739&results=0862777de3082b5a89e00b7b04bf54f7

Next steps to take:

  • test if changing "ifcfg=*=" to dhcp works
  • adapt backend to write the mac address which is assigned to the worker into the xml file when generating the guest vm
  • adapt workerconf:
    • remove static IP assignment
    • add MAC address for each worker

Mac address mapping as follows:

zKVM // s390pb

s390kvm003 - 10.161.145.3 - 52:54:00:9d:f3:02
s390kvm004 - 10.161.145.4 - 52:54:00:f0:7c:91
s390kvm005 - 10.161.145.5 - 52:54:00:67:03:b0


SUSE KVM // s390p8

s390kvm013 - 10.161.145.13 - 52:54:00:9f:6b:f6
s390kvm014 - 10.161.145.14 - 52:54:00:d3:26:39
s390kvm015 - 10.161.145.15 - 52:54:00:14:ee:2e
s390kvm016 - 10.161.145.16 - 52:54:00:2f:2b:94

#26 Updated by mgriessmeier over 3 years ago

PR created: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5150

needs testing for multiple scenarios especially upgrades and extratests/create_hdd

#27 Updated by mgriessmeier over 3 years ago

to add: "SetHostname=0" to Machine definitions

#28 Updated by mgriessmeier over 3 years ago

Unfortunately I've spotted again the black screens in boot_to_desktop when I was trying to verify my PR on Upgrade tests...
so this needs still a bit more investigation...
also the black screens (e.g. ssh to SUT not possible) appeared more frequently now on kernel tests... (https://progress.opensuse.org/issues/36745)

#29 Updated by mgriessmeier over 3 years ago

  • Related to action #36745: [openqa][sle][functional][u][s390x][zkvm][kernel] Broken boot due "Test died: no candidate needle with tag(s) 'password-prompt' matched" added

#30 Updated by mgriessmeier over 3 years ago

  • Due date changed from 2018-06-05 to 2018-06-19
  • Target version changed from Milestone 16 to Milestone 17

working on this in upcoming sprint

#31 Updated by mgriessmeier over 3 years ago

soo... the PR is pretty much done,
all relevant scenarios have been tested with it

but.... All the sp3 s390x qcows needs to be either created newly with the dhcp config or by manually mounting them and modifying the corresponding file with e.g.

for i in $(ls *.qcow); do qemu-nbd -c /dev/nbd0 $i; mount /dev/nbdp02 /mnt/; cp $modified_ifcfg /mnt/etc/sysconfig/network/ifcfg-eth0; umount /mnt/; done

That's the reason why I will not finish this today, because "never deploy on friday" ;)

#32 Updated by mgriessmeier over 3 years ago

  • Status changed from In Progress to Feedback

Pull requests got merged, Machine definitions got updated

retriggered job for 12-SP4 to check if it works properly on o.s.d

I've already modified the qcows to match DHCP config, but didn't upload them yet

#33 Updated by mgriessmeier over 3 years ago

  • Status changed from Feedback to Resolved

retriggered jobs for SLE12SP4:

btrfs
create_hdd_textmode
extratests_in_textmode failing in gpg, but the boot_to_desktop is working fine

also move adjusted SLE12SP3 and older qcows to o.s.d again

closing as resolved - if there are any issues regarding the old qcows which are used for upgrades, please let me know

#34 Updated by okurz over 3 years ago

  • Target version changed from Milestone 17 to Milestone 17

Also available in: Atom PDF