action #26044
closed[functional][sle][s390x][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time
0%
Description
Observation¶
openQA test in scenario sle-15-Installer-DVD-s390x-btrfs@zkvm and many more fail in
bootloader_zkvm
The problem here is that in rare cases when we boot an existing hdd image, it could happen that it still has the same ip as a vm which runs in another worker (e.g. the worker where the qcow was created)
We need to think about a better way to change the ip addresses or even delete them from a created qcow, and set the new, correct one when the image gets booted
Workaround¶
- Retrigger the job
Reproducible¶
Fails since (at least) Build 303.1 in many different test suites. Always the same few characters missing.
Updated by mgriessmeier over 7 years ago
The "URL shown cut off" thingy is probably not the root cause - this is just because of character limitation there.
most likely it is/was a temporary network issue:
Sending DHCP request to enc1...
no/incomplete answer.
I've restarted all jobs failing in bootloader_zkvm
and will monitor them
Updated by mgriessmeier over 7 years ago
- Status changed from New to In Progress
- Assignee set to mgriessmeier
Updated by mgriessmeier over 7 years ago
- Status changed from In Progress to Resolved
none of the 5 jobs I've retriggered is showing the issue, so I guess it was related to some network issue and probably also related to the poweroff
https://openqa.suse.de/tests/overview?distri=sle&version=15&build=303.1&groupid=110&arch=s390x&failed_modules=bootloader_zkvm
What we could consider is a better user feedback, but IMO Test died: yast didn't start
is already sufficient here
please reopen if seeing this again...
Updated by zluo almost 7 years ago
- Status changed from Resolved to Workable
this issue happened again:
https://openqa.suse.de/tests/1630483#step/bootloader_zkvm/26
re-opened this ticket.
Updated by okurz almost 7 years ago
- Subject changed from [functional][sle][s390] test fails in bootloader_zkvm because FTP url is cut off to [functional][sle][s390][u][fast] test fails in bootloader_zkvm because FTP url is cut off
- Due date set to 2018-04-24
- Target version set to Milestone 15
Updated by mgriessmeier almost 7 years ago
it's not about the "cut off FTP URL" this is just a displaying issue
The reason for this is a small time frame when jobs are starting in parallel could cause duplicate IP Addresses because of the network rewrite we do in qcow images on s390 kvm
right now the workaround for this is to restart the job
Updated by mgriessmeier almost 7 years ago
- Subject changed from [functional][sle][s390][u][fast] test fails in bootloader_zkvm because FTP url is cut off to [functional][sle][s390][u][fast] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time
- Description updated (diff)
- Priority changed from Urgent to High
Updated by mgriessmeier almost 7 years ago
- Subject changed from [functional][sle][s390][u][fast] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time to [functional][sle][s390][u][infrastructure][sporadic] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time
- Description updated (diff)
- Due date changed from 2018-04-24 to 2018-05-08
- Target version changed from Milestone 15 to Milestone 16
moving to next sprint, no capacity left and issue is not urgent as a workaround exists
Updated by mloviska almost 7 years ago
Another occurence:
sle-15-Installer-DVD-s390x-Build589.1-textmode+role_textmode@s390x-kvm-sle12
Updated by okurz almost 7 years ago
That seems to be becoming more annyoing:
- https://openqa.suse.de/tests/1656731/file/serial0.txt
- https://openqa.suse.de/tests/1656787/file/serial0.txt
Can we put in a check that the network was restarted without any error output and fail otherwise?
Updated by okurz almost 7 years ago
I am not sure if the following shows the same problem but it looked interesting at least. I triggered 10 jobs with:
$ for i in {1..10} ; do openqa_clone_job_osd --skip-chained-deps 1668775 TEST=okurz_bsc#1091186_$i BUILD=598.1:bsc1091186 _GROUP="Test Development: SLE 15" ; done
Results on
https://openqa.suse.de/tests/overview?version=15&build=598.1%3Absc1091186&distri=sle&groupid=96
and soon after the triggering 4 were running, four failed in "boot_to_desktop". This looks related.
Suggestions:
- Crosscheck the IP-addresses assigned to each machine
- Crosscheck if there is output about duplicate IP adress or not
- Crosscheck timestamps
Updated by mgriessmeier almost 7 years ago
- Status changed from Workable to In Progress
Updated by mgriessmeier almost 7 years ago
okurz wrote:
I am not sure if the following shows the same problem but it looked interesting at least. I triggered 10 jobs with:
$ for i in {1..10} ; do openqa_clone_job_osd --skip-chained-deps 1668775 TEST=okurz_bsc#1091186_$i BUILD=598.1:bsc1091186 _GROUP="Test Development: SLE 15" ; done
Results on
https://openqa.suse.de/tests/overview?version=15&build=598.1%3Absc1091186&distri=sle&groupid=96and soon after the triggering 4 were running, four failed in "boot_to_desktop". This looks related.
Suggestions:
- Crosscheck the IP-addresses assigned to each machine
- Crosscheck if there is output about duplicate IP adress or not
- Crosscheck timestamps
3/4 are failing due to duplicate IP-Addresses
I'm currently working on a list of IP-Assignments for each and every worker and the IPs which are used by the tests
so suggesting to carry over to next Sprint
Updated by mgriessmeier almost 7 years ago
- Due date changed from 2018-05-08 to 2018-05-22
Updated by okurz almost 7 years ago
mgriessmeier wrote:
I'm currently working on a list of IP-Assignments for each and every worker and the IPs which are used by the tests
sounds like you are implementing a DHCP server ;)
Updated by riafarov almost 7 years ago
- Subject changed from [functional][sle][s390][u][infrastructure][sporadic] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time to [functional][sle][s390][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time
Updated by mloviska almost 7 years ago
Duplicate address issue over here as well. https://openqa.suse.de/tests/1682086/file/autoinst-log.txt
Interesting that it did not fail in boot_to_desktop or bootloader_zkvm but in kdump_and_crash during reboot
https://openqa.suse.de/tests/1682086#step/kdump_and_crash/35
Updated by mgriessmeier almost 7 years ago
mloviska wrote:
Duplicate address issue over here as well. https://openqa.suse.de/tests/1682086/file/autoinst-log.txt
but that message appeared during boot_to_desktop, weird that it could proceed...
Interesting that it did not fail in boot_to_desktop or bootloader_zkvm but in kdump_and_crash during reboot
https://openqa.suse.de/tests/1682086#step/kdump_and_crash/35
that seems to be a different issue to be honest
Updated by mloviska almost 7 years ago
- Subject changed from [functional][sle][s390][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time to [functional][sle][s390x][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time
Error: eth0: IPv4 duplicate address 10.161.145.16 detected (in use by 52:54:00:e0:b4:d6)!
- sle-15-Installer-DVD-s390x-Build628.5-addon-module-ftp@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-allpatterns@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-btrfs@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-btrfs+warnings@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-create_hdd_gnome@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-ext4@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-lvm-encrypt-separate-boot@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-minimal+base@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-minimal+role_minimal@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-skip_registration@s390x-kvm-sle1
- sle-15-Installer-DVD-s390x-Build628.5-skip_registration+workaround_modules@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-xfs@s390x-kvm-sle12
- sle-15-Installer-DVD-s390x-Build628.5-yast_no_self_update@s390x-kvm-sle12
Updated by nicksinger almost 7 years ago
We could build a reproduce setup on s390p8. Our approach to circumvent this is following:
- Boot the image which was created previously by a different job
- Break in GRUB2 and add
systemd.unit=rescue.target
to the kernel cmdline- This ensures wicked does not start before we (re)configure the network interface configuration
- Using
systemd.mask=wicked
did not work for us since we could notunmask
the server after we're done re-configuring
- Edit /etc/sysconfig/network/ifcfg-* to change the desired values, save, close
- Exit and continue the boot by typing
exit
into the systemd rescue shell
These steps ensure that the system never acquires its configured IPs when it boots up for a second time (since we can't ensure that the initial creation-worker hasn't already started another installation with the same IP).
Updated by okurz over 6 years ago
That sounds like we are not really testing the normal boot flow anymore. I don't understand. Can we not just use a proper DHCP setup?
Updated by mgriessmeier over 6 years ago
okurz wrote:
Can we not just use a proper DHCP setup?
needs to be clarified with SUSE-IT
Updated by mgriessmeier over 6 years ago
- Due date changed from 2018-05-22 to 2018-06-05
Updated by mgriessmeier over 6 years ago
mgriessmeier wrote:
okurz wrote:
Can we not just use a proper DHCP setup?
needs to be clarified with SUSE-IT
SUSE-IT was on a training in the last two days - I will push this forward today
Updated by mgriessmeier over 6 years ago
opened infra ticket: https://infra.nue.suse.com/Ticket/Display.html?id=113739&results=0862777de3082b5a89e00b7b04bf54f7
Next steps to take:
- test if changing "ifcfg=*=" to dhcp works
- adapt backend to write the mac address which is assigned to the worker into the xml file when generating the guest vm
- adapt workerconf:
- remove static IP assignment
- add MAC address for each worker
Mac address mapping as follows:
zKVM // s390pb
s390kvm003 - 10.161.145.3 - 52:54:00:9d:f3:02
s390kvm004 - 10.161.145.4 - 52:54:00:f0:7c:91
s390kvm005 - 10.161.145.5 - 52:54:00:67:03:b0
SUSE KVM // s390p8
s390kvm013 - 10.161.145.13 - 52:54:00:9f:6b:f6
s390kvm014 - 10.161.145.14 - 52:54:00:d3:26:39
s390kvm015 - 10.161.145.15 - 52:54:00:14:ee:2e
s390kvm016 - 10.161.145.16 - 52:54:00:2f:2b:94
Updated by mgriessmeier over 6 years ago
PR created: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5150
needs testing for multiple scenarios especially upgrades and extratests/create_hdd
Updated by mgriessmeier over 6 years ago
to add: "SetHostname=0" to Machine definitions
Updated by mgriessmeier over 6 years ago
Unfortunately I've spotted again the black screens in boot_to_desktop when I was trying to verify my PR on Upgrade tests...
so this needs still a bit more investigation...
also the black screens (e.g. ssh to SUT not possible) appeared more frequently now on kernel tests... (https://progress.opensuse.org/issues/36745)
Updated by mgriessmeier over 6 years ago
- Related to action #36745: [openqa][sle][functional][u][s390x][zkvm][kernel] Broken boot due "Test died: no candidate needle with tag(s) 'password-prompt' matched" added
Updated by mgriessmeier over 6 years ago
- Due date changed from 2018-06-05 to 2018-06-19
- Target version changed from Milestone 16 to Milestone 17
working on this in upcoming sprint
Updated by mgriessmeier over 6 years ago
soo... the PR is pretty much done,
all relevant scenarios have been tested with it
but.... All the sp3 s390x qcows needs to be either created newly with the dhcp config or by manually mounting them and modifying the corresponding file with e.g.
for i in $(ls *.qcow); do qemu-nbd -c /dev/nbd0 $i; mount /dev/nbdp02 /mnt/; cp $modified_ifcfg /mnt/etc/sysconfig/network/ifcfg-eth0; umount /mnt/; done
That's the reason why I will not finish this today, because "never deploy on friday" ;)
Updated by mgriessmeier over 6 years ago
- Status changed from In Progress to Feedback
Pull requests got merged, Machine definitions got updated
retriggered job for 12-SP4 to check if it works properly on o.s.d
I've already modified the qcows to match DHCP config, but didn't upload them yet
Updated by mgriessmeier over 6 years ago
- Status changed from Feedback to Resolved
retriggered jobs for SLE12SP4:
btrfs
create_hdd_textmode
extratests_in_textmode failing in gpg, but the boot_to_desktop is working fine
also move adjusted SLE12SP3 and older qcows to o.s.d again
closing as resolved - if there are any issues regarding the old qcows which are used for upgrades, please let me know
Updated by okurz over 6 years ago
- Target version changed from Milestone 17 to Milestone 17