action #26044: [functional][sle][s390x][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time - openQA Tests - openSUSE Project Management Tool

Actions

Copy link

action #26044

closed

[functional][sle][s390x][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time

Added by nicksinger over 6 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

High

Assignee:

mgriessmeier

Category:

Bugs in existing tests

Target version:

SUSE QA - Milestone 17

Start date:

2017-10-13

Due date:

2018-06-19

% Done:

Estimated time:

Difficulty:

Description

Observation¶

openQA test in scenario sle-15-Installer-DVD-s390x-btrfs@zkvm and many more fail in
bootloader_zkvm

The problem here is that in rare cases when we boot an existing hdd image, it could happen that it still has the same ip as a vm which runs in another worker (e.g. the worker where the qcow was created)
We need to think about a better way to change the ip addresses or even delete them from a created qcow, and set the new, correct one when the image gets booted

Workaround¶

Retrigger the job

Reproducible¶

Fails since (at least) Build 303.1 in many different test suites. Always the same few characters missing.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by mgriessmeier over 6 years ago

The "URL shown cut off" thingy is probably not the root cause - this is just because of character limitation there.

most likely it is/was a temporary network issue:

Sending DHCP request to enc1...
no/incomplete answer.

I've restarted all jobs failing in bootloader_zkvm and will monitor them

Actions

Copy link

Updated by mgriessmeier over 6 years ago

Status changed from New to In Progress
Assignee set to mgriessmeier

Actions

Copy link

Updated by mgriessmeier over 6 years ago

Status changed from In Progress to Resolved

none of the 5 jobs I've retriggered is showing the issue, so I guess it was related to some network issue and probably also related to the poweroff
https://openqa.suse.de/tests/overview?distri=sle&version=15&build=303.1&groupid=110&arch=s390x&failed_modules=bootloader_zkvm

What we could consider is a better user feedback, but IMO Test died: yast didn't start is already sufficient here
please reopen if seeing this again...

Actions

Copy link

Updated by zluo about 6 years ago

Status changed from Resolved to Workable

this issue happened again:
https://openqa.suse.de/tests/1630483#step/bootloader_zkvm/26

re-opened this ticket.

Actions

Copy link

Updated by okurz about 6 years ago

Subject changed from [functional][sle][s390] test fails in bootloader_zkvm because FTP url is cut off to [functional][sle][s390][u][fast] test fails in bootloader_zkvm because FTP url is cut off
Due date set to 2018-04-24
Target version set to Milestone 15

Actions

Copy link

Updated by mgriessmeier about 6 years ago

it's not about the "cut off FTP URL" this is just a displaying issue
The reason for this is a small time frame when jobs are starting in parallel could cause duplicate IP Addresses because of the network rewrite we do in qcow images on s390 kvm

right now the workaround for this is to restart the job

Actions

Copy link

Updated by mgriessmeier about 6 years ago

Subject changed from [functional][sle][s390][u][fast] test fails in bootloader_zkvm because FTP url is cut off to [functional][sle][s390][u][fast] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time
Description updated (diff)
Priority changed from Urgent to High

Actions

Copy link

Updated by mgriessmeier about 6 years ago

Subject changed from [functional][sle][s390][u][fast] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time to [functional][sle][s390][u][infrastructure][sporadic] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time
Description updated (diff)
Due date changed from 2018-04-24 to 2018-05-08
Target version changed from Milestone 15 to Milestone 16

moving to next sprint, no capacity left and issue is not urgent as a workaround exists

Actions

Copy link

Updated by mloviska about 6 years ago

Another occurence:
sle-15-Installer-DVD-s390x-Build589.1-textmode+role_textmode@s390x-kvm-sle12

Actions

Copy link

#10

Updated by okurz about 6 years ago

That seems to be becoming more annyoing:

Can we put in a check that the network was restarted without any error output and fail otherwise?

Actions

Copy link

#11

Updated by okurz about 6 years ago

I am not sure if the following shows the same problem but it looked interesting at least. I triggered 10 jobs with:

$ for i in {1..10} ; do openqa_clone_job_osd --skip-chained-deps 1668775 TEST=okurz_bsc#1091186_$i BUILD=598.1:bsc1091186 _GROUP="Test Development: SLE 15" ; done

Results on
https://openqa.suse.de/tests/overview?version=15&build=598.1%3Absc1091186&distri=sle&groupid=96

and soon after the triggering 4 were running, four failed in "boot_to_desktop". This looks related.

Suggestions:

Crosscheck the IP-addresses assigned to each machine
Crosscheck if there is output about duplicate IP adress or not
Crosscheck timestamps

Actions

Copy link

#12

Updated by mgriessmeier about 6 years ago

Status changed from Workable to In Progress

Actions

Copy link

#13

Updated by mgriessmeier about 6 years ago

okurz wrote:

I am not sure if the following shows the same problem but it looked interesting at least. I triggered 10 jobs with:
$ for i in {1..10} ; do openqa_clone_job_osd --skip-chained-deps 1668775 TEST=okurz_bsc#1091186_$i BUILD=598.1:bsc1091186 _GROUP="Test Development: SLE 15" ; done
Results on
https://openqa.suse.de/tests/overview?version=15&build=598.1%3Absc1091186&distri=sle&groupid=96

and soon after the triggering 4 were running, four failed in "boot_to_desktop". This looks related.

Suggestions:

Crosscheck the IP-addresses assigned to each machine

Crosscheck if there is output about duplicate IP adress or not

Crosscheck timestamps

3/4 are failing due to duplicate IP-Addresses
I'm currently working on a list of IP-Assignments for each and every worker and the IPs which are used by the tests
so suggesting to carry over to next Sprint

Actions

Copy link

#14

Updated by mgriessmeier about 6 years ago

Due date changed from 2018-05-08 to 2018-05-22

Actions

Copy link

#15

Updated by okurz about 6 years ago

mgriessmeier wrote:

I'm currently working on a list of IP-Assignments for each and every worker and the IPs which are used by the tests

sounds like you are implementing a DHCP server ;)

Actions

Copy link

#16

Updated by riafarov about 6 years ago

Subject changed from [functional][sle][s390][u][infrastructure][sporadic] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time to [functional][sle][s390][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time

Actions

Copy link

#17

Updated by mloviska about 6 years ago

Duplicate address issue over here as well. https://openqa.suse.de/tests/1682086/file/autoinst-log.txt
Interesting that it did not fail in boot_to_desktop or bootloader_zkvm but in kdump_and_crash during reboot
https://openqa.suse.de/tests/1682086#step/kdump_and_crash/35

Actions

Copy link

#18

Updated by mgriessmeier about 6 years ago

mloviska wrote:

Duplicate address issue over here as well. https://openqa.suse.de/tests/1682086/file/autoinst-log.txt

but that message appeared during boot_to_desktop, weird that it could proceed...

Interesting that it did not fail in boot_to_desktop or bootloader_zkvm but in kdump_and_crash during reboot
https://openqa.suse.de/tests/1682086#step/kdump_and_crash/35

that seems to be a different issue to be honest

Actions

Copy link

#19

Updated by mloviska about 6 years ago

Subject changed from [functional][sle][s390][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time to [functional][sle][s390x][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time

Error: eth0: IPv4 duplicate address 10.161.145.16 detected (in use by 52:54:00:e0:b4:d6)!

Actions

Copy link

#20

Updated by nicksinger about 6 years ago

We could build a reproduce setup on s390p8. Our approach to circumvent this is following:

Boot the image which was created previously by a different job
Break in GRUB2 and add systemd.unit=rescue.target to the kernel cmdline
- This ensures wicked does not start before we (re)configure the network interface configuration
- Using systemd.mask=wicked did not work for us since we could not unmask the server after we're done re-configuring
Edit /etc/sysconfig/network/ifcfg-* to change the desired values, save, close
Exit and continue the boot by typing exit into the systemd rescue shell

These steps ensure that the system never acquires its configured IPs when it boots up for a second time (since we can't ensure that the initial creation-worker hasn't already started another installation with the same IP).

Actions

Copy link

#21

Updated by okurz about 6 years ago

That sounds like we are not really testing the normal boot flow anymore. I don't understand. Can we not just use a proper DHCP setup?

Actions

Copy link

#22

Updated by mgriessmeier about 6 years ago

okurz wrote:

Can we not just use a proper DHCP setup?

needs to be clarified with SUSE-IT

Actions

Copy link

#23

Updated by mgriessmeier about 6 years ago

Due date changed from 2018-05-22 to 2018-06-05

Actions

Copy link

#24

Updated by mgriessmeier about 6 years ago

mgriessmeier wrote:

okurz wrote:

Can we not just use a proper DHCP setup?

needs to be clarified with SUSE-IT

SUSE-IT was on a training in the last two days - I will push this forward today

Actions

Copy link

#25

Updated by mgriessmeier about 6 years ago

opened infra ticket: https://infra.nue.suse.com/Ticket/Display.html?id=113739&results=0862777de3082b5a89e00b7b04bf54f7

Next steps to take:

test if changing "ifcfg=*=" to dhcp works
adapt backend to write the mac address which is assigned to the worker into the xml file when generating the guest vm
adapt workerconf:
- remove static IP assignment
- add MAC address for each worker

Mac address mapping as follows:

zKVM // s390pb

s390kvm003 - 10.161.145.3 - 52:54:00:9d:f3:02
s390kvm004 - 10.161.145.4 - 52:54:00:f0:7c:91
s390kvm005 - 10.161.145.5 - 52:54:00:67:03:b0


SUSE KVM // s390p8

s390kvm013 - 10.161.145.13 - 52:54:00:9f:6b:f6
s390kvm014 - 10.161.145.14 - 52:54:00:d3:26:39
s390kvm015 - 10.161.145.15 - 52:54:00:14:ee:2e
s390kvm016 - 10.161.145.16 - 52:54:00:2f:2b:94

Actions

Copy link

#26

Updated by mgriessmeier about 6 years ago

PR created: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5150

needs testing for multiple scenarios especially upgrades and extratests/create_hdd

Actions

Copy link

#27

Updated by mgriessmeier about 6 years ago

to add: "SetHostname=0" to Machine definitions

Actions

Copy link

#28

Updated by mgriessmeier about 6 years ago

Unfortunately I've spotted again the black screens in boot_to_desktop when I was trying to verify my PR on Upgrade tests...
so this needs still a bit more investigation...
also the black screens (e.g. ssh to SUT not possible) appeared more frequently now on kernel tests... (https://progress.opensuse.org/issues/36745)

Actions

Copy link

#29

Updated by mgriessmeier about 6 years ago

Related to action #36745: [openqa][sle][functional][u][s390x][zkvm][kernel] Broken boot due "Test died: no candidate needle with tag(s) 'password-prompt' matched" added

Actions

Copy link

#30

Updated by mgriessmeier about 6 years ago

Due date changed from 2018-06-05 to 2018-06-19
Target version changed from Milestone 16 to Milestone 17

working on this in upcoming sprint

Actions

Copy link

#31

Updated by mgriessmeier about 6 years ago

soo... the PR is pretty much done,
all relevant scenarios have been tested with it

but.... All the sp3 s390x qcows needs to be either created newly with the dhcp config or by manually mounting them and modifying the corresponding file with e.g.

for i in $(ls *.qcow); do qemu-nbd -c /dev/nbd0 $i; mount /dev/nbdp02 /mnt/; cp $modified_ifcfg /mnt/etc/sysconfig/network/ifcfg-eth0; umount /mnt/; done

That's the reason why I will not finish this today, because "never deploy on friday" ;)

Actions

Copy link

#32

Updated by mgriessmeier about 6 years ago

Status changed from In Progress to Feedback

Pull requests got merged, Machine definitions got updated

retriggered job for 12-SP4 to check if it works properly on o.s.d

I've already modified the qcows to match DHCP config, but didn't upload them yet

Actions

Copy link

#33

Updated by mgriessmeier about 6 years ago

Status changed from Feedback to Resolved

retriggered jobs for SLE12SP4:

btrfs
create_hdd_textmode
extratests_in_textmode failing in gpg, but the boot_to_desktop is working fine

also move adjusted SLE12SP3 and older qcows to o.s.d again

closing as resolved - if there are any issues regarding the old qcows which are used for upgrades, please let me know

Actions

Copy link

#34

Updated by okurz about 6 years ago

Target version changed from Milestone 17 to Milestone 17

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Tests

Tags

Custom queries

action #26044

[functional][sle][s390x][u][infrastructure][sporadic][hard] test fails in bootloader_zkvm because IPs could be duplicate when jobs are started at the same time

Observation¶

Workaround¶

Reproducible¶

Updated by mgriessmeier over 6 years ago

Updated by mgriessmeier over 6 years ago

Updated by mgriessmeier over 6 years ago

Updated by zluo about 6 years ago

Updated by okurz about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mloviska about 6 years ago

Updated by okurz about 6 years ago

Updated by okurz about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by okurz about 6 years ago

Updated by riafarov about 6 years ago

Updated by mloviska about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mloviska about 6 years ago

Updated by nicksinger about 6 years ago

Updated by okurz about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by mgriessmeier about 6 years ago

Updated by okurz about 6 years ago