action #113528
closed[qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly)
0%
Description
Observation¶
openQA test in scenario sle-15-SP3-Server-DVD-Updates-s390x-mru-install-minimal-with-addons@s390x-kvm-sle12 fails in
bootloader_zkvm
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.
Reproducible¶
Fails since (at least) Build 20220712-1
Expected result¶
Last good: 20220711-1 (or more recent)
#suggestions
Copy a workerconfig from grenache to openqaworker2 as Matthi suggested, give it a different name and clone a job using the new worker class and monitor
Updated by rfan1 over 2 years ago
- Subject changed from [qe-core]test fails in bootloader_zkvm, some performance issue with worker class 's390x-kvm-sle12'? to [qe-core]test fails in bootloader_zkvm, any network performance issue with worker class 's390x-kvm-sle12'?
Seems we had some issue to download the 'initrd' file, especially on SUT 'zp18'.
Updated by rfan1 over 2 years ago
I tried to add some wait time:
diff --git a/tests/installation/bootloader_zkvm.pm b/tests/installation/bootloader_zkvm.pm
index 76c5a5c8b..b4ce9446f 100644
--- a/tests/installation/bootloader_zkvm.pm
+++ b/tests/installation/bootloader_zkvm.pm
@@ -50,7 +50,7 @@ sub set_svirt_domain_elements {
# show this on screen and make sure that kernel and initrd are actually saved
enter_cmd "wget $repo/boot/s390x/initrd -O $zkvm_img_path/$name.initrd";
- assert_screen "initrd-saved";
+ assert_screen "initrd-saved", 90;
enter_cmd "wget $repo/boot/s390x/linux -O $zkvm_img_path/$name.kernel";
assert_screen "kernel-saved";
}
However, it seems later tests were still slow https://openqa.suse.de/tests/9121775#step/bootloader_zkvm/37
Updated by mgriessmeier over 2 years ago
fwiw: manual installation on s390zp18 works just fine and without any noticeable performance impacts
Updated by mgriessmeier over 2 years ago
so the issue looks to me like that the SUT is either not sending data over serial to openQA, or that openQA cannot receive this data.
manually the serial output openQA is waiting for is present.
Updated by okurz over 2 years ago
- Priority changed from Normal to Urgent
Bumping to urgent as it's affecting QA SLE maintenance tests, e.g. https://openqa.suse.de/tests/9113784#next_previous
Updated by jlausuch over 2 years ago
It also affects SLE Micro 5.3 Alpha 2 release, as we can't do validation on this arch.
Updated by geor over 2 years ago
From my side, after playing around it looks like there is significant delay to get the serial output, the most success I have had was with this job, latest successful result is this one.
It does look like something goes through sometimes, but I really cannot determine any factor that affected the success rate.
Anyway I ll check a bit more tomorrow, hoping that it fixes itself.
Updated by rfan1 over 2 years ago
https://openqa.suse.de/tests/9125700/logfile?filename=autoinst-log.txt
The issue is still there.
[2022-07-14T03:26:50.910940+02:00] [debug] no match: 6.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:51.981234+02:00] [debug] no match: 5.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:52.913105+02:00] [debug] no match: 4.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:53.914215+02:00] [debug] no match: 3.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:54.915226+02:00] [debug] no match: 2.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:55.916294+02:00] [debug] no match: 1.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:56.986691+02:00] [debug] no match: 0.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:57.987640+02:00] [debug] no match: -0.1s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:58.247167+02:00] [debug] >>> testapi::_check_backend_response: match=initrd-saved timed out after 30 (assert_screen)
[2022-07-14T03:26:58.314293+02:00] [info] ::: basetest::runtest: # Test died: no candidate needle with tag(s) 'initrd-saved' matched
[2022-07-14T03:26:58.315794+02:00] [debug] tests/installation/bootloader_zkvm.pm:96 called testapi::select_console
[2022-07-14T03:26:58.315920+02:00] [debug] <<< testapi::select_console(testapi_console="svirt")
[2022-07-14T03:26:58.321021+02:00] [debug] Establishing VNC connection to localhost:47135
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":35331"
after 28809 requests (28652 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":35331"
[2022-07-14T03:26:58.333696+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 491857 and exit status: 1
[2022-07-14T03:26:58.334038+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 491866 and exit status: 0
[2022-07-14T03:26:58.334248+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 491859 and exit status: 84
Updated by szarate over 2 years ago
https://openqa.suse.de/tests/9126241/logfile?filename=autoinst-log.txt
somebody changed the machine definition of: s390x-kvm-sle12
Settings key '_now_in_workerconf_NETDEV' is invalid at /usr/lib/os-autoinst/bmwqemu.pm line 90.
bmwqemu::load_vars() called at /usr/bin/isotovideo line 395
at /usr/lib/os-autoinst/bmwqemu.pm line 313.
bmwqemu::tiedvars::STORE(bmwqemu::tiedvars=HASH(0x1003d7bbe78), "_now_in_workerconf_NETDEV", "eth2") called at /usr/lib/os-autoinst/bmwqemu.pm line 90
bmwqemu::load_vars() called at /usr/bin/isotovideo line 395
but that wasnt the issue
Updated by szarate over 2 years ago
- Related to coordination #58184: [saga][epic][use case] full version control awareness within openQA added
Updated by szarate over 2 years ago
- Sprint set to QE-Core: July Sprint (Jul 06 - Aug 03)
- Tags changed from qe-core-coverage, bugbusters to bugbusters
- Subject changed from [qe-core]test fails in bootloader_zkvm, any network performance issue with worker class 's390x-kvm-sle12'? to [qe-core] test fails in bootloader_zkvm - invalid test setting defined for s390x-kvm-sle12
- Status changed from New to Feedback
- Assignee set to szarate
[6692719.947536] Out of memory: Kill process 152100 (qemu-system-s39) score 120 or sacrifice child
[6692719.947558] Killed process 152100 (qemu-system-s39) total-vm:8013396kB, anon-rss:4188940kB, file-rss:2864kB, shmem-rss:0kB
Just rebooted the machine, I hope that fixes the issue?...
Updated by mgriessmeier over 2 years ago
to add my findings here about the apparently lost serial output which I can reproduce locally
following the approach openQA does to grab the serial output I ran ssh -t root@s390zp18 virsh console openQA-SUT-7-113528
When I send data to the SUT via echo "foobar" > /dev/pts/<pty>
it will appear on the ssh'ed serial output.
This works fine when I do it from my notebook or a worker outside the .qa.suse.de subnet. However, from grenache-1 and other machines in the qa subnet, I observed that the output doesn't get through or is delayed.
furthermore I noticed that my ssh connection to grenache-1 is hanging from time to time so I did some investigation for network performance.
mtr showed that there seems to be another issue in network performance between 10.162.x.x and 10.161.x.x (which is the s390 network) and is most likely is related to the serial output redirection (cause it is done over ssh)
for grenache-1 to s390zp[18,19] I got package losses of about 50%
for grenache-1 to z13router it was about 20%
for openqaworker2.suse.de or my notebook to the 10.161.x.x subnet, everything was fine
furthermore I noticed that qanet.qa.suse.de showed 100% packageloss to 10.161.x.x
I pinged Gerhard and he is also looking into this and I will create an enginfra ticket later on if we cannot solve it in a timely manner
Updated by szarate over 2 years ago
- Subject changed from [qe-core] test fails in bootloader_zkvm - invalid test setting defined for s390x-kvm-sle12 to [qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly)
mgriessmeier wrote:
to add my findings here about the apparently lost serial output which I can reproduce locally
following the approach openQA does to grab the serial output I ran
ssh -t root@s390zp18 virsh console openQA-SUT-7-113528
When I send data to the SUT viaecho "foobar" > /dev/pts/<pty>
it will appear on the ssh'ed serial output.
This works fine when I do it from my notebook or a worker outside the .qa.suse.de subnet. However, from grenache-1 and other machines in the qa subnet, I observed that the output doesn't get through or is delayed.furthermore I noticed that my ssh connection to grenache-1 is hanging from time to time so I did some investigation for network performance.
I tend to have similar hangs connecting to phobos.suse.de too today.
mtr showed that there seems to be another issue in network performance between 10.162.x.x and 10.161.x.x (which is the s390 network) and is most likely is related to the serial output redirection (cause it is done over ssh)
for grenache-1 to s390zp[18,19] I got package losses of about 50%
for grenache-1 to z13router it was about 20%
for openqaworker2.suse.de or my notebook to the 10.161.x.x subnet, everything was fine
furthermore I noticed that qanet.qa.suse.de showed 100% packageloss to 10.161.x.x
indeed got the same issues when trying from phobos
Updated by szarate over 2 years ago
- Description updated (diff)
next thing to try:
Copy a workerconfig from grenache to openqaworker2 as Matthi suggested, give it a different name and clone a job using the new worker class and monitor
Updated by mgriessmeier over 2 years ago
jira-sd tracking ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-92689
Updated by okurz over 2 years ago
@mgriessmeier do you think something in https://monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&refresh=1m&viewPanel=65099 should show this packet loss as well?
Updated by szarate over 2 years ago
- Target version set to QE-Core: Ready
with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/420 and https://openqa.suse.de/tests/9128415 we're now sure that it's network stuff...
Things that I see where we can cover ourselves:
svirt backend needs to be more aware of the host where it is running:
- detect better out of memory conditions
problems between the jumphost and the system itself: for instance, system failing to boot, with zero data in the serial console (see below) could be a good hint of this problem
Connected to domain openQA-SUT-5 Escape character is ^]
libvirt hosts should use the nfs to copy the assets, instead of the assets being downloaded and then copied
Also, right now tests are still run very inefficiently with unused assets being transferred back and forth. E.g. qcow images from https://openqa.suse.de/tests/9128415#downloads are transferred from osd to the worker and then to the s390x host
Updated by mgriessmeier over 2 years ago
okurz wrote:
@mgriessmeier do you think something in https://monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&refresh=1m&viewPanel=65099 should show this packet loss as well?
@okurz, actually it does I just realized, it's barely seeable when all graphs are selected, but when you select only zp18 and zp19 you can see it pretty clearly - very interesting though that s390zp17 and s390zp11 are not showing it....
Updated by szarate over 2 years ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 has been merged, let's see how everything responds: http://stats.openqa-monitor.qa.suse.de/goto/WwiTxWg4z?orgId=1
[2022-07-15T12:27:30.327108+02:00] [debug] [run_ssh_cmd(qemu-img create '/var/lib/libvirt/images/openQA-SUT-7a.img' -f qcow2 -b '/var/lib/libvirt/images//SLES-15-SP4-s390x-minimal_installed_for_LTP.qcow2' 42949672960)] stderr:
qemu-img: /var/lib/libvirt/images/openQA-SUT-7a.img: Failed to get "write" lock
Is another process using the image [/var/lib/libvirt/images/openQA-SUT-7a.img]?
https://openqa.suse.de/tests/9132833/logfile?filename=autoinst-log.txt now let's see where is there another worker using the same VIRSH_INSTANCE number
Updated by szarate over 2 years ago
- Status changed from Feedback to In Progress
For the most part, it looks healthy http://stats.openqa-monitor.qa.suse.de/goto/zuZY-WgVk?orgId=1
Updated by szarate over 2 years ago
- Due date set to 2022-07-18
- Status changed from In Progress to Feedback
Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday
Updated by okurz over 2 years ago
szarate wrote:
Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday
Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time. I suggest to keep this ticket open until workers are back on grenache-1 or a different solution has been found. But right now we have multiple worker instances on grenache-1 with no useful worker class. Of course we need to understand and fix first the underlying issue, that is https://sd.suse.com/servicedesk/customer/portal/1/SD-92689, right?
Updated by szarate over 2 years ago
- Related to action #107062: Multiple failures due to network issues added
Updated by szarate over 2 years ago
okurz wrote:
szarate wrote:
Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday
Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time.
I'll keep it open until the end of the sprint.
I suggest to keep this ticket open until workers are back on grenache-1 or a different solution has been found.
I take moving the workers to openQA worker 2 is a good solution, if you disagree, create a ticket to move the workers back to grenache :)
But right now we have multiple worker instances on grenache-1 with no useful worker class.
When creating the ticket above, mention that the openQA worker distribution has to be improved and rebalanced in the worker config too.
Of course we need to understand and fix first the underlying issue, that is https://sd.suse.com/servicedesk/customer/portal/1/SD-92689, right?
I guess that's what #107062 is for
Updated by szarate over 2 years ago
Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time.
Sure, people had plenty of time to review in such urgent situation :), another story would be if I merged right after creating the PR.
Updated by okurz over 2 years ago
- Copied to action #113701: [qe-core] Move workers back to grenache added
Updated by okurz over 2 years ago
szarate wrote:
okurz wrote:
szarate wrote:
Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday
Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time.
I'll keep it open until the end of the sprint.
Well, I meant the merge request. I assume with "it" you mean this ticket. Anyway, see further below regarding any followup work.
I suggest to keep this ticket open until workers are back on grenache-1 or a different solution has been found.
I take moving the workers to openQA worker 2 is a good solution, if you disagree, create a ticket to move the workers back to grenache :)
Sure. I think it's not a good idea to spread out the rollback actions to separate tickets as it's still the same original issue and moving to another worker is a temporary workaround but I created a ticket if it helps your planning: #113701
But right now we have multiple worker instances on grenache-1 with no useful worker class.
When creating the ticket above, mention that the openQA worker distribution has to be improved and rebalanced in the worker config too.
Done, included with #113701#note-2
Of course we need to understand and fix first the underlying issue, that is https://sd.suse.com/servicedesk/customer/portal/1/SD-92689, right?
I guess that's what #107062 is for
I don't mind if you want to use #107062 for tracking or this ticket or the newly created #113701, please just make sure that https://sd.suse.com/servicedesk/customer/portal/1/SD-92689 is tracked and as soon as resolved #113701 is followed up with.
Updated by szarate over 2 years ago
- Status changed from Feedback to Resolved
After monitoring the situation a bit, I don't see anything else to do here.
Updated by amanzini about 2 years ago
- Related to action #119233: [security] boot_to_desktop test fails on specific s390 workers due to closed SSH port added