Project

General

Profile

Actions

action #113528

closed

[qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly)

Added by rfan1 over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Bugs in existing tests
Start date:
2022-07-13
Due date:
2022-07-18
% Done:

0%

Estimated time:
Difficulty:
Sprint:
QE-Core: July Sprint (Jul 06 - Aug 03)

Description

Observation

openQA test in scenario sle-15-SP3-Server-DVD-Updates-s390x-mru-install-minimal-with-addons@s390x-kvm-sle12 fails in
bootloader_zkvm

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

Reproducible

Fails since (at least) Build 20220712-1

Expected result

Last good: 20220711-1 (or more recent)

#suggestions

Copy a workerconfig from grenache to openqaworker2 as Matthi suggested, give it a different name and clone a job using the new worker class and monitor


Related issues 4 (3 open1 closed)

Related to openQA Project (public) - coordination #58184: [saga][epic][use case] full version control awareness within openQABlockedokurz2019-08-01

Actions
Related to openQA Tests (public) - action #107062: Multiple failures due to network issuesFeedbackjstehlik2021-09-27

Actions
Related to openQA Tests (public) - action #119233: [security] boot_to_desktop test fails on specific s390 workers due to closed SSH portClosed

Actions
Copied to openQA Infrastructure (public) - action #113701: [qe-core] Move workers back to grenacheNew

Actions
Actions #1

Updated by rfan1 over 2 years ago

  • Subject changed from [qe-core]test fails in bootloader_zkvm, some performance issue with worker class 's390x-kvm-sle12'? to [qe-core]test fails in bootloader_zkvm, any network performance issue with worker class 's390x-kvm-sle12'?

Seems we had some issue to download the 'initrd' file, especially on SUT 'zp18'.

Actions #2

Updated by rfan1 over 2 years ago

I tried to add some wait time:

diff --git a/tests/installation/bootloader_zkvm.pm b/tests/installation/bootloader_zkvm.pm
index 76c5a5c8b..b4ce9446f 100644
--- a/tests/installation/bootloader_zkvm.pm
+++ b/tests/installation/bootloader_zkvm.pm
@@ -50,7 +50,7 @@ sub set_svirt_domain_elements {

         # show this on screen and make sure that kernel and initrd are actually saved
         enter_cmd "wget $repo/boot/s390x/initrd -O $zkvm_img_path/$name.initrd";
-        assert_screen "initrd-saved";
+        assert_screen "initrd-saved", 90;
         enter_cmd "wget $repo/boot/s390x/linux -O $zkvm_img_path/$name.kernel";
         assert_screen "kernel-saved";
     }

However, it seems later tests were still slow https://openqa.suse.de/tests/9121775#step/bootloader_zkvm/37

Actions #3

Updated by mgriessmeier over 2 years ago

fwiw: manual installation on s390zp18 works just fine and without any noticeable performance impacts

Actions #4

Updated by mgriessmeier over 2 years ago

so the issue looks to me like that the SUT is either not sending data over serial to openQA, or that openQA cannot receive this data.
manually the serial output openQA is waiting for is present.

Actions #5

Updated by okurz over 2 years ago

  • Priority changed from Normal to Urgent

Bumping to urgent as it's affecting QA SLE maintenance tests, e.g. https://openqa.suse.de/tests/9113784#next_previous

Actions #6

Updated by jlausuch over 2 years ago

It also affects SLE Micro 5.3 Alpha 2 release, as we can't do validation on this arch.

Actions #7

Updated by geor over 2 years ago

From my side, after playing around it looks like there is significant delay to get the serial output, the most success I have had was with this job, latest successful result is this one.
It does look like something goes through sometimes, but I really cannot determine any factor that affected the success rate.
Anyway I ll check a bit more tomorrow, hoping that it fixes itself.

Actions #8

Updated by rfan1 over 2 years ago

https://openqa.suse.de/tests/9125700/logfile?filename=autoinst-log.txt

The issue is still there.

[2022-07-14T03:26:50.910940+02:00] [debug] no match: 6.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:51.981234+02:00] [debug] no match: 5.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:52.913105+02:00] [debug] no match: 4.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:53.914215+02:00] [debug] no match: 3.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:54.915226+02:00] [debug] no match: 2.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:55.916294+02:00] [debug] no match: 1.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:56.986691+02:00] [debug] no match: 0.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:57.987640+02:00] [debug] no match: -0.1s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:58.247167+02:00] [debug] >>> testapi::_check_backend_response: match=initrd-saved timed out after 30 (assert_screen)
[2022-07-14T03:26:58.314293+02:00] [info] ::: basetest::runtest: # Test died: no candidate needle with tag(s) 'initrd-saved' matched
[2022-07-14T03:26:58.315794+02:00] [debug] tests/installation/bootloader_zkvm.pm:96 called testapi::select_console
[2022-07-14T03:26:58.315920+02:00] [debug] <<< testapi::select_console(testapi_console="svirt")
[2022-07-14T03:26:58.321021+02:00] [debug] Establishing VNC connection to localhost:47135
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":35331"
      after 28809 requests (28652 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":35331"
[2022-07-14T03:26:58.333696+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 491857 and exit status: 1
[2022-07-14T03:26:58.334038+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 491866 and exit status: 0
[2022-07-14T03:26:58.334248+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 491859 and exit status: 84
Actions #9

Updated by szarate over 2 years ago

https://openqa.suse.de/tests/9126241/logfile?filename=autoinst-log.txt

somebody changed the machine definition of: s390x-kvm-sle12

Settings key '_now_in_workerconf_NETDEV' is invalid at /usr/lib/os-autoinst/bmwqemu.pm line 90.
    bmwqemu::load_vars() called at /usr/bin/isotovideo line 395
 at /usr/lib/os-autoinst/bmwqemu.pm line 313.
    bmwqemu::tiedvars::STORE(bmwqemu::tiedvars=HASH(0x1003d7bbe78), "_now_in_workerconf_NETDEV", "eth2") called at /usr/lib/os-autoinst/bmwqemu.pm line 90
    bmwqemu::load_vars() called at /usr/bin/isotovideo line 395

but that wasnt the issue

Actions #10

Updated by szarate over 2 years ago

  • Priority changed from Urgent to Immediate
Actions #11

Updated by szarate over 2 years ago

  • Related to coordination #58184: [saga][epic][use case] full version control awareness within openQA added
Actions #12

Updated by szarate over 2 years ago

  • Sprint set to QE-Core: July Sprint (Jul 06 - Aug 03)
  • Tags changed from qe-core-coverage, bugbusters to bugbusters
  • Subject changed from [qe-core]test fails in bootloader_zkvm, any network performance issue with worker class 's390x-kvm-sle12'? to [qe-core] test fails in bootloader_zkvm - invalid test setting defined for s390x-kvm-sle12
  • Status changed from New to Feedback
  • Assignee set to szarate
[6692719.947536] Out of memory: Kill process 152100 (qemu-system-s39) score 120 or sacrifice child
[6692719.947558] Killed process 152100 (qemu-system-s39) total-vm:8013396kB, anon-rss:4188940kB, file-rss:2864kB, shmem-rss:0kB

Just rebooted the machine, I hope that fixes the issue?...

Actions #13

Updated by mgriessmeier over 2 years ago

to add my findings here about the apparently lost serial output which I can reproduce locally

following the approach openQA does to grab the serial output I ran ssh -t root@s390zp18 virsh console openQA-SUT-7-113528
When I send data to the SUT via echo "foobar" > /dev/pts/<pty> it will appear on the ssh'ed serial output.
This works fine when I do it from my notebook or a worker outside the .qa.suse.de subnet. However, from grenache-1 and other machines in the qa subnet, I observed that the output doesn't get through or is delayed.

furthermore I noticed that my ssh connection to grenache-1 is hanging from time to time so I did some investigation for network performance.

mtr showed that there seems to be another issue in network performance between 10.162.x.x and 10.161.x.x (which is the s390 network) and is most likely is related to the serial output redirection (cause it is done over ssh)

for grenache-1 to s390zp[18,19] I got package losses of about 50%
for grenache-1 to z13router it was about 20%
for openqaworker2.suse.de or my notebook to the 10.161.x.x subnet, everything was fine
furthermore I noticed that qanet.qa.suse.de showed 100% packageloss to 10.161.x.x

I pinged Gerhard and he is also looking into this and I will create an enginfra ticket later on if we cannot solve it in a timely manner

Actions #14

Updated by szarate over 2 years ago

  • Subject changed from [qe-core] test fails in bootloader_zkvm - invalid test setting defined for s390x-kvm-sle12 to [qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly)

mgriessmeier wrote:

to add my findings here about the apparently lost serial output which I can reproduce locally

following the approach openQA does to grab the serial output I ran ssh -t root@s390zp18 virsh console openQA-SUT-7-113528
When I send data to the SUT via echo "foobar" > /dev/pts/<pty> it will appear on the ssh'ed serial output.
This works fine when I do it from my notebook or a worker outside the .qa.suse.de subnet. However, from grenache-1 and other machines in the qa subnet, I observed that the output doesn't get through or is delayed.

furthermore I noticed that my ssh connection to grenache-1 is hanging from time to time so I did some investigation for network performance.

I tend to have similar hangs connecting to phobos.suse.de too today.

mtr showed that there seems to be another issue in network performance between 10.162.x.x and 10.161.x.x (which is the s390 network) and is most likely is related to the serial output redirection (cause it is done over ssh)

for grenache-1 to s390zp[18,19] I got package losses of about 50%
for grenache-1 to z13router it was about 20%
for openqaworker2.suse.de or my notebook to the 10.161.x.x subnet, everything was fine
furthermore I noticed that qanet.qa.suse.de showed 100% packageloss to 10.161.x.x

indeed got the same issues when trying from phobos

Actions #15

Updated by szarate over 2 years ago

  • Description updated (diff)

next thing to try:
Copy a workerconfig from grenache to openqaworker2 as Matthi suggested, give it a different name and clone a job using the new worker class and monitor

Actions #18

Updated by szarate over 2 years ago

  • Target version set to QE-Core: Ready

with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/420 and https://openqa.suse.de/tests/9128415 we're now sure that it's network stuff...

Things that I see where we can cover ourselves:

svirt backend needs to be more aware of the host where it is running:

  • detect better out of memory conditions
  • problems between the jumphost and the system itself: for instance, system failing to boot, with zero data in the serial console (see below) could be a good hint of this problem

    Connected to domain openQA-SUT-5
    Escape character is ^]
    
  • libvirt hosts should use the nfs to copy the assets, instead of the assets being downloaded and then copied

Also, right now tests are still run very inefficiently with unused assets being transferred back and forth. E.g. qcow images from https://openqa.suse.de/tests/9128415#downloads are transferred from osd to the worker and then to the s390x host

Actions #19

Updated by mgriessmeier over 2 years ago

okurz wrote:

@mgriessmeier do you think something in https://monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&refresh=1m&viewPanel=65099 should show this packet loss as well?

@okurz, actually it does I just realized, it's barely seeable when all graphs are selected, but when you select only zp18 and zp19 you can see it pretty clearly - very interesting though that s390zp17 and s390zp11 are not showing it....

Actions #20

Updated by szarate over 2 years ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 has been merged, let's see how everything responds: http://stats.openqa-monitor.qa.suse.de/goto/WwiTxWg4z?orgId=1

[2022-07-15T12:27:30.327108+02:00] [debug] [run_ssh_cmd(qemu-img create '/var/lib/libvirt/images/openQA-SUT-7a.img' -f qcow2 -b '/var/lib/libvirt/images//SLES-15-SP4-s390x-minimal_installed_for_LTP.qcow2' 42949672960)] stderr:
  qemu-img: /var/lib/libvirt/images/openQA-SUT-7a.img: Failed to get "write" lock
  Is another process using the image [/var/lib/libvirt/images/openQA-SUT-7a.img]?

https://openqa.suse.de/tests/9132833/logfile?filename=autoinst-log.txt now let's see where is there another worker using the same VIRSH_INSTANCE number

Actions #21

Updated by szarate over 2 years ago

  • Status changed from Feedback to In Progress
Actions #22

Updated by szarate over 2 years ago

  • Due date set to 2022-07-18
  • Status changed from In Progress to Feedback

Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday

Actions #23

Updated by okurz over 2 years ago

szarate wrote:

Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday

Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time. I suggest to keep this ticket open until workers are back on grenache-1 or a different solution has been found. But right now we have multiple worker instances on grenache-1 with no useful worker class. Of course we need to understand and fix first the underlying issue, that is https://sd.suse.com/servicedesk/customer/portal/1/SD-92689, right?

Actions #24

Updated by szarate over 2 years ago

  • Related to action #107062: Multiple failures due to network issues added
Actions #25

Updated by szarate over 2 years ago

okurz wrote:

szarate wrote:

Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday

Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time.

I'll keep it open until the end of the sprint.

I suggest to keep this ticket open until workers are back on grenache-1 or a different solution has been found.

I take moving the workers to openQA worker 2 is a good solution, if you disagree, create a ticket to move the workers back to grenache :)

But right now we have multiple worker instances on grenache-1 with no useful worker class.

When creating the ticket above, mention that the openQA worker distribution has to be improved and rebalanced in the worker config too.

Of course we need to understand and fix first the underlying issue, that is https://sd.suse.com/servicedesk/customer/portal/1/SD-92689, right?

I guess that's what #107062 is for

Actions #26

Updated by szarate over 2 years ago

Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time.

Sure, people had plenty of time to review in such urgent situation :), another story would be if I merged right after creating the PR.

Actions #27

Updated by okurz over 2 years ago

  • Copied to action #113701: [qe-core] Move workers back to grenache added
Actions #28

Updated by okurz over 2 years ago

szarate wrote:

okurz wrote:

szarate wrote:

Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday

Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time.

I'll keep it open until the end of the sprint.

Well, I meant the merge request. I assume with "it" you mean this ticket. Anyway, see further below regarding any followup work.

I suggest to keep this ticket open until workers are back on grenache-1 or a different solution has been found.

I take moving the workers to openQA worker 2 is a good solution, if you disagree, create a ticket to move the workers back to grenache :)

Sure. I think it's not a good idea to spread out the rollback actions to separate tickets as it's still the same original issue and moving to another worker is a temporary workaround but I created a ticket if it helps your planning: #113701

But right now we have multiple worker instances on grenache-1 with no useful worker class.

When creating the ticket above, mention that the openQA worker distribution has to be improved and rebalanced in the worker config too.

Done, included with #113701#note-2

Of course we need to understand and fix first the underlying issue, that is https://sd.suse.com/servicedesk/customer/portal/1/SD-92689, right?

I guess that's what #107062 is for

I don't mind if you want to use #107062 for tracking or this ticket or the newly created #113701, please just make sure that https://sd.suse.com/servicedesk/customer/portal/1/SD-92689 is tracked and as soon as resolved #113701 is followed up with.

Actions #29

Updated by szarate over 2 years ago

  • Status changed from Feedback to Resolved

After monitoring the situation a bit, I don't see anything else to do here.

Actions #30

Updated by amanzini about 2 years ago

  • Related to action #119233: [security] boot_to_desktop test fails on specific s390 workers due to closed SSH port added
Actions

Also available in: Atom PDF