action #113528: [qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly) - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #113528

closed

[qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly)

Added by rfan1 over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

szarate

Category:

Bugs in existing tests

Target version:

QA (public) - QE-Core: Ready

Start date:

2022-07-13

Due date:

2022-07-18

% Done:

Estimated time:

Difficulty:

Sprint:

QE-Core: July Sprint (Jul 06 - Aug 03)

Tags:

bugbusters

Description

Observation¶

openQA test in scenario sle-15-SP3-Server-DVD-Updates-s390x-mru-install-minimal-with-addons@s390x-kvm-sle12 fails in
bootloader_zkvm

Test suite description¶

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

Reproducible¶

Fails since (at least) Build 20220712-1

Expected result¶

Last good: 20220711-1 (or more recent)

#suggestions

Copy a workerconfig from grenache to openqaworker2 as Matthi suggested, give it a different name and clone a job using the new worker class and monitor

Related issues 4 (3 open — 1 closed)

Actions

Copy link

Updated by rfan1 over 2 years ago

Subject changed from [qe-core]test fails in bootloader_zkvm, some performance issue with worker class 's390x-kvm-sle12'? to [qe-core]test fails in bootloader_zkvm, any network performance issue with worker class 's390x-kvm-sle12'?

Seems we had some issue to download the 'initrd' file, especially on SUT 'zp18'.

Actions

Copy link

Updated by rfan1 over 2 years ago

I tried to add some wait time:

diff --git a/tests/installation/bootloader_zkvm.pm b/tests/installation/bootloader_zkvm.pm
index 76c5a5c8b..b4ce9446f 100644
--- a/tests/installation/bootloader_zkvm.pm
+++ b/tests/installation/bootloader_zkvm.pm
@@ -50,7 +50,7 @@ sub set_svirt_domain_elements {

         # show this on screen and make sure that kernel and initrd are actually saved
         enter_cmd "wget $repo/boot/s390x/initrd -O $zkvm_img_path/$name.initrd";
-        assert_screen "initrd-saved";
+        assert_screen "initrd-saved", 90;
         enter_cmd "wget $repo/boot/s390x/linux -O $zkvm_img_path/$name.kernel";
         assert_screen "kernel-saved";
     }

However, it seems later tests were still slow https://openqa.suse.de/tests/9121775#step/bootloader_zkvm/37

Actions

Copy link

Updated by mgriessmeier over 2 years ago

fwiw: manual installation on s390zp18 works just fine and without any noticeable performance impacts

Actions

Copy link

Updated by mgriessmeier over 2 years ago

so the issue looks to me like that the SUT is either not sending data over serial to openQA, or that openQA cannot receive this data.
manually the serial output openQA is waiting for is present.

Actions

Copy link

Updated by okurz over 2 years ago

Priority changed from Normal to Urgent

Bumping to urgent as it's affecting QA SLE maintenance tests, e.g. https://openqa.suse.de/tests/9113784#next_previous

Actions

Copy link

Updated by jlausuch over 2 years ago

It also affects SLE Micro 5.3 Alpha 2 release, as we can't do validation on this arch.

Actions

Copy link

Updated by geor over 2 years ago

From my side, after playing around it looks like there is significant delay to get the serial output, the most success I have had was with this job, latest successful result is this one.
It does look like something goes through sometimes, but I really cannot determine any factor that affected the success rate.
Anyway I ll check a bit more tomorrow, hoping that it fixes itself.

Actions

Copy link

Updated by rfan1 over 2 years ago

https://openqa.suse.de/tests/9125700/logfile?filename=autoinst-log.txt

The issue is still there.

[2022-07-14T03:26:50.910940+02:00] [debug] no match: 6.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:51.981234+02:00] [debug] no match: 5.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:52.913105+02:00] [debug] no match: 4.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:53.914215+02:00] [debug] no match: 3.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:54.915226+02:00] [debug] no match: 2.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:55.916294+02:00] [debug] no match: 1.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:56.986691+02:00] [debug] no match: 0.9s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:57.987640+02:00] [debug] no match: -0.1s, best candidate: bootloader_zkvm-initrd-saved-20160607 (0.00)
[2022-07-14T03:26:58.247167+02:00] [debug] >>> testapi::_check_backend_response: match=initrd-saved timed out after 30 (assert_screen)
[2022-07-14T03:26:58.314293+02:00] [info] ::: basetest::runtest: # Test died: no candidate needle with tag(s) 'initrd-saved' matched
[2022-07-14T03:26:58.315794+02:00] [debug] tests/installation/bootloader_zkvm.pm:96 called testapi::select_console
[2022-07-14T03:26:58.315920+02:00] [debug] <<< testapi::select_console(testapi_console="svirt")
[2022-07-14T03:26:58.321021+02:00] [debug] Establishing VNC connection to localhost:47135
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":35331"
      after 28809 requests (28652 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":35331"
[2022-07-14T03:26:58.333696+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 491857 and exit status: 1
[2022-07-14T03:26:58.334038+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 491866 and exit status: 0
[2022-07-14T03:26:58.334248+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 491859 and exit status: 84

Actions

Copy link

Updated by szarate over 2 years ago

https://openqa.suse.de/tests/9126241/logfile?filename=autoinst-log.txt

somebody changed the machine definition of: s390x-kvm-sle12

Settings key '_now_in_workerconf_NETDEV' is invalid at /usr/lib/os-autoinst/bmwqemu.pm line 90.
    bmwqemu::load_vars() called at /usr/bin/isotovideo line 395
 at /usr/lib/os-autoinst/bmwqemu.pm line 313.
    bmwqemu::tiedvars::STORE(bmwqemu::tiedvars=HASH(0x1003d7bbe78), "_now_in_workerconf_NETDEV", "eth2") called at /usr/lib/os-autoinst/bmwqemu.pm line 90
    bmwqemu::load_vars() called at /usr/bin/isotovideo line 395

but that wasnt the issue

Actions

Copy link

#10

Updated by szarate over 2 years ago

Priority changed from Urgent to Immediate

Actions

Copy link

#11

Updated by szarate over 2 years ago

Related to coordination #58184: [saga][epic][use case] full version control awareness within openQA added

Actions

Copy link

#12

Updated by szarate over 2 years ago

Sprint set to QE-Core: July Sprint (Jul 06 - Aug 03)
Tags changed from qe-core-coverage, bugbusters to bugbusters
Subject changed from [qe-core]test fails in bootloader_zkvm, any network performance issue with worker class 's390x-kvm-sle12'? to [qe-core] test fails in bootloader_zkvm - invalid test setting defined for s390x-kvm-sle12
Status changed from New to Feedback
Assignee set to szarate

[6692719.947536] Out of memory: Kill process 152100 (qemu-system-s39) score 120 or sacrifice child
[6692719.947558] Killed process 152100 (qemu-system-s39) total-vm:8013396kB, anon-rss:4188940kB, file-rss:2864kB, shmem-rss:0kB

Just rebooted the machine, I hope that fixes the issue?...

Actions

Copy link

#13

Updated by mgriessmeier over 2 years ago

to add my findings here about the apparently lost serial output which I can reproduce locally

following the approach openQA does to grab the serial output I ran ssh -t root@s390zp18 virsh console openQA-SUT-7-113528
When I send data to the SUT via echo "foobar" > /dev/pts/<pty> it will appear on the ssh'ed serial output.
This works fine when I do it from my notebook or a worker outside the .qa.suse.de subnet. However, from grenache-1 and other machines in the qa subnet, I observed that the output doesn't get through or is delayed.

furthermore I noticed that my ssh connection to grenache-1 is hanging from time to time so I did some investigation for network performance.

mtr showed that there seems to be another issue in network performance between 10.162.x.x and 10.161.x.x (which is the s390 network) and is most likely is related to the serial output redirection (cause it is done over ssh)

for grenache-1 to s390zp[18,19] I got package losses of about 50%
for grenache-1 to z13router it was about 20%
for openqaworker2.suse.de or my notebook to the 10.161.x.x subnet, everything was fine
furthermore I noticed that qanet.qa.suse.de showed 100% packageloss to 10.161.x.x

I pinged Gerhard and he is also looking into this and I will create an enginfra ticket later on if we cannot solve it in a timely manner

Actions

Copy link

#14

Updated by szarate over 2 years ago

Subject changed from [qe-core] test fails in bootloader_zkvm - invalid test setting defined for s390x-kvm-sle12 to [qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly)

mgriessmeier wrote:

to add my findings here about the apparently lost serial output which I can reproduce locally

following the approach openQA does to grab the serial output I ran ssh -t root@s390zp18 virsh console openQA-SUT-7-113528
When I send data to the SUT via echo "foobar" > /dev/pts/<pty> it will appear on the ssh'ed serial output.
This works fine when I do it from my notebook or a worker outside the .qa.suse.de subnet. However, from grenache-1 and other machines in the qa subnet, I observed that the output doesn't get through or is delayed.

furthermore I noticed that my ssh connection to grenache-1 is hanging from time to time so I did some investigation for network performance.

I tend to have similar hangs connecting to phobos.suse.de too today.

mtr showed that there seems to be another issue in network performance between 10.162.x.x and 10.161.x.x (which is the s390 network) and is most likely is related to the serial output redirection (cause it is done over ssh)

for grenache-1 to s390zp[18,19] I got package losses of about 50%
for grenache-1 to z13router it was about 20%
for openqaworker2.suse.de or my notebook to the 10.161.x.x subnet, everything was fine
furthermore I noticed that qanet.qa.suse.de showed 100% packageloss to 10.161.x.x

indeed got the same issues when trying from phobos

Actions

Copy link

#15

Updated by szarate over 2 years ago

Description updated (diff)

next thing to try:
Copy a workerconfig from grenache to openqaworker2 as Matthi suggested, give it a different name and clone a job using the new worker class and monitor

Actions

Copy link

#16

Updated by mgriessmeier over 2 years ago

jira-sd tracking ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-92689

Actions

Copy link

#17

Updated by okurz over 2 years ago

@mgriessmeier do you think something in https://monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&refresh=1m&viewPanel=65099 should show this packet loss as well?

Actions

Copy link

#18

Updated by szarate over 2 years ago

Target version set to QE-Core: Ready

with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/420 and https://openqa.suse.de/tests/9128415 we're now sure that it's network stuff...

Things that I see where we can cover ourselves:

svirt backend needs to be more aware of the host where it is running:

detect better out of memory conditions
problems between the jumphost and the system itself: for instance, system failing to boot, with zero data in the serial console (see below) could be a good hint of this problem
```
Connected to domain openQA-SUT-5
Escape character is ^]
```
libvirt hosts should use the nfs to copy the assets, instead of the assets being downloaded and then copied

Also, right now tests are still run very inefficiently with unused assets being transferred back and forth. E.g. qcow images from https://openqa.suse.de/tests/9128415#downloads are transferred from osd to the worker and then to the s390x host

Actions

Copy link

#19

Updated by mgriessmeier over 2 years ago

okurz wrote:

@mgriessmeier do you think something in https://monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&refresh=1m&viewPanel=65099 should show this packet loss as well?

@okurz, actually it does I just realized, it's barely seeable when all graphs are selected, but when you select only zp18 and zp19 you can see it pretty clearly - very interesting though that s390zp17 and s390zp11 are not showing it....

Actions

Copy link

#20

Updated by szarate over 2 years ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 has been merged, let's see how everything responds: http://stats.openqa-monitor.qa.suse.de/goto/WwiTxWg4z?orgId=1

[2022-07-15T12:27:30.327108+02:00] [debug] [run_ssh_cmd(qemu-img create '/var/lib/libvirt/images/openQA-SUT-7a.img' -f qcow2 -b '/var/lib/libvirt/images//SLES-15-SP4-s390x-minimal_installed_for_LTP.qcow2' 42949672960)] stderr:
  qemu-img: /var/lib/libvirt/images/openQA-SUT-7a.img: Failed to get "write" lock
  Is another process using the image [/var/lib/libvirt/images/openQA-SUT-7a.img]?

https://openqa.suse.de/tests/9132833/logfile?filename=autoinst-log.txt now let's see where is there another worker using the same VIRSH_INSTANCE number

Actions

Copy link

#21

Updated by szarate over 2 years ago

Status changed from Feedback to In Progress

For the most part, it looks healthy http://stats.openqa-monitor.qa.suse.de/goto/zuZY-WgVk?orgId=1

Actions

Copy link

#22

Updated by szarate over 2 years ago

Due date set to 2022-07-18
Status changed from In Progress to Feedback

Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday

Actions

Copy link

#23

Updated by okurz over 2 years ago

szarate wrote:

Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday

Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time. I suggest to keep this ticket open until workers are back on grenache-1 or a different solution has been found. But right now we have multiple worker instances on grenache-1 with no useful worker class. Of course we need to understand and fix first the underlying issue, that is https://sd.suse.com/servicedesk/customer/portal/1/SD-92689, right?

Actions

Copy link

#24

Updated by szarate over 2 years ago

Related to action #107062: Multiple failures due to network issues added

Actions

Copy link

#25

Updated by szarate over 2 years ago

okurz wrote:

szarate wrote:

Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday

Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time.

I'll keep it open until the end of the sprint.

I suggest to keep this ticket open until workers are back on grenache-1 or a different solution has been found.

I take moving the workers to openQA worker 2 is a good solution, if you disagree, create a ticket to move the workers back to grenache :)

But right now we have multiple worker instances on grenache-1 with no useful worker class.

When creating the ticket above, mention that the openQA worker distribution has to be improved and rebalanced in the worker config too.

Of course we need to understand and fix first the underlying issue, that is https://sd.suse.com/servicedesk/customer/portal/1/SD-92689, right?

I guess that's what #107062 is for

Actions

Copy link

#26

Updated by szarate over 2 years ago

Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time.

Sure, people had plenty of time to review in such urgent situation :), another story would be if I merged right after creating the PR.

Actions

Copy link

#27

Updated by okurz over 2 years ago

Copied to action #113701: [qe-core] Move workers back to grenache added

Actions

Copy link

#28

Updated by okurz over 2 years ago

szarate wrote:

okurz wrote:

szarate wrote:

Matthias was using the instance 7 on the machine that failed... we can call this done :), will leave in feedback until Monday

Please allow some time for people to review merge requests in salt-pillars. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/422 was created by you, approved by you, merged by you. Can be done in case of emergencies but then I suggest to not call this ticket done without waiting for feedback from others for a reasonable time.

I'll keep it open until the end of the sprint.

Well, I meant the merge request. I assume with "it" you mean this ticket. Anyway, see further below regarding any followup work.

I suggest to keep this ticket open until workers are back on grenache-1 or a different solution has been found.

I take moving the workers to openQA worker 2 is a good solution, if you disagree, create a ticket to move the workers back to grenache :)

Sure. I think it's not a good idea to spread out the rollback actions to separate tickets as it's still the same original issue and moving to another worker is a temporary workaround but I created a ticket if it helps your planning: #113701

But right now we have multiple worker instances on grenache-1 with no useful worker class.

When creating the ticket above, mention that the openQA worker distribution has to be improved and rebalanced in the worker config too.

Done, included with #113701#note-2

Of course we need to understand and fix first the underlying issue, that is https://sd.suse.com/servicedesk/customer/portal/1/SD-92689, right?

I guess that's what #107062 is for

I don't mind if you want to use #107062 for tracking or this ticket or the newly created #113701, please just make sure that https://sd.suse.com/servicedesk/customer/portal/1/SD-92689 is tracked and as soon as resolved #113701 is followed up with.

Actions

Copy link

#29

Updated by szarate over 2 years ago

Status changed from Feedback to Resolved

After monitoring the situation a bit, I don't see anything else to do here.

Actions

Copy link

#30

Updated by amanzini about 2 years ago

Related to action #119233: [security] boot_to_desktop test fails on specific s390 workers due to closed SSH port added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #113528

[qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly)

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Updated by rfan1 over 2 years ago

Updated by rfan1 over 2 years ago

Updated by mgriessmeier over 2 years ago

Updated by mgriessmeier over 2 years ago

Updated by okurz over 2 years ago

Updated by jlausuch over 2 years ago

Updated by geor over 2 years ago

Updated by rfan1 over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by mgriessmeier over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by mgriessmeier over 2 years ago

Updated by okurz over 2 years ago

Updated by szarate over 2 years ago

Updated by mgriessmeier over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by okurz over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by szarate over 2 years ago

Updated by amanzini about 2 years ago