action #116287: [qe-core][s390x] SSH serial terminal connection issues on s390x workers - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #116287

closed

[qe-core][s390x] SSH serial terminal connection issues on s390x workers

Added by MDoucha over 2 years ago. Updated about 2 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

szarate

Category:

Bugs in existing tests

Target version:

QA (public) - future

Start date:

2022-09-06

Due date:

2023-01-20

% Done:

Estimated time:

Difficulty:

Sprint:

QE-Core: December Sprint (Dec 07 - Jan 11)

Tags:

kernel, bug, bugbusters

Description

s390x livepatch tests had a lot of installation failures this month due to SSH serial terminal connection failures. Interestingly enough, the connection failures seem to happen around the same module step. serial_terminal.txt output appears to be out of sync with the terminal because part of the commands and output is missing even though it's listed in the update_kernel module details. The dmesg output in serial0.txt often (but not always) shows some key exchange SSH error followed by output from a completely different job:

Welcome to SUSE Linux Enterprise Server 15 SP2  (s390x) - Kernel 5.3.18-24.83-default (ttysclp0).

eth0: 10.161.145.86 fe80::5054:ff:fe84:f877


susetest login: root

Password: 
Last login: Mon Sep  5 10:18:10 from 10.160.0.147
susetest:~ #(B systemctl is-active network
active
susetest:~ #(B systemctl is-active sshd
active
susetest:~ #(B 2022-09-05T10:25:03.604370-04:00 susetest sshd[4272]: error: kex_exchange_identification: Connection closed by remote host
2022-09-05T10:25:04.844743-04:00 susetest sshd[4273]: error: kex_exchange_identification: Connection closed by remote host
[  107.444474] LTP: starting DI000 (dirty)

[  107.445525] LTP: starting DS000 (dio_sparse)

[  107.466125] LTP: starting abort01

[  107.758318] LTP: starting accept01

12-SP4: https://openqa.suse.de/tests/9438804#step/update_kernel/337
15-SP2: https://openqa.suse.de/tests/9457752#step/update_kernel/337
15-SP3: https://openqa.suse.de/tests/9458645#step/update_kernel/337
15-SP4: https://openqa.suse.de/tests/9455666#step/update_kernel/199

I could not find any such connection failure on SLE-12SP5. Other SLE releases don't support s390x livepatches and KOTD tests don't show this kind of issue. This looks like a kernel bug but I'd like an s390x expert to look at this before I create a Bugzilla ticket. And of course this has exposed logging issues in OpenQA.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by MDoucha over 2 years ago

Also note, I could not find any examples of this issue when we used grenache as the intermediary worker. This issue started happening after the switch to openqaworker2.

Actions

Copy link

Updated by livdywan over 2 years ago

MDoucha wrote:

serial_terminal.txt output appears to be out of sync with the terminal because part of the commands and output is missing even though it's listed in the update_kernel module details

This sounds like the serial terminal is not updated correctly

The dmesg output in serial0.txt often (but not always) shows some key exchange SSH error followed by output from a completely different job:

If somehow the serial terminal from a different job gets connected it could explain the former

This looks like a kernel bug but I'd like an s390x expert to look at this before I create a Bugzilla ticket.

Maybe we need a separate ticket, but let's see who can help us identify what the issue is first.

Actions

Copy link

Updated by rfan1 over 2 years ago

Hello @cdywan @MDoucha

I believe the issue should be caused by my tests in https://openqa.suse.de/tests/overview?groupid=220.

I published some qcow2 images with static ip in autoyast configuration files. and leaded to lots of issues. I have switched to dhcp from Sep 8th.

Sorry about this!

Actions

Copy link

Updated by livdywan over 2 years ago

Project changed from openQA Infrastructure (public) to openQA Tests (public)
Category set to Bugs in existing tests
Status changed from New to Feedback
Assignee set to livdywan
Target version set to Ready

Okay. So with the updated this should resolve itself. Let'see

Actions

Copy link

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_online_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/9516680#step/setup_hosts_and_luns/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

Updated by livdywan over 2 years ago

Pinged @rfan1 in chat

Actions

Copy link

Updated by livdywan over 2 years ago

openqa_review wrote:

This bug is still referenced in a failing openQA test: migration_online_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/9516680#step/setup_hosts_and_luns/1

Richard confirmed the DNS configuration issues have been resolved.

I see this in serial0.txt:

kex_exchange_identification: Connection closed by remote host
2022-09-16T00:44:58.436522-04:00 alpha-node02 sshd[1560]: error: kex_exchange_identification: Connection closed by remote host

Followed shortly after by this in autoinst-log.txt)::

[2022-09-16T06:48:17.182729+02:00] [info] ::: basetest::runtest: # Test died: command 'mount -t nfs 1c119.qa.suse.de:/srv/nfs/ha_ss /support_fs' timed out at /usr/lib/os-autoinst/testapi.pm line 919.

Actions

Copy link

Updated by okurz over 2 years ago

Related to action #113701: [tools] Move workers back to grenache added

Actions

Copy link

Updated by okurz over 2 years ago

@MDoucha @rfan1 we have a monitoring dashboard https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4 that shows potential packet losses in our network between different hosts. Please take a look if it can help you in this special case.

Also please aware about #113701 within the scope of qe-core team to move back openQA workers from openqaworker2 to grenache-1. This would likely help you. Also you should be able to improve the test code related to the ssh connections to improve stability.

Actions

Copy link

#10

Updated by MDoucha over 2 years ago

The s390x SSH serial terminal issues continue in the latest batch of livepatches. Given that these issues keep happening at the exact same point in the test, it was unlikely to be caused by DHCP or any other network issue. So could someone please look at any sytem logs in the s390x SSH setup that are not available through OpenQA webAPI?
https://openqa.suse.de/tests/9732748#step/update_kernel/338
https://openqa.suse.de/tests/9731811#step/update_kernel/337
https://openqa.suse.de/tests/9731442#step/update_kernel/337

Actions

Copy link

#11

Updated by MDoucha over 2 years ago

Status changed from Feedback to New

Actions

Copy link

#12

Updated by okurz over 2 years ago

Subject changed from SSH serial terminal connection issues on s390x workers to [qe-core][s390x] SSH serial terminal connection issues on s390x workers
Assignee deleted (~~livdywan~~)
Priority changed from Normal to High
Target version changed from Ready to future

Looking at
https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4&from=1665746933519&to=1665779040310
for https://openqa.suse.de/tests/9732748#step/update_kernel/338 I can't find any significant packet drops.

@qe-core as maintainers of the s390x hypervisor hosts please take over to look into the suggestions from #116287#note-10

Actions

Copy link

#13

Updated by MDoucha over 2 years ago

Priority changed from High to Urgent

The issue is still present in the current s390x livepatch batch. Could someone finally check the s390x SSH logs so that I can report the bug into Bugzilla?
https://openqa.suse.de/tests/9939455#step/update_kernel/337
https://openqa.suse.de/tests/9939473#step/update_kernel/337

Actions

Copy link

#14

Updated by rfan1 over 2 years ago

MDoucha wrote:

The issue is still present in the current s390x livepatch batch. Could someone finally check the s390x SSH logs so that I can report the bug into Bugzilla?
https://openqa.suse.de/tests/9939455#step/update_kernel/337
https://openqa.suse.de/tests/9939473#step/update_kernel/337

Hello @MDoucha，

I noticed the job failed at Test died: command 'lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch' timed out at /usr/lib/os-autoinst/testapi.pm line 969.

Since lsinitrd command will print many lines in serial terminal, can you please increase some timeout for https://openqa.suse.de/tests/9939473/modules/update_kernel/steps/1/src?

 script_run("lsinitrd /boot/initrd-$kver-default | grep patch");

Actions

Copy link

#15

Updated by MDoucha over 2 years ago

rfan1 wrote:

I noticed the job failed at Test died: command 'lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch' timed out at /usr/lib/os-autoinst/testapi.pm line 969.

Since lsinitrd command will print many lines in serial terminal, can you please increase some timeout for https://openqa.suse.de/tests/9939473/modules/update_kernel/steps/1/src?
 script_run("lsinitrd /boot/initrd-$kver-default | grep patch");

In a successful job, the lsinitrd command takes 4 seconds. This is not a timeout issue, the SSH connection between openqaworker2 and the s390x host failed.

Actions

Copy link

#16

Updated by rfan1 over 2 years ago

MDoucha wrote:

rfan1 wrote:
I noticed the job failed at Test died: command 'lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch' timed out at /usr/lib/os-autoinst/testapi.pm line 969.

Since lsinitrd command will print many lines in serial terminal, can you please increase some timeout for https://openqa.suse.de/tests/9939473/modules/update_kernel/steps/1/src?
 script_run("lsinitrd /boot/initrd-$kver-default | grep patch");
In a successful job, the lsinitrd command takes 4 seconds. This is not a timeout issue, the SSH connection between openqaworker2 and the s390x host failed.

I tried to investigate more on this issue, https://openqa.suse.de/tests/9964578#step/update_kernel/337:

Seems only part of the serial outputs [“kernel-livepatch"] were shown there, and then job failed due to no match of return code "plS8Z-$?-"

However, I can see everything worked fine in serial terminal:
https://openqa.suse.de/tests/9964578/logfile?filename=serial_terminal.txt

# lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch; echo plS8Z-$?-
kernel-livepatch
-rwxr-xr-x   1 root     root          467 Aug  1 10:04 lib/dracut/hooks/pre-pivot/99-kernel-livepatch.sh
drwxr-xr-x   1 root     root            0 Nov 15 11:16 lib/modules/5.3.18-24.99-default/livepatch
-rw-r--r--   1 root     root       283936 Nov 11 09:13 lib/modules/5.3.18-24.99-default/livepatch/livepatch-16-150200_2_2.ko
plS8Z-0-

Then I don't think the ssh connection between worker and s390x SUT is lost :)

So, I would still suggest adding sometime out for cmd below [I don't know if it can fix the issue, but worth to have a try]
script_run("lsinitrd /boot/initrd-$kver-default | grep patch");

Actions

Copy link

#17

Updated by szarate over 2 years ago

Status changed from New to Feedback

Martin, can you also prepend the time command? if this is taking long enough that script_run is failing, then the problem might be actually on the storage (fwiw, the storage on s390 is remote, amd should be fast enough so that this doesn´t happen), I would say anything over 5 second, is too much.

time (lsinitrd /boot/initrd | grep zfs); echo $?

real    0m0.416s
user    0m0.278s
sys 0m0.311s
1

Actions

Copy link

#18

Updated by szarate over 2 years ago

Sprint set to QE-Core: November Sprint (Nov 09 - Dec 07)
Tags changed from bug, kernel to bug, kernel, bugbusters
Assignee set to szarate

Actions

Copy link

#19

Updated by MDoucha over 2 years ago

szarate wrote:

Martin, can you also prepend the time command? if this is taking long enough that script_run is failing, then the problem might be actually on the storage (fwiw, the storage on s390 is remote, amd should be fast enough so that this doesn´t happen), I would say anything over 5 second, is too much.

I can, but all livepatch incidents got released last week so I'll have to wait for a new batch in December. Until then I have nothing to test with.

Actions

Copy link

#20

Updated by szarate over 2 years ago

Due date set to 2023-01-20
Priority changed from Urgent to Normal

MDoucha wrote:

szarate wrote:

Martin, can you also prepend the time command? if this is taking long enough that script_run is failing, then the problem might be actually on the storage (fwiw, the storage on s390 is remote, amd should be fast enough so that this doesn´t happen), I would say anything over 5 second, is too much.

I can, but all livepatch incidents got released last week so I'll have to wait for a new batch in December. Until then I have nothing to test with.

sure thing :), changing prio for now

Actions

Copy link

#21

Updated by szarate over 2 years ago

Sprint changed from QE-Core: November Sprint (Nov 09 - Dec 07) to QE-Core: December Sprint (Dec 07 - Jan 04)

Actions

Copy link

#22

Updated by szarate over 2 years ago

Sprint changed from QE-Core: December Sprint (Dec 07 - Jan 11) to QE-Core: January Sprint (Jan 11 - Feb 08)

Actions

Copy link

#23

Updated by szarate over 2 years ago

Sprint changed from QE-Core: January Sprint (Jan 11 - Feb 08) to QE-Core: December Sprint (Dec 07 - Jan 11)

Actions

Copy link

#24

Updated by slo-gin about 2 years ago

This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.

Actions

Copy link

#25

Updated by MDoucha about 2 years ago

Status changed from Feedback to Rejected

It turns out that the timeouts are caused by stress tests which should get terminated when the livepatch installation finishes but instead they kept running and slowed other processes down to a crawl. This PR should fix it:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/16489

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #116287

[qe-core][s390x] SSH serial terminal connection issues on s390x workers

Updated by MDoucha over 2 years ago

Updated by livdywan over 2 years ago

Updated by rfan1 over 2 years ago

Updated by livdywan over 2 years ago

Updated by openqa_review over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by MDoucha over 2 years ago

Updated by MDoucha over 2 years ago

Updated by okurz over 2 years ago

Updated by MDoucha over 2 years ago

Updated by rfan1 over 2 years ago

Updated by MDoucha over 2 years ago

Updated by rfan1 over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by MDoucha over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by slo-gin about 2 years ago

Updated by MDoucha about 2 years ago