action #116287
closed[qe-core][s390x] SSH serial terminal connection issues on s390x workers
0%
Description
s390x livepatch tests had a lot of installation failures this month due to SSH serial terminal connection failures. Interestingly enough, the connection failures seem to happen around the same module step. serial_terminal.txt output appears to be out of sync with the terminal because part of the commands and output is missing even though it's listed in the update_kernel module details. The dmesg output in serial0.txt often (but not always) shows some key exchange SSH error followed by output from a completely different job:
Welcome to SUSE Linux Enterprise Server 15 SP2 (s390x) - Kernel 5.3.18-24.83-default (ttysclp0).
eth0: 10.161.145.86 fe80::5054:ff:fe84:f877
susetest login: root
Password:
Last login: Mon Sep 5 10:18:10 from 10.160.0.147
susetest:~ #(B systemctl is-active network
active
susetest:~ #(B systemctl is-active sshd
active
susetest:~ #(B 2022-09-05T10:25:03.604370-04:00 susetest sshd[4272]: error: kex_exchange_identification: Connection closed by remote host
2022-09-05T10:25:04.844743-04:00 susetest sshd[4273]: error: kex_exchange_identification: Connection closed by remote host
[ 107.444474] LTP: starting DI000 (dirty)
[ 107.445525] LTP: starting DS000 (dio_sparse)
[ 107.466125] LTP: starting abort01
[ 107.758318] LTP: starting accept01
12-SP4: https://openqa.suse.de/tests/9438804#step/update_kernel/337
15-SP2: https://openqa.suse.de/tests/9457752#step/update_kernel/337
15-SP3: https://openqa.suse.de/tests/9458645#step/update_kernel/337
15-SP4: https://openqa.suse.de/tests/9455666#step/update_kernel/199
I could not find any such connection failure on SLE-12SP5. Other SLE releases don't support s390x livepatches and KOTD tests don't show this kind of issue. This looks like a kernel bug but I'd like an s390x expert to look at this before I create a Bugzilla ticket. And of course this has exposed logging issues in OpenQA.
Updated by MDoucha about 2 years ago
Also note, I could not find any examples of this issue when we used grenache as the intermediary worker. This issue started happening after the switch to openqaworker2.
Updated by livdywan about 2 years ago
MDoucha wrote:
serial_terminal.txt output appears to be out of sync with the terminal because part of the commands and output is missing even though it's listed in the update_kernel module details
This sounds like the serial terminal is not updated correctly
The dmesg output in serial0.txt often (but not always) shows some key exchange SSH error followed by output from a completely different job:
If somehow the serial terminal from a different job gets connected it could explain the former
This looks like a kernel bug but I'd like an s390x expert to look at this before I create a Bugzilla ticket.
Maybe we need a separate ticket, but let's see who can help us identify what the issue is first.
Updated by rfan1 about 2 years ago
Hello @cdywan @MDoucha
I believe the issue should be caused by my tests in https://openqa.suse.de/tests/overview?groupid=220.
I published some qcow2 images with static ip
in autoyast configuration files. and leaded to lots of issues. I have switched to dhcp
from Sep 8th.
Sorry about this!
Updated by livdywan about 2 years ago
- Project changed from openQA Infrastructure to openQA Tests
- Category set to Bugs in existing tests
- Status changed from New to Feedback
- Assignee set to livdywan
- Target version set to Ready
Okay. So with the updated this should resolve itself. Let'see
Updated by openqa_review about 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: migration_online_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/9516680#step/setup_hosts_and_luns/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by livdywan about 2 years ago
openqa_review wrote:
This bug is still referenced in a failing openQA test: migration_online_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/9516680#step/setup_hosts_and_luns/1
Richard confirmed the DNS configuration issues have been resolved.
I see this in serial0.txt:
kex_exchange_identification: Connection closed by remote host
2022-09-16T00:44:58.436522-04:00 alpha-node02 sshd[1560]: error: kex_exchange_identification: Connection closed by remote host
Followed shortly after by this in autoinst-log.txt)::
[2022-09-16T06:48:17.182729+02:00] [info] ::: basetest::runtest: # Test died: command 'mount -t nfs 1c119.qa.suse.de:/srv/nfs/ha_ss /support_fs' timed out at /usr/lib/os-autoinst/testapi.pm line 919.
Updated by okurz about 2 years ago
- Related to action #113701: [qe-core] Move workers back to grenache added
Updated by okurz about 2 years ago
@MDoucha @rfan1 we have a monitoring dashboard https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4 that shows potential packet losses in our network between different hosts. Please take a look if it can help you in this special case.
Also please aware about #113701 within the scope of qe-core team to move back openQA workers from openqaworker2 to grenache-1. This would likely help you. Also you should be able to improve the test code related to the ssh connections to improve stability.
Updated by MDoucha almost 2 years ago
The s390x SSH serial terminal issues continue in the latest batch of livepatches. Given that these issues keep happening at the exact same point in the test, it was unlikely to be caused by DHCP or any other network issue. So could someone please look at any sytem logs in the s390x SSH setup that are not available through OpenQA webAPI?
https://openqa.suse.de/tests/9732748#step/update_kernel/338
https://openqa.suse.de/tests/9731811#step/update_kernel/337
https://openqa.suse.de/tests/9731442#step/update_kernel/337
Updated by okurz almost 2 years ago
- Subject changed from SSH serial terminal connection issues on s390x workers to [qe-core][s390x] SSH serial terminal connection issues on s390x workers
- Assignee deleted (
livdywan) - Priority changed from Normal to High
- Target version changed from Ready to future
Looking at
https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4&from=1665746933519&to=1665779040310
for https://openqa.suse.de/tests/9732748#step/update_kernel/338 I can't find any significant packet drops.
@qe-core as maintainers of the s390x hypervisor hosts please take over to look into the suggestions from #116287#note-10
Updated by MDoucha almost 2 years ago
- Priority changed from High to Urgent
The issue is still present in the current s390x livepatch batch. Could someone finally check the s390x SSH logs so that I can report the bug into Bugzilla?
https://openqa.suse.de/tests/9939455#step/update_kernel/337
https://openqa.suse.de/tests/9939473#step/update_kernel/337
Updated by rfan1 almost 2 years ago
MDoucha wrote:
The issue is still present in the current s390x livepatch batch. Could someone finally check the s390x SSH logs so that I can report the bug into Bugzilla?
https://openqa.suse.de/tests/9939455#step/update_kernel/337
https://openqa.suse.de/tests/9939473#step/update_kernel/337
Hello @MDoucha,
I noticed the job failed at Test died: command 'lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch' timed out at /usr/lib/os-autoinst/testapi.pm line 969.
Since lsinitrd
command will print many lines in serial terminal, can you please increase some timeout for https://openqa.suse.de/tests/9939473/modules/update_kernel/steps/1/src?
script_run("lsinitrd /boot/initrd-$kver-default | grep patch");
Updated by MDoucha almost 2 years ago
rfan1 wrote:
I noticed the job failed at
Test died: command 'lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch' timed out at /usr/lib/os-autoinst/testapi.pm line 969.
Since
lsinitrd
command will print many lines in serial terminal, can you please increase some timeout for https://openqa.suse.de/tests/9939473/modules/update_kernel/steps/1/src?script_run("lsinitrd /boot/initrd-$kver-default | grep patch");
In a successful job, the lsinitrd command takes 4 seconds. This is not a timeout issue, the SSH connection between openqaworker2 and the s390x host failed.
Updated by rfan1 almost 2 years ago
MDoucha wrote:
rfan1 wrote:
I noticed the job failed at
Test died: command 'lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch' timed out at /usr/lib/os-autoinst/testapi.pm line 969.
Since
lsinitrd
command will print many lines in serial terminal, can you please increase some timeout for https://openqa.suse.de/tests/9939473/modules/update_kernel/steps/1/src?script_run("lsinitrd /boot/initrd-$kver-default | grep patch");
In a successful job, the lsinitrd command takes 4 seconds. This is not a timeout issue, the SSH connection between openqaworker2 and the s390x host failed.
I tried to investigate more on this issue, https://openqa.suse.de/tests/9964578#step/update_kernel/337:
Seems only part of the serial outputs [“kernel-livepatch"] were shown there, and then job failed due to no match of return code "plS8Z-$?-"
However, I can see everything worked fine in serial terminal:
https://openqa.suse.de/tests/9964578/logfile?filename=serial_terminal.txt
# lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch; echo plS8Z-$?-
kernel-livepatch
-rwxr-xr-x 1 root root 467 Aug 1 10:04 lib/dracut/hooks/pre-pivot/99-kernel-livepatch.sh
drwxr-xr-x 1 root root 0 Nov 15 11:16 lib/modules/5.3.18-24.99-default/livepatch
-rw-r--r-- 1 root root 283936 Nov 11 09:13 lib/modules/5.3.18-24.99-default/livepatch/livepatch-16-150200_2_2.ko
plS8Z-0-
Then I don't think the ssh connection between worker and s390x SUT is lost :)
So, I would still suggest adding sometime out for cmd below [I don't know if it can fix the issue, but worth to have a try]
script_run("lsinitrd /boot/initrd-$kver-default | grep patch");
Updated by szarate almost 2 years ago
- Status changed from New to Feedback
Martin, can you also prepend the time
command? if this is taking long enough that script_run is failing, then the problem might be actually on the storage (fwiw, the storage on s390 is remote, amd should be fast enough so that this doesn´t happen), I would say anything over 5 second, is too much.
time (lsinitrd /boot/initrd | grep zfs); echo $?
real 0m0.416s
user 0m0.278s
sys 0m0.311s
1
Updated by szarate almost 2 years ago
- Sprint set to QE-Core: November Sprint (Nov 09 - Dec 07)
- Tags changed from bug, kernel to bug, kernel, bugbusters
- Assignee set to szarate
Updated by MDoucha almost 2 years ago
szarate wrote:
Martin, can you also prepend the
time
command? if this is taking long enough that script_run is failing, then the problem might be actually on the storage (fwiw, the storage on s390 is remote, amd should be fast enough so that this doesn´t happen), I would say anything over 5 second, is too much.
I can, but all livepatch incidents got released last week so I'll have to wait for a new batch in December. Until then I have nothing to test with.
Updated by szarate almost 2 years ago
- Due date set to 2023-01-20
- Priority changed from Urgent to Normal
MDoucha wrote:
szarate wrote:
Martin, can you also prepend the
time
command? if this is taking long enough that script_run is failing, then the problem might be actually on the storage (fwiw, the storage on s390 is remote, amd should be fast enough so that this doesn´t happen), I would say anything over 5 second, is too much.I can, but all livepatch incidents got released last week so I'll have to wait for a new batch in December. Until then I have nothing to test with.
sure thing :), changing prio for now
Updated by szarate almost 2 years ago
- Sprint changed from QE-Core: November Sprint (Nov 09 - Dec 07) to QE-Core: December Sprint (Dec 07 - Jan 04)
Updated by szarate almost 2 years ago
- Sprint changed from QE-Core: December Sprint (Dec 07 - Jan 11) to QE-Core: January Sprint (Jan 11 - Feb 08)
Updated by szarate almost 2 years ago
- Sprint changed from QE-Core: January Sprint (Jan 11 - Feb 08) to QE-Core: December Sprint (Dec 07 - Jan 11)
Updated by slo-gin over 1 year ago
This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.
Updated by MDoucha over 1 year ago
- Status changed from Feedback to Rejected
It turns out that the timeouts are caused by stress tests which should get terminated when the livepatch installation finishes but instead they kept running and slowed other processes down to a crawl. This PR should fix it:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/16489