Project

General

Profile

action #116287

[qe-core][s390x] SSH serial terminal connection issues on s390x workers

Added by MDoucha 3 months ago. Updated 3 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2022-09-06
Due date:
2023-01-20
% Done:

0%

Estimated time:
Difficulty:

Description

s390x livepatch tests had a lot of installation failures this month due to SSH serial terminal connection failures. Interestingly enough, the connection failures seem to happen around the same module step. serial_terminal.txt output appears to be out of sync with the terminal because part of the commands and output is missing even though it's listed in the update_kernel module details. The dmesg output in serial0.txt often (but not always) shows some key exchange SSH error followed by output from a completely different job:

Welcome to SUSE Linux Enterprise Server 15 SP2  (s390x) - Kernel 5.3.18-24.83-default (ttysclp0).

eth0: 10.161.145.86 fe80::5054:ff:fe84:f877


susetest login: root

Password: 
Last login: Mon Sep  5 10:18:10 from 10.160.0.147
susetest:~ # systemctl is-active network
active
susetest:~ # systemctl is-active sshd
active
susetest:~ # 2022-09-05T10:25:03.604370-04:00 susetest sshd[4272]: error: kex_exchange_identification: Connection closed by remote host
2022-09-05T10:25:04.844743-04:00 susetest sshd[4273]: error: kex_exchange_identification: Connection closed by remote host
[  107.444474] LTP: starting DI000 (dirty)

[  107.445525] LTP: starting DS000 (dio_sparse)

[  107.466125] LTP: starting abort01

[  107.758318] LTP: starting accept01

12-SP4: https://openqa.suse.de/tests/9438804#step/update_kernel/337
15-SP2: https://openqa.suse.de/tests/9457752#step/update_kernel/337
15-SP3: https://openqa.suse.de/tests/9458645#step/update_kernel/337
15-SP4: https://openqa.suse.de/tests/9455666#step/update_kernel/199

I could not find any such connection failure on SLE-12SP5. Other SLE releases don't support s390x livepatches and KOTD tests don't show this kind of issue. This looks like a kernel bug but I'd like an s390x expert to look at this before I create a Bugzilla ticket. And of course this has exposed logging issues in OpenQA.


Related issues

Related to openQA Infrastructure - action #113701: [qe-core] Move workers back to grenacheNew

History

#1 Updated by MDoucha 3 months ago

Also note, I could not find any examples of this issue when we used grenache as the intermediary worker. This issue started happening after the switch to openqaworker2.

#2 Updated by cdywan 3 months ago

MDoucha wrote:

serial_terminal.txt output appears to be out of sync with the terminal because part of the commands and output is missing even though it's listed in the update_kernel module details

This sounds like the serial terminal is not updated correctly

The dmesg output in serial0.txt often (but not always) shows some key exchange SSH error followed by output from a completely different job:

If somehow the serial terminal from a different job gets connected it could explain the former

This looks like a kernel bug but I'd like an s390x expert to look at this before I create a Bugzilla ticket.

Maybe we need a separate ticket, but let's see who can help us identify what the issue is first.

#3 Updated by rfan1 3 months ago

Hello cdywan @MDoucha

I believe the issue should be caused by my tests in https://openqa.suse.de/tests/overview?groupid=220.

I published some qcow2 images with static ip in autoyast configuration files. and leaded to lots of issues. I have switched to dhcp from Sep 8th.

Sorry about this!

#4 Updated by cdywan 3 months ago

  • Project changed from openQA Infrastructure to openQA Tests
  • Category set to Bugs in existing tests
  • Status changed from New to Feedback
  • Assignee set to cdywan
  • Target version set to Ready

Okay. So with the updated this should resolve itself. Let'see

#5 Updated by openqa_review 2 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_online_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/9516680#step/setup_hosts_and_luns/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

#6 Updated by cdywan about 2 months ago

Pinged rfan1 in chat

#7 Updated by cdywan about 2 months ago

openqa_review wrote:

This bug is still referenced in a failing openQA test: migration_online_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/9516680#step/setup_hosts_and_luns/1

Richard confirmed the DNS configuration issues have been resolved.

I see this in serial0.txt:

kex_exchange_identification: Connection closed by remote host
2022-09-16T00:44:58.436522-04:00 alpha-node02 sshd[1560]: error: kex_exchange_identification: Connection closed by remote host

Followed shortly after by this in autoinst-log.txt)::

[2022-09-16T06:48:17.182729+02:00] [info] ::: basetest::runtest: # Test died: command 'mount -t nfs 1c119.qa.suse.de:/srv/nfs/ha_ss /support_fs' timed out at /usr/lib/os-autoinst/testapi.pm line 919.

#8 Updated by okurz about 2 months ago

  • Related to action #113701: [qe-core] Move workers back to grenache added

#9 Updated by okurz about 2 months ago

@MDoucha rfan1 we have a monitoring dashboard https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4 that shows potential packet losses in our network between different hosts. Please take a look if it can help you in this special case.

Also please aware about #113701 within the scope of qe-core team to move back openQA workers from openqaworker2 to grenache-1. This would likely help you. Also you should be able to improve the test code related to the ssh connections to improve stability.

#10 Updated by MDoucha about 1 month ago

The s390x SSH serial terminal issues continue in the latest batch of livepatches. Given that these issues keep happening at the exact same point in the test, it was unlikely to be caused by DHCP or any other network issue. So could someone please look at any sytem logs in the s390x SSH setup that are not available through OpenQA webAPI?
https://openqa.suse.de/tests/9732748#step/update_kernel/338
https://openqa.suse.de/tests/9731811#step/update_kernel/337
https://openqa.suse.de/tests/9731442#step/update_kernel/337

#11 Updated by MDoucha about 1 month ago

  • Status changed from Feedback to New

#12 Updated by okurz about 1 month ago

  • Subject changed from SSH serial terminal connection issues on s390x workers to [qe-core][s390x] SSH serial terminal connection issues on s390x workers
  • Assignee deleted (cdywan)
  • Priority changed from Normal to High
  • Target version changed from Ready to future

Looking at
https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4&from=1665746933519&to=1665779040310
for https://openqa.suse.de/tests/9732748#step/update_kernel/338 I can't find any significant packet drops.

@qe-core as maintainers of the s390x hypervisor hosts please take over to look into the suggestions from #116287#note-10

#13 Updated by MDoucha 13 days ago

  • Priority changed from High to Urgent

The issue is still present in the current s390x livepatch batch. Could someone finally check the s390x SSH logs so that I can report the bug into Bugzilla?
https://openqa.suse.de/tests/9939455#step/update_kernel/337
https://openqa.suse.de/tests/9939473#step/update_kernel/337

#14 Updated by rfan1 13 days ago

MDoucha wrote:

The issue is still present in the current s390x livepatch batch. Could someone finally check the s390x SSH logs so that I can report the bug into Bugzilla?
https://openqa.suse.de/tests/9939455#step/update_kernel/337
https://openqa.suse.de/tests/9939473#step/update_kernel/337

Hello @MDoucha

I noticed the job failed at Test died: command 'lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch' timed out at /usr/lib/os-autoinst/testapi.pm line 969.

Since lsinitrd command will print many lines in serial terminal, can you please increase some timeout for https://openqa.suse.de/tests/9939473/modules/update_kernel/steps/1/src?

 script_run("lsinitrd /boot/initrd-$kver-default | grep patch");

#15 Updated by MDoucha 12 days ago

rfan1 wrote:

I noticed the job failed at Test died: command 'lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch' timed out at /usr/lib/os-autoinst/testapi.pm line 969.

Since lsinitrd command will print many lines in serial terminal, can you please increase some timeout for https://openqa.suse.de/tests/9939473/modules/update_kernel/steps/1/src?

 script_run("lsinitrd /boot/initrd-$kver-default | grep patch");

In a successful job, the lsinitrd command takes 4 seconds. This is not a timeout issue, the SSH connection between openqaworker2 and the s390x host failed.

#16 Updated by rfan1 10 days ago

MDoucha wrote:

rfan1 wrote:

I noticed the job failed at Test died: command 'lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch' timed out at /usr/lib/os-autoinst/testapi.pm line 969.

Since lsinitrd command will print many lines in serial terminal, can you please increase some timeout for https://openqa.suse.de/tests/9939473/modules/update_kernel/steps/1/src?

 script_run("lsinitrd /boot/initrd-$kver-default | grep patch");

In a successful job, the lsinitrd command takes 4 seconds. This is not a timeout issue, the SSH connection between openqaworker2 and the s390x host failed.

I tried to investigate more on this issue, https://openqa.suse.de/tests/9964578#step/update_kernel/337:

Seems only part of the serial outputs [kernel-livepatch"] were shown there, and then job failed due to no match of return code "plS8Z-$?-"

However, I can see everything worked fine in serial terminal:
https://openqa.suse.de/tests/9964578/logfile?filename=serial_terminal.txt

# lsinitrd /boot/initrd-5.3.18-24.99-default | grep patch; echo plS8Z-$?-
kernel-livepatch
-rwxr-xr-x   1 root     root          467 Aug  1 10:04 lib/dracut/hooks/pre-pivot/99-kernel-livepatch.sh
drwxr-xr-x   1 root     root            0 Nov 15 11:16 lib/modules/5.3.18-24.99-default/livepatch
-rw-r--r--   1 root     root       283936 Nov 11 09:13 lib/modules/5.3.18-24.99-default/livepatch/livepatch-16-150200_2_2.ko
plS8Z-0-

Then I don't think the ssh connection between worker and s390x SUT is lost :)

So, I would still suggest adding sometime out for cmd below [I don't know if it can fix the issue, but worth to have a try]
script_run("lsinitrd /boot/initrd-$kver-default | grep patch");

#17 Updated by szarate 9 days ago

  • Status changed from New to Feedback

Martin, can you also prepend the time command? if this is taking long enough that script_run is failing, then the problem might be actually on the storage (fwiw, the storage on s390 is remote, amd should be fast enough so that this doesnt happen), I would say anything over 5 second, is too much.

time (lsinitrd /boot/initrd | grep zfs); echo $?

real    0m0.416s
user    0m0.278s
sys 0m0.311s
1

#18 Updated by szarate 9 days ago

  • Sprint set to QE-Core: November Sprint (Nov 09 - Dec 07)
  • Tags changed from bug, kernel to bug, kernel, bugbusters
  • Assignee set to szarate

#19 Updated by MDoucha 4 days ago

szarate wrote:

Martin, can you also prepend the time command? if this is taking long enough that script_run is failing, then the problem might be actually on the storage (fwiw, the storage on s390 is remote, amd should be fast enough so that this doesnt happen), I would say anything over 5 second, is too much.

I can, but all livepatch incidents got released last week so I'll have to wait for a new batch in December. Until then I have nothing to test with.

#20 Updated by szarate 3 days ago

  • Due date set to 2023-01-20
  • Priority changed from Urgent to Normal

MDoucha wrote:

szarate wrote:

Martin, can you also prepend the time command? if this is taking long enough that script_run is failing, then the problem might be actually on the storage (fwiw, the storage on s390 is remote, amd should be fast enough so that this doesnt happen), I would say anything over 5 second, is too much.

I can, but all livepatch incidents got released last week so I'll have to wait for a new batch in December. Until then I have nothing to test with.

sure thing :), changing prio for now

Also available in: Atom PDF