Broken SSH serial console
Since 2019-10-30, OpenQA tests which print large amounts of output into serial console randomly fail because serial console output gets stuck in a buffer and
wait_serial() times out waiting for test end marker which has been already printed (as seen on VNC screenshots right before the timeout in some jobs). Affected backends: svirt, ipmi
Also reproduced on another OpenQA instance: http://openqa.qam.suse.cz/tests/3472#step/boot_ltp/22
#1 Updated by okurz about 3 years ago
hm, I am not aware of any intended recent changes regarding the serial port. I suggest you try to fill the ticket with the points from the template in https://progress.opensuse.org/projects/openqav3/wiki#Defects and bisect based on the changed involved components, e.g. os-autoinst and openQA, what has changed.
#2 Updated by mkittler about 3 years ago
Maybe this is caused by https://github.com/os-autoinst/os-autoinst/pull/1239. I'll have a look whether I can reproduce it locally.
#3 Updated by acarvajal about 3 years ago
Adding this that was discussed in the chat that could be helpful to debug:
1) Noticed with build 0172@0369 of SLES+HA 12-SP5 on 04.11.2019 that a cluster test (crm script run health) was timing out on s390x over svirt: https://openqa.suse.de/tests/3556626#step/check_logs/2
2) Issue seemed the same: stack trace is being sent over the serial terminal and this does not complete; test never gets the regexp expected by wait_serial() and times out.
3) Same test with the previous build (0170@0368) on 22.10.2019 not only worked, but the test was able to capture a longer stack trace over the serial terminal: https://openqa.suse.de/tests/3529875#step/check_logs/2
4) Re-triggering the tests for build 0172@0369 and monitoring the live view and live log showed the stack trace getting stuck in the serial terminal; moreover, connecting via SSH to both nodes and running a command with an echo to /dev/ttysclp0 while the trace was stuck in the serial terminal, seemed to unstuck the output. Result were successful tests https://openqa.suse.de/tests/3557376 & https://openqa.suse.de/tests/3557377, but with extra texts seen in the serial terminal log produced by the manual echo command.
#4 Updated by mkittler about 3 years ago
- Subject changed from [infra] Broken SSH serial console to Broken SSH serial console
Seems like cloning the jobs from the ticket description isn't trivial. I ran into the problem that HDD images and other assets are missing. I'll try the jobs mentioned by acarvajal but that's yet another
WORKER_CLASS I need to setup locally by borrowing an svirt host from OSD.
Since this happens also on your own openQA instance I think this is likely a bug in os-autoinst and not the infrastructure (hence moving to openQA project). Maybe we need to call
read() in check_serial again so it doesn't rely on being called again to read the remaining bytes in case that there are more than 4048 bytes? When writing the code I assumed that the IO loop via
IO::Select would call the
check_serial function as long as there are bytes to read anyways so I've just added a single
#5 Updated by mkittler about 3 years ago
- Assignee set to mkittler
- Target version set to Current Sprint
That change might fix it: https://github.com/os-autoinst/os-autoinst/pull/1258
I only tested this with 'sle-12-SP5-Server-DVD-s390x-Build0366-installer_extended@s390x-kvm-sle12' so far because I can not run the tests to reproduce the issue locally.
I've been assigning myself but I actually don't know how to reproduce the issue. So help would be welcome. Maybe somebody can test the change/fix?
#7 Updated by okurz about 3 years ago
Testing on openqaworker2:
okurz@openqaworker2:~> md5sum /usr/lib/os-autoinst/backend/baseclass.pm 68ba3ed78bb9cd1e2f95b960bde15c53 /usr/lib/os-autoinst/backend/baseclass.pm okurz@openqaworker2:~> sudo wget https://raw.githubusercontent.com/os-autoinst/os-autoinst/d81c82015d08bf2bb2782a294eb742941ae29f62/backend/baseclass.pm -O /usr/lib/os-autoinst/backend/baseclass.pm --2019-11-07 21:54:47-- https://raw.githubusercontent.com/os-autoinst/os-autoinst/d81c82015d08bf2bb2782a294eb742941ae29f62/backend/baseclass.pm Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 220.127.116.11 Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|18.104.22.168|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 39142 (38K) [text/plain] Saving to: ‘/usr/lib/os-autoinst/backend/baseclass.pm’ /usr/lib/os-autoinst/backend/baseclass.pm 100%[===============================================================================================================================>] 38.22K 207KB/s in 0.2s 2019-11-07 21:54:48 (207 KB/s) - ‘/usr/lib/os-autoinst/backend/baseclass.pm’ saved [39142/39142] okurz@openqaworker2:~> md5sum /usr/lib/os-autoinst/backend/baseclass.pm 3d7b9976d45600819452ed04872d3c39 /usr/lib/os-autoinst/backend/baseclass.pm
Triggered a test https://openqa.suse.de/tests/3568649# which passed so a good sign. second https://openqa.suse.de/tests/3568675# passed. third https://openqa.suse.de/tests/3568676 passed. I also monitored the CPU usage while the jobs were running and it did not reach 100%.
mkittler IMHO you can consider your change as a fix for the regression. We can keep the hotpatch on openqaworker2 for the time being. I recommend you crosscheck some other scenarios for validation before we merge the PR, e.g. other backends also using the method, different scenarios, e.g. an installation one, etc.
#9 Updated by MDoucha about 3 years ago
Seems like cloning the jobs from the ticket description isn't trivial. I ran into the problem that HDD images and other assets are missing.
To clone LTP jobs from OSD, you need to first clone the
install_ltp job for the same build (for the job in issue description, that would be this one: https://openqa.suse.de/tests/3566745). Then it'll work.
#10 Updated by mkittler about 3 years ago
- Project changed from openQA Project to openQA Infrastructure
- Category deleted (
- Status changed from In Progress to New
- Priority changed from High to Normal
@MDoucha I actually tried to clone https://openqa.suse.de/tests/3566745 yesterday but it failed with
[2019-11-07T17:16:19.0305 CET] [error] [pid:26949] HDD_1 handling Cannot find HDD_1 asset hdd/SLES-15-SP1-s390x-minimal_installed_for_LTP.qcow2!
#12 Updated by MDoucha about 3 years ago
@MDoucha I actually tried to clone https://openqa.suse.de/tests/3566745 yesterday but it failed with[2019-11-07T17:16:19.0305 CET] [error] [pid:26949] HDD_1 handling Cannot find HDD_1 asset hdd/SLES-15-SP1-s390x-minimal_installed_for_LTP.qcow2!
You can download it manually from the "Logs & Assets" tab just fine.
#14 Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved
was deployed this morning to OSD: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/138811
tests seem to be good: https://openqa.suse.de/tests/3586587