action #59190
closedBroken SSH serial console
Added by MDoucha about 5 years ago. Updated about 5 years ago.
Description
Since 2019-10-30, OpenQA tests which print large amounts of output into serial console randomly fail because serial console output gets stuck in a buffer and wait_serial()
times out waiting for test end marker which has been already printed (as seen on VNC screenshots right before the timeout in some jobs). Affected backends: svirt, ipmi
Example: https://openqa.suse.de/tests/3566747#step/trace_sched01/2
Also reproduced on another OpenQA instance: http://openqa.qam.suse.cz/tests/3472#step/boot_ltp/22
Updated by okurz about 5 years ago
hm, I am not aware of any intended recent changes regarding the serial port. I suggest you try to fill the ticket with the points from the template in https://progress.opensuse.org/projects/openqav3/wiki#Defects and bisect based on the changed involved components, e.g. os-autoinst and openQA, what has changed.
Updated by mkittler about 5 years ago
Maybe this is caused by https://github.com/os-autoinst/os-autoinst/pull/1239. I'll have a look whether I can reproduce it locally.
Updated by acarvajal about 5 years ago
Adding this that was discussed in the chat that could be helpful to debug:
1) Noticed with build 0172@0369 of SLES+HA 12-SP5 on 04.11.2019 that a cluster test (crm script run health) was timing out on s390x over svirt: https://openqa.suse.de/tests/3556626#step/check_logs/2
2) Issue seemed the same: stack trace is being sent over the serial terminal and this does not complete; test never gets the regexp expected by wait_serial() and times out.
3) Same test with the previous build (0170@0368) on 22.10.2019 not only worked, but the test was able to capture a longer stack trace over the serial terminal: https://openqa.suse.de/tests/3529875#step/check_logs/2
4) Re-triggering the tests for build 0172@0369 and monitoring the live view and live log showed the stack trace getting stuck in the serial terminal; moreover, connecting via SSH to both nodes and running a command with an echo to /dev/ttysclp0 while the trace was stuck in the serial terminal, seemed to unstuck the output. Result were successful tests https://openqa.suse.de/tests/3557376 & https://openqa.suse.de/tests/3557377, but with extra texts seen in the serial terminal log produced by the manual echo command.
Updated by mkittler about 5 years ago
- Subject changed from [infra] Broken SSH serial console to Broken SSH serial console
Seems like cloning the jobs from the ticket description isn't trivial. I ran into the problem that HDD images and other assets are missing. I'll try the jobs mentioned by @acarvajal but that's yet another WORKER_CLASS
I need to setup locally by borrowing an svirt host from OSD.
Since this happens also on your own openQA instance I think this is likely a bug in os-autoinst and not the infrastructure (hence moving to openQA project). Maybe we need to call read()
in check_serial again so it doesn't rely on being called again to read the remaining bytes in case that there are more than 4048 bytes? When writing the code I assumed that the IO loop via IO::Select
would call the check_serial
function as long as there are bytes to read anyways so I've just added a single read()
call.
Updated by mkittler about 5 years ago
- Assignee set to mkittler
- Target version set to Current Sprint
That change might fix it: https://github.com/os-autoinst/os-autoinst/pull/1258
I only tested this with 'sle-12-SP5-Server-DVD-s390x-Build0366-installer_extended@s390x-kvm-sle12' so far because I can not run the tests to reproduce the issue locally.
I've been assigning myself but I actually don't know how to reproduce the issue. So help would be welcome. Maybe somebody can test the change/fix?
Updated by okurz about 5 years ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Category set to Regressions/Crashes
Updated by okurz about 5 years ago
Testing on openqaworker2:
okurz@openqaworker2:~> md5sum /usr/lib/os-autoinst/backend/baseclass.pm
68ba3ed78bb9cd1e2f95b960bde15c53 /usr/lib/os-autoinst/backend/baseclass.pm
okurz@openqaworker2:~> sudo wget https://raw.githubusercontent.com/os-autoinst/os-autoinst/d81c82015d08bf2bb2782a294eb742941ae29f62/backend/baseclass.pm -O /usr/lib/os-autoinst/backend/baseclass.pm
--2019-11-07 21:54:47-- https://raw.githubusercontent.com/os-autoinst/os-autoinst/d81c82015d08bf2bb2782a294eb742941ae29f62/backend/baseclass.pm
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.40.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.40.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39142 (38K) [text/plain]
Saving to: ‘/usr/lib/os-autoinst/backend/baseclass.pm’
/usr/lib/os-autoinst/backend/baseclass.pm 100%[===============================================================================================================================>] 38.22K 207KB/s in 0.2s
2019-11-07 21:54:48 (207 KB/s) - ‘/usr/lib/os-autoinst/backend/baseclass.pm’ saved [39142/39142]
okurz@openqaworker2:~> md5sum /usr/lib/os-autoinst/backend/baseclass.pm
3d7b9976d45600819452ed04872d3c39 /usr/lib/os-autoinst/backend/baseclass.pm
Triggered a test https://openqa.suse.de/tests/3568649# which passed so a good sign. second https://openqa.suse.de/tests/3568675# passed. third https://openqa.suse.de/tests/3568676 passed. I also monitored the CPU usage while the jobs were running and it did not reach 100%.
@mkittler IMHO you can consider your change as a fix for the regression. We can keep the hotpatch on openqaworker2 for the time being. I recommend you crosscheck some other scenarios for validation before we merge the PR, e.g. other backends also using the method, different scenarios, e.g. an installation one, etc.
Updated by okurz about 5 years ago
- Status changed from New to In Progress
- Priority changed from Normal to High
Updated by MDoucha about 5 years ago
mkittler wrote:
Seems like cloning the jobs from the ticket description isn't trivial. I ran into the problem that HDD images and other assets are missing.
To clone LTP jobs from OSD, you need to first clone the install_ltp
job for the same build (for the job in issue description, that would be this one: https://openqa.suse.de/tests/3566745). Then it'll work.
Updated by mkittler about 5 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Category deleted (
Regressions/Crashes) - Status changed from In Progress to New
- Priority changed from High to Normal
@MDoucha I actually tried to clone https://openqa.suse.de/tests/3566745 yesterday but it failed with
[2019-11-07T17:16:19.0305 CET] [error] [pid:26949] HDD_1 handling Cannot find HDD_1 asset hdd/SLES-15-SP1-s390x-minimal_installed_for_LTP.qcow2!
Updated by mkittler about 5 years ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Category set to Regressions/Crashes
- Status changed from New to In Progress
Updated by MDoucha about 5 years ago
mkittler wrote:
@MDoucha I actually tried to clone https://openqa.suse.de/tests/3566745 yesterday but it failed with
[2019-11-07T17:16:19.0305 CET] [error] [pid:26949] HDD_1 handling Cannot find HDD_1 asset hdd/SLES-15-SP1-s390x-minimal_installed_for_LTP.qcow2!
You can download it manually from the "Logs & Assets" tab just fine.
Updated by mkittler about 5 years ago
- Status changed from In Progress to Feedback
Apparently the fix works on openqaworker2. Since the change hasn't been deployed on every worker I'll keep it in feedback for now.
Updated by okurz about 5 years ago
- Status changed from Feedback to Resolved
was deployed this morning to OSD: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/138811
tests seem to be good: https://openqa.suse.de/tests/3586587
Updated by pcervinka about 5 years ago
- Related to action #59226: [kernel]ltp][spvm][sporadic] test fails in boot_ltp - cat /tmp/ver_linux_before.txt timeout added
Updated by mkittler about 5 years ago
- Target version changed from Current Sprint to Done