Project

General

Profile

Actions

action #59190

closed

Broken SSH serial console

Added by MDoucha about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-11-07
Due date:
% Done:

0%

Estimated time:

Description

Since 2019-10-30, OpenQA tests which print large amounts of output into serial console randomly fail because serial console output gets stuck in a buffer and wait_serial() times out waiting for test end marker which has been already printed (as seen on VNC screenshots right before the timeout in some jobs). Affected backends: svirt, ipmi

Example: https://openqa.suse.de/tests/3566747#step/trace_sched01/2

Also reproduced on another OpenQA instance: http://openqa.qam.suse.cz/tests/3472#step/boot_ltp/22


Related issues 1 (0 open1 closed)

Related to openQA Tests (public) - action #59226: [kernel]ltp][spvm][sporadic] test fails in boot_ltp - cat /tmp/ver_linux_before.txt timeoutResolvedpcervinka2019-11-08

Actions
Actions #1

Updated by okurz about 5 years ago

hm, I am not aware of any intended recent changes regarding the serial port. I suggest you try to fill the ticket with the points from the template in https://progress.opensuse.org/projects/openqav3/wiki#Defects and bisect based on the changed involved components, e.g. os-autoinst and openQA, what has changed.

Actions #2

Updated by mkittler about 5 years ago

Maybe this is caused by https://github.com/os-autoinst/os-autoinst/pull/1239. I'll have a look whether I can reproduce it locally.

Actions #3

Updated by acarvajal about 5 years ago

Adding this that was discussed in the chat that could be helpful to debug:

1) Noticed with build 0172@0369 of SLES+HA 12-SP5 on 04.11.2019 that a cluster test (crm script run health) was timing out on s390x over svirt: https://openqa.suse.de/tests/3556626#step/check_logs/2

2) Issue seemed the same: stack trace is being sent over the serial terminal and this does not complete; test never gets the regexp expected by wait_serial() and times out.

3) Same test with the previous build (0170@0368) on 22.10.2019 not only worked, but the test was able to capture a longer stack trace over the serial terminal: https://openqa.suse.de/tests/3529875#step/check_logs/2

4) Re-triggering the tests for build 0172@0369 and monitoring the live view and live log showed the stack trace getting stuck in the serial terminal; moreover, connecting via SSH to both nodes and running a command with an echo to /dev/ttysclp0 while the trace was stuck in the serial terminal, seemed to unstuck the output. Result were successful tests https://openqa.suse.de/tests/3557376 & https://openqa.suse.de/tests/3557377, but with extra texts seen in the serial terminal log produced by the manual echo command.

Actions #4

Updated by mkittler about 5 years ago

  • Subject changed from [infra] Broken SSH serial console to Broken SSH serial console

Seems like cloning the jobs from the ticket description isn't trivial. I ran into the problem that HDD images and other assets are missing. I'll try the jobs mentioned by @acarvajal but that's yet another WORKER_CLASS I need to setup locally by borrowing an svirt host from OSD.

Since this happens also on your own openQA instance I think this is likely a bug in os-autoinst and not the infrastructure (hence moving to openQA project). Maybe we need to call read() in check_serial again so it doesn't rely on being called again to read the remaining bytes in case that there are more than 4048 bytes? When writing the code I assumed that the IO loop via IO::Select would call the check_serial function as long as there are bytes to read anyways so I've just added a single read() call.

Actions #5

Updated by mkittler about 5 years ago

  • Assignee set to mkittler
  • Target version set to Current Sprint

That change might fix it: https://github.com/os-autoinst/os-autoinst/pull/1258

I only tested this with 'sle-12-SP5-Server-DVD-s390x-Build0366-installer_extended@s390x-kvm-sle12' so far because I can not run the tests to reproduce the issue locally.

I've been assigning myself but I actually don't know how to reproduce the issue. So help would be welcome. Maybe somebody can test the change/fix?

Actions #6

Updated by okurz about 5 years ago

  • Project changed from openQA Infrastructure (public) to openQA Project (public)
  • Category set to Regressions/Crashes
Actions #7

Updated by okurz about 5 years ago

Testing on openqaworker2:

okurz@openqaworker2:~> md5sum /usr/lib/os-autoinst/backend/baseclass.pm 
68ba3ed78bb9cd1e2f95b960bde15c53  /usr/lib/os-autoinst/backend/baseclass.pm
okurz@openqaworker2:~> sudo wget https://raw.githubusercontent.com/os-autoinst/os-autoinst/d81c82015d08bf2bb2782a294eb742941ae29f62/backend/baseclass.pm -O /usr/lib/os-autoinst/backend/baseclass.pm
--2019-11-07 21:54:47--  https://raw.githubusercontent.com/os-autoinst/os-autoinst/d81c82015d08bf2bb2782a294eb742941ae29f62/backend/baseclass.pm
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.40.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.40.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39142 (38K) [text/plain]
Saving to: ‘/usr/lib/os-autoinst/backend/baseclass.pm’

/usr/lib/os-autoinst/backend/baseclass.pm                100%[===============================================================================================================================>]  38.22K   207KB/s    in 0.2s    

2019-11-07 21:54:48 (207 KB/s) - ‘/usr/lib/os-autoinst/backend/baseclass.pm’ saved [39142/39142]

okurz@openqaworker2:~> md5sum /usr/lib/os-autoinst/backend/baseclass.pm 
3d7b9976d45600819452ed04872d3c39  /usr/lib/os-autoinst/backend/baseclass.pm

Triggered a test https://openqa.suse.de/tests/3568649# which passed so a good sign. second https://openqa.suse.de/tests/3568675# passed. third https://openqa.suse.de/tests/3568676 passed. I also monitored the CPU usage while the jobs were running and it did not reach 100%.

@mkittler IMHO you can consider your change as a fix for the regression. We can keep the hotpatch on openqaworker2 for the time being. I recommend you crosscheck some other scenarios for validation before we merge the PR, e.g. other backends also using the method, different scenarios, e.g. an installation one, etc.

Actions #8

Updated by okurz about 5 years ago

  • Status changed from New to In Progress
  • Priority changed from Normal to High
Actions #9

Updated by MDoucha about 5 years ago

mkittler wrote:

Seems like cloning the jobs from the ticket description isn't trivial. I ran into the problem that HDD images and other assets are missing.

To clone LTP jobs from OSD, you need to first clone the install_ltp job for the same build (for the job in issue description, that would be this one: https://openqa.suse.de/tests/3566745). Then it'll work.

Actions #10

Updated by mkittler about 5 years ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Category deleted (Regressions/Crashes)
  • Status changed from In Progress to New
  • Priority changed from High to Normal

@MDoucha I actually tried to clone https://openqa.suse.de/tests/3566745 yesterday but it failed with

[2019-11-07T17:16:19.0305 CET] [error] [pid:26949] HDD_1 handling Cannot find HDD_1 asset hdd/SLES-15-SP1-s390x-minimal_installed_for_LTP.qcow2!
Actions #11

Updated by mkittler about 5 years ago

  • Project changed from openQA Infrastructure (public) to openQA Project (public)
  • Category set to Regressions/Crashes
  • Status changed from New to In Progress
Actions #12

Updated by MDoucha about 5 years ago

mkittler wrote:

@MDoucha I actually tried to clone https://openqa.suse.de/tests/3566745 yesterday but it failed with

[2019-11-07T17:16:19.0305 CET] [error] [pid:26949] HDD_1 handling Cannot find HDD_1 asset hdd/SLES-15-SP1-s390x-minimal_installed_for_LTP.qcow2!

You can download it manually from the "Logs & Assets" tab just fine.

Actions #13

Updated by mkittler about 5 years ago

  • Status changed from In Progress to Feedback

Apparently the fix works on openqaworker2. Since the change hasn't been deployed on every worker I'll keep it in feedback for now.

Actions #14

Updated by okurz about 5 years ago

  • Status changed from Feedback to Resolved
Actions #15

Updated by pcervinka about 5 years ago

  • Related to action #59226: [kernel]ltp][spvm][sporadic] test fails in boot_ltp - cat /tmp/ver_linux_before.txt timeout added
Actions #16

Updated by mkittler about 5 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF