Project

General

Profile

Actions

action #60815

closed

Broken SSH serial console (again)

Added by MDoucha almost 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-12-09
Due date:
% Done:

0%

Estimated time:

Description

Since Wednesday last week, SSH serial output is garbled or incomplete. This problem blocks all testing on s390. Examples:
https://openqa.suse.de/tests/3680691#step/boot_ltp/14

# wait_serial expected: 'active'

# Result:

Last login: Mon Dec  9 03:45:42 from grenache-1.qa.suse.de
s390kvm014:~ #(B systemctl is-acetwork
activating
s390kvm014:~ #(B 

https://openqa.suse.de/tests/3680889#step/update_kernel/9

# wait_serial expected: qr/Welcome to SUSE Linux Enterprise .*\(s390x\)/u

# Result:
<snip>
Welcometo SUSE Lux Enterprseerer 1 SP4 s3x) ernl 4.12.14-95.6-default (ttysclp0).


 login: 
Actions #1

Updated by coolo almost 5 years ago

git log 3973b078..91c754a2 is quite a lot, nothing obvious but quite a bit about ssh, serial and rpc

Actions #2

Updated by coolo almost 5 years ago

  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from [OpenQA] Broken SSH serial console (again) to Broken SSH serial console (again)
  • Category set to Regressions/Crashes
  • Assignee deleted (okurz)
  • Target version set to Ready
Actions #3

Updated by coolo almost 5 years ago

https://openqa.suse.de/tests/3680889#step/update_kernel/9 contains a lot of text and most look fine - and then it ends with

[  OK  ] 
[  OK  ] Reached target Multi-User System.
[  OK  ] 
         Starting Update UTMP about System Runlevel Changes...[  OK  ] Started Update UTMP about System Runlevel Changes.

Welcometo SUSE Lux Enterprseerer 1 SP4 s3x) ernl 4.12.14-95.6-default (ttysclp0).


 login: 

This looks like 2 processes are writing independently to each other into the console (file).

Actions #4

Updated by mkittler almost 5 years ago

I've just had a look at the Git log.

@cfconrad Maybe change for stopping the serial console in all error cases (beside LIBSSH2_ERROR_EAGAIN) is wrong after all?

     # read from SSH channel (receiving extended data channel as well via `$chan->ext_data('merge')`)
     my $chan = $self->{serial_chan};
     my $buffer;
-    my $bytes_read      = $chan->read($buffer, 4096);
-    my $could_read_once = defined $bytes_read;
-    while (defined $bytes_read) {
+    while (defined(my $bytes_read = $chan->read($buffer, 4096))) {
         return 1 unless $bytes_read > 0;
         print $buffer;
         open(my $serial, '>>', $self->{serialfile});
@@ -1277,9 +1275,10 @@ sub check_ssh_serial {
     }

     my ($error_code, $error_name, $error_string) = $ssh->error;
-    return 1 if $could_read_once && $error_code == LIBSSH2_ERROR_EAGAIN;
+    return 1 if $error_code == LIBSSH2_ERROR_EAGAIN;

-    bmwqemu::diag("svirt serial: unable to read: $error_string (error code: $error_code)");
+    bmwqemu::diag("ssh serial: unable to read: $error_string (error code: $error_code) - closing connection");
+    $self->stop_ssh_serial();
     return 1;
 }

But ssh serial: unable to read: is not in the log so if that's causing the problem it would be a logic bug before in the code before (which I don't see).

I'm also wondering about this change:

-    $self->{socket_fd}    = $socket_fd;
+    $self->{fd_read}      = $fd_read;
+    $self->{fd_write}     = $fd_write // $fd_read;

Possibly assigning $fd_read to $self->{fd_write} looks wrong. I confused that with $self->{select_read} and $self->{select_write}.

Actions #6

Updated by mkittler almost 5 years ago

  • Status changed from New to In Progress
  • Assignee set to cfconrad
  • Target version changed from Ready to Current Sprint

Now I'm feeling stupid for not seeing this obvious logic bug because I have actually suspected that there is a bug in exactly that place (see my last comment).

@cfconrad I'm assigning you since you've provided the fix.

Actions #7

Updated by cfconrad almost 5 years ago

  • Status changed from In Progress to Feedback

Now I'm feeling stupid for not seeing this obvious logic bug because I have actually suspected that there is a bug in exactly that place (see my last comment).

I don't feel better, as I introduced it :/

Put it to Feedback as we waiting for more results from productive instance. PR was merged, thx!

Actions #8

Updated by okurz almost 5 years ago

I selectively installed the new version of os-autoinst on grenache-1.qa and retriggered the failure from the ticket description as https://openqa.suse.de/tests/3684656

EDIT: looks good. Have also installed the new os-autoinst on all other workers. We can now retrigger all jobs that failed for the same reason on osd.

Actions #9

Updated by cfconrad almost 5 years ago

+1, thx. Let me know if I can do something.

Actions #10

Updated by okurz almost 5 years ago

retrigger tests that fail because of this problem :)

Actions #11

Updated by cfconrad almost 5 years ago

ok, spend some time to walk through the webfrontend to search and retrigger tests.
Didn't note all of them, I think there were some more but these two are in the list:

https://openqa.suse.de/t3684922 - install_ltp+sle+Server-DVD-Incidents-Kernel => uploading
https://openqa.suse.de/t3684940 - install_ltp+sle+Server-DVD-Incidents-Kernel => running

Actions #12

Updated by cfconrad almost 5 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF