action #60815: Broken SSH serial console (again) - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #60815

closed

Broken SSH serial console (again)

Added by MDoucha over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

cfconrad

Category:

Regressions/Crashes

Target version:

Current Sprint

Start date:

2019-12-09

Due date:

% Done:

Estimated time:

Description

Since Wednesday last week, SSH serial output is garbled or incomplete. This problem blocks all testing on s390. Examples:
https://openqa.suse.de/tests/3680691#step/boot_ltp/14

# wait_serial expected: 'active'

# Result:

Last login: Mon Dec  9 03:45:42 from grenache-1.qa.suse.de
[1ms390kvm014:~ #[m(B systemctl is-acetwork
activating
[1ms390kvm014:~ #[m(B

https://openqa.suse.de/tests/3680889#step/update_kernel/9

# wait_serial expected: qr/Welcome to SUSE Linux Enterprise .*\(s390x\)/u

# Result:
<snip>
Welcometo SUSE Lux Enterprseerer 1 SP4 s3x) ernl 4.12.14-95.6-default (ttysclp0).


 login:

Actions

Copy link

Updated by coolo over 5 years ago

git log 3973b078..91c754a2 is quite a lot, nothing obvious but quite a bit about ssh, serial and rpc

Actions

Copy link

Updated by coolo over 5 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Subject changed from [OpenQA] Broken SSH serial console (again) to Broken SSH serial console (again)
Category set to Regressions/Crashes
Assignee deleted (~~okurz~~)
Target version set to Ready

Actions

Copy link

Updated by coolo over 5 years ago

https://openqa.suse.de/tests/3680889#step/update_kernel/9 contains a lot of text and most look fine - and then it ends with

[  OK  ] 
[  OK  ] Reached target Multi-User System.
[  OK  ] 
         Starting Update UTMP about System Runlevel Changes...[  OK  ] Started Update UTMP about System Runlevel Changes.

Welcometo SUSE Lux Enterprseerer 1 SP4 s3x) ernl 4.12.14-95.6-default (ttysclp0).


 login:

This looks like 2 processes are writing independently to each other into the console (file).

Actions

Copy link

Updated by mkittler over 5 years ago

I've just had a look at the Git log.

@cfconrad Maybe change for stopping the serial console in all error cases (beside LIBSSH2_ERROR_EAGAIN) is wrong after all?

     # read from SSH channel (receiving extended data channel as well via `$chan->ext_data('merge')`)
     my $chan = $self->{serial_chan};
     my $buffer;
-    my $bytes_read      = $chan->read($buffer, 4096);
-    my $could_read_once = defined $bytes_read;
-    while (defined $bytes_read) {
+    while (defined(my $bytes_read = $chan->read($buffer, 4096))) {
         return 1 unless $bytes_read > 0;
         print $buffer;
         open(my $serial, '>>', $self->{serialfile});
@@ -1277,9 +1275,10 @@ sub check_ssh_serial {
     }
 
     my ($error_code, $error_name, $error_string) = $ssh->error;
-    return 1 if $could_read_once && $error_code == LIBSSH2_ERROR_EAGAIN;
+    return 1 if $error_code == LIBSSH2_ERROR_EAGAIN;
 
-    bmwqemu::diag("svirt serial: unable to read: $error_string (error code: $error_code)");
+    bmwqemu::diag("ssh serial: unable to read: $error_string (error code: $error_code) - closing connection");
+    $self->stop_ssh_serial();
     return 1;
 }

But ssh serial: unable to read: is not in the log so if that's causing the problem it would be a logic bug before in the code before (which I don't see).

I'm also wondering about this change:

-    $self->{socket_fd}    = $socket_fd;
+    $self->{fd_read}      = $fd_read;
+    $self->{fd_write}     = $fd_write // $fd_read;

~~Possibly assigning $fd_read to $self->{fd_write} looks wrong.~~ I confused that with $self->{select_read} and $self->{select_write}.

Actions

Copy link

Updated by cfconrad over 5 years ago

I think the issue is cause by https://github.com/os-autoinst/os-autoinst/pull/1298 and fixed with https://github.com/os-autoinst/os-autoinst/pull/1319

Actions

Copy link

Updated by mkittler over 5 years ago

Status changed from New to In Progress
Assignee set to cfconrad
Target version changed from Ready to Current Sprint

Now I'm feeling stupid for not seeing this obvious logic bug because I have actually suspected that there is a bug in exactly that place (see my last comment).

@cfconrad I'm assigning you since you've provided the fix.

Actions

Copy link

Updated by cfconrad over 5 years ago

Status changed from In Progress to Feedback

Now I'm feeling stupid for not seeing this obvious logic bug because I have actually suspected that there is a bug in exactly that place (see my last comment).

I don't feel better, as I introduced it :/

Put it to Feedback as we waiting for more results from productive instance. PR was merged, thx!

Actions

Copy link

Updated by okurz over 5 years ago

I selectively installed the new version of os-autoinst on grenache-1.qa and retriggered the failure from the ticket description as https://openqa.suse.de/tests/3684656

EDIT: looks good. Have also installed the new os-autoinst on all other workers. We can now retrigger all jobs that failed for the same reason on osd.

Actions

Copy link

Updated by cfconrad over 5 years ago

+1, thx. Let me know if I can do something.

Actions

Copy link

#10

Updated by okurz over 5 years ago

retrigger tests that fail because of this problem :)

Actions

Copy link

#11

Updated by cfconrad over 5 years ago

ok, spend some time to walk through the webfrontend to search and retrigger tests.
Didn't note all of them, I think there were some more but these two are in the list:

https://openqa.suse.de/t3684922 - install_ltp+sle+Server-DVD-Incidents-Kernel => uploading
https://openqa.suse.de/t3684940 - install_ltp+sle+Server-DVD-Incidents-Kernel => running

Actions

Copy link

#12

Updated by cfconrad over 5 years ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #60815

Broken SSH serial console (again)

Updated by coolo over 5 years ago

Updated by coolo over 5 years ago

Updated by coolo over 5 years ago

Updated by mkittler over 5 years ago

Updated by cfconrad over 5 years ago

Updated by mkittler over 5 years ago

Updated by cfconrad over 5 years ago

Updated by okurz over 5 years ago

Updated by cfconrad over 5 years ago

Updated by okurz over 5 years ago

Updated by cfconrad over 5 years ago

Updated by cfconrad over 5 years ago