action #60815
closedBroken SSH serial console (again)
0%
Description
Since Wednesday last week, SSH serial output is garbled or incomplete. This problem blocks all testing on s390. Examples:
https://openqa.suse.de/tests/3680691#step/boot_ltp/14
# wait_serial expected: 'active'
# Result:
Last login: Mon Dec 9 03:45:42 from grenache-1.qa.suse.de
[1ms390kvm014:~ #[m(B systemctl is-acetwork
activating
[1ms390kvm014:~ #[m(B
https://openqa.suse.de/tests/3680889#step/update_kernel/9
# wait_serial expected: qr/Welcome to SUSE Linux Enterprise .*\(s390x\)/u
# Result:
<snip>
Welcometo SUSE Lux Enterprseerer 1 SP4 s3x) ernl 4.12.14-95.6-default (ttysclp0).
login:
Updated by coolo almost 5 years ago
git log 3973b078..91c754a2
is quite a lot, nothing obvious but quite a bit about ssh, serial and rpc
Updated by coolo almost 5 years ago
- Project changed from openQA Infrastructure to openQA Project
- Subject changed from [OpenQA] Broken SSH serial console (again) to Broken SSH serial console (again)
- Category set to Regressions/Crashes
- Assignee deleted (
okurz) - Target version set to Ready
Updated by coolo almost 5 years ago
https://openqa.suse.de/tests/3680889#step/update_kernel/9 contains a lot of text and most look fine - and then it ends with
[ OK ]
[ OK ] Reached target Multi-User System.
[ OK ]
Starting Update UTMP about System Runlevel Changes...[ OK ] Started Update UTMP about System Runlevel Changes.
Welcometo SUSE Lux Enterprseerer 1 SP4 s3x) ernl 4.12.14-95.6-default (ttysclp0).
login:
This looks like 2 processes are writing independently to each other into the console (file).
Updated by mkittler almost 5 years ago
I've just had a look at the Git log.
@cfconrad Maybe change for stopping the serial console in all error cases (beside LIBSSH2_ERROR_EAGAIN
) is wrong after all?
# read from SSH channel (receiving extended data channel as well via `$chan->ext_data('merge')`)
my $chan = $self->{serial_chan};
my $buffer;
- my $bytes_read = $chan->read($buffer, 4096);
- my $could_read_once = defined $bytes_read;
- while (defined $bytes_read) {
+ while (defined(my $bytes_read = $chan->read($buffer, 4096))) {
return 1 unless $bytes_read > 0;
print $buffer;
open(my $serial, '>>', $self->{serialfile});
@@ -1277,9 +1275,10 @@ sub check_ssh_serial {
}
my ($error_code, $error_name, $error_string) = $ssh->error;
- return 1 if $could_read_once && $error_code == LIBSSH2_ERROR_EAGAIN;
+ return 1 if $error_code == LIBSSH2_ERROR_EAGAIN;
- bmwqemu::diag("svirt serial: unable to read: $error_string (error code: $error_code)");
+ bmwqemu::diag("ssh serial: unable to read: $error_string (error code: $error_code) - closing connection");
+ $self->stop_ssh_serial();
return 1;
}
But ssh serial: unable to read:
is not in the log so if that's causing the problem it would be a logic bug before in the code before (which I don't see).
I'm also wondering about this change:
- $self->{socket_fd} = $socket_fd;
+ $self->{fd_read} = $fd_read;
+ $self->{fd_write} = $fd_write // $fd_read;
Possibly assigning $fd_read
to $self->{fd_write}
looks wrong. I confused that with $self->{select_read}
and $self->{select_write}
.
Updated by cfconrad almost 5 years ago
I think the issue is cause by https://github.com/os-autoinst/os-autoinst/pull/1298 and fixed with https://github.com/os-autoinst/os-autoinst/pull/1319
Updated by mkittler almost 5 years ago
- Status changed from New to In Progress
- Assignee set to cfconrad
- Target version changed from Ready to Current Sprint
Now I'm feeling stupid for not seeing this obvious logic bug because I have actually suspected that there is a bug in exactly that place (see my last comment).
@cfconrad I'm assigning you since you've provided the fix.
Updated by cfconrad almost 5 years ago
- Status changed from In Progress to Feedback
Now I'm feeling stupid for not seeing this obvious logic bug because I have actually suspected that there is a bug in exactly that place (see my last comment).
I don't feel better, as I introduced it :/
Put it to Feedback as we waiting for more results from productive instance. PR was merged, thx!
Updated by okurz almost 5 years ago
I selectively installed the new version of os-autoinst on grenache-1.qa and retriggered the failure from the ticket description as https://openqa.suse.de/tests/3684656
EDIT: looks good. Have also installed the new os-autoinst on all other workers. We can now retrigger all jobs that failed for the same reason on osd.
Updated by cfconrad almost 5 years ago
+1, thx. Let me know if I can do something.
Updated by okurz almost 5 years ago
retrigger tests that fail because of this problem :)
Updated by cfconrad almost 5 years ago
ok, spend some time to walk through the webfrontend to search and retrigger tests.
Didn't note all of them, I think there were some more but these two are in the list:
https://openqa.suse.de/t3684922 - install_ltp+sle+Server-DVD-Incidents-Kernel => uploading
https://openqa.suse.de/t3684940 - install_ltp+sle+Server-DVD-Incidents-Kernel => running
Updated by cfconrad almost 5 years ago
- Status changed from Feedback to Resolved