Project

General

Profile

action #60416

"(undefined value as a symbol reference.*serial_screen|svirt serial: unable to read)" Better feedback for "The console .* is not responding (half-open socket?)", very big log files with repetition of log message

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Concrete Bugs
Target version:
-
Start date:
2019-11-28
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP2-Online-x86_64-prj2_host_upgrade_sles15sp1_to_developing_xen@64bit-ipmi incompletes in
login_console
with a text popup with content "Test died: unexpected response {} at /usr/lib/os-autoinst/testapi.pm line 359."
which is already better than just incomplete but the important part of the log

[2019-11-28T05:03:27.163 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/installation/reboot_after_installation.pm:45 called testapi::     send_key
[2019-11-28T05:03:27.163 CET] [debug] <<< testapi::send_key(key='alt-o', wait_screen_change=0, do_wait=0)
[2019-11-28T05:03:27.970 CET] [debug] svirt serial: unable to read: transport read (error code: -43)
… (near endless continuation of the same message causing a 40MB logfile)
[2019-11-28T05:06:13.474 CET] [debug] svirt serial: unable to read: transport read (error code: -43)
[2019-11-28T05:06:13.476 CET] [debug] Backend process died, backend errors are reported below in the following lines:
The console 'installation' is not responding (half-open socket?). Make sure the console is reachable or disable stall detection on expected disconnects with '$console->disable_vnc_stalls', for example in case of intended machine shutdown

is not shown there. The problem is of the known and usual kind: The system is rebooted before the corresponding connected console is properly disconnected.

Expected result

  • At least the line about "The console 'installation' is not responding" should show in the details page
  • There should not be 40MB worth of repetition of "svirt serial: unable to read: transport read (error code: -43)"

Suggestions

Maybe we can apply something similar to https://github.com/os-autoinst/os-autoinst/pull/1262

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Project - action #59267: test incompletes in installation_overview - when trying to switch to root-ssh?Resolved2019-11-10

Related to openQA Tests - action #60725: [kernel]Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106.Resolved2019-12-05

History

#1 Updated by okurz almost 2 years ago

  • Related to action #59267: test incompletes in installation_overview - when trying to switch to root-ssh? added

#2 Updated by okurz almost 2 years ago

  • Project changed from openQA Tests to openQA Project
  • Subject changed from test fails in login_console to Better feedback for "The console .* is not responding (half-open socket?)"
  • Category changed from Bugs in existing tests to Feature requests

#3 Updated by okurz almost 2 years ago

  • Subject changed from Better feedback for "The console .* is not responding (half-open socket?)" to Better feedback for "The console .* is not responding (half-open socket?)", very big log files with repetition of "svirt serial: unable to read: transport read (error code: -43)"
  • Category changed from Feature requests to Concrete Bugs
  • Status changed from New to Workable
  • Assignee set to cfconrad
  • Priority changed from Normal to Urgent

cfconrad I have the suspicion that https://github.com/os-autoinst/os-autoinst/pull/1166 caused this issue with the super-big logfiles as this seems to occur more often now. Can you please look into this as soon as possible? We can also revert your change again for now if you prefer.

Btw, I found this issue while monitoring incompletes. The output of daily monitoring and alerting jobs can be seen in e.g. https://gitlab.suse.de/openqa/auto-review/-/jobs/144344

#4 Updated by cfconrad almost 2 years ago

ok, I'm on it.

#5 Updated by cfconrad almost 2 years ago

I don't understand our check_ssh_serial() error handling.

When I read this https://www.libssh2.org/libssh2_channel_read_ex.html which is called in https://github.com/rkitover/net-ssh2/blob/master/SSH2.xs#L1854 , which looks like the read implementation of NET::SSH2, we need special handling for LIBSSH2_ERROR_EAGAIN but every other error on read() is like a connection lost.

I'm not sure, why we do not handle connection lost here properly?

#6 Updated by cfconrad almost 2 years ago

Agreed with okurz to use Worker grenache-1:10 for debugging.

systemctl mask openqa-worker@10
systemctl stop openqa-worker@10

Next time I would try with --now --runtime

This job was "canceled" when stopping the worker https://openqa.suse.de/tests/3641819.

#7 Updated by cfconrad almost 2 years ago

As one run is quite time consuming I got a second ipmi host (tails)

ssh grenache-1 -- systemctl mask --now --runtime openqa-worker@9

#8 Updated by mkittler almost 2 years ago

cfconrad I'm the last who touched the error handling within that function to prevent the svirt backend's high CPU usage. Before my changes there was not error handling at all so I thought it would be a good idea to log the errors at least. Apparently that's not sufficient and even made things worse considering the huge log files we get now. It would likely be the best to remove the file descriptor from the select to give up on reading from that connection. We could also try to reconnect if that makes sense.

#9 Updated by cfconrad almost 2 years ago

mkittler yes, I think this will be a good idea.

I'm able to reproduce this behavior in my environment. Currently it was just 1 of 3 runs, while one run cost +2h :/

#10 Updated by cfconrad almost 2 years ago

  • Status changed from Workable to In Progress

#11 Updated by cfconrad almost 2 years ago

Enabled tails, as there are waiting jobs for him:

ssh grenache-1 -- sudo systemctl unmask --runtime openqa-worker@9

Keep grenache-1:10 for some more debugging.

#12 Updated by waynechen55 almost 2 years ago

cfconrad wrote:

Enabled tails, as there are waiting jobs for him:

ssh grenache-1 -- sudo systemctl unmask --runtime openqa-worker@9

Keep grenache-1:10 for some more debugging.

Anything special to grenache-1:10 ? Could you help share more info about this grenache-1:10 ?

#13 Updated by cfconrad almost 2 years ago

No there is nothing special with grenache-1:10, but when I block this worker, I'm able to use the (ipmi config)[[https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/openqa/workerconf.sls#L585]] in my worker.
And then I'm actually able to tests this.

#14 Updated by okurz almost 2 years ago

https://github.com/os-autoinst/os-autoinst/pull/1298 is deployed to both o3 and osd now and it seems we do not have this problem showing up in logs anymore. Unless you plan further work here please resolve.

#15 Updated by okurz almost 2 years ago

however we seem to have a follow-up problem which I assume to be related to your work as you are the only one working on consoles recently: There are some incompletes , e.g. https://openqa.suse.de/tests/3665709/file/autoinst-log.txt showing:

[2019-12-05T10:21:37.567 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/kernel/boot_ltp.pm:45 called opensusebasetest::select_serial_terminal
[2019-12-05T10:21:37.567 CET] [debug] <<< testapi::type_string(text='
')
[2019-12-05T10:21:37.568 CET] [debug] <<< consoles::serial_screen::type_string(cmd='backend_type_string', json_cmd_token='twSkAPBc', text='
')
[2019-12-05T10:21:37.568 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106.

as the ticket is still "Urgent" this also applies here :)

#16 Updated by okurz almost 2 years ago

  • Subject changed from Better feedback for "The console .* is not responding (half-open socket?)", very big log files with repetition of "svirt serial: unable to read: transport read (error code: -43)" to "(undefined value as a symbol reference.*serial_screen|svirt serial: unable to read)" Better feedback for "The console .* is not responding (half-open socket?)", very big log files with repetition of log message

#17 Updated by cfconrad almost 2 years ago

I also have created a ticket to it. As this is actually a regression of https://progress.opensuse.org/issues/55883 PR: https://github.com/os-autoinst/os-autoinst/pull/1274

#19 Updated by cfconrad almost 2 years ago

  • Related to action #60725: [kernel]Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106. added

#20 Updated by cfconrad almost 2 years ago

  • Status changed from In Progress to Resolved

I would set this error now to resolved.
The error from https://progress.opensuse.org/issues/60416#note-15 is not related to this. And I will track it in https://progress.opensuse.org/issues/60725

Enable worker grenache-1:10

ssh grenache-1 -- sudo systemctl unmask openqa-worker@10
ssh grenache-1 -- sudo systemctl start openqa-worker@10

Also available in: Atom PDF