action #60416
closed
"(undefined value as a symbol reference.*serial_screen|svirt serial: unable to read)" Better feedback for "The console .* is not responding (half-open socket?)", very big log files with repetition of log message
Added by okurz almost 5 years ago.
Updated almost 5 years ago.
Category:
Regressions/Crashes
Description
Observation¶
openQA test in scenario sle-15-SP2-Online-x86_64-prj2_host_upgrade_sles15sp1_to_developing_xen@64bit-ipmi incompletes in
login_console
with a text popup with content "Test died: unexpected response {} at /usr/lib/os-autoinst/testapi.pm line 359."
which is already better than just incomplete but the important part of the log
[2019-11-28T05:03:27.163 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/installation/reboot_after_installation.pm:45 called testapi:: send_key
[2019-11-28T05:03:27.163 CET] [debug] <<< testapi::send_key(key='alt-o', wait_screen_change=0, do_wait=0)
[2019-11-28T05:03:27.970 CET] [debug] svirt serial: unable to read: transport read (error code: -43)
… (near endless continuation of the same message causing a 40MB logfile)
[2019-11-28T05:06:13.474 CET] [debug] svirt serial: unable to read: transport read (error code: -43)
[2019-11-28T05:06:13.476 CET] [debug] Backend process died, backend errors are reported below in the following lines:
The console 'installation' is not responding (half-open socket?). Make sure the console is reachable or disable stall detection on expected disconnects with '$console->disable_vnc_stalls', for example in case of intended machine shutdown
is not shown there. The problem is of the known and usual kind: The system is rebooted before the corresponding connected console is properly disconnected.
Expected result¶
- At least the line about "The console 'installation' is not responding" should show in the details page
- There should not be 40MB worth of repetition of "svirt serial: unable to read: transport read (error code: -43)"
Suggestions¶
Maybe we can apply something similar to https://github.com/os-autoinst/os-autoinst/pull/1262
Further details¶
Always latest result in this scenario: latest
- Related to action #59267: test incompletes in installation_overview - when trying to switch to root-ssh? added
- Project changed from openQA Tests to openQA Project
- Subject changed from test fails in login_console to Better feedback for "The console .* is not responding (half-open socket?)"
- Category changed from Bugs in existing tests to Feature requests
- Subject changed from Better feedback for "The console .* is not responding (half-open socket?)" to Better feedback for "The console .* is not responding (half-open socket?)", very big log files with repetition of "svirt serial: unable to read: transport read (error code: -43)"
- Category changed from Feature requests to Regressions/Crashes
- Status changed from New to Workable
- Assignee set to cfconrad
- Priority changed from Normal to Urgent
Agreed with okurz to use Worker grenache-1:10
for debugging.
systemctl mask openqa-worker@10
systemctl stop openqa-worker@10
Next time I would try with --now --runtime
This job was "canceled" when stopping the worker https://openqa.suse.de/tests/3641819.
As one run is quite time consuming I got a second ipmi host (tails)
ssh grenache-1 -- systemctl mask --now --runtime openqa-worker@9
@cfconrad I'm the last who touched the error handling within that function to prevent the svirt backend's high CPU usage. Before my changes there was not error handling at all so I thought it would be a good idea to log the errors at least. Apparently that's not sufficient and even made things worse considering the huge log files we get now. It would likely be the best to remove the file descriptor from the select to give up on reading from that connection. We could also try to reconnect if that makes sense.
@mkittler yes, I think this will be a good idea.
I'm able to reproduce this behavior in my environment. Currently it was just 1 of 3 runs, while one run cost +2h :/
- Status changed from Workable to In Progress
Enabled tails, as there are waiting jobs for him:
ssh grenache-1 -- sudo systemctl unmask --runtime openqa-worker@9
Keep grenache-1:10
for some more debugging.
cfconrad wrote:
Enabled tails, as there are waiting jobs for him:
ssh grenache-1 -- sudo systemctl unmask --runtime openqa-worker@9
Keep grenache-1:10
for some more debugging.
Anything special to grenache-1:10 ? Could you help share more info about this grenache-1:10 ?
No there is nothing special with grenache-1:10
, but when I block this worker, I'm able to use the (ipmi config)[[https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/openqa/workerconf.sls#L585]] in my worker.
And then I'm actually able to tests this.
however we seem to have a follow-up problem which I assume to be related to your work as you are the only one working on consoles recently: There are some incompletes , e.g. https://openqa.suse.de/tests/3665709/file/autoinst-log.txt showing:
[2019-12-05T10:21:37.567 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/kernel/boot_ltp.pm:45 called opensusebasetest::select_serial_terminal
[2019-12-05T10:21:37.567 CET] [debug] <<< testapi::type_string(text='
')
[2019-12-05T10:21:37.568 CET] [debug] <<< consoles::serial_screen::type_string(cmd='backend_type_string', json_cmd_token='twSkAPBc', text='
')
[2019-12-05T10:21:37.568 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106.
as the ticket is still "Urgent" this also applies here :)
- Subject changed from Better feedback for "The console .* is not responding (half-open socket?)", very big log files with repetition of "svirt serial: unable to read: transport read (error code: -43)" to "(undefined value as a symbol reference.*serial_screen|svirt serial: unable to read)" Better feedback for "The console .* is not responding (half-open socket?)", very big log files with repetition of log message
- Related to action #60725: [kernel]Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106. added
- Status changed from In Progress to Resolved
Also available in: Atom
PDF