isotovideo fails to terminate cleanly, message "isotovideo: unable to inform websocket clients about stopping command server: Request timeout", regression from 4cd4af2b
I have seen the same error message when starting isotovideo locally as well as within openQA jobs. From a local run of isotovideo only:
[2019-10-04T14:55:04.861 CEST] [debug] done with autotest process [2019-10-04T14:55:04.862 CEST] [debug] killing command server 10808 because test execution ended [2019-10-04T14:55:04.862 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/3HbEX_vD1nTThFUd/broadcast [2019-10-04T14:55:19.877 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /home/okurz/local/os-autoinst/os-autoinst/isotovideo line 175. [2019-10-04T14:55:21.425 CEST] [debug] commands process exited: 0
See the (wasted) time of 17s due to the timeout. In this case there simply are no websocket clients to inform so the whole operation should not even be tried.
This is actually only expected to fail and show this message when the job is aborted by the worker.
When the test execution just ends regularly the broadcast should work (and do nothing if there are no web socket clients). That it is not working here is a bug indeed. Maybe the command server is blocked in that moment waiting for a response from isotovideo? So isotovideo and the command server are waiting for each other?
Since I've been dealing with that code today again it reminded me of this issue. I don't think it is easy to reproduce because it only happens sometimes, right?
To answer my own questions: The code takes care to prevent the deadlock situation I've described but it looks like that it wouldn't do that in all places.
Note that this log line can also simply be caused by the web UI (the liveviewhandler in particular) being unresponsive. Considering recent performance problems on OSD this might have even been the case. Then the problem is actually correctly logged at this place. We could lower the timeout, though. (If this request fails is not a big deal.)
- Status changed from New to In Progress
- Assignee set to mkittler
- Target version set to Current Sprint
well, yes, the original error is gone. However the shutdown is still far from speedy, e.g. see
[2019-10-17T23:10:23.857 CEST] [debug] no match: 79.0s, best candidate: grub2-2-20131030 (0.00) ^C[2019-10-17T23:10:24.058 CEST] [debug] signalhandler got INT can_read received kill signal at ../../myjsonrpc.pm line 91. [2019-10-17T23:10:24.059 CEST] [debug] killing backend process 10637 [2019-10-17T23:10:24.549 CEST] [debug] WARNING: check_asserted_screen took 0.68 seconds for 32 candidate needles - make your needles more specific [2019-10-17T23:10:24.549 CEST] [debug] no match: 77.8s, best candidate: grub2-2-20131030 (0.00) [2019-10-17T23:10:24.575 CEST] [debug] QEMU: qemu-system-x86_64: terminating on signal 2 [2019-10-17T23:10:24.576 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json [2019-10-17T23:10:24.581 CEST] [debug] sending magic and exit Unexpected end of data 0 [2019-10-17T23:10:26.062 CEST] [debug] done with backend process [2019-10-17T23:10:26.062 CEST] [debug] killing command server 10627 because test execution ended through exception [2019-10-17T23:10:28.063 CEST] [debug] done with command server [2019-10-17T23:10:28.063 CEST] [debug] killing autotest process 10630 [2019-10-17T23:10:30.063 CEST] [debug] done with autotest process 10623: EXIT 0
6s just to kill some processes. I think we can still improve this. Also the "Unexpected end of data 0" isn't that nice. The "end of data" should very much be expected when the processes are asked to shutdown :) If you prefer we can handle this in a new ticket though.