action #57707: isotovideo fails to terminate cleanly, message "isotovideo: unable to inform websocket clients about stopping command server: Request timeout", regression from 4cd4af2b - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #57707

closed

isotovideo fails to terminate cleanly, message "isotovideo: unable to inform websocket clients about stopping command server: Request timeout", regression from 4cd4af2b

Added by okurz over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Done

Start date:

2019-10-04

Due date:

% Done:

Estimated time:

Description

Observation¶

I have seen the same error message when starting isotovideo locally as well as within openQA jobs. From a local run of isotovideo only:

[2019-10-04T14:55:04.861 CEST] [debug] done with autotest process
[2019-10-04T14:55:04.862 CEST] [debug] killing command server 10808 because test execution ended
[2019-10-04T14:55:04.862 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/3HbEX_vD1nTThFUd/broadcast
[2019-10-04T14:55:19.877 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /home/okurz/local/os-autoinst/os-autoinst/isotovideo line 175.

[2019-10-04T14:55:21.425 CEST] [debug] commands process exited: 0

See the (wasted) time of 17s due to the timeout. In this case there simply are no websocket clients to inform so the whole operation should not even be tried.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by okurz over 5 years ago

Related to action #45191: developer mode: error message just when clicking "Cancel job" added

Actions

Copy link

Updated by mkittler over 5 years ago

This is actually only expected to fail and show this message when the job is aborted by the worker.

When the test execution just ends regularly the broadcast should work (and do nothing if there are no web socket clients). That it is not working here is a bug indeed. Maybe the command server is blocked in that moment waiting for a response from isotovideo? So isotovideo and the command server are waiting for each other?

Actions

Copy link

Updated by okurz over 5 years ago

This should be trivial to reproduce and debug as it only involves isotovideo locally.

Actions

Copy link

Updated by mkittler over 5 years ago

Since I've been dealing with that code today again it reminded me of this issue. I don't think it is easy to reproduce because it only happens sometimes, right?

To answer my own questions: The code takes care to prevent the deadlock situation I've described but it looks like that it wouldn't do that in all places.

Actions

Copy link

Updated by mkittler over 5 years ago

Note that this log line can also simply be caused by the web UI (the liveviewhandler in particular) being unresponsive. Considering recent performance problems on OSD this might have even been the case. Then the problem is actually correctly logged at this place. We could lower the timeout, though. (If this request fails is not a big deal.)

Actions

Copy link

Updated by okurz over 5 years ago

Please read what I wrote carefully. The issue is reproducible every time when calling only isotovideo locally, no web UI available or involved. This has nothing to do with OSD performance problems due to upload in chunks taking long because of a monitoring plugin.

Actions

Copy link

Updated by mkittler over 5 years ago

Status changed from New to In Progress
Assignee set to mkittler
Target version set to Current Sprint

PR: https://github.com/os-autoinst/os-autoinst/pull/1222

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from In Progress to Feedback

merged. So I checked locally with isotovideo (no openQA). I suggest you crosscheck openQA jobs on o3 after tomorrow morning's automatic deployment. Ok?

Actions

Copy link

Updated by mkittler over 5 years ago

Status changed from Feedback to Resolved
Target version changed from Current Sprint to Done

It seems to work.

Actions

Copy link

#10

Updated by okurz over 5 years ago

well, yes, the original error is gone. However the shutdown is still far from speedy, e.g. see

[2019-10-17T23:10:23.857 CEST] [debug] no match: 79.0s, best candidate: grub2-2-20131030 (0.00)
^C[2019-10-17T23:10:24.058 CEST] [debug] signalhandler got INT
can_read received kill signal at ../../myjsonrpc.pm line 91.
[2019-10-17T23:10:24.059 CEST] [debug] killing backend process 10637
[2019-10-17T23:10:24.549 CEST] [debug] WARNING: check_asserted_screen took 0.68 seconds for 32 candidate needles - make your needles more specific
[2019-10-17T23:10:24.549 CEST] [debug] no match: 77.8s, best candidate: grub2-2-20131030 (0.00)
[2019-10-17T23:10:24.575 CEST] [debug] QEMU: qemu-system-x86_64: terminating on signal 2
[2019-10-17T23:10:24.576 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-10-17T23:10:24.581 CEST] [debug] sending magic and exit
Unexpected end of data 0
[2019-10-17T23:10:26.062 CEST] [debug] done with backend process
[2019-10-17T23:10:26.062 CEST] [debug] killing command server 10627 because test execution ended through exception
[2019-10-17T23:10:28.063 CEST] [debug] done with command server
[2019-10-17T23:10:28.063 CEST] [debug] killing autotest process 10630
[2019-10-17T23:10:30.063 CEST] [debug] done with autotest process
10623: EXIT 0

6s just to kill some processes. I think we can still improve this. Also the "Unexpected end of data 0" isn't that nice. The "end of data" should very much be expected when the processes are asked to shutdown :) If you prefer we can handle this in a new ticket though.

Actions

Copy link

#11