Project

General

Profile

Actions

action #57707

closed

isotovideo fails to terminate cleanly, message "isotovideo: unable to inform websocket clients about stopping command server: Request timeout", regression from 4cd4af2b

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-10-04
Due date:
% Done:

0%

Estimated time:

Description

Observation

I have seen the same error message when starting isotovideo locally as well as within openQA jobs. From a local run of isotovideo only:

[2019-10-04T14:55:04.861 CEST] [debug] done with autotest process
[2019-10-04T14:55:04.862 CEST] [debug] killing command server 10808 because test execution ended
[2019-10-04T14:55:04.862 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/3HbEX_vD1nTThFUd/broadcast
[2019-10-04T14:55:19.877 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /home/okurz/local/os-autoinst/os-autoinst/isotovideo line 175.

[2019-10-04T14:55:21.425 CEST] [debug] commands process exited: 0

See the (wasted) time of 17s due to the timeout. In this case there simply are no websocket clients to inform so the whole operation should not even be tried.


Related issues 3 (0 open3 closed)

Related to openQA Project - action #45191: developer mode: error message just when clicking "Cancel job"Resolvedmkittler2018-11-22

Actions
Related to openQA Project - coordination #65118: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasonsResolvedokurz2020-04-012020-09-30

Actions
Copied to openQA Project - action #58379: isotovideo is slow to shutdown / error messages on proper shutdownResolvedokurz2019-10-042020-04-14

Actions
Actions #1

Updated by okurz over 4 years ago

  • Related to action #45191: developer mode: error message just when clicking "Cancel job" added
Actions #2

Updated by mkittler over 4 years ago

This is actually only expected to fail and show this message when the job is aborted by the worker.

When the test execution just ends regularly the broadcast should work (and do nothing if there are no web socket clients). That it is not working here is a bug indeed. Maybe the command server is blocked in that moment waiting for a response from isotovideo? So isotovideo and the command server are waiting for each other?

Actions #3

Updated by okurz over 4 years ago

This should be trivial to reproduce and debug as it only involves isotovideo locally.

Actions #4

Updated by mkittler over 4 years ago

Since I've been dealing with that code today again it reminded me of this issue. I don't think it is easy to reproduce because it only happens sometimes, right?

To answer my own questions: The code takes care to prevent the deadlock situation I've described but it looks like that it wouldn't do that in all places.

Actions #5

Updated by mkittler over 4 years ago

Note that this log line can also simply be caused by the web UI (the liveviewhandler in particular) being unresponsive. Considering recent performance problems on OSD this might have even been the case. Then the problem is actually correctly logged at this place. We could lower the timeout, though. (If this request fails is not a big deal.)

Actions #6

Updated by okurz over 4 years ago

Please read what I wrote carefully. The issue is reproducible every time when calling only isotovideo locally, no web UI available or involved. This has nothing to do with OSD performance problems due to upload in chunks taking long because of a monitoring plugin.

Actions #7

Updated by mkittler over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version set to Current Sprint
Actions #8

Updated by okurz over 4 years ago

  • Status changed from In Progress to Feedback

merged. So I checked locally with isotovideo (no openQA). I suggest you crosscheck openQA jobs on o3 after tomorrow morning's automatic deployment. Ok?

Actions #9

Updated by mkittler over 4 years ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

It seems to work.

Actions #10

Updated by okurz over 4 years ago

well, yes, the original error is gone. However the shutdown is still far from speedy, e.g. see

[2019-10-17T23:10:23.857 CEST] [debug] no match: 79.0s, best candidate: grub2-2-20131030 (0.00)
^C[2019-10-17T23:10:24.058 CEST] [debug] signalhandler got INT
can_read received kill signal at ../../myjsonrpc.pm line 91.
[2019-10-17T23:10:24.059 CEST] [debug] killing backend process 10637
[2019-10-17T23:10:24.549 CEST] [debug] WARNING: check_asserted_screen took 0.68 seconds for 32 candidate needles - make your needles more specific
[2019-10-17T23:10:24.549 CEST] [debug] no match: 77.8s, best candidate: grub2-2-20131030 (0.00)
[2019-10-17T23:10:24.575 CEST] [debug] QEMU: qemu-system-x86_64: terminating on signal 2
[2019-10-17T23:10:24.576 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-10-17T23:10:24.581 CEST] [debug] sending magic and exit
Unexpected end of data 0
[2019-10-17T23:10:26.062 CEST] [debug] done with backend process
[2019-10-17T23:10:26.062 CEST] [debug] killing command server 10627 because test execution ended through exception
[2019-10-17T23:10:28.063 CEST] [debug] done with command server
[2019-10-17T23:10:28.063 CEST] [debug] killing autotest process 10630
[2019-10-17T23:10:30.063 CEST] [debug] done with autotest process
10623: EXIT 0

6s just to kill some processes. I think we can still improve this. Also the "Unexpected end of data 0" isn't that nice. The "end of data" should very much be expected when the processes are asked to shutdown :) If you prefer we can handle this in a new ticket though.

Actions #11

Updated by mkittler over 4 years ago

Yes, it is a different delay so better make a new ticket.

Actions #12

Updated by okurz over 4 years ago

  • Copied to action #58379: isotovideo is slow to shutdown / error messages on proper shutdown added
Actions #13

Updated by okurz over 3 years ago

  • Related to coordination #65118: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons added
Actions

Also available in: Atom PDF