Project

General

Profile

Actions

action #57620

closed

job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade

Added by okurz over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2019-10-02
Due date:
2020-04-09
% Done:

0%

Estimated time:

Description

Motivation

After the weekly deployment of OSD I found during the restart of the openQA services that many jobs continued running however jobs that were in state "uploading" were aborted as incomplete, e.g.
https://openqa.suse.de/tests/3425424
I assume that before "worker refactoring in 2019" that behaviour was more resilient and less or no jobs would incomplete in this situation. In any case jobs during upload should retry uploading and the uploads should be accepted over an web UI (or websockets) restart and outage of some seconds.

Further details

logs

autoinst-log.txt and worker-log.txt of the job are attached.

Journal entries on OSD from the restart time window:

Oct 02 11:09:11 openqa systemd[1]: Stopping Handler for live view in openQA's web UI...
Oct 02 11:09:11 openqa openqa-livehandler[9119]: Server available at http://127.0.0.1:9528
Oct 02 11:09:11 openqa openqa-livehandler[9119]: Server available at http://[::1]:9528
Oct 02 11:09:11 openqa systemd[1]: Stopping The openQA web UI...
Oct 02 11:09:11 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Oct 02 11:09:11 openqa systemd[1]: Stopped Enqueues an audit event cleanup task for the openQA database every day..
Oct 02 11:09:11 openqa systemd[1]: Stopping Enqueues an audit event cleanup task for the openQA database every day..
Oct 02 11:09:11 openqa systemd[1]: Started Enqueues an audit event cleanup task for the openQA database every day..
Oct 02 11:09:11 openqa systemd[1]: Stopped Handler for live view in openQA's web UI.
Oct 02 11:09:11 openqa systemd[1]: Started Handler for live view in openQA's web UI.
Oct 02 11:09:11 openqa systemd[1]: Stopped The openQA web UI.
Oct 02 11:09:11 openqa systemd[1]: Stopping The openQA WebSockets server...
Oct 02 11:09:13 openqa systemd[1]: Stopped The openQA WebSockets server.
Oct 02 11:09:13 openqa systemd[1]: Stopping The openQA Scheduler...
Oct 02 11:09:13 openqa openqa-scheduler[27611]: Warning: unable to close filehandle GEN5 properly: No space left on device during global destruction.
Oct 02 11:09:13 openqa systemd[1]: Stopped The openQA Scheduler.
Oct 02 11:09:13 openqa systemd[1]: Started The openQA Scheduler.
Oct 02 11:09:13 openqa systemd[1]: Started The openQA WebSockets server.
Oct 02 11:09:13 openqa systemd[1]: Started The openQA web UI.
Oct 02 11:09:15 openqa openqa-livehandler[17859]: BEGIN WORK
Oct 02 11:09:15 openqa openqa-livehandler[17859]: COMMIT
Oct 02 11:09:15 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Oct 02 11:09:15 openqa systemd[1]: Started The openQA daemon for various background tasks like cleanup and saving needles.
Oct 02 11:09:16 openqa openqa[17862]: BEGIN WORK
Oct 02 11:09:16 openqa openqa[17862]: COMMIT
Oct 02 11:09:17 openqa openqa[17862]: Server available at http://127.0.0.1:9526
Oct 02 11:09:17 openqa openqa[17862]: Server available at http://[::1]:9526

chat log

This was also discussed in https://chat.suse.de/group/openqa-dev?msg=ckosYAkuaob5coXCE , from there:

mkittler:The worker log shows multiple problems:

  • The job is stopped during the upload phase. Specifically, while still uploading results for the first module isotovideo has already been exited. That makes the log look strange on the first look but it actually shouldn't be a problem.
  • OSD was not reachable for some time during the upload (proxy error / service unavailable) [okurz: this was during the restart of services from osd deployment]. However, it was only unreachable for less than a minutes so the worker didn't give up.
  • In the end of the upload OSD was unreachable again also affecting the WS connection.
  • It is strange that the job is set to done in the middle of the upload:
Okt 02 09:57:26 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact icmp4-uni-basic05-6.txt
Okt 02 09:57:26 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact icmp4-uni-basic05-7.icmp4-uni-basic05.txt
Okt 02 09:57:26 openqaworker8 worker[18624]: [info] [pid:9219] Uploading serial_terminal.txt
Okt 02 09:57:27 openqaworker8 worker[18624]: [info] [pid:9219] Uploading worker-log.txt
Okt 02 09:57:28 openqaworker8 worker[18624]: [debug] [pid:18624] Setting job 3425424 to done
Okt 02 09:57:28 openqaworker8 worker[18624]: [debug] [pid:18624] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3425424/status
Okt 02 09:57:29 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact LTP_net_stress.ipsec_icmp_icmp4-uni-basic05.txt
Okt 02 09:57:30 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact icmp4-uni-basic05-8.txt
Okt 02 09:57:30 openqaworker8 worker[18624]: [debug] [pid:9395] Optimizing /var/lib/openqa/pool/8/testresults/shutdown_ltp-57.png
Okt 02 09:57:30 openqaworker8 worker[18624]: [debug] [pid:9395] Uploading artefact shutdown_ltp-57.png as 88ab69ccbd5248aabe46d7afcef7c3f0
Okt 02 09:57:32 openqaworker8 worker[18624]: [debug] [pid:9395] Optimizing /var/lib/openqa/pool/8/testresults/.thumbs/shutdown_ltp-57.png

So there were upload problems caused by OSD being unavailable and maybe there's also a logic error in the worker code because it shouldn't set the job to done in the middle of the upload. That maybe has not worked anyways and therefore the job has been incompleted by the web UI in the end.
okurz: well, the web UI was restarted due to upgrade, many jobs that were "running" were still continuing. So I guess the problem is "job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade". reopen a ticket or will you create a new one?
mkittler: If the web UI is offline long enough the job is actually expected to be incomplete. So that's actually expected. I'm more wondering about the worker itself because setting the job to done in the middle of the upload seems wrong. But if I don't look in the journal log but just at https://openqa.suse.de/tests/3425424/file/worker-log.txt there are lines missing. E.g. the REST call to set the job to done is not logged here. Maybe it didn't happen after all?
"long enough" is defined by the worker's patience for the web UI before stopping a job. It is also defined by the the web UIs patience to wait for a worker which doesn't respond. The worker has a retry limit for REST calls and waits a certain amount of time between them. If you look at the code you will find the exact numbers. For the web UI there's a variable defining a threshold. I guess it was 2 minutes.


Files

autoinst-log.txt (3.97 KB) autoinst-log.txt okurz, 2019-10-02 11:38
worker-log.txt (23.7 KB) worker-log.txt okurz, 2019-10-02 11:38

Related issues 2 (0 open2 closed)

Related to openQA Project - action #41027: worker disconnects during cleanupResolvedkraih2018-09-14

Actions
Related to openQA Project - coordination #62420: [epic] Distinguish all types of incompletesResolvedokurz2018-12-12

Actions
Actions

Also available in: Atom PDF