Project

General

Profile

Actions

action #57620

closed

job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade

Added by okurz over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2019-10-02
Due date:
2020-04-09
% Done:

0%

Estimated time:

Description

Motivation

After the weekly deployment of OSD I found during the restart of the openQA services that many jobs continued running however jobs that were in state "uploading" were aborted as incomplete, e.g.
https://openqa.suse.de/tests/3425424
I assume that before "worker refactoring in 2019" that behaviour was more resilient and less or no jobs would incomplete in this situation. In any case jobs during upload should retry uploading and the uploads should be accepted over an web UI (or websockets) restart and outage of some seconds.

Further details

logs

autoinst-log.txt and worker-log.txt of the job are attached.

Journal entries on OSD from the restart time window:

Oct 02 11:09:11 openqa systemd[1]: Stopping Handler for live view in openQA's web UI...
Oct 02 11:09:11 openqa openqa-livehandler[9119]: Server available at http://127.0.0.1:9528
Oct 02 11:09:11 openqa openqa-livehandler[9119]: Server available at http://[::1]:9528
Oct 02 11:09:11 openqa systemd[1]: Stopping The openQA web UI...
Oct 02 11:09:11 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Oct 02 11:09:11 openqa systemd[1]: Stopped Enqueues an audit event cleanup task for the openQA database every day..
Oct 02 11:09:11 openqa systemd[1]: Stopping Enqueues an audit event cleanup task for the openQA database every day..
Oct 02 11:09:11 openqa systemd[1]: Started Enqueues an audit event cleanup task for the openQA database every day..
Oct 02 11:09:11 openqa systemd[1]: Stopped Handler for live view in openQA's web UI.
Oct 02 11:09:11 openqa systemd[1]: Started Handler for live view in openQA's web UI.
Oct 02 11:09:11 openqa systemd[1]: Stopped The openQA web UI.
Oct 02 11:09:11 openqa systemd[1]: Stopping The openQA WebSockets server...
Oct 02 11:09:13 openqa systemd[1]: Stopped The openQA WebSockets server.
Oct 02 11:09:13 openqa systemd[1]: Stopping The openQA Scheduler...
Oct 02 11:09:13 openqa openqa-scheduler[27611]: Warning: unable to close filehandle GEN5 properly: No space left on device during global destruction.
Oct 02 11:09:13 openqa systemd[1]: Stopped The openQA Scheduler.
Oct 02 11:09:13 openqa systemd[1]: Started The openQA Scheduler.
Oct 02 11:09:13 openqa systemd[1]: Started The openQA WebSockets server.
Oct 02 11:09:13 openqa systemd[1]: Started The openQA web UI.
Oct 02 11:09:15 openqa openqa-livehandler[17859]: BEGIN WORK
Oct 02 11:09:15 openqa openqa-livehandler[17859]: COMMIT
Oct 02 11:09:15 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Oct 02 11:09:15 openqa systemd[1]: Started The openQA daemon for various background tasks like cleanup and saving needles.
Oct 02 11:09:16 openqa openqa[17862]: BEGIN WORK
Oct 02 11:09:16 openqa openqa[17862]: COMMIT
Oct 02 11:09:17 openqa openqa[17862]: Server available at http://127.0.0.1:9526
Oct 02 11:09:17 openqa openqa[17862]: Server available at http://[::1]:9526

chat log

This was also discussed in https://chat.suse.de/group/openqa-dev?msg=ckosYAkuaob5coXCE , from there:

mkittler:The worker log shows multiple problems:

  • The job is stopped during the upload phase. Specifically, while still uploading results for the first module isotovideo has already been exited. That makes the log look strange on the first look but it actually shouldn't be a problem.
  • OSD was not reachable for some time during the upload (proxy error / service unavailable) [okurz: this was during the restart of services from osd deployment]. However, it was only unreachable for less than a minutes so the worker didn't give up.
  • In the end of the upload OSD was unreachable again also affecting the WS connection.
  • It is strange that the job is set to done in the middle of the upload:
Okt 02 09:57:26 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact icmp4-uni-basic05-6.txt
Okt 02 09:57:26 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact icmp4-uni-basic05-7.icmp4-uni-basic05.txt
Okt 02 09:57:26 openqaworker8 worker[18624]: [info] [pid:9219] Uploading serial_terminal.txt
Okt 02 09:57:27 openqaworker8 worker[18624]: [info] [pid:9219] Uploading worker-log.txt
Okt 02 09:57:28 openqaworker8 worker[18624]: [debug] [pid:18624] Setting job 3425424 to done
Okt 02 09:57:28 openqaworker8 worker[18624]: [debug] [pid:18624] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3425424/status
Okt 02 09:57:29 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact LTP_net_stress.ipsec_icmp_icmp4-uni-basic05.txt
Okt 02 09:57:30 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact icmp4-uni-basic05-8.txt
Okt 02 09:57:30 openqaworker8 worker[18624]: [debug] [pid:9395] Optimizing /var/lib/openqa/pool/8/testresults/shutdown_ltp-57.png
Okt 02 09:57:30 openqaworker8 worker[18624]: [debug] [pid:9395] Uploading artefact shutdown_ltp-57.png as 88ab69ccbd5248aabe46d7afcef7c3f0
Okt 02 09:57:32 openqaworker8 worker[18624]: [debug] [pid:9395] Optimizing /var/lib/openqa/pool/8/testresults/.thumbs/shutdown_ltp-57.png

So there were upload problems caused by OSD being unavailable and maybe there's also a logic error in the worker code because it shouldn't set the job to done in the middle of the upload. That maybe has not worked anyways and therefore the job has been incompleted by the web UI in the end.
okurz: well, the web UI was restarted due to upgrade, many jobs that were "running" were still continuing. So I guess the problem is "job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade". reopen a ticket or will you create a new one?
mkittler: If the web UI is offline long enough the job is actually expected to be incomplete. So that's actually expected. I'm more wondering about the worker itself because setting the job to done in the middle of the upload seems wrong. But if I don't look in the journal log but just at https://openqa.suse.de/tests/3425424/file/worker-log.txt there are lines missing. E.g. the REST call to set the job to done is not logged here. Maybe it didn't happen after all?
"long enough" is defined by the worker's patience for the web UI before stopping a job. It is also defined by the the web UIs patience to wait for a worker which doesn't respond. The worker has a retry limit for REST calls and waits a certain amount of time between them. If you look at the code you will find the exact numbers. For the web UI there's a variable defining a threshold. I guess it was 2 minutes.


Files

autoinst-log.txt (3.97 KB) autoinst-log.txt okurz, 2019-10-02 11:38
worker-log.txt (23.7 KB) worker-log.txt okurz, 2019-10-02 11:38

Related issues 2 (0 open2 closed)

Related to openQA Project - action #41027: worker disconnects during cleanupResolvedkraih2018-09-14

Actions
Related to openQA Project - coordination #62420: [epic] Distinguish all types of incompletesResolvedokurz2018-12-12

Actions
Actions #1

Updated by okurz over 4 years ago

  • Related to action #41027: worker disconnects during cleanup added
Actions #2

Updated by okurz over 4 years ago

  • Status changed from New to Rejected
  • Assignee set to okurz

wait, I think I got confused. The job I mentioned was actually not from the time I restarted, it's from 1h earlier when /srv was full. While there is a connection I do not think we want to do anything about this. It looks like the incident at 09:57 left jobs in a limbo until we restarted services at 11:09

Actions #3

Updated by okurz over 4 years ago

  • Description updated (diff)
Actions #4

Updated by okurz over 4 years ago

  • Status changed from Rejected to Workable
  • Assignee deleted (okurz)

ok, found the issue that I originally thought of again. E.g. https://openqa.suse.de/tests/3482101 incompleted at the time the weekly deployment of OSD was running: https://gitlab.suse.de/openqa/osd-deployment/pipelines/39158 , in particular https://gitlab.suse.de/openqa/osd-deployment/-/jobs/128318 at the time 2019-10-16T06:56:18Z . The journal entry from the worker:

Oct 16 08:54:01 malbec worker[24170]: [debug] [pid:112492] Uploading artefact install_service-411.txt
Oct 16 08:54:01 malbec worker[24170]: [debug] [pid:112492] Uploading artefact install_service-415.txt
Oct 16 08:54:01 malbec worker[24170]: [debug] [pid:112492] Uploading artefact install_service-416.txt
Oct 16 08:54:11 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:54:11 malbec worker[24170]: [debug] [pid:112826] Optimizing /var/lib/openqa/pool/1/testresults/reboot_to_upgrade-3.png
Oct 16 08:54:11 malbec worker[24170]: [debug] [pid:112826] Uploading artefact reboot_to_upgrade-3.png as e560b7fb17cf54773cb5dfc9fa786769
Oct 16 08:54:11 malbec worker[24170]: [debug] [pid:112826] Optimizing /var/lib/openqa/pool/1/testresults/.thumbs/reboot_to_upgrade-3.png
Oct 16 08:54:11 malbec worker[24170]: [debug] [pid:112826] Uploading artefact reboot_to_upgrade-3.png as e560b7fb17cf54773cb5dfc9fa786769
Oct 16 08:54:11 malbec worker[24170]: [debug] [pid:112826] Uploading artefact reboot_to_upgrade-4.txt
Oct 16 08:54:21 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:54:32 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:54:42 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:54:52 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:55:02 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:55:07 malbec worker[24170]: [error] [pid:24170] Websocket connection to http://openqa.suse.de/api/v1/ws/885 finished by remote side with code 1006, no reason - trying again in 10 seconds
Oct 16 08:55:12 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:55:17 malbec worker[24170]: [info] [pid:24170] Registering with openQA openqa.suse.de
Oct 16 08:55:18 malbec worker[24170]: [info] [pid:24170] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/885
Oct 16 08:55:22 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:55:23 malbec worker[24170]: [info] [pid:24170] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 885
Oct 16 08:55:35 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:55:45 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:55:55 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:56:05 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:56:15 malbec worker[24170]: [info] [pid:24170] Quit due to signal TERM
Oct 16 08:56:15 malbec worker[24170]: [debug] [pid:24170] Stopping job 3482101 from openqa.suse.de: 03482101-sle-15-SP2-Regression-on-Migration-from-SLE15-SPX-to-SLE15-SP2-ppc64le-Build58.1-011_offline_sles15_pscc_basesys-sr>
Oct 16 08:56:15 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:56:15 malbec systemd[1]: Stopping openQA Worker #1...
Oct 16 08:56:15 malbec worker[24170]: [info] [pid:24170] Trying to stop job gracefully by announcing it to command server via http://localhost:20013/Pn14jGkI2KHHrjpX/broadcast
Oct 16 08:56:17 malbec worker[24170]: [info] [pid:24170] Isotovideo exit status: 0
Oct 16 08:56:17 malbec worker[24170]: [info] [pid:24170] +++ worker notes +++
Oct 16 08:56:17 malbec worker[24170]: [info] [pid:24170] End time: 2019-10-16 06:56:17
Oct 16 08:56:17 malbec worker[24170]: [info] [pid:24170] Result: quit
Oct 16 08:56:17 malbec worker[24170]: [debug] [pid:24170] Job 3482101 stopped as incomplete
Oct 16 08:56:17 malbec worker[24170]: [debug] [pid:24170] duplicating job 3482101
Oct 16 08:56:17 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/duplicate?dup_type_auto=1
Oct 16 08:56:17 malbec worker[24170]: [warn] [pid:24170] Can't open /var/lib/openqa/pool/1/testresults/result-welcome.json for result upload - likely isotovideo could not be started or failed early. Error message: No such fi>
Oct 16 08:56:17 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/status
Oct 16 08:56:17 malbec worker[24170]: [debug] [pid:113654] Optimizing /var/lib/openqa/pool/1/testresults/version_switch_upgrade_target-9.png
Oct 16 08:56:17 malbec worker[24170]: [debug] [pid:113654] Uploading artefact version_switch_upgrade_target-9.png as 45fb1de860b299cf0091d6a49d9bf16a
Oct 16 08:56:17 malbec worker[24170]: [debug] [pid:113654] Optimizing /var/lib/openqa/pool/1/testresults/.thumbs/version_switch_upgrade_target-9.png
Oct 16 08:56:17 malbec worker[24170]: [debug] [pid:113654] Uploading artefact version_switch_upgrade_target-9.png as 45fb1de860b299cf0091d6a49d9bf16a
Oct 16 08:56:18 malbec worker[24170]: [debug] [pid:113654] Uploading artefact version_switch_upgrade_target-1.txt
Oct 16 08:56:18 malbec worker[24170]: [debug] [pid:113654] Uploading artefact isosize-1.txt
Oct 16 08:56:18 malbec worker[24170]: [debug] [pid:24170] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3482101/set_done?result=incomplete
Oct 16 08:56:18 malbec worker[24170]: [debug] [pid:24170] Job 3482101 from openqa.suse.de finished - reason: quit
Oct 16 08:56:18 malbec systemd[1]: Stopped openQA Worker #1.
Oct 16 08:56:18 malbec systemd[1]: Starting openQA Worker #1...
Oct 16 08:56:18 malbec systemd[1]: Started openQA Worker #1.

I can see that on "Oct 16 08:55:07" the websocket server was unreachable – due to the restart after upgrade on osd (confirmed by logs) – and the worker properly reconnected. At "Oct 16 08:56:15" we see that the worker was asked to terminate, probably due the upgrade of the worker at this time. Subsequently the processes shut down what looks like in an orderly procedure including upload attempts of some files. Unfortunately the more important log files "worker-log.txt" and "autoinst-log.txt" are not uploaded. Also there is a warning about a missing "result-welcome.json" file. After the restart of the worker service the next job is started after 9 minutes which makes one think we could have made better use of that time ;)

I have the following ideas for improvement:

  1. There should be no warning about a missing result-welcome.json file
  2. Eagerly try to upload worker-log.txt and autoinst-log.txt as more important files, e.g. try to uploading on TERM with a reasonable timeout, abort the upload if it is taking too long
  3. Try to upload missing files after worker (service) restart
  4. Switch from workers within systemd services on bare metal machines to a container orchestrator and keep old versions of the openQA worker running in pods as long as test jobs run and only start new versions of the openQA worker and backend in new pods
Actions #5

Updated by mkittler over 4 years ago

  1. Ok, but there should generally be warnings when isotovideo produced inconsistent results.
  2. Of course the worker log should also include the upload so it make sense to upload it as last file. But yes, in the TERM case we should still try to upload these files.
  3. That makes sense. We need to make sure that the web UI side still accepts those updates, though.
  4. That's way to complicated in my opinion. If you wanted to achieve the same you could simply put a file into the pool directory. If it is present the worker restarts itself before starting the next job.
Actions #6

Updated by okurz over 4 years ago

https://openqa.suse.de/tests/3648107 is an instance of a related problem. The job recorded no logs but the arm worker openqaworker-arm-2 was really broken yesterday evening and eventually was force power cycled by another automatic action. In this specific case we do not have logs available as the pool directory are on a partition that is recreated on reboot. So I guess in this case we really can not repair the incomplete without logs.

Actions #7

Updated by okurz over 4 years ago

same problem happened again today. arm-2 seems to have severe problems (again). doesn't react to ping nor ipmi sol, caused many incompletes already. will power cycle.

Actions #8

Updated by okurz over 4 years ago

Actions #9

Updated by okurz about 4 years ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz

someone ("root") just rebooted o3 and that caused some incompletes without automatic restarts, e.g. https://openqa.opensuse.org/tests/1221625 showing:

Reason: api failure: Failed to register at http://openqa1-opensuse - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>503 Service Unavailable</title> </head><body> <h1>Service Unavailable</h1> <p>The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.</p> <p>Additionally, a 503 Service Unavailable error was encountered while trying to use an ErrorDocument to handle the request.</p> <hr> <address>Apache/2.4.33 (Linux/SUSE) Server at openqa1-opensuse Port 80</address> </body></html>

and from the worker journal:

Apr 02 08:26:06 openqaworker4 worker[1956]: [info] isotovideo has been started (PID: 15614)
Apr 02 08:28:20 openqaworker4 worker[1956]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/206 finished by remote side with code 1006, no reason - trying again in 10 seconds
Apr 02 08:28:42 openqaworker4 worker[1956]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1221625/status): Connection error: Can't connect: Temporary failure in name resolution (remaining tries: 2)
Apr 02 08:28:42 openqaworker4 worker[1956]: [info] Registering with openQA http://openqa1-opensuse
Apr 02 08:28:52 openqaworker4 worker[1956]: [warn] Failed to register at http://openqa1-opensuse - connection error: Can't connect: Temporary failure in name resolution - trying again in 10 seconds
Apr 02 08:29:02 openqaworker4 worker[1956]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1221625/status): Connection error: Can't connect: Temporary failure in name resolution (remaining tries: 1)
Apr 02 08:29:02 openqaworker4 worker[1956]: [info] Registering with openQA http://openqa1-opensuse
Apr 02 08:29:12 openqaworker4 worker[1956]: [warn] Failed to register at http://openqa1-opensuse - connection error: Can't connect: Temporary failure in name resolution - trying again in 10 seconds
Apr 02 08:29:22 openqaworker4 worker[1956]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1221625/status): Connection error: Can't connect: Temporary failure in name resolution (remaining tries: 0)
Apr 02 08:29:22 openqaworker4 worker[1956]: [error] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
Apr 02 08:29:32 openqaworker4 worker[1956]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1221625/status): Connection error: Can't connect: Temporary failure in name resolution (remaining tries: 2)
Apr 02 08:29:32 openqaworker4 worker[1956]: [info] Registering with openQA http://openqa1-opensuse
Apr 02 08:29:42 openqaworker4 worker[1956]: [warn] Failed to register at http://openqa1-opensuse - connection error: Can't connect: Temporary failure in name resolution - trying again in 10 seconds
Apr 02 08:29:52 openqaworker4 worker[1956]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1221625/status): Connection error: Can't connect: Temporary failure in name resolution (remaining tries: 2)
Apr 02 08:30:02 openqaworker4 worker[1956]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1221625/status): Connection error: Can't connect: Temporary failure in name resolution (remaining tries: 1)
Apr 02 08:30:02 openqaworker4 worker[1956]: [info] Registering with openQA http://openqa1-opensuse
Apr 02 08:30:12 openqaworker4 worker[1956]: [warn] Failed to register at http://openqa1-opensuse - connection error: Can't connect: Temporary failure in name resolution - trying again in 10 seconds
Apr 02 08:30:22 openqaworker4 worker[1956]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1221625/status): Connection error: Can't connect: Temporary failure in name resolution (remaining tries: 1)
Apr 02 08:30:32 openqaworker4 worker[1956]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1221625/status): Connection error: Can't connect: Temporary failure in name resolution (remaining tries: 0)
Apr 02 08:30:32 openqaworker4 worker[1956]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20173/6I_RZtzLIWYC2tg2/broadcast
Apr 02 08:30:32 openqaworker4 worker[1956]: [info] Registering with openQA http://openqa1-opensuse
Apr 02 08:30:42 openqaworker4 worker[1956]: [warn] Failed to register at http://openqa1-opensuse - connection error: Can't connect: Temporary failure in name resolution - trying again in 10 seconds
Apr 02 08:30:52 openqaworker4 worker[1956]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1221625/status): Connection error: Can't connect: Temporary failure in name resolution (remaining tries: 0)
Apr 02 08:30:52 openqaworker4 worker[1956]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Apr 02 08:30:52 openqaworker4 worker[1956]: [info] Registering with openQA http://openqa1-opensuse
Apr 02 08:30:57 openqaworker4 worker[1956]: [warn] Failed to register at http://openqa1-opensuse - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Apr 02 08:30:57 openqaworker4 worker[1956]: <html><head>
Apr 02 08:30:57 openqaworker4 worker[1956]: <title>503 Service Unavailable</title>
Apr 02 08:30:57 openqaworker4 worker[1956]: </head><body>
Apr 02 08:30:57 openqaworker4 worker[1956]: <h1>Service Unavailable</h1>
Apr 02 08:30:57 openqaworker4 worker[1956]: <p>The server is temporarily unable to service your
Apr 02 08:30:57 openqaworker4 worker[1956]: request due to maintenance downtime or capacity
Apr 02 08:30:57 openqaworker4 worker[1956]: problems. Please try again later.</p>
Apr 02 08:30:57 openqaworker4 worker[1956]: <p>Additionally, a 503 Service Unavailable
Apr 02 08:30:57 openqaworker4 worker[1956]: error was encountered while trying to use an ErrorDocument to handle the request.</p>
Apr 02 08:30:57 openqaworker4 worker[1956]: <hr>
Apr 02 08:30:57 openqaworker4 worker[1956]: <address>Apache/2.4.33 (Linux/SUSE) Server at openqa1-opensuse Port 80</address>
Apr 02 08:30:57 openqaworker4 worker[1956]: </body></html>
Apr 02 08:30:57 openqaworker4 worker[1956]:  - trying again in 10 seconds
Apr 02 08:30:57 openqaworker4 worker[1956]: [info] Unable to stop the command server gracefully:
Apr 02 08:30:57 openqaworker4 worker[1956]: [info] Command server likely not reachable at all
Apr 02 08:30:59 openqaworker4 worker[1956]: [info] Isotovideo exit status: 0
Apr 02 08:31:06 openqaworker4 worker[1956]: [info] +++ worker notes +++
Apr 02 08:31:06 openqaworker4 worker[1956]: [info] End time: 2020-04-02 06:31:06
Apr 02 08:31:06 openqaworker4 worker[1956]: [info] Result: api-failure
Apr 02 08:31:07 openqaworker4 worker[1956]: [info] Registering with openQA http://openqa1-opensuse
Apr 02 08:31:25 openqaworker4 worker[1956]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/206

maybe we can just try more often to connect from worker to web UI host:

https://github.com/os-autoinst/openQA/pull/2895

Actions #10

Updated by okurz about 4 years ago

  • Due date set to 2020-04-09

merged. We can await the next automatic nightly deployment of o3 and osd and then also check for reboot of o3 if jobs do not end up as non-retriggered incompletes.

Actions #11

Updated by mkittler about 4 years ago

Note that the ticket title says "if websockets server (or webui?) is unreachable". The PR only concerns the web UI and also not everything (e.g. registration). However, when I remember correctly it should nevertheless be sufficient because when the web socket connection drops the only thing that will happen is that the worker tries to re-register and to establish the connection again. If any of this fails there will be another attempt. This might continue endlessly without affecting the current job. It would still make sense to test locally whether that's actually happening.

Actions #12

Updated by mkittler about 4 years ago

Out of curiosity I've tested this locally and it works (with a completely unreachable web UI and ws server). It attempts to re-register every 10 seconds but keeps running the job. With that high number of retries it is definitely possible to keep web UI and ws server offline for several minutes. In the final uploading phase the missing test modules are uploaded again so there are also no results missing.

Actions #13

Updated by okurz about 4 years ago

thanks for verification. I will wait some more days and see how this fares on o3.

Actions #14

Updated by okurz about 4 years ago

  • Status changed from Feedback to Resolved

No problem observed on o3 and I hope the original idea is accomplished here.

Actions

Also available in: Atom PDF