Project

General

Profile

Actions

action #54680

closed

Worker re-registration after websocket disconnects causes incompletes

Added by mkittler almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-07-26
Due date:
% Done:

0%

Estimated time:

Description

Jul 26 11:36:05 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:06 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:07 openqaworker8 worker[64627]: [error] [pid:64627] Websocket connection to http://openqa.suse.de/api/v1/ws/1275 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jul 26 11:36:07 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:08 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:09 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:10 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:11 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:12 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:13 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:14 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:15 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:17 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:17 openqaworker8 worker[64627]: [info] [pid:64627] Registering with openQA openqa.suse.de
Jul 26 11:36:17 openqaworker8 worker[64627]: [info] [pid:64627] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1275
Jul 26 11:36:17 openqaworker8 worker[64627]: [info] [pid:64627] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1275
Jul 26 11:36:18 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:18 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 2)
Jul 26 11:36:23 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 1)
Jul 26 11:36:28 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 0)
Jul 26 11:36:28 openqaworker8 worker[64627]: [debug] [pid:64627] Stopping job 3136372 from openqa.suse.de: 03136372-sle-15-SP1-Installer-DVD-x86_64-Build228.2-zluo-poo_50336@64bit - reason: api-failure
Jul 26 11:36:28 openqaworker8 worker[64627]: [error] [pid:64627] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
Jul 26 11:36:28 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:28 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 2)
Jul 26 11:36:29 openqaworker8 worker[64627]: [warn] [pid:64627] Can't open /var/lib/openqa/pool/1/testresults/result-consoletest_setup.json for result upload - likely isotovideo could not be started or failed early. Error message: No su>
Jul 26 11:36:29 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/status
Jul 26 11:36:29 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 2)
Jul 26 11:36:33 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 1)
Jul 26 11:36:34 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 1)
Jul 26 11:36:38 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 0)
Jul 26 11:36:38 openqaworker8 worker[64627]: [info] [pid:64627] Trying to stop job gracefully by announcing it to command server via http://localhost:20013/2rO1xssSrlFHWKfa/broadcast
Jul 26 11:36:42 openqaworker8 worker[64627]: [info] [pid:64627] Isotovideo exit status: 1
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] +++ worker notes +++
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] End time: 2019-07-26 09:36:47
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] Result: api-failure
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/set_done?result=incomplete
Jul 26 11:36:47 openqaworker8 worker[64627]: [error] [pid:64627] 400 response: Got status update for job 3136372 with unexpected worker ID 1275 (expected none) (remaining tries: 0)
Jul 26 11:36:47 openqaworker8 worker[64627]: [error] [pid:64627] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] Registering with openQA openqa.suse.de
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1275
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3136372/set_done
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] Websocket connection to http://openqa.suse.de/api/v1/ws/1275 finished from our side.
Jul 26 11:36:47 openqaworker8 worker[64627]: [info] [pid:64627] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1275
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] Job 3136372 from openqa.suse.de finished - reason: api-failure
Jul 26 11:36:47 openqaworker8 worker[64627]: [debug] [pid:64627] Cleaning up for next job
Jul 26 11:38:53 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:53 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:38:54 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:54 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:38:54 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:54 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:38:55 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:55 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:38:57 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:38:57 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:39:06 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:39:06 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }
Jul 26 11:39:06 openqaworker8 worker[64627]: [warn] [pid:64627] Ignoring WS message from openqa.suse.de with type livelog_stop and job ID 3136372 (currently not executing a job):
Jul 26 11:39:06 openqaworker8 worker[64627]: { jobid => 3136372, type => "livelog_stop" }

To simplify the worker/web UI communication and to prevent having job in 'running' state which are no longer actually worked on the worker now tries to re-register after loosing its web socket connection or after too many fatal API errors.

Unfortunately the re-registration in the middle of the job execution leads to the behavior above where the web UI rejects any further status updates and in consequence the worker abandons the job. Interestingly the websocket server still tries to send messages for that job.

I suppose the web UI should not instantly refuse any further updates when the worker registers again. However, we still need to take care not to have any jobs in perpetual 'running' state. It would also be nice if the websocket server would behave accordingly if the web UI gives up on a job although the worker seems to be good at ignoring those web socket messages.

Actions

Also available in: Atom PDF