Project

General

Profile

action #48422

Workers stay in inconsistent job relationship

Added by coolo over 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2019-02-26
Due date:
% Done:

0%

Estimated time:
Difficulty:
Duration:

Description

This is clearly related to the stuck workers, but I'm creating an extra ticket, so we can see if there are other problems.

this is grenache-1:12

Feb 25 14:22:49 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:22:59 grenache-1 worker[162749]: [debug] [pid:162749] updating status
Feb 25 14:22:59 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:23:09 grenache-1 worker[162749]: [debug] [pid:162749] updating status
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Refusing 'grab_job', we are already performing 2495918
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Sending worker status to openqa.chell.qa.suse.de (workerstatus timer)
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Sending worker status to openqa.suse.de (workerstatus timer)
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Population for openqa.chell.qa.suse.de is 43
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## changing timer workerstatus-openqa.chell.qa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer workerstatus-openqa.chell.qa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer workerstatus-openqa.chell.qa.suse.de 64.66
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Population for openqa.suse.de is 352
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## changing timer workerstatus-openqa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer workerstatus-openqa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer workerstatus-openqa.suse.de 100.00
Feb 25 14:27:17 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 2)
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:19 grenache-1 worker[162749]: [info] [pid:162749] Isotovideo exit status: 1
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] stop_job died
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer update_status
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer job_timeout
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:20 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:21 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:22 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:22 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 1)
Feb 25 14:27:22 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:23 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:24 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:25 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:26 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:27 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:27 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 0)
Feb 25 14:27:27 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:27 grenache-1 worker[162749]: [error] [pid:162749] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Feb 25 14:27:27 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:28 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:27:28 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 2)
Feb 25 14:27:28 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:33 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 1)
Feb 25 14:27:33 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:38 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 0)
Feb 25 14:27:38 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] trying to stop job gracefully by announcing it to command server via http://localhost:20123/eAiGNZ93WVaDT9VF/broadcast
Feb 25 14:27:38 grenache-1 worker[162749]: [error] [pid:162749] unable to stop the command server gracefully:
Feb 25 14:27:38 grenache-1 worker[162749]: [error] [pid:162749] command server likely not reachable at all
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] +++ worker notes +++
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] end time: 2019-02-25 13:27:38
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] result: died
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] uploading video.ogv
Feb 25 14:27:39 grenache-1 worker[162749]: [error] [pid:162749] ERROR video.ogv: 404 response: Not Found
Feb 25 14:27:39 grenache-1 worker[162749]: [debug] [pid:162749] job 2495918 incomplete
Feb 25 14:27:39 grenache-1 worker[162749]: can't open /var/lib/openqa/pool/12/testresults/result-svirt_upload_assets.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 973.
Feb 25 14:27:39 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:27:39 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 2)
Feb 25 14:27:39 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:44 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 1)
Feb 25 14:27:44 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:49 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 0)
Feb 25 14:27:49 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:49 grenache-1 worker[162749]: [error] [pid:162749] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Feb 25 14:27:49 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:28:21 grenache-1 worker[162749]: [debug] [pid:162749] Sending worker status to openqa.chell.qa.suse.de (workerstatus timer)

After this the worker is lost, because it's marked as free for the scheduler, but the WS service won't send anything:

[2019-02-26T10:00:45.0020 CET] [debug] [pid:27889] Received from worker "1243" worker_status message "$VAR1 = {
          'status' => 'working',
          'type' => 'worker_status',
          'job' => {
                     'id' => 2495918,
                     'assigned_worker_id' => '1243',
                     't_started' => '',
                     'name' => 'sle-15-SP1-Installer-DVD-s390x-Build177.1-create_hdd_ha_textmode@s390x-kvm-sle12',

...

[2019-02-26T10:00:45.0023 CET] [debug] [pid:27889] Updating worker seen from worker_status
[2019-02-26T10:00:45.0029 CET] [debug] [pid:27889] Found Job(2496347) in DB from worker_status update sent by Worker(1243)
[2019-02-26T10:00:45.0029 CET] [debug] [pid:27889] Received request has id: 2495918
[2019-02-26T10:00:45.0035 CET] [debug] [pid:27889] Worker(1243) for Job(2496347) has token JyckI5bhlcgXNUSg
[2019-02-26T10:00:45.0035 CET] [debug] [pid:27889] Received request has token: eAiGNZ93WVaDT9VF
[2019-02-26T10:00:45.0037 CET] [debug] [pid:27889] Possibly worker 1243 should be freed.
[2019-02-26T10:00:45.0048 CET] [debug] [pid:27889] Job '2496347' reset to 'scheduled' state

Related issues

Is duplicate of openQA Project - action #42980: job stayed in assigned but is deadRejected2018-10-27

History

#1 Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version set to Current Sprint - kernel

Let's go though it step-by-step. Everything looks fine until Refusing 'grab_job', we are already performing 2495918:

Feb 25 14:22:59 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:23:09 grenache-1 worker[162749]: [debug] [pid:162749] updating status
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Refusing 'grab_job', we are already performing 2495918

The worker is most likely right to say that it is performing job 2495918. That it received the 'grab_job' looks like a failure on the remote side.

Then the worker tries to send its status to all web UI hosts. Not wrong, right?

Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Sending worker status to openqa.chell.qa.suse.de (workerstatus timer)
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Sending worker status to openqa.suse.de (workerstatus timer)
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Population for openqa.chell.qa.suse.de is 43
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## changing timer workerstatus-openqa.chell.qa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer workerstatus-openqa.chell.qa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer workerstatus-openqa.chell.qa.suse.de 64.66
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Population for openqa.suse.de is 352
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## changing timer workerstatus-openqa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer workerstatus-openqa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer workerstatus-openqa.suse.de 100.00
Feb 25 14:27:17 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 2)
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5

Then it fails on some request with error 400 but it is not clear what the request was. But likely just some of the status updates.

When isotovideo exits the worker correctly postpones handling that until the previously started status update is concluded (with an error after running out of retry-attempts):

Feb 25 14:27:19 grenache-1 worker[162749]: [info] [pid:162749] Isotovideo exit status: 1
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] stop_job died
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer update_status
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer job_timeout
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:20 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:21 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:22 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:22 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 1)
Feb 25 14:27:22 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:23 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:24 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:25 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:26 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:27 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:27 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 0)

So far nothing really weird or unexpected happened (from the worker-side). Then then get multiple errors:

Feb 25 14:27:27 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:27 grenache-1 worker[162749]: [error] [pid:162749] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Feb 25 14:27:27 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:28 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:27:28 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 2)
Feb 25 14:27:28 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:33 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 1)
Feb 25 14:27:33 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:38 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 0)
Feb 25 14:27:38 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)

But that's still ok and does not mean that handling the job being stopped is messed. The worker just ignores the failing status update because the only way to handle it is stopping the job which is already done anyways.

Stopping the job is actually proceeded. That's why it also tries to inform the command server which obviously is already down at this point:

Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] trying to stop job gracefully by announcing it to command server via http://localhost:20123/eAiGNZ93WVaDT9VF/broadcast
Feb 25 14:27:38 grenache-1 worker[162749]: [error] [pid:162749] unable to stop the command server gracefully:
Feb 25 14:27:38 grenache-1 worker[162749]: [error] [pid:162749] command server likely not reachable at all

This is not a problem at all because that is only important to happen if the job was cancelled through the worker itself. However, we maybe want to prevent that code from being executed in the case isotovideo exits on its own so that error goes away.

Likely the real issue starts when it tries to upload the OGV file:

Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] result: died
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] uploading video.ogv
Feb 25 14:27:39 grenache-1 worker[162749]: [error] [pid:162749] ERROR video.ogv: 404 response: Not Found
Feb 25 14:27:39 grenache-1 worker[162749]: [debug] [pid:162749] job 2495918 incomplete
Feb 25 14:27:39 grenache-1 worker[162749]: can't open /var/lib/openqa/pool/12/testresults/result-svirt_upload_assets.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 973.
Feb 25 14:27:39 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:27:39 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 2)
Feb 25 14:27:39 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:44 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 1)
Feb 25 14:27:44 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:49 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 0)

Apparently that fails but this failure isn't handled well and leads into the inconsistent state. Then also the post to update the job status fails (again).

"Side-problems" are the unexpected job grab event in the beginning and the announcement to the command server.

#2 Updated by mkittler over 1 year ago

  • Target version changed from Current Sprint - kernel to Current Sprint

#3 Updated by mkittler over 1 year ago

PR for the worker-side: https://github.com/os-autoinst/openQA/pull/2000

This PR improves the error handling on the worker-side. It is still not 100 % clear what actually caused the error. Maybe the web UI just considered the job already dead (like the log messages already indicate)? I guess I will also have to check the web UI log but it is likely hard to find the corresponding lines for the particular requests.

This also leaves the reason for the Refusing 'grab_job', we are already performing 2495918 still unknown.

#4 Updated by mkittler over 1 year ago

  • Is duplicate of action #42980: job stayed in assigned but is dead added

#5 Updated by mkittler over 1 year ago

  • Status changed from In Progress to Rejected

#6 Updated by mkittler over 1 year ago

  • Status changed from Rejected to In Progress

closed the wrong ticket

#7 Updated by okurz about 1 year ago

  • Category set to Concrete Bugs

#8 Updated by mkittler 11 months ago

  • Status changed from In Progress to Resolved

I haven't seen this particular case anymore since restructuring the worker. If the same problem would still be present the worker log would look different anyways.

The weird log messages like

Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:20 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:21 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
...

and

Feb 25 14:27:49 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:49 grenache-1 worker[162749]: [error] [pid:162749] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Feb 25 14:27:49 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)

should be definitely gone by now.

Also available in: Atom PDF