Project

General

Profile

Actions

action #48422

closed

Workers stay in inconsistent job relationship

Added by coolo over 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-02-26
Due date:
% Done:

0%

Estimated time:

Description

This is clearly related to the stuck workers, but I'm creating an extra ticket, so we can see if there are other problems.

this is grenache-1:12

Feb 25 14:22:49 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:22:59 grenache-1 worker[162749]: [debug] [pid:162749] updating status
Feb 25 14:22:59 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:23:09 grenache-1 worker[162749]: [debug] [pid:162749] updating status
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Refusing 'grab_job', we are already performing 2495918
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Sending worker status to openqa.chell.qa.suse.de (workerstatus timer)
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Sending worker status to openqa.suse.de (workerstatus timer)
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Population for openqa.chell.qa.suse.de is 43
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## changing timer workerstatus-openqa.chell.qa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer workerstatus-openqa.chell.qa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer workerstatus-openqa.chell.qa.suse.de 64.66
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] Population for openqa.suse.de is 352
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## changing timer workerstatus-openqa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer workerstatus-openqa.suse.de
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer workerstatus-openqa.suse.de 100.00
Feb 25 14:27:17 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 2)
Feb 25 14:27:17 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:19 grenache-1 worker[162749]: [info] [pid:162749] Isotovideo exit status: 1
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] stop_job died
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer update_status
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] ## removing timer job_timeout
Feb 25 14:27:19 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:20 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:21 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:22 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:22 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 1)
Feb 25 14:27:22 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:23 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:24 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:25 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:26 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:27 grenache-1 worker[162749]: [debug] [pid:162749] postpone stopping until ongoing status update is concluded
Feb 25 14:27:27 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 0)
Feb 25 14:27:27 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:27 grenache-1 worker[162749]: [error] [pid:162749] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Feb 25 14:27:27 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:28 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:27:28 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 2)
Feb 25 14:27:28 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:33 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 1)
Feb 25 14:27:33 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:38 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 0)
Feb 25 14:27:38 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] trying to stop job gracefully by announcing it to command server via http://localhost:20123/eAiGNZ93WVaDT9VF/broadcast
Feb 25 14:27:38 grenache-1 worker[162749]: [error] [pid:162749] unable to stop the command server gracefully:
Feb 25 14:27:38 grenache-1 worker[162749]: [error] [pid:162749] command server likely not reachable at all
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] +++ worker notes +++
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] end time: 2019-02-25 13:27:38
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] result: died
Feb 25 14:27:38 grenache-1 worker[162749]: [info] [pid:162749] uploading video.ogv
Feb 25 14:27:39 grenache-1 worker[162749]: [error] [pid:162749] ERROR video.ogv: 404 response: Not Found
Feb 25 14:27:39 grenache-1 worker[162749]: [debug] [pid:162749] job 2495918 incomplete
Feb 25 14:27:39 grenache-1 worker[162749]: can't open /var/lib/openqa/pool/12/testresults/result-svirt_upload_assets.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 973.
Feb 25 14:27:39 grenache-1 worker[162749]: [debug] [pid:162749] POST http://openqa.suse.de/api/v1/jobs/2495918/status
Feb 25 14:27:39 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 2)
Feb 25 14:27:39 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:44 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 1)
Feb 25 14:27:44 grenache-1 worker[162749]: [debug] [pid:162749] ## adding timer  5
Feb 25 14:27:49 grenache-1 worker[162749]: [error] [pid:162749] 400 response: Bad Request (remaining tries: 0)
Feb 25 14:27:49 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:27:49 grenache-1 worker[162749]: [error] [pid:162749] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Feb 25 14:27:49 grenache-1 worker[162749]: [debug] [pid:162749] stop_job called after job has already been asked to stop (reason: api-failure)
Feb 25 14:28:21 grenache-1 worker[162749]: [debug] [pid:162749] Sending worker status to openqa.chell.qa.suse.de (workerstatus timer)

After this the worker is lost, because it's marked as free for the scheduler, but the WS service won't send anything:

[2019-02-26T10:00:45.0020 CET] [debug] [pid:27889] Received from worker "1243" worker_status message "$VAR1 = {
          'status' => 'working',
          'type' => 'worker_status',
          'job' => {
                     'id' => 2495918,
                     'assigned_worker_id' => '1243',
                     't_started' => '',
                     'name' => 'sle-15-SP1-Installer-DVD-s390x-Build177.1-create_hdd_ha_textmode@s390x-kvm-sle12',

...

[2019-02-26T10:00:45.0023 CET] [debug] [pid:27889] Updating worker seen from worker_status
[2019-02-26T10:00:45.0029 CET] [debug] [pid:27889] Found Job(2496347) in DB from worker_status update sent by Worker(1243)
[2019-02-26T10:00:45.0029 CET] [debug] [pid:27889] Received request has id: 2495918
[2019-02-26T10:00:45.0035 CET] [debug] [pid:27889] Worker(1243) for Job(2496347) has token JyckI5bhlcgXNUSg
[2019-02-26T10:00:45.0035 CET] [debug] [pid:27889] Received request has token: eAiGNZ93WVaDT9VF
[2019-02-26T10:00:45.0037 CET] [debug] [pid:27889] Possibly worker 1243 should be freed.
[2019-02-26T10:00:45.0048 CET] [debug] [pid:27889] Job '2496347' reset to 'scheduled' state

Related issues 1 (0 open1 closed)

Is duplicate of openQA Project - action #42980: job stayed in assigned but is deadRejectedmkittler2018-10-27

Actions
Actions

Also available in: Atom PDF