Project

General

Profile

Actions

action #40004

closed

worker continues to work on job which he as well as the webui considers dead

Added by okurz over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2018-08-20
Due date:
% Done:

0%

Estimated time:

Description

Observation

From o3:/var/log/openqa

[2018-08-20T09:17:48.0733 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:48.0770 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:49.0404 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738409
[2018-08-20T09:17:49.0492 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738409
[2018-08-20T09:17:50.0164 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:50.0203 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:50.0524 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:50.0558 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408

https://openqa.opensuse.org/tests/738408 reveals it was openqaworker4:11 which shows in its log from the same time period:

Aug 20 11:07:03 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:07:33 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:07:53 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:08:03 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:09:17 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)

Problem

It looks like both webui and worker agree that the job should not be worked on but the worker still does not stop. What gives?


Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-outResolvedokurz2018-08-15

Actions
Related to openQA Project (public) - action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleepingResolvedEDiGiacinto2018-08-16

Actions
Actions #1

Updated by okurz over 6 years ago

  • Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added
Actions #2

Updated by EDiGiacinto over 6 years ago

I think this likely happens with loss of messages between both sides, especially if the webui is giving intermittents 504, but indeed this needs to be handled somehow as looks like could be avoided

Actions #3

Updated by szarate over 6 years ago

  • Target version set to Ready
Actions #4

Updated by szarate over 6 years ago

Condition for this to happen (or at least one of them) is working taking too long syncing tests, resulting in the webUI deciding to kill the job and restart it but the job not being killed because the event might no be able to be handled there. This could specially happen when downloading an asset, where the kill term is simply not handled by the code that does the download (one more reason to do the extraction of caching code)

Actions #5

Updated by szarate over 6 years ago

  • Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added
Actions #6

Updated by coolo about 6 years ago

  • Target version changed from Ready to Current Sprint

Note: this ticket is only about the worker aborting the job if it's know the job's results are ignored (as the webui considered the worker dead).

Actions #7

Updated by mkittler about 6 years ago

  • Assignee set to mkittler
Actions #8

Updated by mkittler about 6 years ago

  • Status changed from New to In Progress
Actions #9

Updated by mkittler about 6 years ago

  • Status changed from In Progress to Feedback

PR is merged, let's see how it behaves now in production

Actions #10

Updated by mkittler about 6 years ago

  • Status changed from Feedback to Resolved

Re-open if my change is not sufficient in production.

Actions #11

Updated by coolo about 6 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF