action #40004

worker continues to work on job which he as well as the webui considers dead

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:ResolvedStart date:20/08/2018
Priority:NormalDue date:
Assignee:mkittler% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

Observation

From o3:/var/log/openqa

[2018-08-20T09:17:48.0733 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:48.0770 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:49.0404 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738409
[2018-08-20T09:17:49.0492 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738409
[2018-08-20T09:17:50.0164 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:50.0203 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:50.0524 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:50.0558 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408

https://openqa.opensuse.org/tests/738408 reveals it was openqaworker4:11 which shows in its log from the same time period:

Aug 20 11:07:03 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:07:33 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:07:53 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:08:03 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:09:17 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)

Problem

It looks like both webui and worker agree that the job should not be worked on but the worker still does not stop. What gives?


Related issues

Related to openQA Project - action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway ... Resolved 15/08/2018
Related to openQA Project - action #39833: [tools] When a worker is abruptly killed, jobs get blocke... Resolved 16/08/2018

History

#1 Updated by okurz over 1 year ago

  • Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added

#2 Updated by EDiGiacinto over 1 year ago

I think this likely happens with loss of messages between both sides, especially if the webui is giving intermittents 504, but indeed this needs to be handled somehow as looks like could be avoided

#3 Updated by szarate over 1 year ago

  • Target version set to Ready

#4 Updated by szarate over 1 year ago

Condition for this to happen (or at least one of them) is working taking too long syncing tests, resulting in the webUI deciding to kill the job and restart it but the job not being killed because the event might no be able to be handled there. This could specially happen when downloading an asset, where the kill term is simply not handled by the code that does the download (one more reason to do the extraction of caching code)

#5 Updated by szarate over 1 year ago

  • Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added

#6 Updated by coolo over 1 year ago

  • Target version changed from Ready to Current Sprint

Note: this ticket is only about the worker aborting the job if it's know the job's results are ignored (as the webui considered the worker dead).

#7 Updated by mkittler over 1 year ago

  • Assignee set to mkittler

#8 Updated by mkittler over 1 year ago

  • Status changed from New to In Progress

#9 Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

PR is merged, let's see how it behaves now in production

#10 Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

Re-open if my change is not sufficient in production.

#11 Updated by coolo about 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF