action #40004
closedworker continues to work on job which he as well as the webui considers dead
Description
Observation¶
From o3:/var/log/openqa
[2018-08-20T09:17:48.0733 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:48.0770 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:49.0404 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738409
[2018-08-20T09:17:49.0492 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738409
[2018-08-20T09:17:50.0164 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:50.0203 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:50.0524 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
[2018-08-20T09:17:50.0558 UTC] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 738408
https://openqa.opensuse.org/tests/738408 reveals it was openqaworker4:11 which shows in its log from the same time period:
Aug 20 11:07:03 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:07:33 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:07:53 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:08:03 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Aug 20 11:09:17 openqaworker4 worker[30027]: [error] Job aborted because web UI doesn't accept new images anymore (likely considers this job dead)
Problem¶
It looks like both webui and worker agree that the job should not be worked on but the worker still does not stop. What gives?
Updated by okurz over 6 years ago
- Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added
Updated by EDiGiacinto over 6 years ago
I think this likely happens with loss of messages between both sides, especially if the webui is giving intermittents 504, but indeed this needs to be handled somehow as looks like could be avoided
Updated by szarate over 6 years ago
Condition for this to happen (or at least one of them) is working taking too long syncing tests, resulting in the webUI deciding to kill the job and restart it but the job not being killed because the event might no be able to be handled there. This could specially happen when downloading an asset, where the kill term is simply not handled by the code that does the download (one more reason to do the extraction of caching code)
Updated by szarate over 6 years ago
- Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added
Updated by coolo about 6 years ago
- Target version changed from Ready to Current Sprint
Note: this ticket is only about the worker aborting the job if it's know the job's results are ignored (as the webui considered the worker dead).
Updated by mkittler about 6 years ago
- Status changed from New to In Progress
Updated by mkittler about 6 years ago
- Status changed from In Progress to Feedback
PR is merged, let's see how it behaves now in production
Updated by mkittler about 6 years ago
- Status changed from Feedback to Resolved
Re-open if my change is not sufficient in production.
Updated by coolo about 6 years ago
- Target version changed from Current Sprint to Done