action #55328
closedjob is considered incomplete by openQA but worker still pushes updates so that "job is not considered dead"
Description
Observation¶
https://openqa.suse.de/tests/3230497 is already incomplete since 3h but osd reports in /var/log/openqa:
[2019-08-09T23:26:00.0802 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[2019-08-09T23:26:05.0837 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[2019-08-09T23:26:10.0879 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[2019-08-09T23:26:15.0910 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[…]
and on openqaworker9 I can see the job still running:
openqaworker9:/var/lib/openqa/pool/18 # tail -f autoinst-log.txt worker-log.txt
==> autoinst-log.txt <==
[2019-08-09T14:33:58.0183 CEST] [info] [pid:14937] +++ setup notes +++
[2019-08-09T14:33:58.0183 CEST] [info] [pid:14937] Start time: 2019-08-09 12:33:58
[2019-08-09T14:33:58.0183 CEST] [info] [pid:14937] Running on openqaworker9:18 (Linux 4.12.14-lp151.28.7-default #1 SMP Mon Jun 17 16:36:38 UTC 2019 (f8a1872) x86_64)
[2019-08-09T14:33:58.0201 CEST] [debug] [pid:14937] Downloading SLES-15-SP1-x86_64-20190809-2@64bit-minimal_with_sdkGM_installed.qcow2 - request sent to Cache Service.
==> worker-log.txt <==
[2019-08-09T23:32:59.0222 CEST] [debug] [pid:14937] Updating status so job 3230497 is not considered dead.
[2019-08-09T23:32:59.0222 CEST] [debug] [pid:14937] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3230497/status
[2019-08-09T23:33:04.0262 CEST] [debug] [pid:14937] Updating status so job 3230497 is not considered dead.
[2019-08-09T23:33:04.0263 CEST] [debug] [pid:14937] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3230497/status
[2019-08-09T23:33:09.0295 CEST] [debug] [pid:14937] Updating status so job 3230497 is not considered dead.
[2019-08-09T23:33:09.0296 CEST] [debug] [pid:14937] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3230497/status
[…]
so the job never properly started but was probably trying to download a file from the cache service. Maybe the cache service failed to download the file currently and never responded back to the worker. The worker is probably still waiting for the response from the caching server and during that time updating the webui with no change in status.
Further details¶
I have provided worker-log.txt and autoinst-log.txt from the still running job in https://w3.nue.suse.com/~okurz/poo55238/openqaworker9_instance18_still_updating_but_osd_considers_incomplete.txt and https://w3.nue.suse.com/~okurz/poo55238/openqaworker9_instance18_still_updating_but_osd_considers_incomplete-autoinst-log.txt
Updated by okurz over 5 years ago
- Copied from action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts added
Updated by okurz over 5 years ago
applying workaround, same as in #55238#note-6
load has reduced to around 4.0 so way better. But I doubt I made this with just this workaround on a single job or so.
Updated by kraih over 5 years ago
- Status changed from New to In Progress
- Assignee set to kraih
Updated by kraih over 5 years ago
This PR should help a lot with the problem. https://github.com/os-autoinst/openQA/pull/2270
There might still be related issues though, such as failed artefact uploads now making tests fail that should have passed. Need to investigate.
Updated by okurz over 5 years ago
- Priority changed from Normal to High
The current situation what it seems to have caused up to now:
- More test modules reported with "none" status (probably reported by above PR)
- Assets which fail to download because they don't exist are not reported in the worker log and less obvious than in before in autoinst-log.txt . This confused at least one person now as could be seen from discussion in #testing (RC). The error reporting should be at least improved again as we had in before which is now probably more complicated because of the "asynchronous download". See https://openqa.suse.de/tests/3323025/file/autoinst-log.txt for the missing information about the "not found asset" whereas https://openqa.suse.de/tests/3323025/file/worker-log.txt has it. This is probably caused by an earlier code change as https://openqa.suse.de/tests/3264111#downloads from 16 days ago seems to show the same problem. EDIT: Hm, maybe there is something else weird or confusing: https://openqa.suse.de/tests/3320973/file/autoinst-log.txt from "just now" shows the 404 about the asset not found in the autoinst-log.txt
EDIT2: https://openqa.suse.de/tests/3323089#downloads looks interesting. It has "autoinst-log.txt" and "worker-log.txt" but both files seem to be empty. @JERiveraMoya just reported this as #56447
EDIT3: I thought https://github.com/os-autoinst/openQA/pull/2270 is already merged but not yet deployed. Maybe it actually helps with the second problem observed above
Updated by okurz over 5 years ago
- Related to action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs added
Updated by coolo over 5 years ago
- Status changed from In Progress to Resolved
- Target version changed from Current Sprint to Done
Updated by okurz over 5 years ago
For reference, we assume that https://github.com/os-autoinst/openQA/pull/2270 solved it