action #55238
openjobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts
0%
Description
Observation¶
https://openqa.suse.de/tests/3225843# is incomplete. At the time of writing already incomplete since 23 minutes. osd reports in /var/log/openqa:
[2019-08-08T14:33:55.0517 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:55.0744 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:56.0441 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:56.0699 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:57.0285 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:57.0506 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[…]
On openqaworker8 I can see that the job is actually still active and happily spamming osd with more logs:
openqaworker8:/var/lib/openqa/pool/4 # tail -f autoinst-log.txt worker-log.txt
==> autoinst-log.txt <==
last frame
[2019-08-08T13:44:33.969 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-08-08T13:44:33.971 CEST] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 14865 (/usr/bin/isotovideo: backen)
[2019-08-08T13:44:33.972 CEST] [debug] sending magic and exit
[2019-08-08T13:44:33.972 CEST] [debug] received magic close
[2019-08-08T13:44:33.989 CEST] [debug] backend process exited: 0
[2019-08-08T13:44:34.990 CEST] [debug] killing backend process 14865
[2019-08-08T13:44:34.990 CEST] [debug] done with backend process
14405: EXIT 0
[2019-08-08T13:44:35.0060 CEST] [info] [pid:15638] Isotovideo exit status: 0
==> worker-log.txt <==
[2019-08-08T14:35:07.0115 CEST] [debug] [pid:15638] Uploading pthread_kill_7-1-8.txt
[2019-08-08T14:35:07.0385 CEST] [debug] [pid:15638] Uploading pthread_kill_7-1-9.txt
[2019-08-08T14:35:07.0660 CEST] [debug] [pid:15638] Uploading pthread_kill_7-1-10.txt
[2019-08-08T14:35:16.0179 CEST] [debug] [pid:15638] Uploading pthread_kill_8-1-1.txt
[2019-08-08T14:35:16.0821 CEST] [debug] [pid:15638] Uploading pthread_kill_8-1-2.txt
[…]
Suggestions¶
- Have this as a stress test :) For example "job tries to upload 10k log files with say 1k bytes each"
- If the worker was asked to be stopped make sure it will stop or consider it running from server side as long as it actually is running
- Prevent jobs from even trying to upload a too high amount of log files (or bundle them together to reduce numbers)
Further details¶
I have provided worker-log.txt and autoinst-log.txt from the still running job on
https://w3.nue.suse.com/~okurz/poo55238/
Updated by okurz over 5 years ago
- Copied from action #54902: openQA on osd fails at "incomplete" status when uploading, "502 response: Proxy Error" added
Updated by pvorel over 5 years ago
- Subject changed from LTP jobs with endless amount of logs are killing openQA to [ltp][kernel] LTP jobs with endless amount of logs are killing openQA
Updated by okurz over 5 years ago
- Subject changed from [ltp][kernel] LTP jobs with endless amount of logs are killing openQA to jobs with high amount of log files are incompleted but the job continues with upload attempts
- Description updated (diff)
as discussed with pvorel in IRC #55283 created to prevent this situation for now from test side. This ticket can focus on the openQA side then.
Updated by okurz over 5 years ago
- Subject changed from jobs with high amount of log files are incompleted but the job continues with upload attempts to jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts
I observed that a job can also be declared dead while uploading the thumbnails or text files from test results, see https://w3.nue.suse.com/%7Eokurz/poo55238/openqaworker9_instance3_declared_dead_while_uploading_thumbnails.txt and https://w3.nue.suse.com/%7Eokurz/poo55238/openqaworker9_instance3_declared_dead_while_uploading_thumbnails-autoinst-log.txt for that from https://openqa.suse.de/tests/3229595
As short-term mitigation I try to tame the zombies. On OSD I run:
while true; do bash -ex ~okurz/bin/openqa-kill_worker_systemd_jobs_considered_dead && sleep 60; done
with the script from https://github.com/os-autoinst/scripts/pull/4
Updated by mkittler over 5 years ago
The worker is actually supposed to stop after the first failure and according to the log the web UI actually rejects the uploads here. I'm trying to reproduce this locally because only by looking at the code I can not figure out why the error handling doesn't work.
Updated by mkittler over 5 years ago
I have changed the mocking of _upload_log_file
and _upload_asset
within t/24-worker-job.t
to simulate a failure. The worker aborts the job in this case. So the messed up error handling lies deeper in the upload code itself and the recent worker restructuring is unlikely to cause this.
See PR https://github.com/os-autoinst/openQA/pull/2250 for the test.
Updated by okurz over 5 years ago
- Copied to action #55328: job is considered incomplete by openQA but worker still pushes updates so that "job is not considered dead" added
Updated by okurz over 5 years ago
- Status changed from New to Workable
As @kraih prepared https://github.com/os-autoinst/openQA/pull/2268 I am stressing the staging instance a bit. In a shell session in one terminal I am running
sudo stress-ng --cpu 1 --io 10 --hdd 10 --vm 1 --vm-bytes 1G --timeout 3600s --metrics-brief
which is causing a load of about 30 . In another shell I am triggering some jobs:
for i in {1..10000}; do openqa-clone-job --skip-download --skip-chained-deps --from https://openqa.suse.de --host http://localhost 3255132 _QUIET_SCRIPT_CALLS=0 ; done
EDIT: This really managed to put a big stress on tests as can be seen in http://openqa-staging-1.qa.suse.de/tests . http://openqa-staging-1.qa.suse.de/tests/912 is an example of an incomplete job without any logs, http://openqa-staging-1.qa.suse.de/tests/910 is an example of an incomplete job that is half-way through but then ended up also without logs, http://openqa-staging-1.qa.suse.de/tests/899 is a passed job that was lucky, http://openqa-staging-1.qa.suse.de/tests/888 could be the most interesting as it is one of the rare cases which seems complete with all modules there, all logs there but the job is still regarded as incomplete. There you have your stress test :)
I stopped stress-ng
as it seems the ltp_openposix scenario is good enough stress ;)
Updated by okurz about 5 years ago
- Related to action #55904: /status updates are too heavy with external results added
Updated by coolo about 5 years ago
- Priority changed from High to Normal
Right now not high prio - possibly even resolved due to recent changes.
Updated by okurz about 5 years ago
- Status changed from Workable to Blocked
- Assignee set to okurz
I assume https://github.com/os-autoinst/openQA/pull/2270 was the main PR in this direction by making the worker use asynchronous upload as much as possible. Also we recently disabled the mojo monitoring plugin which helped a lot. Further work is also going on in #55904. We can certainly revisit the remaining points but probably we should wait for #55904 first.
The manual tests we have conducted on staging based on the ltp syscall test were helpful enough. Nothing we want to run on PRs of course. Unless we can mock the expensive steps. Potentially we could run more expensive tests as part of openQA-in-openQA however we would need to be very careful to have really stable results. table means it would fail only if it points to reproducible issues that we want to fix. If e.g. a disk is slow the job would just take longer but shouldn't fail. This way eventually we can have openQA "cloud-ready" again ;)
I would be perfectly ok with you setting the ticket to "Resolved". However we need to ensure that on any next incident we do use the potentials of actual tests to tell us where to put our effort.
We could also think about bunching or truncating results, e.g. if there are more than $limit then results would be simply truncated (rather than incomplete or take forever) or create a packed archive on worker side and upload the archive (and optionally unpack on webui host).
Updated by pvorel about 5 years ago
Just a note: understand that this must be fixed, but please don't slow down uploading - we've notified that LTP tests on ipmi backend are already terribly slow due uploads like:
[2019-09-24T23:21:24.0481 CEST] [debug] [pid:520421] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3398899/ status
Sorry if this is not related and see #57323 if interested.
Updated by kraih about 5 years ago
It's hard to say for certain that this kind of problem won't ever appear again. But we can now run 10 parallel LTP tests on the staging machines with a single webui without problems, so it should at least be a lot less likely.
Updated by okurz about 5 years ago
- Target version set to future
Updated by okurz about 5 years ago
How about the following idea: We already upload thumbnails and module results while the job is running but AFAIK for other log files we only do this after the job is done. Can we upload the log files during execution? Another topic could be worker and autoinst log which are both not complete until the job is done but at we could still upload parts of it and append missing parts later with an intelligent upload.
Updated by mkittler about 5 years ago
I think that makes generally sense but I'm not sure whether it will help with that concrete issue, e.g. if the majority of log files is generated at the end of one particular job module. Note that so far the other results are not uploaded continuously while the job is running but only on job module boundaries.
Updated by okurz over 4 years ago
- Status changed from Blocked to New
- Assignee deleted (
okurz) - Priority changed from Normal to Low
I am not sure if we have had changes in the meantime that would actually help with that but I think this is less important by now and also we are blocked by other tickets which we regard as low prio and do not track on the QA tools team backlog as well so I will put the ticket back to "New" to not need to track it anymore for now.