action #55238

jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts

Added by okurz 8 months ago. Updated 4 months ago.

Status:BlockedStart date:31/07/2019
Priority:NormalDue date:
Assignee:okurz% Done:

0%

Category:Concrete Bugs
Target version:QA - future
Difficulty:
Duration:

Description

Observation

https://openqa.suse.de/tests/3225843# is incomplete. At the time of writing already incomplete since 23 minutes. osd reports in /var/log/openqa:

[2019-08-08T14:33:55.0517 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:55.0744 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:56.0441 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:56.0699 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:57.0285 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[2019-08-08T14:33:57.0506 CEST] [info] Got artefact for job with no worker assigned (maybe running job already considered dead): 3225843
[…]

On openqaworker8 I can see that the job is actually still active and happily spamming osd with more logs:

openqaworker8:/var/lib/openqa/pool/4 # tail -f autoinst-log.txt worker-log.txt 
==> autoinst-log.txt <==
last frame
[2019-08-08T13:44:33.969 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-08-08T13:44:33.971 CEST] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 14865 (/usr/bin/isotovideo: backen)
[2019-08-08T13:44:33.972 CEST] [debug] sending magic and exit
[2019-08-08T13:44:33.972 CEST] [debug] received magic close
[2019-08-08T13:44:33.989 CEST] [debug] backend process exited: 0
[2019-08-08T13:44:34.990 CEST] [debug] killing backend process 14865
[2019-08-08T13:44:34.990 CEST] [debug] done with backend process
14405: EXIT 0
[2019-08-08T13:44:35.0060 CEST] [info] [pid:15638] Isotovideo exit status: 0

==> worker-log.txt <==
[2019-08-08T14:35:07.0115 CEST] [debug] [pid:15638] Uploading pthread_kill_7-1-8.txt
[2019-08-08T14:35:07.0385 CEST] [debug] [pid:15638] Uploading pthread_kill_7-1-9.txt
[2019-08-08T14:35:07.0660 CEST] [debug] [pid:15638] Uploading pthread_kill_7-1-10.txt
[2019-08-08T14:35:16.0179 CEST] [debug] [pid:15638] Uploading pthread_kill_8-1-1.txt
[2019-08-08T14:35:16.0821 CEST] [debug] [pid:15638] Uploading pthread_kill_8-1-2.txt
[…]

Suggestions

  • Have this as a stress test :) For example "job tries to upload 10k log files with say 1k bytes each"
  • If the worker was asked to be stopped make sure it will stop or consider it running from server side as long as it actually is running
  • Prevent jobs from even trying to upload a too high amount of log files (or bundle them together to reduce numbers)

Further details

I have provided worker-log.txt and autoinst-log.txt from the still running job on
https://w3.nue.suse.com/~okurz/poo55238/


Related issues

Related to openQA Project - action #55904: /status updates are too heavy with external results New 23/08/2019
Copied from openQA Project - action #54902: openQA on osd fails at "incomplete" status when uploading... Resolved 31/07/2019
Copied to openQA Tests - action #55283: [ltp][kernel] LTP jobs with endless amount of logs are ki... New 31/07/2019
Copied to openQA Project - action #55328: job is considered incomplete by openQA but worker still p... Resolved 31/07/2019

History

#1 Updated by okurz 8 months ago

  • Copied from action #54902: openQA on osd fails at "incomplete" status when uploading, "502 response: Proxy Error" added

#2 Updated by okurz 8 months ago

  • Description updated (diff)

#3 Updated by pvorel 8 months ago

  • Subject changed from LTP jobs with endless amount of logs are killing openQA to [ltp][kernel] LTP jobs with endless amount of logs are killing openQA

#4 Updated by okurz 8 months ago

  • Copied to action #55283: [ltp][kernel] LTP jobs with endless amount of logs are killing openQA added

#5 Updated by okurz 8 months ago

  • Subject changed from [ltp][kernel] LTP jobs with endless amount of logs are killing openQA to jobs with high amount of log files are incompleted but the job continues with upload attempts
  • Description updated (diff)

as discussed with pvorel in IRC #55283 created to prevent this situation for now from test side. This ticket can focus on the openQA side then.

#6 Updated by okurz 8 months ago

  • Subject changed from jobs with high amount of log files are incompleted but the job continues with upload attempts to jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts

I observed that a job can also be declared dead while uploading the thumbnails or text files from test results, see https://w3.nue.suse.com/%7Eokurz/poo55238/openqaworker9_instance3_declared_dead_while_uploading_thumbnails.txt and https://w3.nue.suse.com/%7Eokurz/poo55238/openqaworker9_instance3_declared_dead_while_uploading_thumbnails-autoinst-log.txt for that from https://openqa.suse.de/tests/3229595

As short-term mitigation I try to tame the zombies. On OSD I run:

while true; do bash -ex ~okurz/bin/openqa-kill_worker_systemd_jobs_considered_dead && sleep 60; done

with the script from https://github.com/os-autoinst/scripts/pull/4

#7 Updated by mkittler 8 months ago

The worker is actually supposed to stop after the first failure and according to the log the web UI actually rejects the uploads here. I'm trying to reproduce this locally because only by looking at the code I can not figure out why the error handling doesn't work.

#8 Updated by mkittler 8 months ago

I have changed the mocking of _upload_log_file and _upload_asset within t/24-worker-job.t to simulate a failure. The worker aborts the job in this case. So the messed up error handling lies deeper in the upload code itself and the recent worker restructuring is unlikely to cause this.

See PR https://github.com/os-autoinst/openQA/pull/2250 for the test.

#9 Updated by okurz 8 months ago

  • Copied to action #55328: job is considered incomplete by openQA but worker still pushes updates so that "job is not considered dead" added

#10 Updated by okurz 8 months ago

  • Status changed from New to Workable

As @kraih prepared https://github.com/os-autoinst/openQA/pull/2268 I am stressing the staging instance a bit. In a shell session in one terminal I am running

sudo stress-ng --cpu 1 --io 10 --hdd 10 --vm 1 --vm-bytes 1G --timeout 3600s --metrics-brief

which is causing a load of about 30 . In another shell I am triggering some jobs:

for i in {1..10000}; do openqa-clone-job --skip-download --skip-chained-deps --from https://openqa.suse.de --host http://localhost 3255132 _QUIET_SCRIPT_CALLS=0 ; done

EDIT: This really managed to put a big stress on tests as can be seen in http://openqa-staging-1.qa.suse.de/tests . http://openqa-staging-1.qa.suse.de/tests/912 is an example of an incomplete job without any logs, http://openqa-staging-1.qa.suse.de/tests/910 is an example of an incomplete job that is half-way through but then ended up also without logs, http://openqa-staging-1.qa.suse.de/tests/899 is a passed job that was lucky, http://openqa-staging-1.qa.suse.de/tests/888 could be the most interesting as it is one of the rare cases which seems complete with all modules there, all logs there but the job is still regarded as incomplete. There you have your stress test :)

I stopped stress-ng as it seems the ltp_openposix scenario is good enough stress ;)

#11 Updated by okurz 6 months ago

  • Related to action #55904: /status updates are too heavy with external results added

#12 Updated by coolo 6 months ago

  • Priority changed from High to Normal

Right now not high prio - possibly even resolved due to recent changes.

#13 Updated by okurz 6 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

I assume https://github.com/os-autoinst/openQA/pull/2270 was the main PR in this direction by making the worker use asynchronous upload as much as possible. Also we recently disabled the mojo monitoring plugin which helped a lot. Further work is also going on in #55904. We can certainly revisit the remaining points but probably we should wait for #55904 first.

The manual tests we have conducted on staging based on the ltp syscall test were helpful enough. Nothing we want to run on PRs of course. Unless we can mock the expensive steps. Potentially we could run more expensive tests as part of openQA-in-openQA however we would need to be very careful to have really stable results. table means it would fail only if it points to reproducible issues that we want to fix. If e.g. a disk is slow the job would just take longer but shouldn't fail. This way eventually we can have openQA "cloud-ready" again ;)

I would be perfectly ok with you setting the ticket to "Resolved". However we need to ensure that on any next incident we do use the potentials of actual tests to tell us where to put our effort.

We could also think about bunching or truncating results, e.g. if there are more than $limit then results would be simply truncated (rather than incomplete or take forever) or create a packed archive on worker side and upload the archive (and optionally unpack on webui host).

#14 Updated by pvorel 6 months ago

Just a note: understand that this must be fixed, but please don't slow down uploading - we've notified that LTP tests on ipmi backend are already terribly slow due uploads like:

[2019-09-24T23:21:24.0481 CEST] [debug] [pid:520421] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3398899/   status

Sorry if this is not related and see #57323 if interested.

#15 Updated by kraih 6 months ago

It's hard to say for certain that this kind of problem won't ever appear again. But we can now run 10 parallel LTP tests on the staging machines with a single webui without problems, so it should at least be a lot less likely.

#16 Updated by okurz 6 months ago

  • Target version set to future

@pvorel there is no intent on slowing down uploading. Currently there is no further work planned in this ticket until #55904 is done. This ticket should really focus on jobs incompleting during upload while still continuing in the background, #55904 can try to cover performance.

#17 Updated by okurz 4 months ago

How about the following idea: We already upload thumbnails and module results while the job is running but AFAIK for other log files we only do this after the job is done. Can we upload the log files during execution? Another topic could be worker and autoinst log which are both not complete until the job is done but at we could still upload parts of it and append missing parts later with an intelligent upload.

#18 Updated by mkittler 4 months ago

I think that makes generally sense but I'm not sure whether it will help with that concrete issue, e.g. if the majority of log files is generated at the end of one particular job module. Note that so far the other results are not uploaded continuously while the job is running but only on job module boundaries.

Also available in: Atom PDF