Project

General

Profile

Actions

action #90152

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobs

module results missing on quick job (on auto-restarting worker)

Added by okurz about 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-03-16
Due date:
% Done:

0%

Estimated time:

Description

Observation

From https://chat.suse.de/channel/testing?msg=dc9fhyCdJ3ZM7QHGF

https://openqa.suse.de/tests/5673240/ is a job that turned out "failed" with no module results present. However the system journal on the worker host shows that module results are uploaded:

Mar 15 16:58:53 openqaworker9 worker[1438]: [info] [pid:1438] Finished to rsync tests
Mar 15 16:58:53 openqaworker9 worker[1438]: [info] [pid:1438] Preparing cgroup to start isotovideo
Mar 15 16:58:53 openqaworker9 worker[1438]: [info] [pid:1438] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@18.service/5673240
Mar 15 16:58:53 openqaworker9 worker[1438]: [info] [pid:1438] Starting isotovideo container
Mar 15 16:58:53 openqaworker9 worker[1438]: [debug] [pid:1438] Registered process:1751
Mar 15 16:58:53 openqaworker9 worker[1438]: [info] [pid:1751] 1751: WORKING 5673240
Mar 15 16:58:53 openqaworker9 worker[1438]: [info] [pid:1438] isotovideo has been started (PID: 1751)
Mar 15 16:58:53 openqaworker9 worker[1438]: [debug] [pid:1751] +++ worker notes +++
Mar 15 16:58:53 openqaworker9 worker[1438]: [debug] [pid:1438] Running job 5673240 from openqa.suse.de: 05673240-carwos-dev-qemu-x86_64-Build361378-carwos_master-carwos-selinux@64bit.
Mar 15 16:58:53 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mar 15 16:59:03 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mar 15 16:59:13 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mar 15 16:59:24 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:1438] Isotovideo exit status: 0
Mar 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:1438] Stopping job 5673240 from openqa.suse.de: 05673240-carwos-dev-qemu-x86_64-Build361378-carwos_master-carwos-selinux@64bit - reason: done
Mar 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:1438] +++ worker notes +++
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:1438] End time: 2021-03-15 15:59:32
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:1438] Result: done
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading vars.json
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading autoinst-log.txt
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading worker-log.txt
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading serial0.txt
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading video_time.vtt
Mar 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading serial_terminal.txt
Mar 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:1438] Setting job 5673240 to done
Mar 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mar 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:2675] Optimizing /var/lib/openqa/pool/18/testresults/selinux-2.png
Mar 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-2.png as bc3973abbdd2dd9e0c507ddcd875c26a
Mar 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Optimizing /var/lib/openqa/pool/18/testresults/.thumbs/selinux-2.png
Mar 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-2.png as bc3973abbdd2dd9e0c507ddcd875c26a
Mar 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-30.txt
Mar 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-21.txt
…
Mar 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-20.txt
Mar 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-29.txt
Mar 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-18.txt
Mar 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/set_done?worker_id=617
Mar 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:1438] Job 5673240 from openqa.suse.de finished - reason: done
Mar 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:1438] Cleaning up for next job
Mar 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:1438] Informing openqa.suse.de that we are going offline

Problem

I have the suspicion that the worker uploads everything correctly but the auto-restarting worker terminates after the job is done before the openQA webUI is done processing all data


Related issues 3 (0 open3 closed)

Related to openQA Project - action #58826: Result not rendered in detail view on short (e.g. <10s) test-modules, if job is still runningResolvedtinita2019-10-29

Actions
Related to openQA Project - action #91461: Test is missing webui results and fail despite all tests passedResolvedmkittler2021-04-212021-05-06

Actions
Related to openQA Project - action #91902: Tests incomplete with reason "Failed modules: …"Resolvedmkittler2021-04-282021-05-13

Actions
Actions #1

Updated by okurz about 3 years ago

  • Parent task set to #80908
Actions #2

Updated by mkittler about 3 years ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler about 3 years ago

I have the suspicion that the worker uploads everything correctly but the auto-restarting worker terminates after the job is done before the openQA webUI is done processing all data

That's very strange. There's actually code to ensure that all result uploads have been concluded before proceeding with marking the job as done. And we also even try to do one more "final" upload before marking the job as done to ensure everything is uploaded in the end (in case there were connection issues in the middle). This final upload is also waited for of course. Note that this is supposed to work independently of whether the worker terminates itself at the end of a job because we want to keep the job in the state "uploading" as long as uploads are ongoing. If we would fail to do so uploads would likely even fail in the case the worker wouldn't terminate itself because uploads should be prevented if the job has been declared done (and hence its current worker is unset).

Actions #4

Updated by mkittler about 3 years ago

This is the relevant worker log without anything elided:

Mär 15 16:58:53 openqaworker9 worker[1438]: [info] [pid:1751] 1751: WORKING 5673240
Mär 15 16:58:53 openqaworker9 worker[1438]: [info] [pid:1438] isotovideo has been started (PID: 1751)
Mär 15 16:58:53 openqaworker9 worker[1438]: [debug] [pid:1751] +++ worker notes +++
Mär 15 16:58:53 openqaworker9 worker[1438]: [debug] [pid:1438] Running job 5673240 from openqa.suse.de: 05673240-carwos-dev-qemu-x86_64-Build361378-carwos_master-carwos-selinux@64bit.
Mär 15 16:58:53 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mär 15 16:59:03 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mär 15 16:59:13 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mär 15 16:59:24 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:1438] Isotovideo exit status: 0
Mär 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:1438] Stopping job 5673240 from openqa.suse.de: 05673240-carwos-dev-qemu-x86_64-Build361378-carwos_master-carwos-selinux@64bit - reason: done
Mär 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:1438] +++ worker notes +++
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:1438] End time: 2021-03-15 15:59:32
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:1438] Result: done
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading vars.json
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading autoinst-log.txt
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading worker-log.txt
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading serial0.txt
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading video_time.vtt
Mär 15 16:59:32 openqaworker9 worker[1438]: [info] [pid:2666] Uploading serial_terminal.txt
Mär 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:1438] Setting job 5673240 to done
Mär 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/status
Mär 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:2675] Optimizing /var/lib/openqa/pool/18/testresults/selinux-2.png
Mär 15 16:59:32 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-2.png as bc3973abbdd2dd9e0c507ddcd875c26a
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Optimizing /var/lib/openqa/pool/18/testresults/.thumbs/selinux-2.png
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-2.png as bc3973abbdd2dd9e0c507ddcd875c26a
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-30.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-21.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-13.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-23.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-28.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-6.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-12.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-24.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-31.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-17.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-8.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-22.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-11.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-27.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-4.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-3.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-19.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-25.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-14.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-9.txt
Mär 15 16:59:33 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-26.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-7.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-32.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-15.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-10.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-5.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-16.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-20.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-29.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:2675] Uploading artefact selinux-18.txt
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:1438] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5673240/set_done?worker_id=617
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:1438] Job 5673240 from openqa.suse.de finished - reason: done
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:1438] Cleaning up for next job
Mär 15 16:59:34 openqaworker9 worker[1438]: [debug] [pid:1438] Informing openqa.suse.de that we are going offline
Mär 15 16:59:34 openqaworker9 systemd[1]: openqa-worker-auto-restart@18.service: Service RestartSec=100ms expired, scheduling restart.
Mär 15 16:59:34 openqaworker9 systemd[1]: Stopped openQA Worker #18.

That Setting job 5673240 to done appears before Uploading artefact … lines seems to contradict the statement from my last comment but it actually isn't because Setting job … to done just corresponds to the descicion to mark the job as done but then the worker still does the upload and only proceeds with actually marking the job as done after that. The log also shows that this job really relies on the final upload but like I said in the previous comment: If the final upload is actually still ongoing in the background at the time of REST-API call: POST and later it wouldn't only be a problem for auto-restarting workers but also generally. (Not only for the reason stated in the last comment but also because the Cleaning up for next job would interfere.)

I nevertheless checked the code but it looks good:

    if ($reason eq WORKER_SR_DONE) {
        log_debug("Setting job $job_id to done");
        return $self->_upload_results(sub { $callback->(); });
    }

I also checked _upload_results and its internal functions and it doesn't seem that there's a way the callback is invoked to early.

Unfortunately there are no traces in the web UI log for 5673240 (except some accesses of its page later on). I can only check out the results directory. It looks like at least some files are there:

martchus@openqa:/var/lib/openqa/testresults/05673/05673240-carwos-dev-qemu-x86_64-Build361378-carwos_master-carwos-selinux@64bit> l
insgesamt 324
drwxr-xr-x   4 geekotest nogroup  4096 15. Mär 16:59 ./
drwxr-xr-x 641 geekotest nogroup 77824 16. Mär 09:15 ../
-rw-r--r--   1 geekotest nogroup 59476 15. Mär 16:59 autoinst-log.txt
-rw-r--r--   1 geekotest nogroup    59 15. Mär 16:59 selinux-10.txt
-rw-r--r--   1 geekotest nogroup    43 15. Mär 16:59 selinux-11.txt
-rw-r--r--   1 geekotest nogroup   118 15. Mär 16:59 selinux-12.txt
-rw-r--r--   1 geekotest nogroup    81 15. Mär 16:59 selinux-13.txt
-rw-r--r--   1 geekotest nogroup    43 15. Mär 16:59 selinux-14.txt
-rw-r--r--   1 geekotest nogroup   135 15. Mär 16:59 selinux-15.txt
-rw-r--r--   1 geekotest nogroup    59 15. Mär 16:59 selinux-16.txt
-rw-r--r--   1 geekotest nogroup    43 15. Mär 16:59 selinux-17.txt
-rw-r--r--   1 geekotest nogroup    57 15. Mär 16:59 selinux-18.txt
-rw-r--r--   1 geekotest nogroup   151 15. Mär 16:59 selinux-19.txt
-rw-r--r--   1 geekotest nogroup    43 15. Mär 16:59 selinux-20.txt
-rw-r--r--   1 geekotest nogroup    71 15. Mär 16:59 selinux-21.txt
-rw-r--r--   1 geekotest nogroup    55 15. Mär 16:59 selinux-22.txt
-rw-r--r--   1 geekotest nogroup    43 15. Mär 16:59 selinux-23.txt
-rw-r--r--   1 geekotest nogroup   201 15. Mär 16:59 selinux-24.txt
-rw-r--r--   1 geekotest nogroup   104 15. Mär 16:59 selinux-25.txt
-rw-r--r--   1 geekotest nogroup    74 15. Mär 16:59 selinux-26.txt
-rw-r--r--   1 geekotest nogroup    43 15. Mär 16:59 selinux-27.txt
-rw-r--r--   1 geekotest nogroup   135 15. Mär 16:59 selinux-28.txt
-rw-r--r--   1 geekotest nogroup   117 15. Mär 16:59 selinux-29.txt
-rw-r--r--   1 geekotest nogroup    43 15. Mär 16:59 selinux-30.txt
-rw-r--r--   1 geekotest nogroup   342 15. Mär 16:59 selinux-31.txt
-rw-r--r--   1 geekotest nogroup   110 15. Mär 16:59 selinux-32.txt
-rw-r--r--   1 geekotest nogroup   120 15. Mär 16:59 selinux-3.txt
-rw-r--r--   1 geekotest nogroup    71 15. Mär 16:59 selinux-4.txt
-rw-r--r--   1 geekotest nogroup    71 15. Mär 16:59 selinux-5.txt
-rw-r--r--   1 geekotest nogroup   132 15. Mär 16:59 selinux-6.txt
-rw-r--r--   1 geekotest nogroup    57 15. Mär 16:59 selinux-7.txt
-rw-r--r--   1 geekotest nogroup    43 15. Mär 16:59 selinux-8.txt
-rw-r--r--   1 geekotest nogroup   135 15. Mär 16:59 selinux-9.txt
-rw-r--r--   1 geekotest nogroup 28003 15. Mär 16:59 serial0.txt
-rw-r--r--   1 geekotest nogroup  5822 15. Mär 16:59 serial_terminal.txt
drwxr-xr-x   2 geekotest nogroup    10 15. Mär 16:59 .thumbs/
drwxr-xr-x   2 geekotest nogroup    10 15. Mär 16:59 ulogs/
-rw-r--r--   1 geekotest nogroup  2153 15. Mär 16:59 vars.json
-rw-r--r--   1 geekotest nogroup     7 15. Mär 16:59 video_time.vtt
-rw-r--r--   1 geekotest nogroup  4010 15. Mär 16:59 worker-log.txt
martchus@openqa:/var/lib/openqa/testresults/05673/05673240-carwos-dev-qemu-x86_64-Build361378-carwos_master-carwos-selinux@64bit> cat selinux-3.txt
# wait_serial expected: "boot:"
# Result:

SYSLINUX 4.04 0x9f68dc20 Copyright (C) 1994-2011 H. Peter Anvin et al
boot:
Actions #5

Updated by mkittler about 3 years ago

I still don't know why this would happen. All I have so far is a PR for extending the test and improving the logging: https://github.com/os-autoinst/openQA/pull/3791

Actions #6

Updated by mkittler about 3 years ago

  • Status changed from New to In Progress
Actions #7

Updated by openqa_review about 3 years ago

  • Due date set to 2021-03-31

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by mkittler about 3 years ago

  • Status changed from In Progress to Feedback

Did we encounter more jobs like this one? I'm currently out of ideas.

Actions #9

Updated by okurz about 3 years ago

@coolo can you share more examples for https://progress.opensuse.org/issues/90152#change-392579 please based on your carwos tests?

Actions #11

Updated by mkittler about 3 years ago

This job looks indeed identical to the first occurrence. I've checked the log on the worker host but I'm not pasting it here because it is really the same.


By the way, this is the worker upload log of a good job, just for comparison:

Mär 17 16:53:27 openqaworker6 worker[25577]: [info] [pid:25577] Finished to rsync tests
Mär 17 16:53:27 openqaworker6 worker[25577]: [info] [pid:25577] Preparing cgroup to start isotovideo
Mär 17 16:53:27 openqaworker6 worker[25577]: [info] [pid:25577] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@10.service/5687533
Mär 17 16:53:27 openqaworker6 worker[25577]: [info] [pid:25577] Starting isotovideo container
Mär 17 16:53:27 openqaworker6 worker[25577]: [debug] [pid:25577] Registered process:31388
Mär 17 16:53:27 openqaworker6 worker[25577]: [info] [pid:25577] isotovideo has been started (PID: 31388)
Mär 17 16:53:27 openqaworker6 worker[25577]: [debug] [pid:25577] Running job 5687533 from openqa.suse.de: 05687533-carwos-dev-qemu-x86_64-Build363609-vgautam_master-carwos-selinux@64bit.
Mär 17 16:53:27 openqaworker6 worker[25577]: [info] [pid:31388] 31388: WORKING 5687533
Mär 17 16:53:27 openqaworker6 worker[25577]: [debug] [pid:31388] +++ worker notes +++
Mär 17 16:53:27 openqaworker6 worker[25577]: [debug] [pid:25577] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5687533/status
Mär 17 16:53:37 openqaworker6 worker[25577]: [debug] [pid:25577] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5687533/status
Mär 17 16:53:47 openqaworker6 worker[25577]: [debug] [pid:25577] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5687533/status
Mär 17 16:53:57 openqaworker6 worker[25577]: [debug] [pid:25577] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5687533/status
Mär 17 16:54:07 openqaworker6 worker[25577]: [debug] [pid:25577] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5687533/status
Mär 17 16:54:17 openqaworker6 worker[25577]: [debug] [pid:25577] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5687533/status
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:25577] Isotovideo exit status: 0
Mär 17 16:54:20 openqaworker6 worker[25577]: [debug] [pid:25577] Stopping job 5687533 from openqa.suse.de: 05687533-carwos-dev-qemu-x86_64-Build363609-vgautam_master-carwos-selinux@64bit - reason: done
Mär 17 16:54:20 openqaworker6 worker[25577]: [debug] [pid:25577] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5687533/status
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:25577] +++ worker notes +++
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:25577] End time: 2021-03-17 15:54:20
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:25577] Result: done
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:33242] Uploading vars.json
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:33242] Uploading autoinst-log.txt
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:33242] Uploading worker-log.txt
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:33242] Uploading serial0.txt
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:33242] Uploading video_time.vtt
Mär 17 16:54:20 openqaworker6 worker[25577]: [info] [pid:33242] Uploading serial_terminal.txt
Mär 17 16:54:20 openqaworker6 worker[25577]: [debug] [pid:25577] Setting job 5687533 to done
Mär 17 16:54:20 openqaworker6 worker[25577]: [debug] [pid:25577] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5687533/status
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-29.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-15.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-17.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-12.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-7.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-11.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-25.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-23.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-21.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-19.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-18.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-28.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-32.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-30.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-16.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-20.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-9.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-24.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-14.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-27.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-4.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-3.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-13.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-10.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-8.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-5.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-22.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-31.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-26.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:33244] Uploading artefact selinux-6.txt
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:25577] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5687533/set_done?worker_id=972
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:25577] Job 5687533 from openqa.suse.de finished - reason: done
Mär 17 16:54:21 openqaworker6 worker[25577]: [debug] [pid:25577] Cleaning up for next job
Mär 17 16:54:22 openqaworker6 worker[25577]: [debug] [pid:25577] Informing openqa.suse.de that we are going offline
Mär 17 16:54:22 openqaworker6 systemd[1]: openqa-worker-auto-restart@10.service: Service RestartSec=100ms expired, scheduling restart.
Mär 17 16:54:22 openqaworker6 systemd[1]: Stopped openQA Worker #10.

However, I also fail to see a significant difference to the ones with missing results.


The good thing is that these jobs can actually be cloned locally. However, so far I couldn't reproduce a bad case.

Actions #12

Updated by mkittler about 3 years ago

I couldn't reproduce it locally. I tried it several times and of course with an auto-restarting worker and with cleanup enabled. Maybe there's also something wrong on the web UI side? Unfortunately I couldn't also find anything related to the 2nd occurrence in the web UI or GRU log as well.

Maybe also simply an error occurred but the error handling is broken and therefore the problem isn't apparent within the logs? It would be possible considering the test coverage of the upload code is unknown (as it runs in a forked process and coverage data is not collected for it).

Actions #13

Updated by mkittler about 3 years ago

I have the suspicion that the worker uploads everything correctly but the auto-restarting worker terminates after the job is done before the openQA webUI is done processing all data

If I change the web UI to require 10 seconds for every job status and artifact update and to handle only 2 requests at a time one can easily follow the very slow upload. The upload is still happening successfully long after isotovideo has existed. The worker does not disconnect/unregister before the upload is done. In conclusion this suspicion can not be correct.


Maybe also simply an error occurred but the error handling is broken and therefore the problem isn't apparent within the logs?

If I change the web UI to return 500 on artifact creation I get [warn] [pid:6061] Upload attempts remaining: 2/5 for vars.json, in 5 seconds so the error handling should generally be ok.


If I comment-out $self->update_module($name, $result->{$name}, \%known_image, \%known_files); in the web UI controller I get a similar behavior: This step results are missing but from the point of the worker everything looks ok. That seems a bit contrived but judging by the code of update_module I suppose this might be what's happening when the module creation didn't work in the first place:

sub update_module {
…
    my $mod = $self->modules->find({name => $name}, {order_by => {-desc => 't_updated'}, rows => 1});
    return undef unless $mod;
Actions #14

Updated by mkittler about 3 years ago

If I comment-out $self->insert_test_modules($status->{test_order}) if $status->{test_order}; on the web UI side I get exactly the same behavior. I assume either the worker or the web UI silently fail to post/receive the test order. Or isotovideo did not create test_order.json or autoinst-status.json so the worker had no chance of posting the test order in the first place.

It is also problematic that the web UI silently throws away uploaded result if there's no corresponding job module (see the code block mentioned in my last comment). However, I'm not sure what ramifications treating this as an error would have.

Actions #15

Updated by mkittler about 3 years ago

PR for improving the error handling (see last paragraph of previous comment): https://github.com/os-autoinst/openQA/pull/3813

Actions #16

Updated by livdywan about 3 years ago

  • Due date changed from 2021-03-31 to 2021-04-09

mkittler wrote:

PR for improving the error handling (see last paragraph of previous comment): https://github.com/os-autoinst/openQA/pull/3813

The PR got merged. Is there more to be done?

Actions #17

Updated by mkittler about 3 years ago

Yes, like fixing the actual problem. However, I don't have any clue so far.

The most recent job I could currently find is https://openqa.suse.de/tests/5742623.

Actions #18

Updated by okurz about 3 years ago

As discussed, when we don't see the problem again I suggest you update the due-date to a later time, e.g. two weeks from now as reminder, then check again or close the ticket with the improved error handling added.

You could also do a SQL query for "failed but no modules" or use auto-review to try to catch these results with result "failed", reason "no reason" but maybe a matching expression in the worker-log.txt

Actions #19

Updated by mkittler about 3 years ago

I've now been checking with an SQL query:

openqa=# select concat('https://openqa.suse.de/tests/', id), t_finished from jobs where result = 'failed' and reason is null and (select count(id) from job_modules where job_modules.job_id = jobs.id) = 0 order by t_created desc limit 25;
                concat                |     t_finished      
--------------------------------------+---------------------
 https://openqa.suse.de/tests/5790499 | 2021-04-08 07:33:07
 https://openqa.suse.de/tests/5790494 | 2021-04-08 07:33:09
 https://openqa.suse.de/tests/5790467 | 2021-04-08 06:57:39
 https://openqa.suse.de/tests/5790463 | 2021-04-08 06:35:16
 https://openqa.suse.de/tests/5790462 | 2021-04-08 06:34:58
 https://openqa.suse.de/tests/5790461 | 2021-04-08 06:34:33
 https://openqa.suse.de/tests/5742623 | 2021-03-30 12:48:45
 https://openqa.suse.de/tests/5720797 | 2021-03-24 18:20:21
 https://openqa.suse.de/tests/5719571 | 2021-03-24 11:59:04
 https://openqa.suse.de/tests/5705535 | 2021-03-20 05:00:44
 https://openqa.suse.de/tests/5699763 | 2021-03-19 17:20:04
 https://openqa.suse.de/tests/5687434 | 2021-03-17 14:04:14
 https://openqa.suse.de/tests/5673240 | 2021-03-15 15:59:34
 https://openqa.suse.de/tests/5671556 | 2021-03-15 10:39:19
 https://openqa.suse.de/tests/5648501 | 2021-03-11 10:01:22
 https://openqa.suse.de/tests/5618573 | 2021-03-07 16:17:20
 https://openqa.suse.de/tests/5515496 | 2021-02-24 10:10:45
 https://openqa.suse.de/tests/4723195 | 2020-09-23 08:24:42
 https://openqa.suse.de/tests/1669762 | 2018-05-04 10:10:58
 https://openqa.suse.de/tests/844881  | 2017-03-30 13:58:13
 https://openqa.suse.de/tests/535650  | 2016-08-31 10:47:52
 https://openqa.suse.de/tests/535629  | 2016-08-31 11:57:48
 https://openqa.suse.de/tests/535628  | 2016-08-31 12:08:38
 https://openqa.suse.de/tests/535627  | 2016-08-31 11:51:15
 https://openqa.suse.de/tests/535626  | 2016-08-31 11:44:42

So https://openqa.suse.de/tests/5742623 is still the most recent one. The other ones failed because no tests could be loaded.

Actions #20

Updated by livdywan about 3 years ago

  • Due date changed from 2021-04-09 to 2021-04-16
Actions #21

Updated by okurz about 3 years ago

  • Status changed from Feedback to Resolved

as you stated that you could not reproduce we might have a Heisenbug that is not visible after changing the log :)

Please anyone reopen if we encounter the issue again

Actions #22

Updated by coolo about 3 years ago

  • Status changed from Resolved to New
Actions #23

Updated by okurz about 3 years ago

thanks, that will work :)

Actions #24

Updated by mkittler almost 3 years ago

This one looks now a bit different:

[2021-04-16T09:28:51.0825 CEST] [debug] [pid:12269] Updating status so job 5832654 is not considered dead.
[2021-04-16T09:28:51.0826 CEST] [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
[2021-04-16T09:28:51.0964 CEST] [debug] [pid:12269] Linked asset "/var/lib/openqa/cache/openqa.suse.de/carwos-carwos_master-391975.qcow2" to "/var/lib/openqa/pool/14/carwos-carwos_master-391975.qcow2"
[2021-04-16T09:28:56.0980 CEST] [debug] [pid:12269] Updating status so job 5832654 is not considered dead.
[2021-04-16T09:28:56.0980 CEST] [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
[2021-04-16T09:29:02.0115 CEST] [debug] [pid:12269] Updating status so job 5832654 is not considered dead.
[2021-04-16T09:29:02.0116 CEST] [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
[2021-04-16T09:29:02.0284 CEST] [info] [pid:12269] Preparing cgroup to start isotovideo
[2021-04-16T09:29:02.0285 CEST] [info] [pid:12269] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@14.service/5832654
[2021-04-16T09:29:02.0285 CEST] [info] [pid:12269] Starting isotovideo container
[2021-04-16T09:29:02.0288 CEST] [debug] [pid:12269] Registered process:12438
[2021-04-16T09:29:02.0289 CEST] [info] [pid:12438] 12438: WORKING 5832654
[2021-04-16T09:29:02.0289 CEST] [info] [pid:12269] isotovideo has been started (PID: 12438)
[2021-04-16T09:29:02.0289 CEST] [debug] [pid:12269] Running job 5832654 from openqa.suse.de: 05832654-carwos-dev-qemu-x86_64-Build391975-carwos_master-carwos-selinux@64bit.
[2021-04-16T09:29:02.0291 CEST] [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
[2021-04-16T09:29:02.0447 CEST] [debug] [pid:12269] Upload concluded (no current module)
[2021-04-16T09:29:12.0454 CEST] [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
[2021-04-16T09:29:12.0533 CEST] [debug] [pid:12269] Upload concluded (no current module)
[2021-04-16T09:29:22.0544 CEST] [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
[2021-04-16T09:29:22.0601 CEST] [debug] [pid:12269] Upload concluded (no current module)
[2021-04-16T09:29:31.0120 CEST] [debug] [pid:12269] Stopping job 5832654 from openqa.suse.de: 05832654-carwos-dev-qemu-x86_64-Build391975-carwos_master-carwos-selinux@64bit - reason: done
[2021-04-16T09:29:31.0120 CEST] [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
[2021-04-16T09:29:31.0206 CEST] [info] [pid:12897] Uploading vars.json
[2021-04-16T09:29:31.0306 CEST] [info] [pid:12897] Uploading autoinst-log.txt
[2021-04-16T09:29:31.0389 CEST] [info] [pid:12897] Uploading worker-log.txt

That's it. There are no Uploading artefact … logs messages at all. I was actually hoping these uploads would now fail with a better error message at least. Luckily the journal on the worker has the error messages I was looking for:

Apr 16 09:29:02 openqaworker2 worker[12269]: [info] [pid:12269] Finished to rsync tests
Apr 16 09:29:02 openqaworker2 worker[12269]: [info] [pid:12269] Preparing cgroup to start isotovideo
Apr 16 09:29:02 openqaworker2 worker[12269]: [info] [pid:12269] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@14.service/5832654
Apr 16 09:29:02 openqaworker2 worker[12269]: [info] [pid:12269] Starting isotovideo container
Apr 16 09:29:02 openqaworker2 worker[12269]: [debug] [pid:12269] Registered process:12438
Apr 16 09:29:02 openqaworker2 worker[12269]: [info] [pid:12438] 12438: WORKING 5832654
Apr 16 09:29:02 openqaworker2 worker[12269]: [info] [pid:12269] isotovideo has been started (PID: 12438)
Apr 16 09:29:02 openqaworker2 worker[12269]: [debug] [pid:12438] +++ worker notes +++
Apr 16 09:29:02 openqaworker2 worker[12269]: [debug] [pid:12269] Running job 5832654 from openqa.suse.de: 05832654-carwos-dev-qemu-x86_64-Build391975-carwos_master-carwos-selinux@64bit.
Apr 16 09:29:02 openqaworker2 worker[12269]: [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
Apr 16 09:29:02 openqaworker2 worker[12269]: [debug] [pid:12269] Upload concluded (no current module)
Apr 16 09:29:12 openqaworker2 worker[12269]: [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
Apr 16 09:29:12 openqaworker2 worker[12269]: [debug] [pid:12269] Upload concluded (no current module)
Apr 16 09:29:22 openqaworker2 worker[12269]: [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
Apr 16 09:29:22 openqaworker2 worker[12269]: [debug] [pid:12269] Upload concluded (no current module)
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12269] Isotovideo exit status: 0
Apr 16 09:29:31 openqaworker2 worker[12269]: [debug] [pid:12269] Stopping job 5832654 from openqa.suse.de: 05832654-carwos-dev-qemu-x86_64-Build391975-carwos_master-carwos-selinux@64bit - reason: done
Apr 16 09:29:31 openqaworker2 worker[12269]: [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12269] +++ worker notes +++
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12269] End time: 2021-04-16 07:29:31
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12269] Result: done
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12897] Uploading vars.json
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12897] Uploading autoinst-log.txt
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12897] Uploading worker-log.txt
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12897] Uploading serial0.txt
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12897] Uploading video_time.vtt
Apr 16 09:29:31 openqaworker2 worker[12269]: [info] [pid:12897] Uploading serial_terminal.txt
Apr 16 09:29:31 openqaworker2 worker[12269]: [debug] [pid:12269] Considering job 5832654 as regularly done
Apr 16 09:29:31 openqaworker2 worker[12269]: [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/status
Apr 16 09:29:31 openqaworker2 worker[12269]: [error] [pid:12269] REST-API error (POST http://openqa.suse.de/api/v1/jobs/5832654/status): 400 response: Failed modules: selinux (remaining tries: 0)
Apr 16 09:29:31 openqaworker2 worker[12269]: [error] [pid:12269] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Apr 16 09:29:31 openqaworker2 worker[12269]: [debug] [pid:12269] Upload concluded (no current module)
Apr 16 09:29:31 openqaworker2 worker[12269]: [debug] [pid:12269] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5832654/set_done?worker_id=1259
Apr 16 09:29:31 openqaworker2 worker[12269]: [debug] [pid:12269] Job 5832654 from openqa.suse.de finished - reason: done
Apr 16 09:29:31 openqaworker2 worker[12269]: [debug] [pid:12269] Cleaning up for next job
Apr 16 09:29:31 openqaworker2 worker[12269]: [debug] [pid:12269] Informing openqa.suse.de that we are going offline

So it really runs into the expected error condition which means sending the test order didn't work and therefore the upload of the results of the modules doesn't work as well. With the new logging we see that the worker thinks until the very end that there's no "current module". Either the worker fails to read the test order or the test order isn't written correctly in the first place.

Actions #25

Updated by coolo almost 3 years ago

Again: https://openqa.suse.de/tests/5838063

I have my doubts that this is related to auto-restarting workers btw

Actions #26

Updated by okurz almost 3 years ago

  • Due date changed from 2021-04-16 to 2021-04-30
  • Status changed from New to In Progress
Actions #27

Updated by mkittler almost 3 years ago

Log of that job, basically the same as before:

…
Apr 17 08:45:13 openqaworker5 worker[10669]: [debug] [pid:10669] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5838063/status
Apr 17 08:45:13 openqaworker5 worker[10669]: [debug] [pid:10669] Upload concluded (no current module)
Apr 17 08:45:23 openqaworker5 worker[10669]: [debug] [pid:10669] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5838063/status
Apr 17 08:45:23 openqaworker5 worker[10669]: [debug] [pid:10669] Upload concluded (no current module)
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:10669] Isotovideo exit status: 0
Apr 17 08:45:33 openqaworker5 worker[10669]: [debug] [pid:10669] Stopping job 5838063 from openqa.suse.de: 05838063-carwos-dev-qemu-x86_64-Build394621-carwos_master-carwos-selinux@64bit - reason: done
Apr 17 08:45:33 openqaworker5 worker[10669]: [debug] [pid:10669] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5838063/status
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:10669] +++ worker notes +++
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:10669] End time: 2021-04-17 06:45:33
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:10669] Result: done
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:38344] Uploading vars.json
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:38344] Uploading autoinst-log.txt
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:38344] Uploading worker-log.txt
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:38344] Uploading serial0.txt
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:38344] Uploading video_time.vtt
Apr 17 08:45:33 openqaworker5 worker[10669]: [info] [pid:38344] Uploading serial_terminal.txt
Apr 17 08:45:33 openqaworker5 worker[10669]: [debug] [pid:10669] Considering job 5838063 as regularly done
Apr 17 08:45:33 openqaworker5 worker[10669]: [debug] [pid:10669] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5838063/status
Apr 17 08:45:33 openqaworker5 worker[10669]: [error] [pid:10669] REST-API error (POST http://openqa.suse.de/api/v1/jobs/5838063/status): 400 response: Failed modules: selinux (remaining tries: 0)
Apr 17 08:45:33 openqaworker5 worker[10669]: [error] [pid:10669] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Apr 17 08:45:33 openqaworker5 worker[10669]: [debug] [pid:10669] Upload concluded (no current module)
Apr 17 08:45:33 openqaworker5 worker[10669]: [debug] [pid:10669] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5838063/set_done?worker_id=985
Apr 17 08:45:33 openqaworker5 worker[10669]: [debug] [pid:10669] Job 5838063 from openqa.suse.de finished - reason: done
Apr 17 08:45:33 openqaworker5 worker[10669]: [debug] [pid:10669] Cleaning up for next job
Apr 17 08:45:33 openqaworker5 worker[10669]: [debug] [pid:10669] Informing openqa.suse.de that we are going offline

And yes, this has nothing to do with the auto-restart feature.

To follow up on my last comment. If the test order wouldn't have been written by isotovideo in the first place the log would look like this:

[debug] [pid:31550] Stopping job 1916 from http://localhost:9526: 00001916-carwos-dev-qemu-x86_64-Build394621-carwos_master-carwos-selinux@64bit - reason: died
[debug] [pid:31550] REST-API call: POST http://localhost:9526/api/v1/jobs/1916/status
[info] [pid:31550] +++ worker notes +++
[info] [pid:31550] End time: 2021-04-19 10:21:17
[info] [pid:31550] Result: died
[info] [pid:32067] Uploading vars.json
[info] [pid:32067] Uploading autoinst-log.txt
[info] [pid:32067] Uploading worker-log.txt
[debug] [pid:31550] Job 1916 stopped as incomplete
[debug] [pid:31550] Unable to read test_order.json: Can't open file "/hdd/openqa-devel/openqa/pool/1/testresults/test_order.json": No such file or directory at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/Worker/Job.pm line 1154.

[debug] [pid:31550] Upload concluded (no current module)
[debug] [pid:31550] REST-API call: POST http://localhost:9526/api/v1/jobs/1916/set_done?reason=isotovideo+died%3A+Can%27t+init+needles+from+%2Fvar%2Flib%2Fopenqa%2Fcache%2Fopenqa.suse.de%2Ftests%2Fsle%2Fproducts%2Fsle%2Fneedles+at+%2Fhdd%2Fopenqa-devel%2Frepos%2Fos-autoinst%2Fneedle.pm+line+317%2C+%3C%24fh%3E+line+4.&result=incomplete&worker_id=20
[debug] [pid:31550] Job 1916 from http://localhost:9526 finished - reason: died

But there's no Unable to read test_order.json. So likely the worker silently fails with another problem when trying to upload the test order and isotovideo is not at fault.

Actions #28

Updated by mkittler almost 3 years ago

Or isotovideo fails to write autoinst-status.json. If I locally change it not to write this file I get the exact same behavior as in these jobs from production. There should definitely a warning for this problem on the worker log.

Actions #29

Updated by mkittler almost 3 years ago

Possible fix if the theory from the last comment is correct: https://github.com/os-autoinst/openQA/pull/3849
(Instead of showing a warning I simply made it deal with the fact that the status file is missing. At the end the information from that file is irrelevant anyways.)

Actions #30

Updated by okurz almost 3 years ago

  • Related to action #58826: Result not rendered in detail view on short (e.g. <10s) test-modules, if job is still running added
Actions #31

Updated by okurz almost 3 years ago

  • Related to action #91461: Test is missing webui results and fail despite all tests passed added
Actions #32

Updated by mkittler almost 3 years ago

Actions #33

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback

Because we had to roll back the potential fix is not deployed on OSD. It'll be deployed tomorrow. Let's see whether it helps.

Actions #34

Updated by okurz almost 3 years ago

I found a simpler SQL query. As the jobs table has the module count included we should not need to query the job_modules table having quicker results:

select concat('https://openqa.suse.de/tests/', id), t_finished from jobs where result = 'failed' and passed_module_count = 0 and failed_module_count = 0 and softfailed_module_count = 0 and skipped_module_count = 0 order by t_created desc limit 25;

where the last job with such symptoms was https://openqa.suse.de/tests/5838063 from 2021-04-17 06:45:33
so 10 days ago. I would consider the ticket done then, again, until we see a problem like this again. But I think maybe the above is a good safety check we can build in as internal check or at least grafana monitoring alert. How about making a job incomplete in case some validation checks like regarding the module count fail?

Actions #35

Updated by mkittler almost 3 years ago

Oh, right. I was even the one introducing the columns on job level :-)

How about making a job incomplete in case some validation checks like regarding the module count fail?

We already mark jobs failing due to compilation errors or asset download errors as incomplete. So marking jobs without modules generally as incomplete would be in-line with that.

Actions #36

Updated by okurz almost 3 years ago

We already mark jobs failing due to compilation errors or asset download errors as incomplete. So marking jobs without modules generally as incomplete would be in-line with that.

Right. And with an explicit reason we can distinguish :)

Actions #37

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to In Progress
Actions #38

Updated by okurz almost 3 years ago

  • Related to action #91902: Tests incomplete with reason "Failed modules: …" added
Actions #39

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback

I've checked again and there's no new job matching the query. The PR to consider these jobs incomplete has been merged as well.

Actions #40

Updated by okurz almost 3 years ago

  • Due date deleted (2021-04-30)
  • Status changed from Feedback to Resolved

trying again

okurz wrote:

as you stated that you could not reproduce we might have a Heisenbug that is not visible after changing the log :)

Please anyone reopen if we encounter the issue again

Actions

Also available in: Atom PDF