action #55328

job is considered incomplete by openQA but worker still pushes updates so that "job is not considered dead"

Added by okurz 7 months ago. Updated 5 months ago.

Status:ResolvedStart date:31/07/2019
Priority:NormalDue date:
Assignee:kraih% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

Observation

https://openqa.suse.de/tests/3230497 is already incomplete since 3h but osd reports in /var/log/openqa:

[2019-08-09T23:26:00.0802 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[2019-08-09T23:26:05.0837 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[2019-08-09T23:26:10.0879 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[2019-08-09T23:26:15.0910 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[…]

and on openqaworker9 I can see the job still running:

openqaworker9:/var/lib/openqa/pool/18 # tail -f autoinst-log.txt worker-log.txt 
==> autoinst-log.txt <==
[2019-08-09T14:33:58.0183 CEST] [info] [pid:14937] +++ setup notes +++
[2019-08-09T14:33:58.0183 CEST] [info] [pid:14937] Start time: 2019-08-09 12:33:58
[2019-08-09T14:33:58.0183 CEST] [info] [pid:14937] Running on openqaworker9:18 (Linux 4.12.14-lp151.28.7-default #1 SMP Mon Jun 17 16:36:38 UTC 2019 (f8a1872) x86_64)
[2019-08-09T14:33:58.0201 CEST] [debug] [pid:14937] Downloading SLES-15-SP1-x86_64-20190809-2@64bit-minimal_with_sdkGM_installed.qcow2 - request sent to Cache Service.

==> worker-log.txt <==
[2019-08-09T23:32:59.0222 CEST] [debug] [pid:14937] Updating status so job 3230497 is not considered dead.
[2019-08-09T23:32:59.0222 CEST] [debug] [pid:14937] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3230497/status
[2019-08-09T23:33:04.0262 CEST] [debug] [pid:14937] Updating status so job 3230497 is not considered dead.
[2019-08-09T23:33:04.0263 CEST] [debug] [pid:14937] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3230497/status
[2019-08-09T23:33:09.0295 CEST] [debug] [pid:14937] Updating status so job 3230497 is not considered dead.
[2019-08-09T23:33:09.0296 CEST] [debug] [pid:14937] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3230497/status
[…]

so the job never properly started but was probably trying to download a file from the cache service. Maybe the cache service failed to download the file currently and never responded back to the worker. The worker is probably still waiting for the response from the caching server and during that time updating the webui with no change in status.

Further details

I have provided worker-log.txt and autoinst-log.txt from the still running job in https://w3.nue.suse.com/~okurz/poo55238/openqaworker9_instance18_still_updating_but_osd_considers_incomplete.txt and https://w3.nue.suse.com/~okurz/poo55238/openqaworker9_instance18_still_updating_but_osd_considers_incomplete-autoinst-log.txt


Related issues

Related to openQA Project - action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd... Resolved 31/07/2019
Copied from openQA Project - action #55238: jobs with high amount of log files, thumbnails, test resu... Blocked 31/07/2019

History

#1 Updated by okurz 7 months ago

  • Copied from action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts added

#2 Updated by okurz 7 months ago

applying workaround, same as in #55238#note-6

load has reduced to around 4.0 so way better. But I doubt I made this with just this workaround on a single job or so.

#3 Updated by kraih 6 months ago

  • Status changed from New to In Progress
  • Assignee set to kraih

#4 Updated by coolo 6 months ago

  • Target version set to Current Sprint

#5 Updated by kraih 6 months ago

This PR should help a lot with the problem. https://github.com/os-autoinst/openQA/pull/2270

There might still be related issues though, such as failed artefact uploads now making tests fail that should have passed. Need to investigate.

#6 Updated by okurz 6 months ago

  • Priority changed from Normal to High

The current situation what it seems to have caused up to now:

EDIT2: https://openqa.suse.de/tests/3323089#downloads looks interesting. It has "autoinst-log.txt" and "worker-log.txt" but both files seem to be empty. @jeriveramoya just reported this as #56447

EDIT3: I thought https://github.com/os-autoinst/openQA/pull/2270 is already merged but not yet deployed. Maybe it actually helps with the second problem observed above

#7 Updated by okurz 6 months ago

  • Priority changed from High to Normal

#8 Updated by okurz 6 months ago

  • Related to action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs added

#9 Updated by coolo 5 months ago

  • Status changed from In Progress to Resolved
  • Target version changed from Current Sprint to Done

#10 Updated by okurz 5 months ago

For reference, we assume that https://github.com/os-autoinst/openQA/pull/2270 solved it

Also available in: Atom PDF