Project

General

Profile

Actions

action #55328

closed

job is considered incomplete by openQA but worker still pushes updates so that "job is not considered dead"

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-07-31
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/3230497 is already incomplete since 3h but osd reports in /var/log/openqa:

[2019-08-09T23:26:00.0802 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[2019-08-09T23:26:05.0837 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[2019-08-09T23:26:10.0879 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[2019-08-09T23:26:15.0910 CEST] [info] Got status update for job 3230497 with unexpected worker ID 617 (expected no updates anymore, job is done with result incomplete)
[…]

and on openqaworker9 I can see the job still running:

openqaworker9:/var/lib/openqa/pool/18 # tail -f autoinst-log.txt worker-log.txt 
==> autoinst-log.txt <==
[2019-08-09T14:33:58.0183 CEST] [info] [pid:14937] +++ setup notes +++
[2019-08-09T14:33:58.0183 CEST] [info] [pid:14937] Start time: 2019-08-09 12:33:58
[2019-08-09T14:33:58.0183 CEST] [info] [pid:14937] Running on openqaworker9:18 (Linux 4.12.14-lp151.28.7-default #1 SMP Mon Jun 17 16:36:38 UTC 2019 (f8a1872) x86_64)
[2019-08-09T14:33:58.0201 CEST] [debug] [pid:14937] Downloading SLES-15-SP1-x86_64-20190809-2@64bit-minimal_with_sdkGM_installed.qcow2 - request sent to Cache Service.

==> worker-log.txt <==
[2019-08-09T23:32:59.0222 CEST] [debug] [pid:14937] Updating status so job 3230497 is not considered dead.
[2019-08-09T23:32:59.0222 CEST] [debug] [pid:14937] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3230497/status
[2019-08-09T23:33:04.0262 CEST] [debug] [pid:14937] Updating status so job 3230497 is not considered dead.
[2019-08-09T23:33:04.0263 CEST] [debug] [pid:14937] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3230497/status
[2019-08-09T23:33:09.0295 CEST] [debug] [pid:14937] Updating status so job 3230497 is not considered dead.
[2019-08-09T23:33:09.0296 CEST] [debug] [pid:14937] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3230497/status
[…]

so the job never properly started but was probably trying to download a file from the cache service. Maybe the cache service failed to download the file currently and never responded back to the worker. The worker is probably still waiting for the response from the caching server and during that time updating the webui with no change in status.

Further details

I have provided worker-log.txt and autoinst-log.txt from the still running job in https://w3.nue.suse.com/~okurz/poo55238/openqaworker9_instance18_still_updating_but_osd_considers_incomplete.txt and https://w3.nue.suse.com/~okurz/poo55238/openqaworker9_instance18_still_updating_but_osd_considers_incomplete-autoinst-log.txt


Related issues 2 (1 open1 closed)

Related to openQA Project - action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logsResolvedokurz2019-07-31

Actions
Copied from openQA Project - action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attemptsNew2019-07-31

Actions
Actions #1

Updated by okurz over 4 years ago

  • Copied from action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts added
Actions #2

Updated by okurz over 4 years ago

applying workaround, same as in #55238#note-6

load has reduced to around 4.0 so way better. But I doubt I made this with just this workaround on a single job or so.

Actions #3

Updated by kraih over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to kraih
Actions #4

Updated by coolo over 4 years ago

  • Target version set to Current Sprint
Actions #5

Updated by kraih over 4 years ago

This PR should help a lot with the problem. https://github.com/os-autoinst/openQA/pull/2270

There might still be related issues though, such as failed artefact uploads now making tests fail that should have passed. Need to investigate.

Actions #6

Updated by okurz over 4 years ago

  • Priority changed from Normal to High

The current situation what it seems to have caused up to now:

EDIT2: https://openqa.suse.de/tests/3323089#downloads looks interesting. It has "autoinst-log.txt" and "worker-log.txt" but both files seem to be empty. @JERiveraMoya just reported this as #56447

EDIT3: I thought https://github.com/os-autoinst/openQA/pull/2270 is already merged but not yet deployed. Maybe it actually helps with the second problem observed above

Actions #7

Updated by okurz over 4 years ago

  • Priority changed from High to Normal
Actions #8

Updated by okurz over 4 years ago

  • Related to action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs added
Actions #9

Updated by coolo over 4 years ago

  • Status changed from In Progress to Resolved
  • Target version changed from Current Sprint to Done
Actions #10

Updated by okurz over 4 years ago

For reference, we assume that https://github.com/os-autoinst/openQA/pull/2270 solved it

Actions

Also available in: Atom PDF