action #62237

many incompletes with just "setup failure" and no further information

Added by okurz over 1 year ago. Updated over 1 year ago.

Concrete Bugs
Target version:
Start date:
Due date:
% Done:


Estimated time:



First mention in : "bunch of incompletes on o.s.d: [2020-01-17T11:29:56.0370 CET] [error] [pid:47664] Unable to setup job 3801895: Cache service not reachable: Inactivity timeout".

The jobs themselves show not much details, e.g. shows just

[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] +++ setup notes +++
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Start time: 2020-01-17 09:56:50
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Running on QA-Power8-5-kvm:6 (Linux 4.12.14-lp151.27-default #1 SMP Fri May 10 14:13:15 UTC 2019 (862c838) ppc64le)
[2020-01-17T11:01:50.0997 CET] [info] [pid:110583] +++ worker notes +++
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] End time: 2020-01-17 10:01:50
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] Result: setup failure
[2020-01-17T11:01:51.0002 CET] [info] [pid:21796] Uploading autoinst-log.txt

osukup could it be the QAM openQA triggered 4k jobs on osd within the last 1h? take a look on . On I checked and it seems these are all distinct builds so not retriggers of all the same.

We have 6k jobs scheduled on osd. That's higher than I expected it should be, a new record actually.


We discussed this a bit more in detail internally in
and have identified more.

openqa-worker-cacheservice is started with -w 4 -c 1 and the client uses keep-alive connections so we exhaust the connections of the cacheservice by "blocking" them with keep-alives. heavy load on a machine with more than 4 workers active would trigger that. as a quick fix we can remove the -c 1 option. created.

Related issues

Copied to openQA Project - coordination #62420: [epic] Distinguish all types of incompletesBlocked2018-12-12


#1 Updated by okurz over 1 year ago

Updated monitoring to apply less smoothing before alerting and submitting as

the more important issue I see right now is that is really missing the relevant information. I don't even know how to retrigger all the incompletes.

#2 Updated by kraih over 1 year ago

The PR with the fix has been merged.

#3 Updated by okurz over 1 year ago

#4 Updated by okurz over 1 year ago

  • Status changed from In Progress to Resolved
  • Target version changed from Current Sprint to Done

ok, split out the rest into a more specific feature into #62420 .

Other things we managed is to monitor incompletes better, e.g. in #62048

so we can resolve this ticket.

Also available in: Atom PDF