Project

General

Profile

Actions

action #62237

closed

many incompletes with just "setup failure" and no further information

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-01-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

First mention in https://chat.suse.de/channel/testing?msg=oGraqZcxoBfQC8o2S : "bunch of incompletes on o.s.d: [2020-01-17T11:29:56.0370 CET] [error] [pid:47664] Unable to setup job 3801895: Cache service not reachable: Inactivity timeout".

The jobs themselves show not much details, e.g. https://openqa.suse.de/tests/3795872 shows just

[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] +++ setup notes +++
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Start time: 2020-01-17 09:56:50
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Running on QA-Power8-5-kvm:6 (Linux 4.12.14-lp151.27-default #1 SMP Fri May 10 14:13:15 UTC 2019 (862c838) ppc64le)
[2020-01-17T11:01:50.0997 CET] [info] [pid:110583] +++ worker notes +++
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] End time: 2020-01-17 10:01:50
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] Result: setup failure
[2020-01-17T11:01:51.0002 CET] [info] [pid:21796] Uploading autoinst-log.txt

@osukup could it be the QAM openQA triggered 4k jobs on osd within the last 1h? take a look on https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1579184194465&to=1579263005276&panelId=11&fullscreen . On https://openqa.suse.de/tests I checked and it seems these are all distinct builds so not retriggers of all the same.

We have 6k jobs scheduled on osd. That's higher than I expected it should be, a new record actually.

Problem

We discussed this a bit more in detail internally in
https://chat.suse.de/group/openqa-dev?msg=zwguNpEH9jucatfkQ
and have identified more.

openqa-worker-cacheservice is started with -w 4 -c 1 and the client uses keep-alive connections so we exhaust the connections of the cacheservice by "blocking" them with keep-alives. heavy load on a machine with more than 4 workers active would trigger that. as a quick fix we can remove the -c 1 option.

https://github.com/os-autoinst/openQA/pull/2671 created.


Related issues 1 (0 open1 closed)

Copied to openQA Project - coordination #62420: [epic] Distinguish all types of incompletesResolvedokurz2018-12-12

Actions
Actions

Also available in: Atom PDF