Project

General

Profile

Actions

action #62237

closed

many incompletes with just "setup failure" and no further information

Added by okurz almost 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-01-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

First mention in https://chat.suse.de/channel/testing?msg=oGraqZcxoBfQC8o2S : "bunch of incompletes on o.s.d: [2020-01-17T11:29:56.0370 CET] [error] [pid:47664] Unable to setup job 3801895: Cache service not reachable: Inactivity timeout".

The jobs themselves show not much details, e.g. https://openqa.suse.de/tests/3795872 shows just

[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] +++ setup notes +++
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Start time: 2020-01-17 09:56:50
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Running on QA-Power8-5-kvm:6 (Linux 4.12.14-lp151.27-default #1 SMP Fri May 10 14:13:15 UTC 2019 (862c838) ppc64le)
[2020-01-17T11:01:50.0997 CET] [info] [pid:110583] +++ worker notes +++
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] End time: 2020-01-17 10:01:50
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] Result: setup failure
[2020-01-17T11:01:51.0002 CET] [info] [pid:21796] Uploading autoinst-log.txt

@osukup could it be the QAM openQA triggered 4k jobs on osd within the last 1h? take a look on https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1579184194465&to=1579263005276&panelId=11&fullscreen . On https://openqa.suse.de/tests I checked and it seems these are all distinct builds so not retriggers of all the same.

We have 6k jobs scheduled on osd. That's higher than I expected it should be, a new record actually.

Problem

We discussed this a bit more in detail internally in
https://chat.suse.de/group/openqa-dev?msg=zwguNpEH9jucatfkQ
and have identified more.

openqa-worker-cacheservice is started with -w 4 -c 1 and the client uses keep-alive connections so we exhaust the connections of the cacheservice by "blocking" them with keep-alives. heavy load on a machine with more than 4 workers active would trigger that. as a quick fix we can remove the -c 1 option.

https://github.com/os-autoinst/openQA/pull/2671 created.


Related issues 1 (0 open1 closed)

Copied to openQA Project - coordination #62420: [epic] Distinguish all types of incompletesResolvedokurz2018-12-12

Actions
Actions #1

Updated by okurz almost 5 years ago

Updated monitoring to apply less smoothing before alerting and submitting as https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/256

the more important issue I see right now is that https://openqa.suse.de/tests/3796966/file/autoinst-log.txt is really missing the relevant information. I don't even know how to retrigger all the incompletes.

Actions #2

Updated by kraih almost 5 years ago

The PR with the fix has been merged. https://github.com/os-autoinst/openQA/pull/2671

Actions #3

Updated by okurz almost 5 years ago

Actions #4

Updated by okurz almost 5 years ago

  • Status changed from In Progress to Resolved
  • Target version changed from Current Sprint to Done

ok, split out the rest into a more specific feature into #62420 .

Other things we managed is to monitor incompletes better, e.g. in #62048

so we can resolve this ticket.

Actions

Also available in: Atom PDF