many incompletes with just "setup failure" and no further information
First mention in https://chat.suse.de/channel/testing?msg=oGraqZcxoBfQC8o2S : "bunch of incompletes on o.s.d: [2020-01-17T11:29:56.0370 CET] [error] [pid:47664] Unable to setup job 3801895: Cache service not reachable: Inactivity timeout".
The jobs themselves show not much details, e.g. https://openqa.suse.de/tests/3795872 shows just
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] +++ setup notes +++ [2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Start time: 2020-01-17 09:56:50 [2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Running on QA-Power8-5-kvm:6 (Linux 4.12.14-lp151.27-default #1 SMP Fri May 10 14:13:15 UTC 2019 (862c838) ppc64le) [2020-01-17T11:01:50.0997 CET] [info] [pid:110583] +++ worker notes +++ [2020-01-17T11:01:50.0998 CET] [info] [pid:110583] End time: 2020-01-17 10:01:50 [2020-01-17T11:01:50.0998 CET] [info] [pid:110583] Result: setup failure [2020-01-17T11:01:51.0002 CET] [info] [pid:21796] Uploading autoinst-log.txt
@osukup could it be the QAM openQA triggered 4k jobs on osd within the last 1h? take a look on https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1579184194465&to=1579263005276&panelId=11&fullscreen . On https://openqa.suse.de/tests I checked and it seems these are all distinct builds so not retriggers of all the same.
We have 6k jobs scheduled on osd. That's higher than I expected it should be, a new record actually.
We discussed this a bit more in detail internally in
and have identified more.
openqa-worker-cacheservice is started with
-w 4 -c 1 and the client uses keep-alive connections so we exhaust the connections of the cacheservice by "blocking" them with keep-alives. heavy load on a machine with more than 4 workers active would trigger that. as a quick fix we can remove the -c 1 option.
#1 Updated by okurz about 1 month ago
Updated monitoring to apply less smoothing before alerting and submitting as https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/256
the more important issue I see right now is that https://openqa.suse.de/tests/3796966/file/autoinst-log.txt is really missing the relevant information. I don't even know how to retrigger all the incompletes.
#4 Updated by okurz about 1 month ago
- Status changed from In Progress to Resolved
- Target version changed from Current Sprint to Done