action #96557
closedjobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry
Description
Observation¶
https://openqa.suse.de/tests/6645417 is "timeout_exceeded". log details:
[2021-08-04T10:33:49.0665 CEST] [info] [pid:16838] Downloading SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso, request #5221 sent to Cache Service
[2021-08-04T11:26:20.0708 CEST] [info] [pid:16838] Download of SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso processed:
Asset "SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso" was downloaded by #5196, details are therefore unavailable here
[2021-08-04T11:26:20.0712 CEST] [info] [pid:16838] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #5327 sent to Cache Service
[2021-08-04T11:28:20.0151 CEST] [info] [pid:16838] +++ worker notes +++
[2021-08-04T11:28:20.0151 CEST] [info] [pid:16838] End time: 2021-08-04 09:28:20
[2021-08-04T11:28:20.0151 CEST] [info] [pid:16838] Result: timeout
[2021-08-04T11:28:20.0160 CEST] [info] [pid:23579] Uploading autoinst-log.txt
this was on openqaworker6:15. Related systemd journal logs
- openqa-worker-cacheservice:
Aug 04 08:31:58 openqaworker6 openqa-workercache-daemon[26298]: [26379] [i] Worker 26379 started
Aug 04 08:31:58 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Creating process id file "/tmp/prefork.pid"
Aug 04 08:31:58 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 26379 stopped
Aug 04 08:33:56 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 26377 gracefully (80 seconds)
Aug 04 08:33:57 openqaworker6 openqa-workercache-daemon[26298]: [26377] [i] Worker 26377 started
Aug 04 08:49:40 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 26377 stopped
Aug 04 08:49:40 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 26376 gracefully (80 seconds)
Aug 04 08:49:42 openqaworker6 openqa-workercache-daemon[26298]: [26376] [i] Worker 26376 started
Aug 04 08:51:05 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 26376 stopped
Aug 04 08:51:05 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 26378 gracefully (80 seconds)
Aug 04 08:51:06 openqaworker6 openqa-workercache-daemon[26298]: [26378] [i] Worker 26378 started
Aug 04 09:55:17 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 26378 stopped
Aug 04 09:55:17 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 1568 gracefully (80 seconds)
Aug 04 09:55:21 openqaworker6 openqa-workercache-daemon[26298]: [1568] [i] Worker 1568 started
Aug 04 09:59:14 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 1568 stopped
Aug 04 09:59:14 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 1389 gracefully (80 seconds)
Aug 04 09:59:15 openqaworker6 openqa-workercache-daemon[26298]: [1389] [i] Worker 1389 started
Aug 04 10:14:16 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 1389 stopped
Aug 04 10:14:16 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 3852 gracefully (80 seconds)
Aug 04 10:14:19 openqaworker6 openqa-workercache-daemon[26298]: [3852] [i] Worker 3852 started
Aug 04 10:19:26 openqaworker6 openqa-workercache-daemon[26298]: [4587] [i] Worker 4587 started
Aug 04 10:19:26 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 3852 stopped
Aug 04 10:19:26 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 4587 stopped
- openqa-worker-cacheservice-minion:
Aug 04 10:29:04 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16882] [i] [#5188] Downloading: "SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
Aug 04 10:29:14 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16884] [i] [#5190] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Aug 04 10:29:14 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16884] [i] [#5190] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/>
Aug 04 10:29:39 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16894] [i] [#5192] Downloading: "publiccloud_15sp3_Azure_gen2_BYOS_Updates.qcow2"
Aug 04 10:29:44 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16895] [i] [#5193] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Aug 04 10:29:44 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16895] [i] [#5193] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/>
Aug 04 10:29:54 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16904] [i] [#5195] Downloading: "SLED-15-SP2-x86_64-Installtest.qcow2"
Aug 04 10:32:54 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16788] [i] [#5170] Cache size of "/var/lib/openqa/cache" is 47GiB, with limit 50GiB
Aug 04 10:32:54 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16788] [i] [#5170] Downloading "SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build20210804-1->
Aug 04 10:32:59 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17198] [i] [#5196] Downloading: "SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso"
Aug 04 10:33:09 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16904] [i] [#5195] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
Aug 04 10:33:09 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16904] [i] [#5195] Downloading "SLED-15-SP2-x86_64-Installtest.qcow2" from "http://openqa.suse.de/t>
Aug 04 10:33:14 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17212] [i] [#5197] Downloading: "SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
Aug 04 10:33:19 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17215] [i] [#5198] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
…
Aug 04 10:34:49 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17802] [i] [#5213] Downloading: "sle-15-SP1-x86_64-Build:19472:aws-cli-with-hpc.qcow2
…
Aug 04 11:26:18 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17198] [i] [#5196] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Installer-DVD-x86_64-GM>
Aug 04 11:26:18 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17198] [i] [#5196] Download error 598, waiting 5 seconds for next try (4 remaining)
Aug 04 11:26:18 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17198] [i] [#5196] Downloading "SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso" from "http://openqa.su>
Aug 04 11:26:20 openqaworker6 openqa-worker-cacheservice-minion[26299]: [23427] [i] [#5306] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Aug 04 11:26:20 openqaworker6 openqa-worker-cacheservice-minion[26299]: [23427] [i] [#5306] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/>
…
to me it looks like the same asset is downloaded multiple times, which can happen if cache size is exceeded, and somehow the openQA job sees a response 53 minutes later and there is "waiting 5 seconds for next try (4 remaining)" but then nothing related anymore
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#96557
Updated by okurz over 3 years ago
- Subject changed from jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"(?s)timeout: setup exceeded MAX_SETUP_TIME.*details are therefore unavailable here":retry to jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"(?s)details are therefore unavailable here.*timeout: setup exceeded MAX_SETUP_TIME":retry
Updated by okurz over 3 years ago
- Description updated (diff)
hurhaj has found four examples, all on openqaworker5, the original issue is from openqaworker6.
https://openqa.suse.de/tests/6645577, https://openqa.suse.de/tests/6643847, https://openqa.suse.de/tests/6643846, https://openqa.suse.de/tests/6645230
Updated by okurz over 3 years ago
- Priority changed from Urgent to Immediate
I am bumping https://progress.opensuse.org/issues/96557 up to immediate. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=14&orgId=1&from=1628004156924&to=1628079287359 shows a clear problem over the last hours.
I have triggered a reboot of openqaworker5+6 where most problems have been observed. Not sure if this makes a difference.
I have also called for i in 5 6; do host=openqa.suse.de result=timeout_exceeded WORKER=openqaworker$i bash -ex ./openqa-advanced-retrigger-jobs; done
to retrigger all "timeout_exceeded" from both openqaworker5+6
Updated by mkittler over 3 years ago
I have triggered a reboot of openqaworker5+6 where most problems have been observed. Not sure if this makes a difference.
The workers are now quite busy downloading assets. The download rates aren't too bad and they don't look stuck but the Minion jobs are still piling up¹. It would have been good to check whether the workers are stuck (e.g. waiting for a Minion lock) before restarting them.
¹ Maybe that's normal since the asset cache is just being built from scratch right now? The workers openqaworker5 and openqaworker6 have 41 and 26 slots respectively so when they're back online it isn't very surprising to see them instantly loaded with many asset downloads. When introducing MAX_SETUP_TIME
, we might have set a too low limit (not considering that parallel jobs running on the same worker share resources here).
Updated by mkittler over 3 years ago
- Status changed from New to In Progress
Judging by the output of select id, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-08-04T00:00:00' order by id;
, openqaworker3
also produced a few of those jobs.
The query also shows that the problem started to re-emerge on openqaworker5 and 6 after rebooting them, see the gap between the timestamps (which are UTC):
6647186 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:43:23
6647203 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:48:14
6647206 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:48:15
6647227 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:48:14
6647235 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:48:28
6647244 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:49:05
6647496 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:18
6647776 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:19
6647782 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:17
6647861 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:16
6647862 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:17
6647863 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:26
6647866 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:19
6647868 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:23
So the reboot helped to reproduce the problem. Apparently it is happening consistently on those workers when the asset cache needs to be built-up (after restarting or scheduling many jobs with new assets).
Maybe I'll just bump MAX_SETUP_TIME
on those workers for now.
Updated by mkittler over 3 years ago
SR to increase MAX_SETUP_TIME
for worker hosts with many slots: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/536
The problem occurred on openqaworker3
which has 19 slots so I just applied it to every worker with 15 or more slots to be safe.
Looks like the last job is 6647906 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:27
which timed out one our after the reboot. Let's see how many jobs exceed the timeout on the next full hour after the reboot.
Updated by mkittler over 3 years ago
A few more statistics from the Minion dashboard of openqaworker5
after the reboot:
- Syncing tests (
rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
took 43 minutes. - Some downloads take quite long, e.g. downloading
SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso
took 29 minutes andsle-rhel82.qcow2
31 minutes. - The majority of downloads doesn't take that long but since they pile up they still take very long to be processed, e.g. a download created one hour ago has now running since 10 minutes. The corresponding job has just ran into the timeout. Note that this does not abort asset the download (for the better or worse).
Updated by mkittler over 3 years ago
Let's see how many jobs exceed the timeout on the next full hour after the reboot.
6647906 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:27
6647991 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:19
6648500 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:19
6647754 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:20
6647762 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:21
6647996 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:21
6648365 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:23
6647693 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:24
6648384 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:24
6647993 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:25
6648011 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:25
6643681 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:25
6648495 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:26
6648380 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:26
6648435 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:27
6647739 | parallel_restarted | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:28
6648014 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:28
6647692 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:28
6648371 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:29
6648378 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:29
6647670 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:29
6648139 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:29
6648018 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:29
6648368 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:29
6648013 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:30
6643682 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
6647981 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
6648016 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
6648019 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:30
6647709 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
6648350 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:30
6648121 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
6643682 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
6648350 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:30
6647709 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
6648499 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:45
6648524 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:46
6648554 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:46
6648608 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:46
6648520 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:46
6648496 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:47
6649930 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:48
6648525 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:48
6643637 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:49
6648563 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:49
6643763 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
6648609 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
6643636 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
6648634 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
6648603 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
6648531 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
6648518 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
6648548 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
6643758 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
6643762 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
6643638 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:52
6649952 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:52
6648568 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:52
6648532 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:53
6648516 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 15:07:53
6648612 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:53
6649931 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:55
6643760 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:56
6643639 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:57
6649945 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 15:07:57
6648503 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:57
6648510 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:57
6648512 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 15:07:58
6648551 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:08:00
Updated by okurz over 3 years ago
- Subject changed from jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"(?s)details are therefore unavailable here.*timeout: setup exceeded MAX_SETUP_TIME":retry to jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry
Generalizing the regex to catch more related jobs
Updated by okurz over 3 years ago
Somehow it still looks like openqaworker5+6 are in particular very error-prone.
SQL on OSD:
openqa=> select count(jobs.id),host from jobs left join workers on workers.id = jobs.assigned_worker_id where jobs.result = 'timeout_exceeded' and jobs.t_finished >= '2021-08-04' group by host order by count desc;
count | host
-------+----------------
221 | openqaworker5
75 | openqaworker6
9 | openqaworker3
4 | openqaworker8
1 | openqaworker10
1 | openqaworker2
(6 rows)
I expected more incomplete jobs but still it's mainly w5+6
Updated by mkittler over 3 years ago
SR to remove load from these workers: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/338/diffs
Updated by mkittler over 3 years ago
On other workers the average time of asset download jobs is significantly lower. It can be luck (and openqaworker5 really has extraordinarily many slots) but it looks more like openqaworker5 and 6 are downloading slower. (There's no easy way to query the average download speed for asset downloads.)
Note that when only looking for jobs where the setup timeout has been exceeded (and not for all timeouts) then it looks like only openqaworker3, 5 and 6 are affected:
openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-08-04' group by host order by job_count desc;
job_count | worker_slots | host
-----------+--------------+---------------
224 | 41 | openqaworker5
79 | 29 | openqaworker6
10 | 19 | openqaworker3
(3 Zeilen)
Considering all jobs of this year affects the query results not really:
openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-01-01' group by host order by job_count desc;
job_count | worker_slots | host
-----------+--------------+---------------
224 | 41 | openqaworker5
80 | 29 | openqaworker6
49 | 49 | grenache-1
18 | 4 | malbec
10 | 19 | openqaworker3
So this is a relatively new problem. Although it deteriorated today significantly I've already encountered one failure 7 days ago on openqaworker6, see https://progress.opensuse.org/issues/95299#note-15. (The jobs on malbec are likely caused by rsync being stuck which is a different problem and has already been fixed, see https://progress.opensuse.org/issues/95299#note-4. Not sure about grenache-1
. It has also many worker slots so possibly the same issue after all?)
Updated by openqa_review over 3 years ago
- Due date set to 2021-08-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 3 years ago
mkittler wrote:
SR to remove load from these workers: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/338/diffs
openqa-reload-worker-auto-restart@1
failed on openqaworker5
this morning, which I restarted manually now.
Updated by mkittler over 3 years ago
After https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/338/diffs has been merged (and should have been effective) there were sill a few jobs:
openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-08-04T19:15:00' group by host order by job_count desc;
job_count | worker_slots | host
-----------+--------------+---------------
11 | 41 | openqaworker5
2 | 29 | openqaworker6
Likely it took a while for the piled up asset download jobs to be processed. (There were still jobs scheduled because the SR also kept a few special worker classes because I thought these few jobs should be too much load. I assume that's also correct now that the piled up asset download jobs have been processed. Also some user temporarily enabled worker classes again to test a few jobs.)
Not sure what to do with these workers. They're likely good enough to run only a limited number of jobs but I don't know how I can bring them fully back.
I'm also wondering whether #96554 might be related after all. It is about /dev/vdc
which is used to store (and serve) assets. If this disk is slow it would of course also slow down asset downloads of workers. That wouldn't explain why especially openqaworker5 and 6 are affected (and 3 a little bit).
Of course workers with many slots would be affected more easily which could be an explanation. However, it doesn't really make sense because we should have been seen the issue on openqaworker2
then as well (not considering grenache-1
here as it "only" runs ppc jobs):
openqa=> select host, count(id) as online_slots from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by online_slots desc;
host | online_slots
---------------------+--------------
grenache-1 | 49
openqaworker5 | 41
openqaworker2 | 34
openqaworker6 | 29
openqaworker-arm-2 | 20
openqaworker-arm-3 | 20
openqaworker3 | 19
openqaworker13 | 16
openqaworker9 | 16
openqaworker8 | 16
openqaworker11 | 10
openqaworker10 | 10
openqaworker-arm-1 | 10
powerqaworker-qam-1 | 8
QA-Power8-5-kvm | 8
QA-Power8-4-kvm | 8
malbec | 4
automotive-3 | 1
(18 Zeilen)
Updated by mkittler over 3 years ago
@cdywan It looks like someone messed with the worker services because the regular openqa-worker@1
service (as opposed to openqa-worker-auto-restart@1
) was running:
martchus@openqaworker5:~> sudo systemctl status openqa-worker@1
● openqa-worker@1.service - openQA Worker #1
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled-runtime; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker@.service.d
└─20-nvme-autoformat.conf
Active: inactive (dead) since Thu 2021-08-05 07:10:46 CEST; 4h 59min ago
Process: 38598 ExecStart=/usr/share/openqa/script/worker --instance 1 (code=exited, status=0/SUCCESS)
Process: 38597 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/1 (code=exited, status=0/SUCCESS)
Main PID: 38598 (code=exited, status=0/SUCCESS)
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6652982/duplicate?dup_type_auto=1
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] Unable to read result-coredump_collect.json: Can't open file "/var/lib/openqa/pool/1/testresults/result-coredump_collect.json": No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1098.
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6652982/status
Aug 05 07:10:46 openqaworker5 worker[38598]: [error] [pid:38598] REST-API error (POST http://openqa.suse.de/api/v1/jobs/6652982/status): 400 response: Got status update for job 6652982 with unexpected worker ID 957 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Aug 05 07:10:46 openqaworker5 worker[38598]: [error] [pid:38598] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] Upload concluded (at bind)
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6652982/set_done?reason=quit%3A+worker+has+been+stopped+or+restarted&result=incomplete&worker_id=957
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] Job 6652982 from openqa.suse.de finished - reason: quit
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] Informing openqa.suse.de that we are going offline
Aug 05 07:10:46 openqaworker5 systemd[1]: Stopped openQA Worker #1.
This was likely the same user who temporarily changed the worker classes to do some tests. So I wouldn't worry about it.
Updated by mkittler over 3 years ago
That openqaworker2
wasn't affected could be explained by the fact that fewer jobs ran on the machine. On the other hand, the number of different assets on openqaworker2
wasn't less. The following table's last column shows the number of assets followed by the number of jobs per worker host:
openqa=> select host, count(id) as online_slots, (select array[count(distinct asset_id), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-04T00:00:00' and t_finished <= '2021-08-04T18:00:00') as recent_assets_and_jobs from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_assets_and_jobs desc;
host | online_slots | recent_assets_and_jobs
---------------------+--------------+------------------------
openqaworker13 | 16 | {529,587}
openqaworker9 | 16 | {412,496}
openqaworker8 | 16 | {389,519}
openqaworker3 | 19 | {371,587}
openqaworker10 | 10 | {339,324}
grenache-1 | 49 | {314,329}
openqaworker2 | 34 | {268,176}
openqaworker6 | 29 | {229,706}
openqaworker5 | 41 | {203,675}
powerqaworker-qam-1 | 8 | {171,171}
openqaworker-arm-3 | 20 | {130,339}
QA-Power8-4-kvm | 8 | {130,140}
openqaworker-arm-2 | 20 | {129,320}
QA-Power8-5-kvm | 8 | {125,138}
openqaworker-arm-1 | 10 | {101,157}
malbec | 4 | {89,84}
openqaworker11 | 10 | {0,0}
automotive-3 | 1 | {0,0}
(18 Zeilen)
Updated by mkittler over 3 years ago
When taking the asset size (and not just count) into account the picture looks quite different:
openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-04T00:00:00' and t_finished <= '2021-08-04T18:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
host | online_slots | recent_asset_size_in_gb_and_job_count
---------------------+--------------+---------------------------------------
openqaworker11 | 10 | {NULL,0}
automotive-3 | 1 | {NULL,0}
openqaworker5 | 41 | {529.2056695101782705,675}
openqaworker3 | 19 | {525.3903578147292139,587}
openqaworker6 | 29 | {514.1747308215126416,706}
openqaworker13 | 16 | {509.4812797503545859,587}
openqaworker8 | 16 | {504.1433167262002852,519}
openqaworker9 | 16 | {454.4808612000197217,496}
openqaworker10 | 10 | {381.9807668616995244,324}
openqaworker2 | 34 | {301.2545210225507637,176}
grenache-1 | 49 | {259.5087282918393613,329}
openqaworker-arm-2 | 20 | {164.0323526887223145,320}
openqaworker-arm-3 | 20 | {152.4092571316286943,339}
openqaworker-arm-1 | 10 | {128.0922723710536953,157}
powerqaworker-qam-1 | 8 | {109.7296858420595547,171}
QA-Power8-4-kvm | 8 | {98.2997177951037881,140}
malbec | 4 | {96.5083871735259912,84}
QA-Power8-5-kvm | 8 | {91.5260982112959023,138}
(18 Zeilen)
That would actually explain why we see the problems in particular on openqaworker5, 6 and 3: They have the most number of worker slots and jobs with the biggest assets assigned.
The query got quite complex, though. So it would be great if someone takes a 2nd look whether the data is actually correctly (before jumping to conclusions based on that data).
Updated by ilausuch over 3 years ago
One idea exposed in a conversation:
- We can use a distributed object DB for store the assets, such as https://wiki.openstack.org/wiki/Swift or https://www.suse.com/products/suse-enterprise-storage/
Updated by mkittler over 3 years ago
Comment by @MDoucha on the chat:
The OSD monitor dashboard show some kind of network problem for openqaworker5 and openqaworker6. Both have high number of "passive opens" TCP handshake issues throughout the whole day yesterday. openqaworker8 has comparable number of worker slots and network > traffic but no such problem.
openqaworker6 (network issues): https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker6/worker-dashboard-openqaworker6?viewPanel=65094&orgId=1&refresh=1m&from=now-2d&to=now
openqaworker8 (no issues): https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?viewPanel=65094&orgId=1&refresh=1m&from=now-2d&to=now
So this could be the culprit as well.
For now I have the following ideas:
Check the impact of these "passive opens" TCP handshakes.This is likely just a symptom of having downloads continuously ongoing.- Reduce the number of worker slots on these workers as first mitigation. This is actually done by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/338/diffs because a few special worker classes are still configured. As shown by the queries in subsequent comments the workers still execute jobs. So I didn't file yet another workaround SR yet.
- Let workers declare themselves as broken if asset downloads are piling up (similar to how we already do it when the cache service is not available at all). I've been creating a separate ticket (#96623) for this.
- Abort inactive Minion jobs for asset downloads when the job has been cancelled anyways (e.g. after the timeout has been exceeded). This would avoid starting asset downloads for jobs which are not relevant anymore. Likely that's not really required when implementing 3., though. I've been creating a separate ticket (#96684) for this.
- If it really turns out to be a scalability issue we need to look into providing mirrors for assets (so not all workers download assets directly from OSD). That's what the comment from @ilausuch is about.
Updated by mkittler over 3 years ago
@nicksinger tested the network performance between OSD and openqaworker5 and 6 with iperf and it doesn't look bad (for both, IPv4 and 6). S.M.A.R.T. also looks good on both hosts. He also observed package loss on openqaworker6 but only over IPv4. Maybe there are general networking issues.
The "passive open" TCP connections are only 5. That's matching the number of Minion workers we use for asset downloads. However, I suspect this is rather a symptom of having continuously asset downloads running rather than the cause of the problem.
The few remaining worker classes which are still configured for these workers don't seem to overload them:
openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-08-05T00:00:00' group by host order by job_count desc;
job_count | worker_slots | host
-----------+--------------+------
(0 Zeilen)
But they still got some jobs done today:
openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-05T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
host | online_slots | recent_asset_size_in_gb_and_job_count
---------------------+--------------+---------------------------------------
openqaworker11 | 10 | {NULL,0}
automotive-3 | 1 | {NULL,0}
openqaworker3 | 19 | {501.8125675506889824,463}
openqaworker8 | 16 | {420.8258606726303672,511}
openqaworker9 | 16 | {417.8395846914500049,463}
openqaworker13 | 16 | {357.7716910494491484,438}
openqaworker10 | 10 | {315.0581799233332256,251}
openqaworker2 | 34 | {218.7443862790241816,159}
openqaworker5 | 41 | {205.9085173439234541,198}
openqaworker6 | 29 | {155.4249267578125000,101}
grenache-1 | 49 | {151.8009881842881484,219}
openqaworker-arm-2 | 20 | {130.5520288934931182,214}
openqaworker-arm-3 | 20 | {117.5937674688175293,254}
openqaworker-arm-1 | 10 | {117.1310680918395518,138}
powerqaworker-qam-1 | 8 | {70.9400719152763486,48}
QA-Power8-4-kvm | 8 | {67.0431841108947988,50}
QA-Power8-5-kvm | 8 | {62.7192087164148691,47}
malbec | 4 | {51.0415565958246592,34}
(18 Zeilen)
Today the overall load (in terms of number of jobs and size of the assets) is also lower than yesterday (compared to #note-19).
But before the workers were able to handle more. These are the stats for the day before yesterday and the days before (in descending order):
openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-03T00:00:00' and t_finished < '2021-08-04T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
host | online_slots | recent_asset_size_in_gb_and_job_count
---------------------+--------------+---------------------------------------
openqaworker11 | 10 | {NULL,0}
openqaworker5 | 41 | {564.0345656182616992,997}
openqaworker6 | 29 | {513.9413869706913857,816}
openqaworker3 | 19 | {444.0452355770394229,624}
openqaworker8 | 16 | {355.8525378452613926,628}
openqaworker13 | 16 | {324.7379645807668594,500}
openqaworker9 | 16 | {316.6136079821735664,579}
openqaworker10 | 10 | {242.6821001125499629,366}
grenache-1 | 49 | {197.3507730877026943,285}
openqaworker2 | 34 | {192.5563597902655605,232}
QA-Power8-4-kvm | 8 | {121.9976354017853740,84}
powerqaworker-qam-1 | 8 | {111.7647066665813350,78}
QA-Power8-5-kvm | 8 | {97.0567456847056768,97}
malbec | 4 | {96.7560659423470498,52}
openqaworker-arm-2 | 20 | {76.5491055669263008,309}
openqaworker-arm-1 | 10 | {67.7944853128865361,165}
automotive-3 | 1 | {0.000000155530869960784912,1}
(17 Zeilen)
openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-02T00:00:00' and t_finished < '2021-08-03T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
host | online_slots | recent_asset_size_in_gb_and_job_count
---------------------+--------------+---------------------------------------
openqaworker11 | 10 | {NULL,0}
automotive-3 | 1 | {NULL,0}
openqaworker5 | 41 | {454.5740667823702148,988}
openqaworker6 | 29 | {408.9359143218025586,781}
openqaworker3 | 19 | {369.7522765416652012,629}
openqaworker8 | 16 | {302.8272499749436953,677}
openqaworker13 | 16 | {255.5371335381641963,561}
openqaworker9 | 16 | {251.7375126304104902,594}
openqaworker10 | 10 | {209.3847685232758525,410}
openqaworker2 | 34 | {177.5325910160317998,181}
grenache-1 | 49 | {172.0760751683265020,306}
QA-Power8-4-kvm | 8 | {74.7613717345520850,77}
powerqaworker-qam-1 | 8 | {67.8156854547560215,80}
QA-Power8-5-kvm | 8 | {67.0418267175555225,65}
malbec | 4 | {66.2995940865948799,38}
openqaworker-arm-1 | 10 | {61.5950581943616270,135}
openqaworker-arm-2 | 20 | {55.9928215481340889,284}
(17 Zeilen)
openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-01T00:00:00' and t_finished < '2021-08-02T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
host | online_slots | recent_asset_size_in_gb_and_job_count
---------------------+--------------+---------------------------------------
openqaworker11 | 10 | {NULL,0}
automotive-3 | 1 | {NULL,0}
openqaworker5 | 41 | {292.3224924663081738,377}
openqaworker6 | 29 | {280.0997966835275273,338}
openqaworker3 | 19 | {243.8785007568076230,222}
openqaworker13 | 16 | {119.8393164910376074,169}
openqaworker9 | 16 | {111.7239799592644023,210}
openqaworker8 | 16 | {97.6359134484082510,223}
openqaworker10 | 10 | {72.9029493164271113,133}
openqaworker2 | 34 | {52.4286611760035156,45}
powerqaworker-qam-1 | 8 | {31.4961270811036230,10}
grenache-1 | 49 | {30.9796929555013779,168}
openqaworker-arm-2 | 20 | {22.3183024693280459,39}
openqaworker-arm-1 | 10 | {20.9576080394908789,15}
QA-Power8-5-kvm | 8 | {16.6074772160500283,8}
QA-Power8-4-kvm | 8 | {10.3167545357719063,12}
malbec | 4 | {9.4528544545173645,2}
(17 Zeilen)
openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-07-30T00:00:00' and t_finished < '2021-08-31T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
host | online_slots | recent_asset_size_in_gb_and_job_count
---------------------+--------------+---------------------------------------
openqaworker11 | 10 | {NULL,0}
openqaworker8 | 16 | {1501.5539881587028506,4055}
openqaworker13 | 16 | {1482.9567158073186875,3524}
openqaworker3 | 19 | {1405.1550275972113037,3965}
openqaworker9 | 16 | {1363.2696598181501006,3535}
openqaworker5 | 41 | {1161.9894029386341572,5248}
openqaworker10 | 10 | {1140.3377235047519209,2253}
openqaworker6 | 29 | {1118.6922541307285400,4544}
openqaworker2 | 34 | {806.7370675625279521,1175}
grenache-1 | 49 | {625.7008608793839834,1850}
openqaworker-arm-2 | 20 | {325.4278003461658955,1667}
openqaworker-arm-1 | 10 | {289.5065780291333770,891}
QA-Power8-4-kvm | 8 | {239.9003790505230430,532}
powerqaworker-qam-1 | 8 | {225.4872193448245527,590}
malbec | 4 | {216.3586414456367490,298}
QA-Power8-5-kvm | 8 | {195.2722742008045293,495}
automotive-3 | 1 | {0.00045777391642332077,117}
(17 Zeilen)
openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-07-29T00:00:00' and t_finished < '2021-08-30T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
host | online_slots | recent_asset_size_in_gb_and_job_count
---------------------+--------------+---------------------------------------
openqaworker11 | 10 | {NULL,0}
openqaworker8 | 16 | {1564.1907451339066025,4874}
openqaworker13 | 16 | {1527.8008457580581279,4121}
openqaworker3 | 19 | {1454.0253826659172822,4707}
openqaworker9 | 16 | {1420.8450045976787852,4253}
openqaworker5 | 41 | {1254.6822126023471357,6388}
openqaworker10 | 10 | {1199.0791805796325205,2715}
openqaworker6 | 29 | {1195.0145842516794775,5654}
openqaworker2 | 34 | {877.3261965857818701,1464}
grenache-1 | 49 | {726.2261848598718643,2345}
openqaworker-arm-2 | 20 | {394.4984608776867393,2447}
QA-Power8-4-kvm | 8 | {374.7659980067983242,729}
openqaworker-arm-1 | 10 | {369.4309996552765371,1296}
QA-Power8-5-kvm | 8 | {336.1131262471899414,666}
powerqaworker-qam-1 | 8 | {332.0527856061235049,848}
malbec | 4 | {296.4981770850718018,416}
automotive-3 | 1 | {0.00068451650440692902,173}
(17 Zeilen)
So something has definitely deteriorated.
Updated by mkittler over 3 years ago
- Related to action #96554: Mitigate on-going disk I/O alerts size:M added
Updated by mkittler over 3 years ago
The package loss and ping times when using IPv4 to ping OSD are really quite high on openqaworker6:
martchus@openqaworker6:~> ping -4 openqa.suse.de
PING openqa.suse.de (10.160.0.207) 56(84) bytes of data.
64 bytes from salt (10.160.0.207): icmp_seq=7 ttl=64 time=2.92 ms
64 bytes from salt (10.160.0.207): icmp_seq=11 ttl=64 time=3.08 ms
64 bytes from salt (10.160.0.207): icmp_seq=15 ttl=64 time=2.51 ms
64 bytes from salt (10.160.0.207): icmp_seq=21 ttl=64 time=2.94 ms
64 bytes from salt (10.160.0.207): icmp_seq=23 ttl=64 time=3.00 ms
64 bytes from salt (10.160.0.207): icmp_seq=26 ttl=64 time=2.77 ms
64 bytes from salt (10.160.0.207): icmp_seq=30 ttl=64 time=2.86 ms
64 bytes from salt (10.160.0.207): icmp_seq=32 ttl=64 time=3.85 ms
64 bytes from salt (10.160.0.207): icmp_seq=37 ttl=64 time=2.99 ms
64 bytes from salt (10.160.0.207): icmp_seq=41 ttl=64 time=2.87 ms
64 bytes from salt (10.160.0.207): icmp_seq=44 ttl=64 time=2.36 ms
64 bytes from salt (10.160.0.207): icmp_seq=51 ttl=64 time=3.40 ms
64 bytes from salt (10.160.0.207): icmp_seq=53 ttl=64 time=3.04 ms
64 bytes from salt (10.160.0.207): icmp_seq=54 ttl=64 time=2.87 ms
64 bytes from salt (10.160.0.207): icmp_seq=55 ttl=64 time=2.80 ms
64 bytes from salt (10.160.0.207): icmp_seq=61 ttl=64 time=3.00 ms
64 bytes from salt (10.160.0.207): icmp_seq=63 ttl=64 time=3.26 ms
64 bytes from salt (10.160.0.207): icmp_seq=64 ttl=64 time=3.06 ms
64 bytes from salt (10.160.0.207): icmp_seq=71 ttl=64 time=2.93 ms
64 bytes from salt (10.160.0.207): icmp_seq=78 ttl=64 time=2.83 ms
64 bytes from salt (10.160.0.207): icmp_seq=80 ttl=64 time=2.92 ms
64 bytes from salt (10.160.0.207): icmp_seq=86 ttl=64 time=3.03 ms
64 bytes from salt (10.160.0.207): icmp_seq=88 ttl=64 time=3.02 ms
64 bytes from salt (10.160.0.207): icmp_seq=89 ttl=64 time=2.90 ms
64 bytes from salt (10.160.0.207): icmp_seq=91 ttl=64 time=2.95 ms
64 bytes from salt (10.160.0.207): icmp_seq=93 ttl=64 time=2.87 ms
64 bytes from salt (10.160.0.207): icmp_seq=94 ttl=64 time=2.75 ms
64 bytes from salt (10.160.0.207): icmp_seq=101 ttl=64 time=2.90 ms
64 bytes from salt (10.160.0.207): icmp_seq=102 ttl=64 time=2.70 ms
64 bytes from salt (10.160.0.207): icmp_seq=108 ttl=64 time=3.23 ms
64 bytes from salt (10.160.0.207): icmp_seq=111 ttl=64 time=2.77 ms
64 bytes from salt (10.160.0.207): icmp_seq=113 ttl=64 time=3.10 ms
64 bytes from salt (10.160.0.207): icmp_seq=117 ttl=64 time=3.10 ms
64 bytes from salt (10.160.0.207): icmp_seq=118 ttl=64 time=2.67 ms
^C
--- openqa.suse.de ping statistics ---
118 packets transmitted, 34 received, 71% packet loss, time 118069ms
rtt min/avg/max/mdev = 2.363/2.953/3.858/0.256 ms
martchus@openqaworker6:~> ping -6 openqa.suse.de
PING openqa.suse.de(openqa.suse.de (2620:113:80c0:8080:10:160:0:207)) 56 data bytes
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=1 ttl=64 time=0.177 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=2 ttl=64 time=1.64 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=3 ttl=64 time=0.244 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=4 ttl=64 time=0.268 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=5 ttl=64 time=0.199 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=6 ttl=64 time=0.411 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=7 ttl=64 time=0.686 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=8 ttl=64 time=0.569 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=9 ttl=64 time=0.706 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=10 ttl=64 time=0.843 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=11 ttl=64 time=0.437 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=12 ttl=64 time=0.868 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=13 ttl=64 time=0.749 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=14 ttl=64 time=0.680 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=15 ttl=64 time=0.614 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=16 ttl=64 time=0.747 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=17 ttl=64 time=0.627 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=18 ttl=64 time=0.703 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=19 ttl=64 time=0.806 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=20 ttl=64 time=0.669 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=21 ttl=64 time=0.801 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=22 ttl=64 time=0.661 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=23 ttl=64 time=0.717 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=24 ttl=64 time=0.866 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=25 ttl=64 time=1.06 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=26 ttl=64 time=0.730 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=27 ttl=64 time=0.387 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=28 ttl=64 time=0.686 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=29 ttl=64 time=0.810 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=30 ttl=64 time=0.606 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=31 ttl=64 time=0.576 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=32 ttl=64 time=0.599 ms
^C
--- openqa.suse.de ping statistics ---
32 packets transmitted, 32 received, 0% packet loss, time 31684ms
rtt min/avg/max/mdev = 0.177/0.660/1.640/0.271 ms
When pinging wotan or an external host (via IPv4 or 6) the package loss does not occur. However, when pinging OSD from another worker (including openqaworker5) the problem is also not reproducible.
The connection speed seems fine, though:
martchus@openqaworker6:~> iperf3 -4 -c openqa.suse.de -i 1 -t 30
Connecting to host openqa.suse.de, port 5201
[ 5] local 10.160.1.100 port 50254 connected to 10.160.0.207 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 68.9 MBytes 578 Mbits/sec 438 62.2 KBytes
[ 5] 1.00-2.00 sec 78.3 MBytes 657 Mbits/sec 495 130 KBytes
[ 5] 2.00-3.00 sec 73.3 MBytes 615 Mbits/sec 521 49.5 KBytes
[ 5] 3.00-4.00 sec 77.8 MBytes 653 Mbits/sec 511 106 KBytes
[ 5] 4.00-5.00 sec 76.2 MBytes 640 Mbits/sec 711 67.9 KBytes
[ 5] 5.00-6.00 sec 75.3 MBytes 632 Mbits/sec 353 148 KBytes
[ 5] 6.00-7.00 sec 78.4 MBytes 658 Mbits/sec 733 50.9 KBytes
[ 5] 7.00-8.00 sec 69.2 MBytes 580 Mbits/sec 476 62.2 KBytes
[ 5] 8.00-9.00 sec 70.7 MBytes 593 Mbits/sec 450 89.1 KBytes
[ 5] 9.00-10.00 sec 72.9 MBytes 611 Mbits/sec 482 276 KBytes
[ 5] 10.00-11.00 sec 75.8 MBytes 635 Mbits/sec 401 79.2 KBytes
[ 5] 11.00-12.00 sec 80.6 MBytes 676 Mbits/sec 739 66.5 KBytes
[ 5] 12.00-13.00 sec 74.9 MBytes 629 Mbits/sec 857 272 KBytes
[ 5] 13.00-14.00 sec 75.8 MBytes 635 Mbits/sec 552 50.9 KBytes
[ 5] 14.00-15.00 sec 76.4 MBytes 641 Mbits/sec 586 129 KBytes
[ 5] 15.00-16.00 sec 74.8 MBytes 628 Mbits/sec 403 93.3 KBytes
[ 5] 16.00-17.00 sec 72.7 MBytes 610 Mbits/sec 541 147 KBytes
[ 5] 17.00-18.00 sec 63.9 MBytes 536 Mbits/sec 344 79.2 KBytes
[ 5] 18.00-19.00 sec 70.0 MBytes 587 Mbits/sec 249 52.3 KBytes
[ 5] 19.00-20.00 sec 79.4 MBytes 666 Mbits/sec 559 93.3 KBytes
[ 5] 20.00-21.00 sec 78.0 MBytes 655 Mbits/sec 376 91.9 KBytes
[ 5] 21.00-22.00 sec 77.7 MBytes 652 Mbits/sec 594 274 KBytes
[ 5] 22.00-23.00 sec 71.6 MBytes 601 Mbits/sec 546 158 KBytes
[ 5] 23.00-24.00 sec 66.9 MBytes 561 Mbits/sec 279 89.1 KBytes
[ 5] 24.00-25.00 sec 68.5 MBytes 575 Mbits/sec 444 67.9 KBytes
[ 5] 25.00-26.00 sec 75.0 MBytes 629 Mbits/sec 628 99.0 KBytes
[ 5] 26.00-27.00 sec 83.0 MBytes 696 Mbits/sec 239 175 KBytes
[ 5] 27.00-28.00 sec 75.8 MBytes 636 Mbits/sec 143 100 KBytes
[ 5] 28.00-29.00 sec 76.5 MBytes 642 Mbits/sec 308 112 KBytes
[ 5] 29.00-30.00 sec 55.6 MBytes 466 Mbits/sec 154 90.5 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 2.16 GBytes 619 Mbits/sec 14112 sender
[ 5] 0.00-30.00 sec 2.16 GBytes 619 Mbits/sec receiver
iperf Done.
It looks the same with IPv6. I've also tested it on openqaworker5 (had many incomplete jobs) and openqaworker8 (had no incomplete jobs) and the figures are slightly higher (sometimes reaching the 800 Mbits/sec mark).
With the download speed ruled out that leaves /dev/vdc
on OSD being too slow as the cause. So I've tried downloading an asset specifically:
martchus@openqaworker5:~> wget -4 'https://openqa.suse.de/tests/6598846/asset/iso/SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso'
--2021-08-06 12:53:37-- https://openqa.suse.de/tests/6598846/asset/iso/SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso
Auflösen des Hostnamens openqa.suse.de (openqa.suse.de)… 10.160.0.207
Verbindungsaufbau zu openqa.suse.de (openqa.suse.de)|10.160.0.207|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 302 Found
Platz: /assets/iso/SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso [folgend]
--2021-08-06 12:53:37-- https://openqa.suse.de/assets/iso/SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso
Wiederverwendung der bestehenden Verbindung zu openqa.suse.de:443.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 412090368 (393M) [application/x-iso9660-image]
Wird in »SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso« gespeichert.
SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso 100%[====================================================================================================================================================================================================>] 393,00M 16,9MB/s in 20s
2021-08-06 12:53:57 (20,1 MB/s) - »SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso« gespeichert [412090368/412090368]
I've tried it multiple times in a row. At first I've got 20,1 MB/s to 22,9 MB/s which already doesn't seem too bad to me. The speed also matches the I/O traffic I observed yesterday via htop
when looking at the relevant Minion processes. I also did some tested on the "good" worker openqaworker8 where I've got 42,8 MB/s to 47,7 MB/s. That's more than the double. However, after downloading the asset several times on both hosts (interleaved) I've reached a very high download speed of around 60 MB/s on both workers. So I assume the download isn't really slower on "bad" hosts like openqaworker5 compared to "good" hosts like openqaworker8 and it is mainly a matter of how fast OSD can serve the asset which highly depends on caching.
I've also tested the download on openqaworker6 with wget -4 …
and despite the package loss shown by ping -4 …
the speed also reaches around 60 MB/s.
So /dev/vdc
is not slow per-se either. The values measured on OSD itself (while likely other processes access /dev/vdc
as well) also don't look bad:
martchus@openqa:~> sudo hdparm -Tt /dev/vdc
/dev/vdc:
Timing cached reads: 5682 MB in 2.00 seconds = 2844.42 MB/sec
Timing buffered disk reads: 276 MB in 3.01 seconds = 91.66 MB/sec
However, when many worker slots are hammering it concurrently than the performance likely goes down much (which affects the workers with high asset load the most).
These figures don't give much for filing an infra ticket. Not sure what to do next.
Updated by mkittler over 3 years ago
Now there are 3 more jobs:
openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-08-05T00:00:00' group by host order by job_count desc;
job_count | worker_slots | host
-----------+--------------+--------------------
1 | 10 | openqaworker-arm-1
1 | 19 | openqaworker3
1 | 29 | openqaworker6
(3 Zeilen)
openqa=> select id, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-08-05T00:00:00' order by id;
id | result | reason | worker
---------+------------------+----------------------------------------+--------------------
6654913 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker3
6656551 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker-arm-1
6656603 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6
(3 Zeilen)
Looks like syncing tests/needles took too long in these cases, e.g. it took an hour on openqaworker-arm-1 and openqaworker3. Nothing was stuck here, it just was slow.
Updated by mkittler over 3 years ago
There are also cache service related coredumps on those hosts:
martchus@openqaworker6:~> sudo coredumpctl info 3775
PID: 3775 (openqa-workerca)
UID: 486 (_openqa-worker)
GID: 65534 (nogroup)
Signal: 3 (QUIT)
Timestamp: Wed 2021-08-04 15:58:37 CEST (2 days ago)
Command Line: /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
Executable: /usr/bin/perl
Control Group: /system.slice/openqa-worker-cacheservice.service
Unit: openqa-worker-cacheservice.service
Slice: system.slice
Boot ID: 0cbc03c70c1b4c3594d42f7307ef8c7f
Machine ID: c8fa1ae99ae4d5730af2d97658d3edd9
Hostname: openqaworker6
Storage: /var/lib/systemd/coredump/core.openqa-workerca.486.0cbc03c70c1b4c3594d42f7307ef8c7f.3775.1628085517000000.lz4
Message: Process 3775 (openqa-workerca) of user 486 dumped core.
Stack trace of thread 3775:
#0 0x0000556350927344 Perl_hv_common (perl)
#1 0x00005563508b30cf n/a (perl)
#2 0x00005563508b3bf6 Perl_gv_fetchmeth_pvn (perl)
#3 0x00005563509398d3 n/a (perl)
#4 0x000055635093a2f8 Perl_sv_clear (perl)
#5 0x000055635093a7b1 Perl_sv_free2 (perl)
#6 0x000055635096aa67 Perl_free_tmps (perl)
#7 0x00005563508ad5d9 perl_run (perl)
#8 0x0000556350884ef2 main (perl)
#9 0x00007efe102e534a __libc_start_main (libc.so.6)
#10 0x0000556350884f3a _start (perl)
martchus@openqaworker5:~> sudo coredumpctl info 38544
PID: 38544 (openqa-workerca)
UID: 486 (_openqa-worker)
GID: 65534 (nogroup)
Signal: 3 (QUIT)
Timestamp: Wed 2021-08-04 22:21:07 CEST (1 day 19h ago)
Command Line: /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
Executable: /usr/bin/perl
Control Group: /system.slice/openqa-worker-cacheservice.service
Unit: openqa-worker-cacheservice.service
Slice: system.slice
Boot ID: 549d9e3d0d0e4ac08b8aef53a6bcfb98
Machine ID: 9b632e64195547add4173e3c5878a024
Hostname: openqaworker5
Storage: /var/lib/systemd/coredump/core.openqa-workerca.486.549d9e3d0d0e4ac08b8aef53a6bcfb98.38544.1628108467000000.lz4
Message: Process 38544 (openqa-workerca) of user 486 dumped core.
Stack trace of thread 38544:
#0 0x0000557588e1703b Perl_sv_clear (perl)
#1 0x0000557588e177b1 Perl_sv_free2 (perl)
#2 0x0000557588e47a67 Perl_free_tmps (perl)
#3 0x0000557588d8a5d9 perl_run (perl)
#4 0x0000557588d61ef2 main (perl)
#5 0x00007f1c98f3434a __libc_start_main (libc.so.6)
#6 0x0000557588d61f3a _start (perl)
Updated by livdywan over 3 years ago
- Related to action #96623: Let workers declare themselves as broken if asset downloads are piling up size:M added
Updated by mkittler over 3 years ago
Since jobs are piling up on OSD and with the currently reduced number of jobs on openqaworker5 and 6 there are no further jobs running into MAX_SETUP_TIME
I've been creating https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/339/diffs to increase the number of jobs we run on those workers again. The number of slots will still be reduced compared to what we usually had.
Updated by mkittler over 3 years ago
The SR has been merged and the worker slots are used now. select id, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-08-06T11:00:00' order by id;
still gives no results and the Minion dashboard looks good as well. So it doesn't look like it was too much; possibly we can enable even more worker slots.
Updated by mkittler over 3 years ago
This problem is possibly related to a recent VLAN change in the QA lab which created a network loop. That should be resolved by now. I also don't see the package loss anymore (when pinging OSD from openqaworker6 via IPv4).
To check we could get back to the usual number of worker slots: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/341
Updated by vpelcak over 3 years ago
Hello.
This issue seems to be pretty severe and took ~5 days to resolve.
Do you have any post-mortem analysis? What needs to be done not only to prevent the issue but also resolve it faster?
Updated by mkittler over 3 years ago
- Related to action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M added
Updated by livdywan over 3 years ago
vpelcak wrote:
This issue seems to be pretty severe and took ~5 days to resolve.
Do you have any post-mortem analysis? What needs to be done not only to prevent the issue but also resolve it faster?
We've been discussing this daily and I'm not sure that we know what the issue is or was and if it's fixed. And please keep in mind that we had no alerts within this period due to infra issues. I appreciate your concern but I think the team is handling this sensibly.
Updated by mkittler over 3 years ago
And the issue isn't even completely resolved. I'm still checking for those incompletes now that the workers are back. So far it looks good, though.
Making progress with this issue took long because all I could do is measuring the performance and load of these workers and with the information I could gather it was just not easily possible to pin down the problem, e.g. to file an Infra ticket. Only after asking why the QA lab had to be taken offline it became clear that the networking problems (not caused by work of the tools team) had been the culprit. I guess better communication about the work in the QA labs or at least the resulting problems would have helped.
Note that after all actions outside the tools team caused and also resolved the problem (if it is actually resolved now; I'm still monitoring as mentioned). So likely Infra (or whoever was involved) should also think what they can improve. I was merely looking at symptoms and could reduce the number of incompletes while the problem apparent by manually reducing the number of workers slots. This went quite smoothly via salt besides one problem I've filed a ticket about (#96683). I've also already filed some tickets with ideas for improving openQA so it could cope better with situations like this one (#96623, #96684).
Updated by mkittler over 3 years ago
A new occurrence but this time on openqaworker10: https://openqa.suse.de/tests/6789948
I've checked the Minion dashboard. Looks like the download of http://openqa.suse.de/tests/6789948/asset/iso/SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso took almost an hour with one retry in the middle. The download eventually succeeded but it was already too late.
Updated by mkittler over 3 years ago
A few more occurrences happened 12 hours ago, this time again on openqaworker5 waiting for tests to sync:
openqa=> select id, result, t_finished, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-08-06T11:00:00' order by t_finished asc;
id | result | t_finished | reason | worker
---------+------------------+---------------------+----------------------------------------+----------------
6789948 | timeout_exceeded | 2021-08-10 10:44:41 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker10
6796355 | timeout_exceeded | 2021-08-10 22:31:18 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5
6796465 | timeout_exceeded | 2021-08-10 22:58:18 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5
6796613 | timeout_exceeded | 2021-08-10 23:15:50 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5
6796612 | timeout_exceeded | 2021-08-10 23:15:50 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5
(5 Zeilen)
Considering we've increased MAX_SETUP_TIME
from one hour to two hours for workers with many slots (like openqaworker5) we're not where we were before.
Btw, the syncing of tests took max. 9 minutes here and nothing was stuck here. So everything was just slow and syncing tests is the last part of the setup which just exceeded the timeout.
Currently there are 40 inactive download jobs and one of the active ones has already taken 24 minutes so far (SLE-15-SP3-Full-x86_64-GM-Media1.iso
). Among these 40 inactive download jobs there are 11 ones to cache tests. Maybe these should have a higher priority.
EDIT: The 40 jobs could actually be processed in time before another test exceeded MAX_SETUP_TIME
.
Updated by mkittler over 3 years ago
PR for adjusting the prio as mentioned in the last paragraph of the previous comment: https://github.com/os-autoinst/openQA/pull/4118
Updated by livdywan over 3 years ago
- Related to coordination #96447: [epic] Failed systemd services and job age alerts added
Updated by mkittler over 3 years ago
- Priority changed from Immediate to High
#96710 is most problematic right now and the remaining jobs exceeding MAX_SETUP_TIME
might still be a symptom of the high number of incompletes we see due to this other issue. The numbers are also relatively small (compared to #96710), e.g. the query from the previous comment now "only" returns 40 rows.
Updated by mkittler over 3 years ago
- Status changed from In Progress to Resolved
It looks like this problem has been resolved now. The query still only returns 40 rows. I suppose it helped that the network loop has been fixed and that #96710 has at least been worked around on all production workers.
We've also been following up with https://progress.opensuse.org/issues/96623 which has already been deployed. So I would leave further improvements of openQA itself (to handle such a situation better) to follow-up tickets (which have already been created, see previous comments).
Updated by okurz about 3 years ago
- Related to action #98307: Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M added
Updated by okurz about 3 years ago
- Related to coordination #99579: [epic][retro] Follow-up to "Published QCOW images appear to be uncompressed" added