QA (public) &raquo; openQA Project (public)

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-08-04

Due date:

2021-08-19

% Done:

Estimated time:

Description

Observation¶

https://openqa.suse.de/tests/6645417 is "timeout_exceeded". log details:

[2021-08-04T10:33:49.0665 CEST] [info] [pid:16838] Downloading SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso, request #5221 sent to Cache Service
[2021-08-04T11:26:20.0708 CEST] [info] [pid:16838] Download of SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso processed:
Asset "SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso" was downloaded by #5196, details are therefore unavailable here
[2021-08-04T11:26:20.0712 CEST] [info] [pid:16838] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #5327 sent to Cache Service
[2021-08-04T11:28:20.0151 CEST] [info] [pid:16838] +++ worker notes +++
[2021-08-04T11:28:20.0151 CEST] [info] [pid:16838] End time: 2021-08-04 09:28:20
[2021-08-04T11:28:20.0151 CEST] [info] [pid:16838] Result: timeout
[2021-08-04T11:28:20.0160 CEST] [info] [pid:23579] Uploading autoinst-log.txt

this was on openqaworker6:15. Related systemd journal logs

openqa-worker-cacheservice:

Aug 04 08:31:58 openqaworker6 openqa-workercache-daemon[26298]: [26379] [i] Worker 26379 started
Aug 04 08:31:58 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Creating process id file "/tmp/prefork.pid"
Aug 04 08:31:58 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 26379 stopped
Aug 04 08:33:56 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 26377 gracefully (80 seconds)
Aug 04 08:33:57 openqaworker6 openqa-workercache-daemon[26298]: [26377] [i] Worker 26377 started
Aug 04 08:49:40 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 26377 stopped
Aug 04 08:49:40 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 26376 gracefully (80 seconds)
Aug 04 08:49:42 openqaworker6 openqa-workercache-daemon[26298]: [26376] [i] Worker 26376 started
Aug 04 08:51:05 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 26376 stopped
Aug 04 08:51:05 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 26378 gracefully (80 seconds)
Aug 04 08:51:06 openqaworker6 openqa-workercache-daemon[26298]: [26378] [i] Worker 26378 started
Aug 04 09:55:17 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 26378 stopped
Aug 04 09:55:17 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 1568 gracefully (80 seconds)
Aug 04 09:55:21 openqaworker6 openqa-workercache-daemon[26298]: [1568] [i] Worker 1568 started
Aug 04 09:59:14 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 1568 stopped
Aug 04 09:59:14 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 1389 gracefully (80 seconds)
Aug 04 09:59:15 openqaworker6 openqa-workercache-daemon[26298]: [1389] [i] Worker 1389 started
Aug 04 10:14:16 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 1389 stopped
Aug 04 10:14:16 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Stopping worker 3852 gracefully (80 seconds)
Aug 04 10:14:19 openqaworker6 openqa-workercache-daemon[26298]: [3852] [i] Worker 3852 started
Aug 04 10:19:26 openqaworker6 openqa-workercache-daemon[26298]: [4587] [i] Worker 4587 started
Aug 04 10:19:26 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 3852 stopped
Aug 04 10:19:26 openqaworker6 openqa-workercache-daemon[26298]: [26298] [i] Worker 4587 stopped

openqa-worker-cacheservice-minion:

Aug 04 10:29:04 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16882] [i] [#5188] Downloading: "SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
Aug 04 10:29:14 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16884] [i] [#5190] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Aug 04 10:29:14 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16884] [i] [#5190] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/>
Aug 04 10:29:39 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16894] [i] [#5192] Downloading: "publiccloud_15sp3_Azure_gen2_BYOS_Updates.qcow2"
Aug 04 10:29:44 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16895] [i] [#5193] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Aug 04 10:29:44 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16895] [i] [#5193] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/>
Aug 04 10:29:54 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16904] [i] [#5195] Downloading: "SLED-15-SP2-x86_64-Installtest.qcow2"
Aug 04 10:32:54 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16788] [i] [#5170] Cache size of "/var/lib/openqa/cache" is 47GiB, with limit 50GiB
Aug 04 10:32:54 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16788] [i] [#5170] Downloading "SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build20210804-1->
Aug 04 10:32:59 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17198] [i] [#5196] Downloading: "SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso"
Aug 04 10:33:09 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16904] [i] [#5195] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
Aug 04 10:33:09 openqaworker6 openqa-worker-cacheservice-minion[26299]: [16904] [i] [#5195] Downloading "SLED-15-SP2-x86_64-Installtest.qcow2" from "http://openqa.suse.de/t>
Aug 04 10:33:14 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17212] [i] [#5197] Downloading: "SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
Aug 04 10:33:19 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17215] [i] [#5198] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
…
Aug 04 10:34:49 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17802] [i] [#5213] Downloading: "sle-15-SP1-x86_64-Build:19472:aws-cli-with-hpc.qcow2
…
Aug 04 11:26:18 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17198] [i] [#5196] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Installer-DVD-x86_64-GM>
Aug 04 11:26:18 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17198] [i] [#5196] Download error 598, waiting 5 seconds for next try (4 remaining)
Aug 04 11:26:18 openqaworker6 openqa-worker-cacheservice-minion[26299]: [17198] [i] [#5196] Downloading "SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso" from "http://openqa.su>
Aug 04 11:26:20 openqaworker6 openqa-worker-cacheservice-minion[26299]: [23427] [i] [#5306] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Aug 04 11:26:20 openqaworker6 openqa-worker-cacheservice-minion[26299]: [23427] [i] [#5306] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/>
…

to me it looks like the same asset is downloaded multiple times, which can happen if cache size is exceeded, and somehow the openQA job sees a response 53 minutes later and there is "waiting 5 seconds for next try (4 remaining)" but then nothing related anymore

Steps to reproduce¶

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#96557

Related issues 6 (0 open — 6 closed)

Related to openQA Infrastructure (public) - action #96554: Mitigate on-going disk I/O alerts size:M

Resolved

2021-08-04

Related to openQA Project (public) - action #96623: Let workers declare themselves as broken if asset downloads are piling up size:M

Resolved

dheidler

2021-08-06

Related to openQA Project (public) - action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M

Rejected

2021-08-09

Related to openQA Infrastructure (public) - coordination #96447: [epic] Failed systemd services and job age alerts

Resolved

okurz

2021-08-04

Related to openQA Infrastructure (public) - action #98307: Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M

Resolved

2021-09-08

Related to openQA Project (public) - coordination #99579: [epic][retro] Follow-up to "Published QCOW images appear to be uncompressed"

Resolved

okurz

2021-10-01

Updated by okurz over 3 years ago

Subject changed from jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"(?s)timeout: setup exceeded MAX_SETUP_TIME.*details are therefore unavailable here":retry to jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"(?s)details are therefore unavailable here.*timeout: setup exceeded MAX_SETUP_TIME":retry

Actions

https://openqa.suse.de/tests/6645577, https://openqa.suse.de/tests/6643847, https://openqa.suse.de/tests/6643846, https://openqa.suse.de/tests/6645230

Updated by okurz over 3 years ago

Description updated (diff)

hurhaj has found four examples, all on openqaworker5, the original issue is from openqaworker6.

Actions

Updated by okurz over 3 years ago

Priority changed from Urgent to Immediate

I am bumping https://progress.opensuse.org/issues/96557 up to immediate. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=14&orgId=1&from=1628004156924&to=1628079287359 shows a clear problem over the last hours.
I have triggered a reboot of openqaworker5+6 where most problems have been observed. Not sure if this makes a difference.

I have also called for i in 5 6; do host=openqa.suse.de result=timeout_exceeded WORKER=openqaworker$i bash -ex ./openqa-advanced-retrigger-jobs; done to retrigger all "timeout_exceeded" from both openqaworker5+6

Actions

Updated by mkittler over 3 years ago

Assignee set to mkittler

Actions

Updated by mkittler over 3 years ago

I have triggered a reboot of openqaworker5+6 where most problems have been observed. Not sure if this makes a difference.

The workers are now quite busy downloading assets. The download rates aren't too bad and they don't look stuck but the Minion jobs are still piling up¹. It would have been good to check whether the workers are stuck (e.g. waiting for a Minion lock) before restarting them.

¹ Maybe that's normal since the asset cache is just being built from scratch right now? The workers openqaworker5 and openqaworker6 have 41 and 26 slots respectively so when they're back online it isn't very surprising to see them instantly loaded with many asset downloads. When introducing MAX_SETUP_TIME, we might have set a too low limit (not considering that parallel jobs running on the same worker share resources here).

Actions

Updated by mkittler over 3 years ago

Status changed from New to In Progress

Judging by the output of select id, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-08-04T00:00:00' order by id;, openqaworker3 also produced a few of those jobs.

The query also shows that the problem started to re-emerge on openqaworker5 and 6 after rebooting them, see the gap between the timestamps (which are UTC):

 6647186 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:43:23
 6647203 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:48:14
 6647206 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:48:15
 6647227 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:48:14
 6647235 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:48:28
 6647244 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 11:49:05
 6647496 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:18
 6647776 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:19
 6647782 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:17
 6647861 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:16
 6647862 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:17
 6647863 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:26
 6647866 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 13:06:19
 6647868 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:23

So the reboot helped to reproduce the problem. Apparently it is happening consistently on those workers when the asset cache needs to be built-up (after restarting or scheduling many jobs with new assets).

Maybe I'll just bump MAX_SETUP_TIME on those workers for now.

Actions

Updated by mkittler over 3 years ago

SR to increase MAX_SETUP_TIME for worker hosts with many slots: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/536
The problem occurred on openqaworker3 which has 19 slots so I just applied it to every worker with 15 or more slots to be safe.

Looks like the last job is 6647906 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:27 which timed out one our after the reboot. Let's see how many jobs exceed the timeout on the next full hour after the reboot.

Actions

Updated by mkittler over 3 years ago

A few more statistics from the Minion dashboard of openqaworker5 after the reboot:

Syncing tests (rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/ took 43 minutes.
Some downloads take quite long, e.g. downloading SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso took 29 minutes and sle-rhel82.qcow2 31 minutes.
The majority of downloads doesn't take that long but since they pile up they still take very long to be processed, e.g. a download created one hour ago has now running since 10 minutes. The corresponding job has just ran into the timeout. Note that this does not abort asset the download (for the better or worse).

Actions

Updated by mkittler over 3 years ago

Let's see how many jobs exceed the timeout on the next full hour after the reboot.

 6647906 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 13:06:27
 6647991 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:19
 6648500 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:19
 6647754 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:20
 6647762 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:21
 6647996 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:21
 6648365 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:23
 6647693 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:24
 6648384 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:24
 6647993 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:25
 6648011 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:25
 6643681 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:25
 6648495 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:26
 6648380 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:26
 6648435 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:27
 6647739 | parallel_restarted | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:28
 6648014 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:28
 6647692 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:28
 6648371 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:29
 6648378 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:29
 6647670 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:29
 6648139 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:29
 6648018 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:29
 6648368 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:29
 6648013 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:30
 6643682 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
 6647981 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
 6648016 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
 6648019 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:30
 6647709 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
 6648350 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:30
 6648121 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
 6643682 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
 6648350 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 14:07:30
 6647709 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 14:07:30
 6648499 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:45
 6648524 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:46
 6648554 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:46
 6648608 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:46
 6648520 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:46
 6648496 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:47
 6649930 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:48
 6648525 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:48
 6643637 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:49
 6648563 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:49
 6643763 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
 6648609 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
 6643636 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
 6648634 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
 6648603 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:50
 6648531 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
 6648518 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
 6648548 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
 6643758 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
 6643762 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:51
 6643638 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:52
 6649952 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:52
 6648568 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:52
 6648532 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:53
 6648516 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 15:07:53
 6648612 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:53
 6649931 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:55
 6643760 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:56
 6643639 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:57
 6649945 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 15:07:57
 6648503 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:57
 6648510 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:07:57
 6648512 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6 | 2021-08-04 15:07:58
 6648551 | timeout_exceeded   | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5 | 2021-08-04 15:08:00

Actions

#10

Updated by okurz over 3 years ago

Subject changed from jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"(?s)details are therefore unavailable here.*timeout: setup exceeded MAX_SETUP_TIME":retry to jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry

Generalizing the regex to catch more related jobs

Actions

#11

Updated by okurz over 3 years ago

Somehow it still looks like openqaworker5+6 are in particular very error-prone.

SQL on OSD:

openqa=> select count(jobs.id),host from jobs left join workers on workers.id = jobs.assigned_worker_id where jobs.result = 'timeout_exceeded' and jobs.t_finished >= '2021-08-04' group by host order by count desc;
 count |      host      
-------+----------------
   221 | openqaworker5
    75 | openqaworker6
     9 | openqaworker3
     4 | openqaworker8
     1 | openqaworker10
     1 | openqaworker2
(6 rows)

I expected more incomplete jobs but still it's mainly w5+6

Actions

#12

Updated by mkittler over 3 years ago

SR to remove load from these workers: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/338/diffs

Actions

#13

Updated by mkittler over 3 years ago

On other workers the average time of asset download jobs is significantly lower. It can be luck (and openqaworker5 really has extraordinarily many slots) but it looks more like openqaworker5 and 6 are downloading slower. (There's no easy way to query the average download speed for asset downloads.)

Note that when only looking for jobs where the setup timeout has been exceeded (and not for all timeouts) then it looks like only openqaworker3, 5 and 6 are affected:

openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-08-04' group by host order by job_count desc;
 job_count | worker_slots |     host      
-----------+--------------+---------------
       224 |           41 | openqaworker5
        79 |           29 | openqaworker6
        10 |           19 | openqaworker3
(3 Zeilen)

Considering all jobs of this year affects the query results not really:

openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-01-01' group by host order by job_count desc;
 job_count | worker_slots |     host      
-----------+--------------+---------------
       224 |           41 | openqaworker5
        80 |           29 | openqaworker6
        49 |           49 | grenache-1
        18 |            4 | malbec
        10 |           19 | openqaworker3

So this is a relatively new problem. Although it deteriorated today significantly I've already encountered one failure 7 days ago on openqaworker6, see https://progress.opensuse.org/issues/95299#note-15. (The jobs on malbec are likely caused by rsync being stuck which is a different problem and has already been fixed, see https://progress.opensuse.org/issues/95299#note-4. Not sure about grenache-1. It has also many worker slots so possibly the same issue after all?)

Actions

#14

Updated by openqa_review over 3 years ago

Due date set to 2021-08-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#15

Updated by livdywan over 3 years ago

mkittler wrote:

SR to remove load from these workers: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/338/diffs

openqa-reload-worker-auto-restart@1 failed on openqaworker5 this morning, which I restarted manually now.

Actions

#16

Updated by mkittler over 3 years ago

After https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/338/diffs has been merged (and should have been effective) there were sill a few jobs:

openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-08-04T19:15:00' group by host order by job_count desc;
 job_count | worker_slots |     host      
-----------+--------------+---------------
        11 |           41 | openqaworker5
         2 |           29 | openqaworker6

Likely it took a while for the piled up asset download jobs to be processed. (There were still jobs scheduled because the SR also kept a few special worker classes because I thought these few jobs should be too much load. I assume that's also correct now that the piled up asset download jobs have been processed. Also some user temporarily enabled worker classes again to test a few jobs.)

Not sure what to do with these workers. They're likely good enough to run only a limited number of jobs but I don't know how I can bring them fully back.

I'm also wondering whether #96554 might be related after all. It is about /dev/vdc which is used to store (and serve) assets. If this disk is slow it would of course also slow down asset downloads of workers. That wouldn't explain why especially openqaworker5 and 6 are affected (and 3 a little bit).

Of course workers with many slots would be affected more easily which could be an explanation. However, it doesn't really make sense because we should have been seen the issue on openqaworker2 then as well (not considering grenache-1 here as it "only" runs ppc jobs):

openqa=> select host, count(id) as online_slots from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by online_slots desc;
        host         | online_slots 
---------------------+--------------
 grenache-1          |           49
 openqaworker5       |           41
 openqaworker2       |           34
 openqaworker6       |           29
 openqaworker-arm-2  |           20
 openqaworker-arm-3  |           20
 openqaworker3       |           19
 openqaworker13      |           16
 openqaworker9       |           16
 openqaworker8       |           16
 openqaworker11      |           10
 openqaworker10      |           10
 openqaworker-arm-1  |           10
 powerqaworker-qam-1 |            8
 QA-Power8-5-kvm     |            8
 QA-Power8-4-kvm     |            8
 malbec              |            4
 automotive-3        |            1
(18 Zeilen)

Actions

#17

Updated by mkittler over 3 years ago

@cdywan It looks like someone messed with the worker services because the regular openqa-worker@1 service (as opposed to openqa-worker-auto-restart@1) was running:

martchus@openqaworker5:~> sudo systemctl status openqa-worker@1
● openqa-worker@1.service - openQA Worker #1
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled-runtime; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker@.service.d
           └─20-nvme-autoformat.conf
   Active: inactive (dead) since Thu 2021-08-05 07:10:46 CEST; 4h 59min ago
  Process: 38598 ExecStart=/usr/share/openqa/script/worker --instance 1 (code=exited, status=0/SUCCESS)
  Process: 38597 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/1 (code=exited, status=0/SUCCESS)
 Main PID: 38598 (code=exited, status=0/SUCCESS)

Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6652982/duplicate?dup_type_auto=1
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] Unable to read result-coredump_collect.json: Can't open file "/var/lib/openqa/pool/1/testresults/result-coredump_collect.json": No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1098.
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6652982/status
Aug 05 07:10:46 openqaworker5 worker[38598]: [error] [pid:38598] REST-API error (POST http://openqa.suse.de/api/v1/jobs/6652982/status): 400 response: Got status update for job 6652982 with unexpected worker ID 957 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Aug 05 07:10:46 openqaworker5 worker[38598]: [error] [pid:38598] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] Upload concluded (at bind)
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6652982/set_done?reason=quit%3A+worker+has+been+stopped+or+restarted&result=incomplete&worker_id=957
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] Job 6652982 from openqa.suse.de finished - reason: quit
Aug 05 07:10:46 openqaworker5 worker[38598]: [debug] [pid:38598] Informing openqa.suse.de that we are going offline
Aug 05 07:10:46 openqaworker5 systemd[1]: Stopped openQA Worker #1.

This was likely the same user who temporarily changed the worker classes to do some tests. So I wouldn't worry about it.

Actions

#18

Updated by mkittler over 3 years ago

That openqaworker2 wasn't affected could be explained by the fact that fewer jobs ran on the machine. On the other hand, the number of different assets on openqaworker2 wasn't less. The following table's last column shows the number of assets followed by the number of jobs per worker host:

openqa=> select host, count(id) as online_slots, (select array[count(distinct asset_id), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-04T00:00:00' and t_finished <= '2021-08-04T18:00:00') as recent_assets_and_jobs from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_assets_and_jobs desc;
        host         | online_slots | recent_assets_and_jobs 
---------------------+--------------+------------------------
 openqaworker13      |           16 | {529,587}
 openqaworker9       |           16 | {412,496}
 openqaworker8       |           16 | {389,519}
 openqaworker3       |           19 | {371,587}
 openqaworker10      |           10 | {339,324}
 grenache-1          |           49 | {314,329}
 openqaworker2       |           34 | {268,176}
 openqaworker6       |           29 | {229,706}
 openqaworker5       |           41 | {203,675}
 powerqaworker-qam-1 |            8 | {171,171}
 openqaworker-arm-3  |           20 | {130,339}
 QA-Power8-4-kvm     |            8 | {130,140}
 openqaworker-arm-2  |           20 | {129,320}
 QA-Power8-5-kvm     |            8 | {125,138}
 openqaworker-arm-1  |           10 | {101,157}
 malbec              |            4 | {89,84}
 openqaworker11      |           10 | {0,0}
 automotive-3        |            1 | {0,0}
(18 Zeilen)

Actions

#19

Updated by mkittler over 3 years ago

When taking the asset size (and not just count) into account the picture looks quite different:

openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-04T00:00:00' and t_finished <= '2021-08-04T18:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
        host         | online_slots | recent_asset_size_in_gb_and_job_count 
---------------------+--------------+---------------------------------------
 openqaworker11      |           10 | {NULL,0}
 automotive-3        |            1 | {NULL,0}
 openqaworker5       |           41 | {529.2056695101782705,675}
 openqaworker3       |           19 | {525.3903578147292139,587}
 openqaworker6       |           29 | {514.1747308215126416,706}
 openqaworker13      |           16 | {509.4812797503545859,587}
 openqaworker8       |           16 | {504.1433167262002852,519}
 openqaworker9       |           16 | {454.4808612000197217,496}
 openqaworker10      |           10 | {381.9807668616995244,324}
 openqaworker2       |           34 | {301.2545210225507637,176}
 grenache-1          |           49 | {259.5087282918393613,329}
 openqaworker-arm-2  |           20 | {164.0323526887223145,320}
 openqaworker-arm-3  |           20 | {152.4092571316286943,339}
 openqaworker-arm-1  |           10 | {128.0922723710536953,157}
 powerqaworker-qam-1 |            8 | {109.7296858420595547,171}
 QA-Power8-4-kvm     |            8 | {98.2997177951037881,140}
 malbec              |            4 | {96.5083871735259912,84}
 QA-Power8-5-kvm     |            8 | {91.5260982112959023,138}
(18 Zeilen)

That would actually explain why we see the problems in particular on openqaworker5, 6 and 3: They have the most number of worker slots and jobs with the biggest assets assigned.

The query got quite complex, though. So it would be great if someone takes a 2nd look whether the data is actually correctly (before jumping to conclusions based on that data).

Actions

#20

Updated by ilausuch over 3 years ago

One idea exposed in a conversation:

We can use a distributed object DB for store the assets, such as https://wiki.openstack.org/wiki/Swift or https://www.suse.com/products/suse-enterprise-storage/

Actions

#21

Updated by mkittler over 3 years ago

Comment by @MDoucha on the chat:

The OSD monitor dashboard show some kind of network problem for openqaworker5 and openqaworker6. Both have high number of "passive opens" TCP handshake issues throughout the whole day yesterday. openqaworker8 has comparable number of worker slots and network > traffic but no such problem.
openqaworker6 (network issues): https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker6/worker-dashboard-openqaworker6?viewPanel=65094&orgId=1&refresh=1m&from=now-2d&to=now
openqaworker8 (no issues): https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?viewPanel=65094&orgId=1&refresh=1m&from=now-2d&to=now

So this could be the culprit as well.

For now I have the following ideas:

~~Check the impact of these "passive opens" TCP handshakes.~~ This is likely just a symptom of having downloads continuously ongoing.
Reduce the number of worker slots on these workers as first mitigation. This is actually done by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/338/diffs because a few special worker classes are still configured. As shown by the queries in subsequent comments the workers still execute jobs. So I didn't file yet another workaround SR yet.
Let workers declare themselves as broken if asset downloads are piling up (similar to how we already do it when the cache service is not available at all). I've been creating a separate ticket (#96623) for this.
Abort inactive Minion jobs for asset downloads when the job has been cancelled anyways (e.g. after the timeout has been exceeded). This would avoid starting asset downloads for jobs which are not relevant anymore. Likely that's not really required when implementing 3., though. I've been creating a separate ticket (#96684) for this.
If it really turns out to be a scalability issue we need to look into providing mirrors for assets (so not all workers download assets directly from OSD). That's what the comment from @ilausuch is about.

Actions

#22

Updated by mkittler over 3 years ago

@nicksinger tested the network performance between OSD and openqaworker5 and 6 with iperf and it doesn't look bad (for both, IPv4 and 6). S.M.A.R.T. also looks good on both hosts. He also observed package loss on openqaworker6 but only over IPv4. Maybe there are general networking issues.

The "passive open" TCP connections are only 5. That's matching the number of Minion workers we use for asset downloads. However, I suspect this is rather a symptom of having continuously asset downloads running rather than the cause of the problem.

The few remaining worker classes which are still configured for these workers don't seem to overload them:

openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-08-05T00:00:00' group by host order by job_count desc;
 job_count | worker_slots | host 
-----------+--------------+------
(0 Zeilen)

But they still got some jobs done today:

openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-05T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
        host         | online_slots | recent_asset_size_in_gb_and_job_count 
---------------------+--------------+---------------------------------------
 openqaworker11      |           10 | {NULL,0}
 automotive-3        |            1 | {NULL,0}
 openqaworker3       |           19 | {501.8125675506889824,463}
 openqaworker8       |           16 | {420.8258606726303672,511}
 openqaworker9       |           16 | {417.8395846914500049,463}
 openqaworker13      |           16 | {357.7716910494491484,438}
 openqaworker10      |           10 | {315.0581799233332256,251}
 openqaworker2       |           34 | {218.7443862790241816,159}
 openqaworker5       |           41 | {205.9085173439234541,198}
 openqaworker6       |           29 | {155.4249267578125000,101}
 grenache-1          |           49 | {151.8009881842881484,219}
 openqaworker-arm-2  |           20 | {130.5520288934931182,214}
 openqaworker-arm-3  |           20 | {117.5937674688175293,254}
 openqaworker-arm-1  |           10 | {117.1310680918395518,138}
 powerqaworker-qam-1 |            8 | {70.9400719152763486,48}
 QA-Power8-4-kvm     |            8 | {67.0431841108947988,50}
 QA-Power8-5-kvm     |            8 | {62.7192087164148691,47}
 malbec              |            4 | {51.0415565958246592,34}
(18 Zeilen)

Today the overall load (in terms of number of jobs and size of the assets) is also lower than yesterday (compared to #note-19).

But before the workers were able to handle more. These are the stats for the day before yesterday and the days before (in descending order):

openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-03T00:00:00' and t_finished < '2021-08-04T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
        host         | online_slots | recent_asset_size_in_gb_and_job_count 
---------------------+--------------+---------------------------------------
 openqaworker11      |           10 | {NULL,0}
 openqaworker5       |           41 | {564.0345656182616992,997}
 openqaworker6       |           29 | {513.9413869706913857,816}
 openqaworker3       |           19 | {444.0452355770394229,624}
 openqaworker8       |           16 | {355.8525378452613926,628}
 openqaworker13      |           16 | {324.7379645807668594,500}
 openqaworker9       |           16 | {316.6136079821735664,579}
 openqaworker10      |           10 | {242.6821001125499629,366}
 grenache-1          |           49 | {197.3507730877026943,285}
 openqaworker2       |           34 | {192.5563597902655605,232}
 QA-Power8-4-kvm     |            8 | {121.9976354017853740,84}
 powerqaworker-qam-1 |            8 | {111.7647066665813350,78}
 QA-Power8-5-kvm     |            8 | {97.0567456847056768,97}
 malbec              |            4 | {96.7560659423470498,52}
 openqaworker-arm-2  |           20 | {76.5491055669263008,309}
 openqaworker-arm-1  |           10 | {67.7944853128865361,165}
 automotive-3        |            1 | {0.000000155530869960784912,1}
(17 Zeilen)

openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-02T00:00:00' and t_finished < '2021-08-03T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
        host         | online_slots | recent_asset_size_in_gb_and_job_count 
---------------------+--------------+---------------------------------------
 openqaworker11      |           10 | {NULL,0}
 automotive-3        |            1 | {NULL,0}
 openqaworker5       |           41 | {454.5740667823702148,988}
 openqaworker6       |           29 | {408.9359143218025586,781}
 openqaworker3       |           19 | {369.7522765416652012,629}
 openqaworker8       |           16 | {302.8272499749436953,677}
 openqaworker13      |           16 | {255.5371335381641963,561}
 openqaworker9       |           16 | {251.7375126304104902,594}
 openqaworker10      |           10 | {209.3847685232758525,410}
 openqaworker2       |           34 | {177.5325910160317998,181}
 grenache-1          |           49 | {172.0760751683265020,306}
 QA-Power8-4-kvm     |            8 | {74.7613717345520850,77}
 powerqaworker-qam-1 |            8 | {67.8156854547560215,80}
 QA-Power8-5-kvm     |            8 | {67.0418267175555225,65}
 malbec              |            4 | {66.2995940865948799,38}
 openqaworker-arm-1  |           10 | {61.5950581943616270,135}
 openqaworker-arm-2  |           20 | {55.9928215481340889,284}
(17 Zeilen)

openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-01T00:00:00' and t_finished < '2021-08-02T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
        host         | online_slots | recent_asset_size_in_gb_and_job_count 
---------------------+--------------+---------------------------------------
 openqaworker11      |           10 | {NULL,0}
 automotive-3        |            1 | {NULL,0}
 openqaworker5       |           41 | {292.3224924663081738,377}
 openqaworker6       |           29 | {280.0997966835275273,338}
 openqaworker3       |           19 | {243.8785007568076230,222}
 openqaworker13      |           16 | {119.8393164910376074,169}
 openqaworker9       |           16 | {111.7239799592644023,210}
 openqaworker8       |           16 | {97.6359134484082510,223}
 openqaworker10      |           10 | {72.9029493164271113,133}
 openqaworker2       |           34 | {52.4286611760035156,45}
 powerqaworker-qam-1 |            8 | {31.4961270811036230,10}
 grenache-1          |           49 | {30.9796929555013779,168}
 openqaworker-arm-2  |           20 | {22.3183024693280459,39}
 openqaworker-arm-1  |           10 | {20.9576080394908789,15}
 QA-Power8-5-kvm     |            8 | {16.6074772160500283,8}
 QA-Power8-4-kvm     |            8 | {10.3167545357719063,12}
 malbec              |            4 | {9.4528544545173645,2}
(17 Zeilen)

openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-07-30T00:00:00' and t_finished < '2021-08-31T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
        host         | online_slots | recent_asset_size_in_gb_and_job_count 
---------------------+--------------+---------------------------------------
 openqaworker11      |           10 | {NULL,0}
 openqaworker8       |           16 | {1501.5539881587028506,4055}
 openqaworker13      |           16 | {1482.9567158073186875,3524}
 openqaworker3       |           19 | {1405.1550275972113037,3965}
 openqaworker9       |           16 | {1363.2696598181501006,3535}
 openqaworker5       |           41 | {1161.9894029386341572,5248}
 openqaworker10      |           10 | {1140.3377235047519209,2253}
 openqaworker6       |           29 | {1118.6922541307285400,4544}
 openqaworker2       |           34 | {806.7370675625279521,1175}
 grenache-1          |           49 | {625.7008608793839834,1850}
 openqaworker-arm-2  |           20 | {325.4278003461658955,1667}
 openqaworker-arm-1  |           10 | {289.5065780291333770,891}
 QA-Power8-4-kvm     |            8 | {239.9003790505230430,532}
 powerqaworker-qam-1 |            8 | {225.4872193448245527,590}
 malbec              |            4 | {216.3586414456367490,298}
 QA-Power8-5-kvm     |            8 | {195.2722742008045293,495}
 automotive-3        |            1 | {0.00045777391642332077,117}
(17 Zeilen)

openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-07-29T00:00:00' and t_finished < '2021-08-30T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
        host         | online_slots | recent_asset_size_in_gb_and_job_count 
---------------------+--------------+---------------------------------------
 openqaworker11      |           10 | {NULL,0}
 openqaworker8       |           16 | {1564.1907451339066025,4874}
 openqaworker13      |           16 | {1527.8008457580581279,4121}
 openqaworker3       |           19 | {1454.0253826659172822,4707}
 openqaworker9       |           16 | {1420.8450045976787852,4253}
 openqaworker5       |           41 | {1254.6822126023471357,6388}
 openqaworker10      |           10 | {1199.0791805796325205,2715}
 openqaworker6       |           29 | {1195.0145842516794775,5654}
 openqaworker2       |           34 | {877.3261965857818701,1464}
 grenache-1          |           49 | {726.2261848598718643,2345}
 openqaworker-arm-2  |           20 | {394.4984608776867393,2447}
 QA-Power8-4-kvm     |            8 | {374.7659980067983242,729}
 openqaworker-arm-1  |           10 | {369.4309996552765371,1296}
 QA-Power8-5-kvm     |            8 | {336.1131262471899414,666}
 powerqaworker-qam-1 |            8 | {332.0527856061235049,848}
 malbec              |            4 | {296.4981770850718018,416}
 automotive-3        |            1 | {0.00068451650440692902,173}
(17 Zeilen)

So something has definitely deteriorated.

Actions

#23

Updated by mkittler over 3 years ago

Related to action #96554: Mitigate on-going disk I/O alerts size:M added

Actions

#24

Updated by mkittler over 3 years ago

The package loss and ping times when using IPv4 to ping OSD are really quite high on openqaworker6:

martchus@openqaworker6:~> ping -4 openqa.suse.de
PING openqa.suse.de (10.160.0.207) 56(84) bytes of data.
64 bytes from salt (10.160.0.207): icmp_seq=7 ttl=64 time=2.92 ms
64 bytes from salt (10.160.0.207): icmp_seq=11 ttl=64 time=3.08 ms
64 bytes from salt (10.160.0.207): icmp_seq=15 ttl=64 time=2.51 ms
64 bytes from salt (10.160.0.207): icmp_seq=21 ttl=64 time=2.94 ms
64 bytes from salt (10.160.0.207): icmp_seq=23 ttl=64 time=3.00 ms
64 bytes from salt (10.160.0.207): icmp_seq=26 ttl=64 time=2.77 ms
64 bytes from salt (10.160.0.207): icmp_seq=30 ttl=64 time=2.86 ms
64 bytes from salt (10.160.0.207): icmp_seq=32 ttl=64 time=3.85 ms
64 bytes from salt (10.160.0.207): icmp_seq=37 ttl=64 time=2.99 ms
64 bytes from salt (10.160.0.207): icmp_seq=41 ttl=64 time=2.87 ms
64 bytes from salt (10.160.0.207): icmp_seq=44 ttl=64 time=2.36 ms
64 bytes from salt (10.160.0.207): icmp_seq=51 ttl=64 time=3.40 ms
64 bytes from salt (10.160.0.207): icmp_seq=53 ttl=64 time=3.04 ms
64 bytes from salt (10.160.0.207): icmp_seq=54 ttl=64 time=2.87 ms
64 bytes from salt (10.160.0.207): icmp_seq=55 ttl=64 time=2.80 ms
64 bytes from salt (10.160.0.207): icmp_seq=61 ttl=64 time=3.00 ms
64 bytes from salt (10.160.0.207): icmp_seq=63 ttl=64 time=3.26 ms
64 bytes from salt (10.160.0.207): icmp_seq=64 ttl=64 time=3.06 ms
64 bytes from salt (10.160.0.207): icmp_seq=71 ttl=64 time=2.93 ms
64 bytes from salt (10.160.0.207): icmp_seq=78 ttl=64 time=2.83 ms
64 bytes from salt (10.160.0.207): icmp_seq=80 ttl=64 time=2.92 ms
64 bytes from salt (10.160.0.207): icmp_seq=86 ttl=64 time=3.03 ms
64 bytes from salt (10.160.0.207): icmp_seq=88 ttl=64 time=3.02 ms
64 bytes from salt (10.160.0.207): icmp_seq=89 ttl=64 time=2.90 ms
64 bytes from salt (10.160.0.207): icmp_seq=91 ttl=64 time=2.95 ms
64 bytes from salt (10.160.0.207): icmp_seq=93 ttl=64 time=2.87 ms
64 bytes from salt (10.160.0.207): icmp_seq=94 ttl=64 time=2.75 ms
64 bytes from salt (10.160.0.207): icmp_seq=101 ttl=64 time=2.90 ms
64 bytes from salt (10.160.0.207): icmp_seq=102 ttl=64 time=2.70 ms
64 bytes from salt (10.160.0.207): icmp_seq=108 ttl=64 time=3.23 ms
64 bytes from salt (10.160.0.207): icmp_seq=111 ttl=64 time=2.77 ms
64 bytes from salt (10.160.0.207): icmp_seq=113 ttl=64 time=3.10 ms
64 bytes from salt (10.160.0.207): icmp_seq=117 ttl=64 time=3.10 ms
64 bytes from salt (10.160.0.207): icmp_seq=118 ttl=64 time=2.67 ms
^C
--- openqa.suse.de ping statistics ---
118 packets transmitted, 34 received, 71% packet loss, time 118069ms
rtt min/avg/max/mdev = 2.363/2.953/3.858/0.256 ms
martchus@openqaworker6:~> ping -6 openqa.suse.de
PING openqa.suse.de(openqa.suse.de (2620:113:80c0:8080:10:160:0:207)) 56 data bytes
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=1 ttl=64 time=0.177 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=2 ttl=64 time=1.64 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=3 ttl=64 time=0.244 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=4 ttl=64 time=0.268 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=5 ttl=64 time=0.199 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=6 ttl=64 time=0.411 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=7 ttl=64 time=0.686 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=8 ttl=64 time=0.569 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=9 ttl=64 time=0.706 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=10 ttl=64 time=0.843 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=11 ttl=64 time=0.437 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=12 ttl=64 time=0.868 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=13 ttl=64 time=0.749 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=14 ttl=64 time=0.680 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=15 ttl=64 time=0.614 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=16 ttl=64 time=0.747 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=17 ttl=64 time=0.627 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=18 ttl=64 time=0.703 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=19 ttl=64 time=0.806 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=20 ttl=64 time=0.669 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=21 ttl=64 time=0.801 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=22 ttl=64 time=0.661 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=23 ttl=64 time=0.717 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=24 ttl=64 time=0.866 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=25 ttl=64 time=1.06 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=26 ttl=64 time=0.730 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=27 ttl=64 time=0.387 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=28 ttl=64 time=0.686 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=29 ttl=64 time=0.810 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=30 ttl=64 time=0.606 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=31 ttl=64 time=0.576 ms
64 bytes from openqa.suse.de (2620:113:80c0:8080:10:160:0:207): icmp_seq=32 ttl=64 time=0.599 ms
^C
--- openqa.suse.de ping statistics ---
32 packets transmitted, 32 received, 0% packet loss, time 31684ms
rtt min/avg/max/mdev = 0.177/0.660/1.640/0.271 ms

When pinging wotan or an external host (via IPv4 or 6) the package loss does not occur. However, when pinging OSD from another worker (including openqaworker5) the problem is also not reproducible.

The connection speed seems fine, though:

martchus@openqaworker6:~> iperf3 -4 -c openqa.suse.de -i 1 -t 30
Connecting to host openqa.suse.de, port 5201
[  5] local 10.160.1.100 port 50254 connected to 10.160.0.207 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  68.9 MBytes   578 Mbits/sec  438   62.2 KBytes       
[  5]   1.00-2.00   sec  78.3 MBytes   657 Mbits/sec  495    130 KBytes       
[  5]   2.00-3.00   sec  73.3 MBytes   615 Mbits/sec  521   49.5 KBytes       
[  5]   3.00-4.00   sec  77.8 MBytes   653 Mbits/sec  511    106 KBytes       
[  5]   4.00-5.00   sec  76.2 MBytes   640 Mbits/sec  711   67.9 KBytes       
[  5]   5.00-6.00   sec  75.3 MBytes   632 Mbits/sec  353    148 KBytes       
[  5]   6.00-7.00   sec  78.4 MBytes   658 Mbits/sec  733   50.9 KBytes       
[  5]   7.00-8.00   sec  69.2 MBytes   580 Mbits/sec  476   62.2 KBytes       
[  5]   8.00-9.00   sec  70.7 MBytes   593 Mbits/sec  450   89.1 KBytes       
[  5]   9.00-10.00  sec  72.9 MBytes   611 Mbits/sec  482    276 KBytes       
[  5]  10.00-11.00  sec  75.8 MBytes   635 Mbits/sec  401   79.2 KBytes       
[  5]  11.00-12.00  sec  80.6 MBytes   676 Mbits/sec  739   66.5 KBytes       
[  5]  12.00-13.00  sec  74.9 MBytes   629 Mbits/sec  857    272 KBytes       
[  5]  13.00-14.00  sec  75.8 MBytes   635 Mbits/sec  552   50.9 KBytes       
[  5]  14.00-15.00  sec  76.4 MBytes   641 Mbits/sec  586    129 KBytes       
[  5]  15.00-16.00  sec  74.8 MBytes   628 Mbits/sec  403   93.3 KBytes       
[  5]  16.00-17.00  sec  72.7 MBytes   610 Mbits/sec  541    147 KBytes       
[  5]  17.00-18.00  sec  63.9 MBytes   536 Mbits/sec  344   79.2 KBytes       
[  5]  18.00-19.00  sec  70.0 MBytes   587 Mbits/sec  249   52.3 KBytes       
[  5]  19.00-20.00  sec  79.4 MBytes   666 Mbits/sec  559   93.3 KBytes       
[  5]  20.00-21.00  sec  78.0 MBytes   655 Mbits/sec  376   91.9 KBytes       
[  5]  21.00-22.00  sec  77.7 MBytes   652 Mbits/sec  594    274 KBytes       
[  5]  22.00-23.00  sec  71.6 MBytes   601 Mbits/sec  546    158 KBytes       
[  5]  23.00-24.00  sec  66.9 MBytes   561 Mbits/sec  279   89.1 KBytes       
[  5]  24.00-25.00  sec  68.5 MBytes   575 Mbits/sec  444   67.9 KBytes       
[  5]  25.00-26.00  sec  75.0 MBytes   629 Mbits/sec  628   99.0 KBytes       
[  5]  26.00-27.00  sec  83.0 MBytes   696 Mbits/sec  239    175 KBytes       
[  5]  27.00-28.00  sec  75.8 MBytes   636 Mbits/sec  143    100 KBytes       
[  5]  28.00-29.00  sec  76.5 MBytes   642 Mbits/sec  308    112 KBytes       
[  5]  29.00-30.00  sec  55.6 MBytes   466 Mbits/sec  154   90.5 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  2.16 GBytes   619 Mbits/sec  14112             sender
[  5]   0.00-30.00  sec  2.16 GBytes   619 Mbits/sec                  receiver

iperf Done.

It looks the same with IPv6. I've also tested it on openqaworker5 (had many incomplete jobs) and openqaworker8 (had no incomplete jobs) and the figures are slightly higher (sometimes reaching the 800 Mbits/sec mark).

With the download speed ruled out that leaves /dev/vdc on OSD being too slow as the cause. So I've tried downloading an asset specifically:

martchus@openqaworker5:~> wget -4 'https://openqa.suse.de/tests/6598846/asset/iso/SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso'
--2021-08-06 12:53:37--  https://openqa.suse.de/tests/6598846/asset/iso/SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso
Auflösen des Hostnamens openqa.suse.de (openqa.suse.de)… 10.160.0.207
Verbindungsaufbau zu openqa.suse.de (openqa.suse.de)|10.160.0.207|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 302 Found
Platz: /assets/iso/SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso [folgend]
--2021-08-06 12:53:37--  https://openqa.suse.de/assets/iso/SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso
Wiederverwendung der bestehenden Verbindung zu openqa.suse.de:443.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 412090368 (393M) [application/x-iso9660-image]
Wird in »SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso« gespeichert.

SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso                                  100%[====================================================================================================================================================================================================>] 393,00M  16,9MB/s    in 20s     

2021-08-06 12:53:57 (20,1 MB/s) - »SLE-15-SP4-Online-x86_64-Build18.1-Media1.iso« gespeichert [412090368/412090368]

I've tried it multiple times in a row. At first I've got 20,1 MB/s to 22,9 MB/s which already doesn't seem too bad to me. The speed also matches the I/O traffic I observed yesterday via htop when looking at the relevant Minion processes. I also did some tested on the "good" worker openqaworker8 where I've got 42,8 MB/s to 47,7 MB/s. That's more than the double. However, after downloading the asset several times on both hosts (interleaved) I've reached a very high download speed of around 60 MB/s on both workers. So I assume the download isn't really slower on "bad" hosts like openqaworker5 compared to "good" hosts like openqaworker8 and it is mainly a matter of how fast OSD can serve the asset which highly depends on caching.

I've also tested the download on openqaworker6 with wget -4 … and despite the package loss shown by ping -4 … the speed also reaches around 60 MB/s.

So /dev/vdc is not slow per-se either. The values measured on OSD itself (while likely other processes access /dev/vdc as well) also don't look bad:

martchus@openqa:~> sudo hdparm -Tt /dev/vdc 

/dev/vdc:
 Timing cached reads:   5682 MB in  2.00 seconds = 2844.42 MB/sec
 Timing buffered disk reads: 276 MB in  3.01 seconds =  91.66 MB/sec

However, when many worker slots are hammering it concurrently than the performance likely goes down much (which affects the workers with high asset load the most).

These figures don't give much for filing an infra ticket. Not sure what to do next.

Actions

#25

Updated by mkittler over 3 years ago

Now there are 3 more jobs:

openqa=> select count(jobs.id) as job_count, (select count(id) from workers as w where w.host = workers.host and t_updated > (timezone('UTC', now()) - interval '1 hour')) as worker_slots, host from jobs left join workers on workers.id = jobs.assigned_worker_id where reason like '%setup exceeded MAX_SETUP_TIME%' and jobs.t_finished >= '2021-08-05T00:00:00' group by host order by job_count desc;
 job_count | worker_slots |        host        
-----------+--------------+--------------------
         1 |           10 | openqaworker-arm-1
         1 |           19 | openqaworker3
         1 |           29 | openqaworker6
(3 Zeilen)
openqa=> select id, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-08-05T00:00:00' order by id;
   id    |      result      |                 reason                 |       worker       
---------+------------------+----------------------------------------+--------------------
 6654913 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker3
 6656551 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker-arm-1
 6656603 | timeout_exceeded | timeout: setup exceeded MAX_SETUP_TIME | openqaworker6
(3 Zeilen)

Looks like syncing tests/needles took too long in these cases, e.g. it took an hour on openqaworker-arm-1 and openqaworker3. Nothing was stuck here, it just was slow.

Actions

#26

Updated by mkittler over 3 years ago

There are also cache service related coredumps on those hosts:

martchus@openqaworker6:~> sudo coredumpctl info 3775
           PID: 3775 (openqa-workerca)
           UID: 486 (_openqa-worker)
           GID: 65534 (nogroup)
        Signal: 3 (QUIT)
     Timestamp: Wed 2021-08-04 15:58:37 CEST (2 days ago)
  Command Line: /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
    Executable: /usr/bin/perl
 Control Group: /system.slice/openqa-worker-cacheservice.service
          Unit: openqa-worker-cacheservice.service
         Slice: system.slice
       Boot ID: 0cbc03c70c1b4c3594d42f7307ef8c7f
    Machine ID: c8fa1ae99ae4d5730af2d97658d3edd9
      Hostname: openqaworker6
       Storage: /var/lib/systemd/coredump/core.openqa-workerca.486.0cbc03c70c1b4c3594d42f7307ef8c7f.3775.1628085517000000.lz4
       Message: Process 3775 (openqa-workerca) of user 486 dumped core.
                
                Stack trace of thread 3775:
                #0  0x0000556350927344 Perl_hv_common (perl)
                #1  0x00005563508b30cf n/a (perl)
                #2  0x00005563508b3bf6 Perl_gv_fetchmeth_pvn (perl)
                #3  0x00005563509398d3 n/a (perl)
                #4  0x000055635093a2f8 Perl_sv_clear (perl)
                #5  0x000055635093a7b1 Perl_sv_free2 (perl)
                #6  0x000055635096aa67 Perl_free_tmps (perl)
                #7  0x00005563508ad5d9 perl_run (perl)
                #8  0x0000556350884ef2 main (perl)
                #9  0x00007efe102e534a __libc_start_main (libc.so.6)
                #10 0x0000556350884f3a _start (perl)

martchus@openqaworker5:~> sudo coredumpctl info 38544
           PID: 38544 (openqa-workerca)
           UID: 486 (_openqa-worker)
           GID: 65534 (nogroup)
        Signal: 3 (QUIT)
     Timestamp: Wed 2021-08-04 22:21:07 CEST (1 day 19h ago)
  Command Line: /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
    Executable: /usr/bin/perl
 Control Group: /system.slice/openqa-worker-cacheservice.service
          Unit: openqa-worker-cacheservice.service
         Slice: system.slice
       Boot ID: 549d9e3d0d0e4ac08b8aef53a6bcfb98
    Machine ID: 9b632e64195547add4173e3c5878a024
      Hostname: openqaworker5
       Storage: /var/lib/systemd/coredump/core.openqa-workerca.486.549d9e3d0d0e4ac08b8aef53a6bcfb98.38544.1628108467000000.lz4
       Message: Process 38544 (openqa-workerca) of user 486 dumped core.
                
                Stack trace of thread 38544:
                #0  0x0000557588e1703b Perl_sv_clear (perl)
                #1  0x0000557588e177b1 Perl_sv_free2 (perl)
                #2  0x0000557588e47a67 Perl_free_tmps (perl)
                #3  0x0000557588d8a5d9 perl_run (perl)
                #4  0x0000557588d61ef2 main (perl)
                #5  0x00007f1c98f3434a __libc_start_main (libc.so.6)
                #6  0x0000557588d61f3a _start (perl)

Actions

#27

Updated by livdywan over 3 years ago

Related to action #96623: Let workers declare themselves as broken if asset downloads are piling up size:M added

Actions

#28

Updated by mkittler over 3 years ago

Since jobs are piling up on OSD and with the currently reduced number of jobs on openqaworker5 and 6 there are no further jobs running into MAX_SETUP_TIME I've been creating https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/339/diffs to increase the number of jobs we run on those workers again. The number of slots will still be reduced compared to what we usually had.

Actions

#29

Updated by mkittler over 3 years ago

The SR has been merged and the worker slots are used now. select id, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-08-06T11:00:00' order by id; still gives no results and the Minion dashboard looks good as well. So it doesn't look like it was too much; possibly we can enable even more worker slots.

Actions

#30

Updated by mkittler over 3 years ago

This problem is possibly related to a recent VLAN change in the QA lab which created a network loop. That should be resolved by now. I also don't see the package loss anymore (when pinging OSD from openqaworker6 via IPv4).

To check we could get back to the usual number of worker slots: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/341

Actions

#31

Updated by vpelcak over 3 years ago

Hello.

This issue seems to be pretty severe and took ~5 days to resolve.
Do you have any post-mortem analysis? What needs to be done not only to prevent the issue but also resolve it faster?

Actions

#32

Updated by mkittler over 3 years ago

Related to action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M added

Actions

#33

Updated by livdywan over 3 years ago

vpelcak wrote:

This issue seems to be pretty severe and took ~5 days to resolve.
Do you have any post-mortem analysis? What needs to be done not only to prevent the issue but also resolve it faster?

We've been discussing this daily and I'm not sure that we know what the issue is or was and if it's fixed. And please keep in mind that we had no alerts within this period due to infra issues. I appreciate your concern but I think the team is handling this sensibly.

Actions

#34

Updated by mkittler over 3 years ago

And the issue isn't even completely resolved. I'm still checking for those incompletes now that the workers are back. So far it looks good, though.

Making progress with this issue took long because all I could do is measuring the performance and load of these workers and with the information I could gather it was just not easily possible to pin down the problem, e.g. to file an Infra ticket. Only after asking why the QA lab had to be taken offline it became clear that the networking problems (not caused by work of the tools team) had been the culprit. I guess better communication about the work in the QA labs or at least the resulting problems would have helped.

Note that after all actions outside the tools team caused and also resolved the problem (if it is actually resolved now; I'm still monitoring as mentioned). So likely Infra (or whoever was involved) should also think what they can improve. I was merely looking at symptoms and could reduce the number of incompletes while the problem apparent by manually reducing the number of workers slots. This went quite smoothly via salt besides one problem I've filed a ticket about (#96683). I've also already filed some tickets with ideas for improving openQA so it could cope better with situations like this one (#96623, #96684).

Actions

#35

Updated by mkittler over 3 years ago

A new occurrence but this time on openqaworker10: https://openqa.suse.de/tests/6789948

I've checked the Minion dashboard. Looks like the download of http://openqa.suse.de/tests/6789948/asset/iso/SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso took almost an hour with one retry in the middle. The download eventually succeeded but it was already too late.

Actions

#36

Updated by mkittler over 3 years ago

A few more occurrences happened 12 hours ago, this time again on openqaworker5 waiting for tests to sync:

openqa=> select id, result, t_finished, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-08-06T11:00:00' order by t_finished asc;
   id    |      result      |     t_finished      |                 reason                 |     worker     
---------+------------------+---------------------+----------------------------------------+----------------
 6789948 | timeout_exceeded | 2021-08-10 10:44:41 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker10
 6796355 | timeout_exceeded | 2021-08-10 22:31:18 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5
 6796465 | timeout_exceeded | 2021-08-10 22:58:18 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5
 6796613 | timeout_exceeded | 2021-08-10 23:15:50 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5
 6796612 | timeout_exceeded | 2021-08-10 23:15:50 | timeout: setup exceeded MAX_SETUP_TIME | openqaworker5
(5 Zeilen)

Considering we've increased MAX_SETUP_TIME from one hour to two hours for workers with many slots (like openqaworker5) we're not where we were before.

Btw, the syncing of tests took max. 9 minutes here and nothing was stuck here. So everything was just slow and syncing tests is the last part of the setup which just exceeded the timeout.

Currently there are 40 inactive download jobs and one of the active ones has already taken 24 minutes so far (SLE-15-SP3-Full-x86_64-GM-Media1.iso). Among these 40 inactive download jobs there are 11 ones to cache tests. Maybe these should have a higher priority.

EDIT: The 40 jobs could actually be processed in time before another test exceeded MAX_SETUP_TIME.

Actions

#37

Updated by mkittler over 3 years ago

PR for adjusting the prio as mentioned in the last paragraph of the previous comment: https://github.com/os-autoinst/openQA/pull/4118

Actions

#38

Updated by livdywan over 3 years ago

Related to coordination #96447: [epic] Failed systemd services and job age alerts added

Actions

#39

Updated by mkittler over 3 years ago

Priority changed from Immediate to High

#96710 is most problematic right now and the remaining jobs exceeding MAX_SETUP_TIME might still be a symptom of the high number of incompletes we see due to this other issue. The numbers are also relatively small (compared to #96710), e.g. the query from the previous comment now "only" returns 40 rows.

Actions

#40

Updated by mkittler over 3 years ago

Status changed from In Progress to Resolved

It looks like this problem has been resolved now. The query still only returns 40 rows. I suppose it helped that the network loop has been fixed and that #96710 has at least been worked around on all production workers.

We've also been following up with https://progress.opensuse.org/issues/96623 which has already been deployed. So I would leave further improvements of openQA itself (to handle such a situation better) to follow-up tickets (which have already been created, see previous comments).

Actions

#41

Updated by okurz over 3 years ago

Related to action #98307: Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M added

Actions