action #154345
closedIncomplete jobs (not restarted) of last 24h alert Salt
From Grafana FIRING:1:
- DONE Add a silence
- View dashboard
- View panel
Rollback steps¶
- DONE Remove silence from
Updated by livdywan about 1 year ago
- 470 at its peak in the middle of the European night.
- shows 8 incompletes right now, grafana says 317.
Updated by mkittler about 1 year ago
The list of recent incompletes is really dominated by asset download failures:
openqa=> select count(id), substring(reason from 0 for 60) as reason_substr from jobs where t_finished >= '2024-01-22T00:00:00' and result = 'incomplete' and clone_id is null group by reason_substr order by count(id) desc;
count | reason_substr
78 | asset failure: Failed to download sle-micro-6.0-x86_64-10.1
76 | asset failure: Failed to download sle-micro-6.0-aarch64-10.
58 | asset failure: Failed to download SLES15-SP5-Minimal-VM.x86
40 | asset failure: Failed to download dev_tools.dud to /var/lib
39 | tests died: unable to load, check the log for the c
38 | asset failure: Failed to download SLES-15-SP6-x86_64-Build4
20 | asset failure: Failed to download sle-15-SP6-x86_64-45.1-gn
20 | asset failure: Failed to download sle-15-SP6-x86_64-40.1-te
15 | asset failure: Failed to download SLE-15-SP6-Full-aarch64-B
15 | backend died: QMP command migrate failed: GenericError; Sta
12 | tests died: unable to load tests/network/samba/samba_adcli.
12 | asset failure: Failed to download sle-15-SP6-ppc64le-45.1-g
11 | backend died: QEMU terminated before QMP connection could b
10 | asset failure: Failed to download sle-15-SP4-x86_64-2024012
10 | asset failure: Failed to download sle-15-SP6-aarch64-Build4
10 | tests died: unable to load tests/yast2_gui/yast2_bootloader
10 | asset failure: Failed to download sle-15-SP6-x86_64-39.1-gn
9 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base
9 | asset failure: Failed to download autoyast-SLES-12SP5-x86_6
8 | asset failure: Failed to download sle-15-SP6-aarch64-45.1-g
8 | asset failure: Failed to download sle-15-SP5-x86_64-120.11-
8 | asset failure: Failed to download SLE-15-SP6-ppc64le-Build4
8 | asset failure: Failed to download sle-15-SP5-ppc64le-Build1
Maybe that's due to me reducing asset storage limits. The 40 jobs about dev_tools.dud
might be related to
There were also 39 incompletes due to errors when loading the schedule. These are often syntax errors but when I had a look at some of those I found only incompletes due to YAML_SCHEDULE file not found: 'sle/lib/../schedule/security/oscap_stig.yaml'
. That's maybe an error case we can distinguish from syntax errors and make those jobs failures instead.
In any case I would just wait and see whether the trend of declining figures continues.
Updated by okurz about 1 year ago
- Description updated (diff)
- Priority changed from Urgent to High
mkittler and me looked into this. Currently no alert condition. The biggest problem was a syntax error with a missing "%" in a variable causing missing dependencies among jobs. That was fixed meanwhile likely by yosun in the testsuite. Right now the number of incomplete jobs has already decreased sufficiently so reducing prio accordingly. Removed the silence again.
Updated by mkittler about 1 year ago
- Most assets were missing due to a typo in a job dependency. It seems already fixed but I mentioned it also in the chat.
- Maybe
is a victim of our asset cleanup or maybe must missing. - Some investigation jobs were missing assets. This is because they were about a very old (2 month old) last good build and the asset simply didn't exist anymore.
Updated by mkittler about 1 year ago
- Status changed from In Progress to Resolved
Turning the error about YAML_SCHEDULE
into a failure would probably not be the best idea. We could emit a more specific reason for that but this exception is happening within the test distribution so I'm not looking further into it right now.
With that I would actually close the ticket.
Updated by jbaier_cz 2 months ago
- Copied to action #174586: Incomplete jobs (not restarted) of last 24h alert Salt added