action #154345
closedIncomplete jobs (not restarted) of last 24h alert Salt
0%
Description
Observation¶
From Grafana FIRING:1:
B0=312
Suggestions¶
- DONE Add a silence http://stats.openqa-monitor.qa.suse.de/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DIncomplete+jobs+%28not+restarted%29+of+last+24h+alert&matcher=grafana_folder%3DSalt&matcher=rule_uid%3DcXo2cmBVk&orgId=1
- View dashboard http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz?orgId=1
- View panel http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz?orgId=1&viewPanel=1
Rollback steps¶
- DONE Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences
Updated by livdywan 11 months ago
- 470 at its peak in the middle of the European night.
- https://openqa.suse.de/tests?resultfilter=Incomplete shows 8 incompletes right now, grafana says 317.
Updated by mkittler 11 months ago
The list of recent incompletes is really dominated by asset download failures:
openqa=> select count(id), substring(reason from 0 for 60) as reason_substr from jobs where t_finished >= '2024-01-22T00:00:00' and result = 'incomplete' and clone_id is null group by reason_substr order by count(id) desc;
count | reason_substr
-------+-------------------------------------------------------------
78 | asset failure: Failed to download sle-micro-6.0-x86_64-10.1
76 | asset failure: Failed to download sle-micro-6.0-aarch64-10.
58 | asset failure: Failed to download SLES15-SP5-Minimal-VM.x86
40 | asset failure: Failed to download dev_tools.dud to /var/lib
39 | tests died: unable to load main.pm, check the log for the c
38 | asset failure: Failed to download SLES-15-SP6-x86_64-Build4
20 | asset failure: Failed to download sle-15-SP6-x86_64-45.1-gn
20 | asset failure: Failed to download sle-15-SP6-x86_64-40.1-te
15 | asset failure: Failed to download SLE-15-SP6-Full-aarch64-B
15 | backend died: QMP command migrate failed: GenericError; Sta
12 | tests died: unable to load tests/network/samba/samba_adcli.
12 | asset failure: Failed to download sle-15-SP6-ppc64le-45.1-g
11 | backend died: QEMU terminated before QMP connection could b
10 | asset failure: Failed to download sle-15-SP4-x86_64-2024012
10 | asset failure: Failed to download sle-15-SP6-aarch64-Build4
10 | tests died: unable to load tests/yast2_gui/yast2_bootloader
10 | asset failure: Failed to download sle-15-SP6-x86_64-39.1-gn
9 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base
9 | asset failure: Failed to download autoyast-SLES-12SP5-x86_6
8 | asset failure: Failed to download sle-15-SP6-aarch64-45.1-g
8 | asset failure: Failed to download sle-15-SP5-x86_64-120.11-
8 | asset failure: Failed to download SLE-15-SP6-ppc64le-Build4
8 | asset failure: Failed to download sle-15-SP5-ppc64le-Build1
…
Maybe that's due to me reducing asset storage limits. The 40 jobs about dev_tools.dud
might be related to https://suse.slack.com/archives/C02CANHLANP/p1706249706124219.
There were also 39 incompletes due to errors when loading the schedule. These are often syntax errors but when I had a look at some of those I found only incompletes due to YAML_SCHEDULE file not found: 'sle/lib/../schedule/security/oscap_stig.yaml'
. That's maybe an error case we can distinguish from syntax errors and make those jobs failures instead.
In any case I would just wait and see whether the trend of declining figures continues.
Updated by okurz 11 months ago
- Description updated (diff)
- Priority changed from Urgent to High
mkittler and me looked into this. Currently no alert condition. The biggest problem was a syntax error with a missing "%" in a variable causing missing dependencies among jobs. That was fixed meanwhile likely by yosun in the testsuite. Right now the number of incomplete jobs has already decreased sufficiently so reducing prio accordingly. Removed the silence again.
Updated by mkittler 11 months ago
- Most assets were missing due to a typo in a job dependency. It seems already fixed but I mentioned it also in the chat.
- Maybe
SLES15-SP5-Minimal-VM.x86_64-VMware-Build4.2.23.vmdk.xz
is a victim of our asset cleanup or maybe must missing. - Some investigation jobs were missing assets. This is because they were about a very old (2 month old) last good build and the asset simply didn't exist anymore.
Updated by mkittler 11 months ago
- Status changed from In Progress to Resolved
Turning the error about YAML_SCHEDULE
into a failure would probably not be the best idea. We could emit a more specific reason for that but this exception is happening within the test distribution so I'm not looking further into it right now.
With that I would actually close the ticket.