Project

General

Profile

Actions

action #154345

closed

Incomplete jobs (not restarted) of last 24h alert Salt

Added by livdywan 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Actions #1

Updated by livdywan 10 months ago

Actions #2

Updated by mkittler 10 months ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler 10 months ago

  • Status changed from New to In Progress
Actions #4

Updated by mkittler 10 months ago

The list of recent incompletes is really dominated by asset download failures:

openqa=> select count(id), substring(reason from 0 for 60) as reason_substr from jobs where t_finished >= '2024-01-22T00:00:00' and result = 'incomplete' and clone_id is null group by reason_substr order by count(id) desc;
 count |                        reason_substr                        
-------+-------------------------------------------------------------
    78 | asset failure: Failed to download sle-micro-6.0-x86_64-10.1
    76 | asset failure: Failed to download sle-micro-6.0-aarch64-10.
    58 | asset failure: Failed to download SLES15-SP5-Minimal-VM.x86
    40 | asset failure: Failed to download dev_tools.dud to /var/lib
    39 | tests died: unable to load main.pm, check the log for the c
    38 | asset failure: Failed to download SLES-15-SP6-x86_64-Build4
    20 | asset failure: Failed to download sle-15-SP6-x86_64-45.1-gn
    20 | asset failure: Failed to download sle-15-SP6-x86_64-40.1-te
    15 | asset failure: Failed to download SLE-15-SP6-Full-aarch64-B
    15 | backend died: QMP command migrate failed: GenericError; Sta
    12 | tests died: unable to load tests/network/samba/samba_adcli.
    12 | asset failure: Failed to download sle-15-SP6-ppc64le-45.1-g
    11 | backend died: QEMU terminated before QMP connection could b
    10 | asset failure: Failed to download sle-15-SP4-x86_64-2024012
    10 | asset failure: Failed to download sle-15-SP6-aarch64-Build4
    10 | tests died: unable to load tests/yast2_gui/yast2_bootloader
    10 | asset failure: Failed to download sle-15-SP6-x86_64-39.1-gn
     9 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base
     9 | asset failure: Failed to download autoyast-SLES-12SP5-x86_6
     8 | asset failure: Failed to download sle-15-SP6-aarch64-45.1-g
     8 | asset failure: Failed to download sle-15-SP5-x86_64-120.11-
     8 | asset failure: Failed to download SLE-15-SP6-ppc64le-Build4
     8 | asset failure: Failed to download sle-15-SP5-ppc64le-Build1
…

Maybe that's due to me reducing asset storage limits. The 40 jobs about dev_tools.dud might be related to https://suse.slack.com/archives/C02CANHLANP/p1706249706124219.

There were also 39 incompletes due to errors when loading the schedule. These are often syntax errors but when I had a look at some of those I found only incompletes due to YAML_SCHEDULE file not found: 'sle/lib/../schedule/security/oscap_stig.yaml'. That's maybe an error case we can distinguish from syntax errors and make those jobs failures instead.

In any case I would just wait and see whether the trend of declining figures continues.

Actions #5

Updated by okurz 10 months ago

  • Description updated (diff)
  • Priority changed from Urgent to High

mkittler and me looked into this. Currently no alert condition. The biggest problem was a syntax error with a missing "%" in a variable causing missing dependencies among jobs. That was fixed meanwhile likely by yosun in the testsuite. Right now the number of incomplete jobs has already decreased sufficiently so reducing prio accordingly. Removed the silence again.

Actions #6

Updated by mkittler 10 months ago

  • Most assets were missing due to a typo in a job dependency. It seems already fixed but I mentioned it also in the chat.
  • Maybe SLES15-SP5-Minimal-VM.x86_64-VMware-Build4.2.23.vmdk.xz is a victim of our asset cleanup or maybe must missing.
  • Some investigation jobs were missing assets. This is because they were about a very old (2 month old) last good build and the asset simply didn't exist anymore.
Actions #7

Updated by mkittler 10 months ago

  • Status changed from In Progress to Resolved

Turning the error about YAML_SCHEDULE into a failure would probably not be the best idea. We could emit a more specific reason for that but this exception is happening within the test distribution so I'm not looking further into it right now.

With that I would actually close the ticket.

Actions

Also available in: Atom PDF