action #154345: Incomplete jobs (not restarted) of last 24h alert Salt - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #154345

closed

Incomplete jobs (not restarted) of last 24h alert Salt

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

From Grafana [FIRING:1] (Incomplete jobs (not restarted) of last 24h alert Salt cXo2cmBVk):

  B0=312

Suggestions¶

Rollback steps¶

DONE Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by livdywan over 1 year ago

470 at its peak in the middle of the European night.
https://openqa.suse.de/tests?resultfilter=Incomplete shows 8 incompletes right now, grafana says 317.

Actions

Copy link

Updated by mkittler over 1 year ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 1 year ago

Status changed from New to In Progress

Actions

Copy link

Updated by mkittler over 1 year ago

The list of recent incompletes is really dominated by asset download failures:

openqa=> select count(id), substring(reason from 0 for 60) as reason_substr from jobs where t_finished >= '2024-01-22T00:00:00' and result = 'incomplete' and clone_id is null group by reason_substr order by count(id) desc;
 count |                        reason_substr                        
-------+-------------------------------------------------------------
    78 | asset failure: Failed to download sle-micro-6.0-x86_64-10.1
    76 | asset failure: Failed to download sle-micro-6.0-aarch64-10.
    58 | asset failure: Failed to download SLES15-SP5-Minimal-VM.x86
    40 | asset failure: Failed to download dev_tools.dud to /var/lib
    39 | tests died: unable to load main.pm, check the log for the c
    38 | asset failure: Failed to download SLES-15-SP6-x86_64-Build4
    20 | asset failure: Failed to download sle-15-SP6-x86_64-45.1-gn
    20 | asset failure: Failed to download sle-15-SP6-x86_64-40.1-te
    15 | asset failure: Failed to download SLE-15-SP6-Full-aarch64-B
    15 | backend died: QMP command migrate failed: GenericError; Sta
    12 | tests died: unable to load tests/network/samba/samba_adcli.
    12 | asset failure: Failed to download sle-15-SP6-ppc64le-45.1-g
    11 | backend died: QEMU terminated before QMP connection could b
    10 | asset failure: Failed to download sle-15-SP4-x86_64-2024012
    10 | asset failure: Failed to download sle-15-SP6-aarch64-Build4
    10 | tests died: unable to load tests/yast2_gui/yast2_bootloader
    10 | asset failure: Failed to download sle-15-SP6-x86_64-39.1-gn
     9 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base
     9 | asset failure: Failed to download autoyast-SLES-12SP5-x86_6
     8 | asset failure: Failed to download sle-15-SP6-aarch64-45.1-g
     8 | asset failure: Failed to download sle-15-SP5-x86_64-120.11-
     8 | asset failure: Failed to download SLE-15-SP6-ppc64le-Build4
     8 | asset failure: Failed to download sle-15-SP5-ppc64le-Build1
…

Maybe that's due to me reducing asset storage limits. The 40 jobs about dev_tools.dud might be related to https://suse.slack.com/archives/C02CANHLANP/p1706249706124219.

There were also 39 incompletes due to errors when loading the schedule. These are often syntax errors but when I had a look at some of those I found only incompletes due to YAML_SCHEDULE file not found: 'sle/lib/../schedule/security/oscap_stig.yaml'. That's maybe an error case we can distinguish from syntax errors and make those jobs failures instead.

In any case I would just wait and see whether the trend of declining figures continues.

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)
Priority changed from Urgent to High

mkittler and me looked into this. Currently no alert condition. The biggest problem was a syntax error with a missing "%" in a variable causing missing dependencies among jobs. That was fixed meanwhile likely by yosun in the testsuite. Right now the number of incomplete jobs has already decreased sufficiently so reducing prio accordingly. Removed the silence again.

Actions

Copy link

Updated by mkittler over 1 year ago

Most assets were missing due to a typo in a job dependency. It seems already fixed but I mentioned it also in the chat.
Maybe SLES15-SP5-Minimal-VM.x86_64-VMware-Build4.2.23.vmdk.xz is a victim of our asset cleanup or maybe must missing.
Some investigation jobs were missing assets. This is because they were about a very old (2 month old) last good build and the asset simply didn't exist anymore.

Actions

Copy link

Updated by mkittler over 1 year ago

Status changed from In Progress to Resolved

Turning the error about YAML_SCHEDULE into a failure would probably not be the best idea. We could emit a more specific reason for that but this exception is happening within the test distribution so I'm not looking further into it right now.

With that I would actually close the ticket.

Actions

Copy link

Updated by jbaier_cz 6 months ago

Copied to action #174586: Incomplete jobs (not restarted) of last 24h alert Salt added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #154345

Incomplete jobs (not restarted) of last 24h alert Salt

Observation¶

Suggestions¶

Rollback steps¶

Updated by livdywan over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by jbaier_cz 6 months ago