action #137300: [FIRING:1] (Incomplete jobs (not restarted) of last 24h alert Salt size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #137300

closed

[FIRING:1] (Incomplete jobs (not restarted) of last 24h alert Salt size:M

Added by tinita about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2023-10-02

Due date:

% Done:

Estimated time:

Tags:

incomplete, alert, osd, infra

Description

Observation¶

Firing [stats.openqa-monitor.qa.suse.de]
Incomplete jobs (not restarted) of last 24h alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
B0=314 
Labels
alertname
Incomplete jobs (not restarted) of last 24h alert

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/cXo2cmBVk/view
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/cXo2cmBVk/view

Acceptance criteria¶

AC1: Alert is not triggered anymore
AC2: It is known what triggered the alert originally

Suggestions¶

Investigate what happened on September 28th
run select id,test,reason from jobs where result='incomplete' and t_created >= '2023-09-27' and t_created <= '2023-09-29' limit 30; to find anything obvious by reason. I think we can group by shortened reason

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Tags set to infra, alert, incomplete, osd
Priority changed from Normal to High
Target version set to Tools - Next

Actions

Copy link

Updated by okurz about 1 year ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by livdywan about 1 year ago

Subject changed from [FIRING:1] (Incomplete jobs (not restarted) of last 24h alert Salt to [FIRING:1] (Incomplete jobs (not restarted) of last 24h alert Salt size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz about 1 year ago

Related to action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M added

Actions

Copy link

Updated by okurz about 1 year ago

Status changed from Workable to Resolved
Assignee set to okurz

openqa=> select count(id),left(reason,80) as r from jobs where result='incomplete' and t_created >= '2023-09-27' and t_created <= '2023-09-29' group by r order by count desc limit 30;
 count |                                        r                                         
-------+----------------------------------------------------------------------------------
   565 | asset failure: Failed to download dev_tools.dud to /var/lib/openqa/cache/openqa.
   221 | cache failure: Cache service queue already full (10)
    55 | asset failure: Failed to download SLES-15-SP4-x86_64-mru-install-desktop-with-ad
    50 | asset failure: Failed to download SLES-15-SP5-x86_64-mru-install-desktop-with-ad
    36 | asset failure: Failed to download SLES-15-SP5-x86_64-mru-install-minimal-with-ad
    35 | asset failure: Failed to download SLES-15-SP4-x86_64-mru-install-minimal-with-ad
    31 | backend died: QEMU exited unexpectedly, see log for details
    29 | cache failure: Failed to download dev_tools.dud to /var/lib/openqa/cache/openqa.
    26 | backend died: runcmd '/usr/bin/qemu-img create -f qcow2 -F qcow2 -b /var/lib/ope
    25 | asset failure: Failed to download SLES-15-SP5-aarch64-mru-install-minimal-with-a
    19 | asset failure: Failed to download SLES-15-SP3-x86_64-mru-install-minimal-with-ad

since then this did not reappear again. For the first entry, well, I think this is clearly within the scope of testers so we don't need to care about it. For the second issue "Cache service queue already full" we have #96684 which is already in our backlog so good enough.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #137300

[FIRING:1] (Incomplete jobs (not restarted) of last 24h alert Salt size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by livdywan about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago