action #152560
closed[alert] Incomplete jobs (not restarted) of last 24h alert Salt
0%
Description
Date: Tue, 12 Dec 2023 16:29:30 +0100
From: Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Subject: [FIRING:1] (Incomplete jobs (not restarted) of last 24h alert Salt cXo2cmBVk)
https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=17&orgId=1
Updated by okurz about 1 year ago
- Tags set to reactive work
- Priority changed from High to Urgent
"Urgent" as long as that is not mitigated
Updated by mkittler about 1 year ago
Looks like there's a single type of incompletes that is occurring much more often compared to the others:
openqa=> select count(id), substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-12-12T13:00:00' and result = 'incomplete' group by reason_substr order by count(id) desc;
count | reason_substr
-------+-----------------------------------------------------------------------
961 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
180 | asset failure: Failed to download sle-micro-6.0-x86_64-6.1@64bit-smp_
180 | asset failure: Failed to download sle-micro-6.0-aarch64-6.1@aarch64_x
31 | cache failure: Failed to download sle-micro-6.0-aarch64-6.1@aarch64_x
16 | backend died: QEMU terminated before QMP connection could be establis
14 | backend died: QEMU exited unexpectedly, see log for details
12 | backend died: runcmd '/usr/bin/qemu-img create -f qcow2 -F qcow2 -b /
5 | cache failure: Cache service queue already full (10)
4 | asset failure: Failed to download sle-15-SP4-x86_64-20231212-1-gnome_
4 | cache failure: Failed to download sle-micro-6.0-x86_64-6.1@64bit-smp_
4 | tests died: unable to load main.pm, check the log for the cause (e.g.
4 | cache failure: Failed to download SLES-15-SP6-x86_64-mru-install-mini
4 | asset failure: Failed to download SLES-15-SP6-x86_64-Build41.1@64bit-
2 | tests died: unable to load tests/network/samba/samba_adcli.pm, check
2 | asset failure: Failed to download SLES-15-SP4-s390x-Build20231212-1@s
2 | asset failure: Failed to download SLES-15-SP5-s390x-Build20231212-1@s
2 | asset failure: Failed to download SLES-15-SP6-x86_64-mru-install-mini
2 | backend died: QMP command migrate failed: GenericError; State blocked
1 | backend died: qemu-img: Could not open '/var/lib/openqa/pool/2/SLES-1
1 | backend died: qemu-img: Could not open '/var/lib/openqa/pool/21/SLES-
1 | backend died: qemu-img: Could not open '/var/lib/openqa/pool/28/SLES-
1 | backend died: Migrate to file failed, it has been running for more th
1 | backend died: unable to extract assets: runcmd 'nice ionice qemu-img
1 | backend died: Error connecting to VNC server <s390kvm095.oqa.prg2.sus
1 | cache failure: Cache service status error from API: Minion job #27966
1 | backend died: Error connecting to VNC server <s390kvm085.oqa.prg2.sus
1 | backend died: Encoder not accepting data: Broken pipe at /usr/lib/os-
1 | asset failure: Failed to download windows-11-x86_64-22H2@windows_bios
1 | died: terminated prematurely, see log output for details
1 | isotovideo died: unable to handle generated assets: machine not shut
1 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base-RT-SelfIn
1 | tests died: unable to load tests/installation/verify_secure_boot.pm,
1 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base-RT-Build6
(33 rows)
Updated by tinita about 1 year ago
- Status changed from New to In Progress
- Assignee set to tinita
Updated by tinita about 1 year ago
If I only look at the ones which haven't been restarted I see a different picture:
openqa=> select count(id), substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-12-12T13:00:00' and result = 'incomplete' and clone_id is null group by reason_substr order by cou
nt(id) desc;
count | reason_substr
-------+-----------------------------------------------------------------------
180 | asset failure: Failed to download sle-micro-6.0-x86_64-6.1@64bit-smp_
180 | asset failure: Failed to download sle-micro-6.0-aarch64-6.1@aarch64_x
16 | backend died: QEMU terminated before QMP connection could be establis
13 | backend died: QEMU exited unexpectedly, see log for details
6 | backend died: runcmd '/usr/bin/qemu-img create -f qcow2 -F qcow2 -b /
4 | asset failure: Failed to download SLES-15-SP6-ppc64le-Build40.1-conta
4 | asset failure: Failed to download SLES-15-SP6-x86_64-Build41.1@64bit-
3 | tests died: unable to load main.pm, check the log for the cause (e.g.
2 | asset failure: Failed to download sle-15-SP4-x86_64-20231212-1-gnome_
2 | asset failure: Failed to download SLES-15-SP6-x86_64-mru-install-mini
2 | backend died: QMP command migrate failed: GenericError; State blocked
1 | tests died: unable to load tests/network/samba/samba_adcli.pm, check
1 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base-RT-SelfIn
1 | asset failure: Failed to download SLES-15-SP4-s390x-Build20231212-1@s
1 | asset failure: Failed to download SLES-15-SP5-s390x-Build20231212-1@s
1 | asset failure: Failed to download windows-11-x86_64-22H2@windows_bios
1 | backend died: qemu-img: Could not open '/var/lib/openqa/pool/21/SLES-
1 | isotovideo died: unable to handle generated assets: machine not shut
1 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base-RT-Build6
(19 rows)
e.g. https://openqa.suse.de/tests/13052428
asset failure: Failed to download sle-micro-6.0-x86_64-6.1@64bit-smp_xfstests.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-micro-6.0-x86_64-6.1@64bit-smp_xfstests.qcow2
Updated by livdywan about 1 year ago
https://openqa.suse.de/minion/jobs?id=9722205 shows:
---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:A
attempts: 1
children: []
created: 2023-12-13T08:26:27.537994Z
delayed: 2023-12-13T08:26:27.537994Z
expires: ~
finished: 2023-12-13T08:26:32.131789Z
id: 9722205
lax: 0
notes:
gru_id: 35952609
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: |-
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/A/images/repo" (in repos) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1835) [Receiver=3.2.3]
retried: ~
retries: 0
started: 2023-12-13T08:26:27.540492Z
state: failed
task: obs_rsync_run
time: 2023-12-13T10:35:28.620235Z
worker: 1545
Updated by tinita about 1 year ago
- Priority changed from Urgent to Normal
No new incompletes not restarted are appearing right now.
Looking at the build: https://openqa.suse.de/tests/overview?version=6.0&build=6.1&groupid=536&distri=sle-micro
There was a job started manually: https://openqa.suse.de/tests/13060819#downloads
And that created the HDD sle-micro-6.0-x86_64-6.1@64bit-smp_xfstests.qcow2
that is required by all the other jobs.
It seems that all is taken care of and jobs are rescheduled. I will check with yosun if there is anything we could do or if it was just an error in scheduling.
Updated by okurz about 1 year ago
Hi, I suggest you also query for related incomplete jobs and ensure they are retriggered accordingly
Updated by tinita about 1 year ago
- Related to action #152569: Many incomplete jobs endlessly restarted over several weeks size:M added
Updated by tinita about 1 year ago
- Related to action #152578: Many incompletes with "Error connecting to VNC server <unreal6.qe.nue2.suse.org:...>" size:M added
Updated by tinita about 1 year ago
- Status changed from In Progress to Feedback
It looks like all other related incompletes have now new tests, I think the build has been scheduled again.
Updated by tinita about 1 year ago
- Status changed from Feedback to Resolved
I checked with Yong Sun that we are good here.