Project

General

Profile

Actions

action #152560

closed

[alert] Incomplete jobs (not restarted) of last 24h alert Salt

Added by tinita 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-12-13
Due date:
% Done:

0%

Estimated time:

Description

Date: Tue, 12 Dec 2023 16:29:30 +0100                                                                                                                                                                         
From: Grafana <osd-admins@suse.de>                                                                                                                                                                            
To: osd-admins@suse.de                                                                                                                                                                                        
Subject: [FIRING:1] (Incomplete jobs (not restarted) of last 24h alert Salt cXo2cmBVk)                                                                                                                        

https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=17&orgId=1


Related issues 2 (0 open2 closed)

Related to openQA Project - action #152569: Many incomplete jobs endlessly restarted over several weeks size:MResolvedtinita2023-12-132024-01-12

Actions
Related to openQA Infrastructure - action #152578: Many incompletes with "Error connecting to VNC server <unreal6.qe.nue2.suse.org:...>" size:MResolvedtinita2023-12-13

Actions
Actions #1

Updated by okurz 5 months ago

  • Tags set to reactive work
  • Priority changed from High to Urgent

"Urgent" as long as that is not mitigated

Actions #2

Updated by mkittler 5 months ago

Looks like there's a single type of incompletes that is occurring much more often compared to the others:

openqa=> select count(id), substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-12-12T13:00:00' and result = 'incomplete' group by reason_substr order by count(id) desc;
 count |                             reason_substr                             
-------+-----------------------------------------------------------------------
   961 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   180 | asset failure: Failed to download sle-micro-6.0-x86_64-6.1@64bit-smp_
   180 | asset failure: Failed to download sle-micro-6.0-aarch64-6.1@aarch64_x
    31 | cache failure: Failed to download sle-micro-6.0-aarch64-6.1@aarch64_x
    16 | backend died: QEMU terminated before QMP connection could be establis
    14 | backend died: QEMU exited unexpectedly, see log for details
    12 | backend died: runcmd '/usr/bin/qemu-img create -f qcow2 -F qcow2 -b /
     5 | cache failure: Cache service queue already full (10)
     4 | asset failure: Failed to download sle-15-SP4-x86_64-20231212-1-gnome_
     4 | cache failure: Failed to download sle-micro-6.0-x86_64-6.1@64bit-smp_
     4 | tests died: unable to load main.pm, check the log for the cause (e.g.
     4 | cache failure: Failed to download SLES-15-SP6-x86_64-mru-install-mini
     4 | asset failure: Failed to download SLES-15-SP6-x86_64-Build41.1@64bit-
     2 | tests died: unable to load tests/network/samba/samba_adcli.pm, check 
     2 | asset failure: Failed to download SLES-15-SP4-s390x-Build20231212-1@s
     2 | asset failure: Failed to download SLES-15-SP5-s390x-Build20231212-1@s
     2 | asset failure: Failed to download SLES-15-SP6-x86_64-mru-install-mini
     2 | backend died: QMP command migrate failed: GenericError; State blocked
     1 | backend died: qemu-img: Could not open '/var/lib/openqa/pool/2/SLES-1
     1 | backend died: qemu-img: Could not open '/var/lib/openqa/pool/21/SLES-
     1 | backend died: qemu-img: Could not open '/var/lib/openqa/pool/28/SLES-
     1 | backend died: Migrate to file failed, it has been running for more th
     1 | backend died: unable to extract assets: runcmd 'nice ionice qemu-img 
     1 | backend died: Error connecting to VNC server <s390kvm095.oqa.prg2.sus
     1 | cache failure: Cache service status error from API: Minion job #27966
     1 | backend died: Error connecting to VNC server <s390kvm085.oqa.prg2.sus
     1 | backend died: Encoder not accepting data: Broken pipe at /usr/lib/os-
     1 | asset failure: Failed to download windows-11-x86_64-22H2@windows_bios
     1 | died: terminated prematurely, see log output for details
     1 | isotovideo died: unable to handle generated assets: machine not shut 
     1 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base-RT-SelfIn
     1 | tests died: unable to load tests/installation/verify_secure_boot.pm, 
     1 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base-RT-Build6
(33 rows)
Actions #3

Updated by tinita 5 months ago

  • Status changed from New to In Progress
  • Assignee set to tinita
Actions #4

Updated by tinita 5 months ago

If I only look at the ones which haven't been restarted I see a different picture:

openqa=> select count(id), substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-12-12T13:00:00' and result = 'incomplete' and clone_id is null group by reason_substr order by cou
nt(id) desc;                                                                                                                                                     
 count |                             reason_substr                                                                                                                                                                
-------+-----------------------------------------------------------------------                                                                                                                                   
   180 | asset failure: Failed to download sle-micro-6.0-x86_64-6.1@64bit-smp_                                                                                                                                    
   180 | asset failure: Failed to download sle-micro-6.0-aarch64-6.1@aarch64_x                                                                                                                                    
    16 | backend died: QEMU terminated before QMP connection could be establis                                                                                                                                    
    13 | backend died: QEMU exited unexpectedly, see log for details                                                                                                                                              
     6 | backend died: runcmd '/usr/bin/qemu-img create -f qcow2 -F qcow2 -b /                                                                                                                                    
     4 | asset failure: Failed to download SLES-15-SP6-ppc64le-Build40.1-conta                                                                                                                                    
     4 | asset failure: Failed to download SLES-15-SP6-x86_64-Build41.1@64bit-                                                                                                                                    
     3 | tests died: unable to load main.pm, check the log for the cause (e.g.                                                                                                                                    
     2 | asset failure: Failed to download sle-15-SP4-x86_64-20231212-1-gnome_                                                                                                                                    
     2 | asset failure: Failed to download SLES-15-SP6-x86_64-mru-install-mini                                                                                                                                    
     2 | backend died: QMP command migrate failed: GenericError; State blocked                                                                                                                                    
     1 | tests died: unable to load tests/network/samba/samba_adcli.pm, check                                                                                                                                     
     1 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base-RT-SelfIn                                                                                                                                    
     1 | asset failure: Failed to download SLES-15-SP4-s390x-Build20231212-1@s                                                                                                                                    
     1 | asset failure: Failed to download SLES-15-SP5-s390x-Build20231212-1@s                                                                                                                                    
     1 | asset failure: Failed to download windows-11-x86_64-22H2@windows_bios                                                                                                                                    
     1 | backend died: qemu-img: Could not open '/var/lib/openqa/pool/21/SLES-                                                                                                                                    
     1 | isotovideo died: unable to handle generated assets: machine not shut                                                                                                                                     
     1 | asset failure: Failed to download SLE-Micro.x86_64-6.0-Base-RT-Build6                                                                                                                                    
(19 rows)                                                                                                                                                                                                         

e.g. https://openqa.suse.de/tests/13052428
asset failure: Failed to download sle-micro-6.0-x86_64-6.1@64bit-smp_xfstests.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-micro-6.0-x86_64-6.1@64bit-smp_xfstests.qcow2

Actions #5

Updated by livdywan 5 months ago

https://openqa.suse.de/minion/jobs?id=9722205 shows:

---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:A
attempts: 1
children: []
created: 2023-12-13T08:26:27.537994Z
delayed: 2023-12-13T08:26:27.537994Z
expires: ~
finished: 2023-12-13T08:26:32.131789Z
id: 9722205
lax: 0
notes:
  gru_id: 35952609
  project_lock: 1
parents: []
priority: 100
queue: default
result:
  code: 256
  message: |-
    rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/A/images/repo" (in repos) failed: No such file or directory (2)
    rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1835) [Receiver=3.2.3]
retried: ~
retries: 0
started: 2023-12-13T08:26:27.540492Z
state: failed
task: obs_rsync_run
time: 2023-12-13T10:35:28.620235Z
worker: 1545
Actions #6

Updated by tinita 5 months ago

  • Priority changed from Urgent to Normal

No new incompletes not restarted are appearing right now.
Looking at the build: https://openqa.suse.de/tests/overview?version=6.0&build=6.1&groupid=536&distri=sle-micro
There was a job started manually: https://openqa.suse.de/tests/13060819#downloads
And that created the HDD sle-micro-6.0-x86_64-6.1@64bit-smp_xfstests.qcow2 that is required by all the other jobs.
It seems that all is taken care of and jobs are rescheduled. I will check with yosun if there is anything we could do or if it was just an error in scheduling.

Actions #7

Updated by okurz 5 months ago

Hi, I suggest you also query for related incomplete jobs and ensure they are retriggered accordingly

Actions #8

Updated by tinita 5 months ago

  • Related to action #152569: Many incomplete jobs endlessly restarted over several weeks size:M added
Actions #9

Updated by tinita 5 months ago

  • Related to action #152578: Many incompletes with "Error connecting to VNC server <unreal6.qe.nue2.suse.org:...>" size:M added
Actions #10

Updated by tinita 5 months ago

  • Status changed from In Progress to Feedback

It looks like all other related incompletes have now new tests, I think the build has been scheduled again.

Actions #11

Updated by tinita 5 months ago

  • Status changed from Feedback to Resolved

I checked with Yong Sun that we are good here.

Actions

Also available in: Atom PDF