action #98901: [alert] Incomplete jobs (not restarted) of last 24h alert - openQA Project - openSUSE Project Management Tool

Actions

Copy link

action #98901

closed

[alert] Incomplete jobs (not restarted) of last 24h alert

Added by kodymo about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-09-20

Due date:

% Done:

Estimated time:

Description

There was an alert because of too many incomplete jobs on Saturday

Link: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=alert&viewPanel=17&orgId=1&from=1631222416820&to=1632432016820

Did someone look at this or did it sort itself out?

Actions

Copy link

Updated by livdywan about 3 years ago

I'd like to add that this alert was active twice on Friday 22:47 CEST and 23:38 CEST, and resolved for the second time Saturday 0.07 CEST, just in case the timing matters.

Actions

Copy link

Updated by okurz about 3 years ago

Project changed from openSUSE Release Process to openQA Project
Category set to Regressions/Crashes
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by mkittler about 3 years ago

Please move the ticket to the openQA Infrastructure project.

Besides the usual crashes of ARM workers we've got tons of jobs incompleting with asset failure: Failed to download publiccloud_tools_0023.qcow2 to /var/lib/openqa/cache/openqa.suse.de/publiccloud_tools_0023.qcow2 and asset failure: Failed to download SLE-15-SP3-Online-x86_64-Build188.15-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP3-Online-x86_64-Build188.15-Media1.iso and asset failure: Failed to download SLE-12-SP5-Server-DVD-ppc64le-GM.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLE-12-SP5-Server-DVD-ppc64le-GM.qcow2 and more.

There are also many incompletes with backend died: Open vSwitch command 'set_vlan' with arguments 'tap13 39' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.os_autoinst.switch was not provided by any .service files on openqaworker-arm-2 so possibly the TAP setup on this worker is broken.

There's also cache failure: Cache service queue already full (10) on our two workers with the most slots openqaworker5 and openqaworker6 for which we already have #98463.

Actions

Copy link

Updated by mkittler about 3 years ago

Status changed from New to In Progress
Assignee set to mkittler

I've been asking about the asset failures on #eng-testing.

It looks like the TAP setup on openqaworker-arm-2 fixed itself after the machine was rebooted. I don't think it is worth looking into it any further.

Actions

Copy link

Updated by mkittler about 3 years ago

Status changed from In Progress to Resolved

In the chat Anton said the asset failures were just due to his scheduling script relying on openQA listing only existing assets. He said it won't happen again. I assume #98802 is the relevant ticket.

Since the ARM workers aren't very stable anyways and a reboot seems to have fixed it I don't think it is worth looking into the TAP setup issues on openqaworker2.

cache failure: Cache service queue already full (10) is handled by #98463 but those jobs are restarted anyways so it is actually not related to this alert at all.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #98901

[alert] Incomplete jobs (not restarted) of last 24h alert

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago