Project

General

Profile

Actions

action #98901

closed

[alert] Incomplete jobs (not restarted) of last 24h alert

Added by kodymo over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-09-20
Due date:
% Done:

0%

Estimated time:

Description

There was an alert because of too many incomplete jobs on Saturday

Link: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=alert&viewPanel=17&orgId=1&from=1631222416820&to=1632432016820

Did someone look at this or did it sort itself out?

Actions #1

Updated by livdywan over 2 years ago

I'd like to add that this alert was active twice on Friday 22:47 CEST and 23:38 CEST, and resolved for the second time Saturday 0.07 CEST, just in case the timing matters.

Actions #2

Updated by okurz over 2 years ago

  • Project changed from openSUSE Release Process to openQA Project
  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #3

Updated by mkittler over 2 years ago

Please move the ticket to the openQA Infrastructure project.


Besides the usual crashes of ARM workers we've got tons of jobs incompleting with asset failure: Failed to download publiccloud_tools_0023.qcow2 to /var/lib/openqa/cache/openqa.suse.de/publiccloud_tools_0023.qcow2 and asset failure: Failed to download SLE-15-SP3-Online-x86_64-Build188.15-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP3-Online-x86_64-Build188.15-Media1.iso and asset failure: Failed to download SLE-12-SP5-Server-DVD-ppc64le-GM.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLE-12-SP5-Server-DVD-ppc64le-GM.qcow2 and more.

There are also many incompletes with backend died: Open vSwitch command 'set_vlan' with arguments 'tap13 39' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.os_autoinst.switch was not provided by any .service files on openqaworker-arm-2 so possibly the TAP setup on this worker is broken.

There's also cache failure: Cache service queue already full (10) on our two workers with the most slots openqaworker5 and openqaworker6 for which we already have #98463.

Actions #4

Updated by mkittler over 2 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

I've been asking about the asset failures on #eng-testing.

It looks like the TAP setup on openqaworker-arm-2 fixed itself after the machine was rebooted. I don't think it is worth looking into it any further.

Actions #5

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Resolved

In the chat Anton said the asset failures were just due to his scheduling script relying on openQA listing only existing assets. He said it won't happen again. I assume #98802 is the relevant ticket.

Since the ARM workers aren't very stable anyways and a reboot seems to have fixed it I don't think it is worth looking into the TAP setup issues on openqaworker2.

cache failure: Cache service queue already full (10) is handled by #98463 but those jobs are restarted anyways so it is actually not related to this alert at all.

Actions

Also available in: Atom PDF