action #98901
closed[alert] Incomplete jobs (not restarted) of last 24h alert
Description
There was an alert because of too many incomplete jobs on Saturday
Did someone look at this or did it sort itself out?
Updated by livdywan about 3 years ago
I'd like to add that this alert was active twice on Friday 22:47 CEST and 23:38 CEST, and resolved for the second time Saturday 0.07 CEST, just in case the timing matters.
Updated by okurz about 3 years ago
- Project changed from openSUSE Release Process to openQA Project
- Category set to Regressions/Crashes
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by mkittler about 3 years ago
Please move the ticket to the openQA Infrastructure project.
Besides the usual crashes of ARM workers we've got tons of jobs incompleting with asset failure: Failed to download publiccloud_tools_0023.qcow2 to /var/lib/openqa/cache/openqa.suse.de/publiccloud_tools_0023.qcow2
and asset failure: Failed to download SLE-15-SP3-Online-x86_64-Build188.15-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP3-Online-x86_64-Build188.15-Media1.iso
and asset failure: Failed to download SLE-12-SP5-Server-DVD-ppc64le-GM.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLE-12-SP5-Server-DVD-ppc64le-GM.qcow2
and more.
There are also many incompletes with backend died: Open vSwitch command 'set_vlan' with arguments 'tap13 39' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.os_autoinst.switch was not provided by any .service files
on openqaworker-arm-2
so possibly the TAP setup on this worker is broken.
There's also cache failure: Cache service queue already full (10)
on our two workers with the most slots openqaworker5
and openqaworker6
for which we already have #98463.
Updated by mkittler about 3 years ago
- Status changed from New to In Progress
- Assignee set to mkittler
I've been asking about the asset failures on #eng-testing.
It looks like the TAP setup on openqaworker-arm-2 fixed itself after the machine was rebooted. I don't think it is worth looking into it any further.
Updated by mkittler about 3 years ago
- Status changed from In Progress to Resolved
In the chat Anton said the asset failures were just due to his scheduling script relying on openQA listing only existing assets. He said it won't happen again. I assume #98802 is the relevant ticket.
Since the ARM workers aren't very stable anyways and a reboot seems to have fixed it I don't think it is worth looking into the TAP setup issues on openqaworker2.
cache failure: Cache service queue already full (10)
is handled by #98463 but those jobs are restarted anyways so it is actually not related to this alert at all.