Actions
action #98901
closed[alert] Incomplete jobs (not restarted) of last 24h alert
Description
There was an alert because of too many incomplete jobs on Saturday
Did someone look at this or did it sort itself out?
Actions
Added by kodymo over 3 years ago. Updated over 3 years ago.
Description
There was an alert because of too many incomplete jobs on Saturday
Did someone look at this or did it sort itself out?
I'd like to add that this alert was active twice on Friday 22:47 CEST and 23:38 CEST, and resolved for the second time Saturday 0.07 CEST, just in case the timing matters.
Please move the ticket to the openQA Infrastructure project.
Besides the usual crashes of ARM workers we've got tons of jobs incompleting with asset failure: Failed to download publiccloud_tools_0023.qcow2 to /var/lib/openqa/cache/openqa.suse.de/publiccloud_tools_0023.qcow2
and asset failure: Failed to download SLE-15-SP3-Online-x86_64-Build188.15-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP3-Online-x86_64-Build188.15-Media1.iso
and asset failure: Failed to download SLE-12-SP5-Server-DVD-ppc64le-GM.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLE-12-SP5-Server-DVD-ppc64le-GM.qcow2
and more.
There are also many incompletes with backend died: Open vSwitch command 'set_vlan' with arguments 'tap13 39' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.os_autoinst.switch was not provided by any .service files
on openqaworker-arm-2
so possibly the TAP setup on this worker is broken.
There's also cache failure: Cache service queue already full (10)
on our two workers with the most slots openqaworker5
and openqaworker6
for which we already have #98463.
I've been asking about the asset failures on #eng-testing.
It looks like the TAP setup on openqaworker-arm-2 fixed itself after the machine was rebooted. I don't think it is worth looking into it any further.
In the chat Anton said the asset failures were just due to his scheduling script relying on openQA listing only existing assets. He said it won't happen again. I assume #98802 is the relevant ticket.
Since the ARM workers aren't very stable anyways and a reboot seems to have fixed it I don't think it is worth looking into the TAP setup issues on openqaworker2.
cache failure: Cache service queue already full (10)
is handled by #98463 but those jobs are restarted anyways so it is actually not related to this alert at all.