action #159348
closeds390x kvm jobs incomplete with auto_review:"cache failure: Failed to send asset request for SLE-Micro-.*Cache service enqueue error 500: Internal Server Error" size:M
0%
Description
Observation¶
https://openqa.suse.de/tests/14103039 incomplete with auto_review:"cache failure: Failed to send asset request for SLE-Micro-.*Cache service enqueue error 500: Internal Server Error". Similar in multiple other jobs on at least the instance worker40:4. So there seems to be a problem in handling that in the cache service.
https://openqa.suse.de/admin/workers/3090 shows multiple tens of incomplete jobs with the same reason.
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label 159348
Acceptance criteria¶
- AC1: No more references to this ticket from openqa-query-for-job-label
Suggestions¶
- Find out if the issues are specific to the arch or product
- Maybe related to recent changes with regard to git
Updated by nicksinger 7 months ago · Edited
This ticket is about the error 500 in the cache service, right? Because repairing the instances will be done in #158170
Updated by okurz 7 months ago
- Subject changed from s390x kvm jobs incomplete with auto_review:"cache failure: Failed to send asset request for SLE-Micro-.*Cache service enqueue error 500: Internal Server Error" to s390x kvm jobs incomplete with auto_review:"cache failure: Failed to send asset request for SLE-Micro-.*Cache service enqueue error 500: Internal Server Error" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler 7 months ago · Edited
- Description updated (diff)
openqa-query-for-job-label 159348
only shows the job already mentioned in the ticket description. So I used select id, t_finished, result, (select host from workers where workers.id = jobs.assigned_worker_id) as host, reason from jobs where reason ilike '%Cache service enqueue error 500: Internal Server Error%' order by t_finished desc;
instead. It is definitely notable that all those jobs ran on worker40. The most recent job is 14103490 from 2024-04-20 23:44:39 and the oldest still relevant is 14083560 from 2024-04-20 05:00:56. So the problem persisted for many hours and was maybe only resolved by the next reboot on 2024-04-21 03:34. Unfortunately logs from that timeframe are gone so I can't tell what was going on. The minion dashboard also doesn't show any relevant jobs anymore (although the problem was probably not with the job execution anyway but with the minion web application).
Updated by mkittler 7 months ago
- Status changed from In Progress to Resolved
Considering the job history looks good on https://openqa.suse.de/admin/workers/3090 and AC1 is fulfilled I'm resolving this ticket. If this happens again we have to be a bit faster (or at least adding relevant logs when creating the ticket).