action #182303
openopenQA worker instances blocked by jobs reported as "running" but according openQA jobs are already cancelled/obsoleted for long
Description
Observation¶
As found in #181766 since #181175 OSD multiple openQA worker instances are blocked by jobs reported as "running" but according openQA jobs are already cancelled/obsoleted for long, e.g. https://openqa.suse.de/admin/workers/3408 on petrol:2 reported as "working" on https://openqa.suse.de/tests/17335290 which is cancelled, finished 21 days ago. mkittler already found that the openQA webUI sees registration events by workers with
openqa-webui-daemon[1041]: [warn] [pid:1041] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3563: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335222-sle-15-SP4-Server-DVD-Incidents-Kernel-KOTD-x86_64-Build5.14.21-150400.184.1.ga8db0ef-ltp_net_nfs@64bit/details-nfs03_v40_ip6t.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.
Updated by okurz 1 day ago
- Copied from action #181766: [osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:S added
Updated by mkittler 1 day ago · Edited
The full error looks like this:
May 13 14:49:09 openqa openqa-webui-daemon[1041]: [warn] [pid:1041] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3563: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335222-sle-15-SP4-Server-DVD-Incidents-Kernel-KOTD-x86_64-Build5.14.21-150400.184.1.ga8db0ef-ltp_net_nfs@64bit/details-nfs03_v40_ip6t.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.
May 13 14:49:19 openqa openqa-webui-daemon[5513]: [debug] [pid:5513] Duplicating jobs: {
May 13 14:49:19 openqa openqa-webui-daemon[5513]: 17335222 => {
May 13 14:49:19 openqa openqa-webui-daemon[5513]: chained_children => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]: chained_parents => [17335167],
May 13 14:49:19 openqa openqa-webui-daemon[5513]: directly_chained_children => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]: directly_chained_parents => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]: is_parent_or_initial_job => 1,
May 13 14:49:19 openqa openqa-webui-daemon[5513]: ok => 0,
May 13 14:49:19 openqa openqa-webui-daemon[5513]: parallel_children => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]: parallel_parents => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]: state => "cancelled",
May 13 14:49:19 openqa openqa-webui-daemon[5513]: },
May 13 14:49:19 openqa openqa-webui-daemon[5513]: }
May 13 14:49:19 openqa openqa-webui-daemon[5513]: [debug] [pid:5513] Job 17335222 duplicated as 17675884
It still lacks a proper backtrace so I'm currently guessing where the error comes from. It must be after the auto_duplicate
call.
EDIT: This must happen during the carry over (in done
-> carry_over_bugrefs
-> … -> _failure_reason
-> $m->results(…)
).
Updated by openqa_review about 22 hours ago
- Due date set to 2025-05-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 10 hours ago
- Status changed from In Progress to Feedback
The PR has been merged. Let's see whether the problem fixes itself after the deployment. (We might have to manually reload worker slots.)
I'm going to re-run the query select id, host, instance, job_id, (select result from jobs where jobs.id = workers.job_id) as job_result from workers where (select count(id) from jobs where jobs.id = workers.job_id and not state in ('running', 'assigned', 'uploading', 'setup')) > 0 order by job_id;
to check this after the deployment.