Project

General

Profile

Actions

action #182303

open

openQA worker instances blocked by jobs reported as "running" but according openQA jobs are already cancelled/obsoleted for long

Added by okurz 1 day ago. Updated about 10 hours ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-05-13
Due date:
2025-05-28 (Due in 13 days)
% Done:

0%

Estimated time:

Description

Observation

As found in #181766 since #181175 OSD multiple openQA worker instances are blocked by jobs reported as "running" but according openQA jobs are already cancelled/obsoleted for long, e.g. https://openqa.suse.de/admin/workers/3408 on petrol:2 reported as "working" on https://openqa.suse.de/tests/17335290 which is cancelled, finished 21 days ago. mkittler already found that the openQA webUI sees registration events by workers with

openqa-webui-daemon[1041]: [warn] [pid:1041] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3563: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335222-sle-15-SP4-Server-DVD-Incidents-Kernel-KOTD-x86_64-Build5.14.21-150400.184.1.ga8db0ef-ltp_net_nfs@64bit/details-nfs03_v40_ip6t.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #181766: [osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:SResolveddheidler2025-05-27

Actions
Actions #1

Updated by okurz 1 day ago

  • Copied from action #181766: [osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:S added
Actions #2

Updated by mkittler 1 day ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #3

Updated by mkittler 1 day ago · Edited

The full error looks like this:

May 13 14:49:09 openqa openqa-webui-daemon[1041]: [warn] [pid:1041] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3563: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335222-sle-15-SP4-Server-DVD-Incidents-Kernel-KOTD-x86_64-Build5.14.21-150400.184.1.ga8db0ef-ltp_net_nfs@64bit/details-nfs03_v40_ip6t.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.
May 13 14:49:19 openqa openqa-webui-daemon[5513]: [debug] [pid:5513] Duplicating jobs: {
May 13 14:49:19 openqa openqa-webui-daemon[5513]:   17335222 => {
May 13 14:49:19 openqa openqa-webui-daemon[5513]:     chained_children => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]:     chained_parents => [17335167],
May 13 14:49:19 openqa openqa-webui-daemon[5513]:     directly_chained_children => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]:     directly_chained_parents => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]:     is_parent_or_initial_job => 1,
May 13 14:49:19 openqa openqa-webui-daemon[5513]:     ok => 0,
May 13 14:49:19 openqa openqa-webui-daemon[5513]:     parallel_children => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]:     parallel_parents => [],
May 13 14:49:19 openqa openqa-webui-daemon[5513]:     state => "cancelled",
May 13 14:49:19 openqa openqa-webui-daemon[5513]:   },
May 13 14:49:19 openqa openqa-webui-daemon[5513]: }
May 13 14:49:19 openqa openqa-webui-daemon[5513]: [debug] [pid:5513] Job 17335222 duplicated as 17675884

It still lacks a proper backtrace so I'm currently guessing where the error comes from. It must be after the auto_duplicate call.

EDIT: This must happen during the carry over (in done -> carry_over_bugrefs -> … -> _failure_reason -> $m->results(…)).

Actions #5

Updated by openqa_review about 22 hours ago

  • Due date set to 2025-05-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by mkittler about 10 hours ago

  • Status changed from In Progress to Feedback

The PR has been merged. Let's see whether the problem fixes itself after the deployment. (We might have to manually reload worker slots.)

I'm going to re-run the query select id, host, instance, job_id, (select result from jobs where jobs.id = workers.job_id) as job_result from workers where (select count(id) from jobs where jobs.id = workers.job_id and not state in ('running', 'assigned', 'uploading', 'setup')) > 0 order by job_id; to check this after the deployment.

Actions

Also available in: Atom PDF