action #169747
closedcoordination #102915: [saga][epic] Automated classification of failures
coordination #166655: [epic] openqa-label-known-issues
Multiple finalize_job_results and hook_script minion jobs per openQA job size:M
Description
Observation¶
In #166772 I noticed that multiple minion jobs are created for the same openQA job.
The jobs I investigated were all incomplete, and I didn't research if this is happening also for passed/failed jobs.
Here is an example:
hook_script:
- https://openqa.opensuse.org/minion/jobs?id=4545183
- https://openqa.opensuse.org/minion/jobs?id=4545190
- https://openqa.opensuse.org/minion/jobs?id=4545193
finalize_job_results:
- https://openqa.opensuse.org/minion/jobs?id=4545182
- https://openqa.opensuse.org/minion/jobs?id=4545187
- https://openqa.opensuse.org/minion/jobs?id=4545191
https://openqa.opensuse.org/tests/4637440
Reason: abandoned: associated worker qa-power8-3:4 re-connected but abandoned the job
Also check
select id, concat('https://openqa.opensuse.org/tests/', args->1), task, started, state from minion_jobs where task = 'hook_script' and created >= '2024-11-10 11:39:00' and created <= '2024-11-12 11:42:00' and notes::varchar like '%hook_rc": 1%' order by started limit 100;
Especially having multiple hook_script jobs for the same job could be problematic.
enqueue_finalize_job_results
is called from Jobs->done
and Jobs->cancel
.
Acceptance Criteria¶
AC1: At least hook_script minion jobs are not created multiple times on the same openQA job (maybe also finalize_job_results)
Suggestions¶
- Use database queries to find relevant duplicate Minion jobs and the reason why their openQA jobs incompleted (maybe group by
args
) to find out- If it happens only on incompletes or on all kinds of results
- If it happens only on those "reconnect" incompletes
- Then it might be easier to find out which code is calling
done
multiple times and why
- Ensure the
done
/cancel
functions are only invoking the finalize job if the job hasn't been finalized yet - Otherwise, make sure that from the finalize job hook scripts only run once
- Consider adding a check within the hook script itself so it doesn't matter if it is invoked multiple times
Updated by tinita 2 months ago
- Copied from action #166772: openqa-label-known-issues overrides size:S added
Updated by mkittler about 2 months ago
Looks like this is really just about abandoned jobs because
select count(minion_jobs.id) as minion_job_count, args[0] as openqa_job_id, string_agg(reason, ',') as reasons from minion_jobs join jobs on jobs.id = args[0]::bigint where task = 'finalize_job_results' and not (reason like '%abandoned%') group by args having count(minion_jobs.id) > 1 order by args[0]::bigint desc limit 50;
returns no jobs but
openqa=> select count(minion_jobs.id) as minion_job_count, args[0] as openqa_job_id, string_agg(reason, ',') as reasons from minion_jobs join jobs on jobs.id = args[0]::bigint where task = 'finalize_job_results' group by args having count(minion_jobs.id) > 1 order by args[0]::bigint desc limit 50;
returned many:
minion_job_count | openqa_job_id | reasons
------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3 | 15998350 | abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long
3 | 15998310 | abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long
3 | 15997970 | abandoned: associated worker sapworker1:14 has not sent any status updates for too long,abandoned: associated worker sapworker1:14 has not sent any status updates for too long,abandoned: associated worker sapworker1:14 has not sent any status updates for too long
3 | 15997722 | abandoned: associated worker sapworker1:15 has not sent any status updates for too long,abandoned: associated worker sapworker1:15 has not sent any status updates for too long,abandoned: associated worker sapworker1:15 has not sent any status updates for too long
3 | 15997699 | abandoned: associated worker sapworker1:19 has not sent any status updates for too long,abandoned: associated worker sapworker1:19 has not sent any status updates for too long,abandoned: associated worker sapworker1:19 has not sent any status updates for too long
3 | 15997698 | abandoned: associated worker sapworker1:22 has not sent any status updates for too long,abandoned: associated worker sapworker1:22 has not sent any status updates for too long,abandoned: associated worker sapworker1:22 has not sent any status updates for too long
3 | 15997688 | abandoned: associated worker sapworker1:17 has not sent any status updates for too long,abandoned: associated worker sapworker1:17 has not sent any status updates for too long,abandoned: associated worker sapworker1:17 has not sent any status updates for too long
3 | 15997687 | abandoned: associated worker sapworker1:20 has not sent any status updates for too long,abandoned: associated worker sapworker1:20 has not sent any status updates for too long,abandoned: associated worker sapworker1:20 has not sent any status updates for too long
3 | 15997656 | abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long
3 | 15997655 | abandoned: associated worker sapworker1:16 has not sent any status updates for too long,abandoned: associated worker sapworker1:16 has not sent any status updates for too long,abandoned: associated worker sapworker1:16 has not sent any status updates for too long
3 | 15997654 | abandoned: associated worker sapworker1:13 has not sent any status updates for too long,abandoned: associated worker sapworker1:13 has not sent any status updates for too long,abandoned: associated worker sapworker1:13 has not sent any status updates for too long
3 | 15997653 | abandoned: associated worker sapworker1:14 has not sent any status updates for too long,abandoned: associated worker sapworker1:14 has not sent any status updates for too long,abandoned: associated worker sapworker1:14 has not sent any status updates for too long
3 | 15997569 | abandoned: associated worker sapworker1:37 has not sent any status updates for too long,abandoned: associated worker sapworker1:37 has not sent any status updates for too long,abandoned: associated worker sapworker1:37 has not sent any status updates for too long
3 | 15997489 | abandoned: associated worker sapworker1:15 has not sent any status updates for too long,abandoned: associated worker sapworker1:15 has not sent any status updates for too long,abandoned: associated worker sapworker1:15 has not sent any status updates for too long
3 | 15997466 | abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long
3 | 15997465 | abandoned: associated worker sapworker1:16 has not sent any status updates for too long,abandoned: associated worker sapworker1:16 has not sent any status updates for too long,abandoned: associated worker sapworker1:16 has not sent any status updates for too long
…
Updated by mkittler about 2 months ago
- Status changed from Workable to In Progress
Updated by mkittler about 2 months ago
- Status changed from In Progress to Feedback
PR: https://github.com/os-autoinst/openQA/pull/6068
It is not synchronized but probably good enough in practice.
Updated by mkittler about 2 months ago · Edited
The PR has only been deployed since 2024-11-26T13:59Z. So I'll re-run the SQL command again later this week to see whether there are more occurrences. If not I'll consider this ticket resolved.
Updated by mkittler about 2 months ago · Edited
The query
select count(minion_jobs.id) as minion_job_count, args[0] as openqa_job_id, max(t_finished) as t_finished, string_agg(reason, ',') as reasons from minion_jobs join jobs on jobs.id = args[0]::bigint where task = 'finalize_job_results' group by args having count(minion_jobs.id) > 1 order by max(t_finished) desc limit 50;
shows that the most recent job is 16007507 from 2024-11-26T08:13:11Z so no new occurrences so far.
Updated by mkittler about 2 months ago
- Status changed from Feedback to Resolved
Still no further occurrences so I'm considering this ticket resolved.