Project

General

Profile

Actions

action #169747

closed

coordination #102915: [saga][epic] Automated classification of failures

coordination #166655: [epic] openqa-label-known-issues

Multiple finalize_job_results and hook_script minion jobs per openQA job size:M

Added by tinita about 1 month ago. Updated 22 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-09-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

In #166772 I noticed that multiple minion jobs are created for the same openQA job.

The jobs I investigated were all incomplete, and I didn't research if this is happening also for passed/failed jobs.

Here is an example:

hook_script:

finalize_job_results:

https://openqa.opensuse.org/tests/4637440

Reason: abandoned: associated worker qa-power8-3:4 re-connected but abandoned the job

Also check

select id, concat('https://openqa.opensuse.org/tests/', args->1), task, started, state from minion_jobs where task = 'hook_script' and created >= '2024-11-10 11:39:00' and created <= '2024-11-12 11:42:00' and notes::varchar like '%hook_rc": 1%' order by started limit 100;

Especially having multiple hook_script jobs for the same job could be problematic.

enqueue_finalize_job_results is called from Jobs->done and Jobs->cancel.

Acceptance Criteria

AC1: At least hook_script minion jobs are not created multiple times on the same openQA job (maybe also finalize_job_results)

Suggestions

  • Use database queries to find relevant duplicate Minion jobs and the reason why their openQA jobs incompleted (maybe group by args) to find out
    • If it happens only on incompletes or on all kinds of results
    • If it happens only on those "reconnect" incompletes
    • Then it might be easier to find out which code is calling done multiple times and why
  • Ensure the done/cancel functions are only invoking the finalize job if the job hasn't been finalized yet
  • Otherwise, make sure that from the finalize job hook scripts only run once
  • Consider adding a check within the hook script itself so it doesn't matter if it is invoked multiple times

Related issues 1 (0 open1 closed)

Copied from openQA Project (public) - action #166772: openqa-label-known-issues overrides size:SResolvedtinita2024-09-13

Actions
Actions #1

Updated by tinita about 1 month ago

  • Copied from action #166772: openqa-label-known-issues overrides size:S added
Actions #2

Updated by okurz about 1 month ago

  • Target version set to Ready
Actions #3

Updated by livdywan about 1 month ago

  • Subject changed from Multiple finalize_job_results and hook_script minion jobs per openQA job to Multiple finalize_job_results and hook_script minion jobs per openQA job size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by mkittler about 1 month ago

  • Assignee set to mkittler
Actions #5

Updated by mkittler 26 days ago

Looks like this is really just about abandoned jobs because

select count(minion_jobs.id) as minion_job_count, args[0] as openqa_job_id, string_agg(reason, ',') as reasons from minion_jobs join jobs on jobs.id = args[0]::bigint where task = 'finalize_job_results' and not (reason like '%abandoned%') group by args having count(minion_jobs.id) > 1 order by args[0]::bigint desc limit 50;

returns no jobs but

openqa=> select count(minion_jobs.id) as minion_job_count, args[0] as openqa_job_id, string_agg(reason, ',') as reasons from minion_jobs join jobs on jobs.id = args[0]::bigint where task = 'finalize_job_results' group by args having count(minion_jobs.id) > 1 order by args[0]::bigint desc limit 50;

returned many:

 minion_job_count | openqa_job_id |                                                                                                                                 reasons                                                                                                                                 
------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                3 | 15998350      | abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long
                3 | 15998310      | abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long
                3 | 15997970      | abandoned: associated worker sapworker1:14 has not sent any status updates for too long,abandoned: associated worker sapworker1:14 has not sent any status updates for too long,abandoned: associated worker sapworker1:14 has not sent any status updates for too long
                3 | 15997722      | abandoned: associated worker sapworker1:15 has not sent any status updates for too long,abandoned: associated worker sapworker1:15 has not sent any status updates for too long,abandoned: associated worker sapworker1:15 has not sent any status updates for too long
                3 | 15997699      | abandoned: associated worker sapworker1:19 has not sent any status updates for too long,abandoned: associated worker sapworker1:19 has not sent any status updates for too long,abandoned: associated worker sapworker1:19 has not sent any status updates for too long
                3 | 15997698      | abandoned: associated worker sapworker1:22 has not sent any status updates for too long,abandoned: associated worker sapworker1:22 has not sent any status updates for too long,abandoned: associated worker sapworker1:22 has not sent any status updates for too long
                3 | 15997688      | abandoned: associated worker sapworker1:17 has not sent any status updates for too long,abandoned: associated worker sapworker1:17 has not sent any status updates for too long,abandoned: associated worker sapworker1:17 has not sent any status updates for too long
                3 | 15997687      | abandoned: associated worker sapworker1:20 has not sent any status updates for too long,abandoned: associated worker sapworker1:20 has not sent any status updates for too long,abandoned: associated worker sapworker1:20 has not sent any status updates for too long
                3 | 15997656      | abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long
                3 | 15997655      | abandoned: associated worker sapworker1:16 has not sent any status updates for too long,abandoned: associated worker sapworker1:16 has not sent any status updates for too long,abandoned: associated worker sapworker1:16 has not sent any status updates for too long
                3 | 15997654      | abandoned: associated worker sapworker1:13 has not sent any status updates for too long,abandoned: associated worker sapworker1:13 has not sent any status updates for too long,abandoned: associated worker sapworker1:13 has not sent any status updates for too long
                3 | 15997653      | abandoned: associated worker sapworker1:14 has not sent any status updates for too long,abandoned: associated worker sapworker1:14 has not sent any status updates for too long,abandoned: associated worker sapworker1:14 has not sent any status updates for too long
                3 | 15997569      | abandoned: associated worker sapworker1:37 has not sent any status updates for too long,abandoned: associated worker sapworker1:37 has not sent any status updates for too long,abandoned: associated worker sapworker1:37 has not sent any status updates for too long
                3 | 15997489      | abandoned: associated worker sapworker1:15 has not sent any status updates for too long,abandoned: associated worker sapworker1:15 has not sent any status updates for too long,abandoned: associated worker sapworker1:15 has not sent any status updates for too long
                3 | 15997466      | abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long,abandoned: associated worker sapworker1:25 has not sent any status updates for too long
                3 | 15997465      | abandoned: associated worker sapworker1:16 has not sent any status updates for too long,abandoned: associated worker sapworker1:16 has not sent any status updates for too long,abandoned: associated worker sapworker1:16 has not sent any status updates for too long
…
Actions #6

Updated by mkittler 26 days ago

  • Status changed from Workable to In Progress
Actions #7

Updated by mkittler 26 days ago

  • Status changed from In Progress to Feedback

PR: https://github.com/os-autoinst/openQA/pull/6068

It is not synchronized but probably good enough in practice.

Actions #8

Updated by mkittler 25 days ago · Edited

The PR has only been deployed since 2024-11-26T13:59Z. So I'll re-run the SQL command again later this week to see whether there are more occurrences. If not I'll consider this ticket resolved.

Actions #9

Updated by mkittler 24 days ago · Edited

The query

select count(minion_jobs.id) as minion_job_count, args[0] as openqa_job_id, max(t_finished) as t_finished, string_agg(reason, ',') as reasons from minion_jobs join jobs on jobs.id = args[0]::bigint where task = 'finalize_job_results' group by args having count(minion_jobs.id) > 1 order by max(t_finished) desc limit 50;

shows that the most recent job is 16007507 from 2024-11-26T08:13:11Z so no new occurrences so far.

Actions #10

Updated by mkittler 22 days ago

  • Status changed from Feedback to Resolved

Still no further occurrences so I'm considering this ticket resolved.

Actions

Also available in: Atom PDF