Project

General

Profile

Actions

action #152569

closed

Many incomplete jobs endlessly restarted over several weeks size:M

Added by tinita 10 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-12-13
Due date:
2024-01-12
% Done:

0%

Estimated time:

Description

Observation

When investigating #152560 we noticed that there are also a lot of restarted incomplete jobs like this one:
https://openqa.suse.de/tests/13062217

Reason: backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused

Apparently there is an auto_clone_regex feature that will restart a job directly in openQA if the reason matches a certain regex.

But it doesn't make sense to restart the job thousands of times. I couldn't even find the original job (haven't tried the recursion feature yet).

In total I could find over 17k jobs with that error about unreal6.qe.nue2.suse.org since mid november.

A symptom of having such huge restart/clone-chains is:

Dec 04 14:39:53 openqa openqa-gru[6326]: Deep recursion on subroutine "OpenQA::Schema::Result::Jobs::related_scheduled_product_id" at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 2016.

Acceptance Criteria

  • AC1: Incomplete jobs are restarted up to n times at most (configurable)

Suggestions

  • Implement a cap/limit on the automatic restarting of incomplete jobs
  • Search for auto_clone_regex in the code repository to find the relevant starting point
  • Have a look into avoiding the deep recursion as well

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #152578: Many incompletes with "Error connecting to VNC server <unreal6.qe.nue2.suse.org:...>" size:MResolvedtinita2023-12-13

Actions
Related to openQA Project - action #152560: [alert] Incomplete jobs (not restarted) of last 24h alert SaltResolvedtinita2023-12-13

Actions
Copied to openQA Project - action #153475: Reconsider the formatting of variable-names in the reason field, e.g. "$auto_clone_regex" size:SResolvedmkittler

Actions
Actions #1

Updated by tinita 10 months ago

  • Related to action #152578: Many incompletes with "Error connecting to VNC server <unreal6.qe.nue2.suse.org:...>" size:M added
Actions #2

Updated by tinita 10 months ago

  • Related to action #152560: [alert] Incomplete jobs (not restarted) of last 24h alert Salt added
Actions #3

Updated by livdywan 10 months ago

  • Subject changed from Many incomplete jobs endlessly restarted over several weeks to Many incomplete jobs endlessly restarted over several weeks size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by tinita 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #5

Updated by tinita 10 months ago

https://github.com/os-autoinst/openQA/pull/5397 Limit number of auto_clone restarts

Now trying to find out if I can use postgresql recursion in OpenQA::Schema::Result::Jobs::related_scheduled_product_id instead of using SQL queries from perl recursively.

Actions #6

Updated by tinita 10 months ago

https://github.com/os-autoinst/openQA/pull/5398 Use postgresql recursive feature for getting scheduled product

Actions #7

Updated by openqa_review 10 months ago

  • Due date set to 2023-12-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by tinita 10 months ago

  • Status changed from In Progress to Feedback

Waiting for feedback on both PRs

Actions #9

Updated by tinita 10 months ago · Edited

It should also be noted that also investigation jobs are auto cloned.
The incompletes of the last 2.5 days are exclusively investigations: (plus a memtest-poo152578 started by @okurz today)

openqa=> select count(id), test, substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-12-17T00:00:00' and result = 'incomplete' and reason like '%unreal6%' group by reason_substr, test order by count(id) desc;
 count |                                            test                                            |                             reason_substr                             
-------+--------------------------------------------------------------------------------------------+-----------------------------------------------------------------------
   378 | jeos-base+sdk+desktop:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   206 | jeos-filesystem:investigate:retry                                                          | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   205 | jeos-extratest:investigate:retry                                                           | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   205 | jeos-kdump:investigate:retry                                                               | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   204 | jeos-containers-docker:investigate:retry                                                   | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   107 | memtest:investigate:retry                                                                  | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   106 | memtest:investigate:last_good_tests:3a3104f2ab3bc31d94191dc20635f191ef914fe2               | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   105 | jeos-base+sdk+desktop:investigate:retry                                                    | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   105 | jeos-containers-podman:investigate:retry                                                   | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   104 | jeos-fs_stress:investigate:retry                                                           | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   102 | jeos-main:investigate:last_good_tests:62553f401b66a1ec01fa037476113a1a42016150             | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   100 | jeos-extratest:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056        | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   100 | jeos-fips:investigate:retry                                                                | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
    10 | memtest-poo152578                                                                          | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
     1 | jeos-extratest:investigate:last_good_build:2.37                                            | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
(15 rows)

So it could be that the original problem is gone (like there was a problem in the test code), but the investigation jobs will continue to be triggered forever.
That's insane.

Actions #10

Updated by tinita 10 months ago

  • Status changed from Feedback to In Progress
Actions #11

Updated by tinita 10 months ago · Edited

As we discussed today, we would only stop the auto_clone if the result is incomplete and the $x direct ancestors are incomplete as well.
Any passed or failed or softfailed results in between will "reset", and auto_clone starts from zero.
That can of course still result in an endless loop if there is a progress ticket triggering a restart by ticket title via auto_review. We could still set an upper total limit.

An example how the SQL could look like. For demonstration I use a limit of 10.
The first job 13114452 is user_cancelled , but I mocked it here because I couldn't find an incomplete that fits my example's needs. It's two ancestors are both failed, and older ancestors are incomplete.
The second job has more than 10 direct incomplete ancestors.

openqa=> with recursive orig_id as (
      select (13114452)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where (10 < 0 or level < 10))
 select orig_id, level from orig_id where orig_result = 'incomplete' order by level asc;
 orig_id  | level 
----------+-------
 13114452 |    -1
 13089718 |     2         # <----- gap: missing 0 and 1 here
 13089696 |     3
 13089673 |     4
 13089651 |     5
 13089628 |     6
 13089606 |     7
 13089584 |     8
 13089559 |     9
 13089536 |    10
(10 rows)

openqa=> with recursive orig_id as (
      select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where (10 < 0 or level < 10))
 select orig_id, level from orig_id where orig_result = 'incomplete' order by level asc;
 orig_id  | level 
----------+-------
 13084330 |    -1
 13084317 |     0
 13084302 |     1
 13084286 |     2
 13084271 |     3
 13084257 |     4
 13084244 |     5
 13084229 |     6
 13084213 |     7
 13084198 |     8
 13084184 |     9
 13084171 |    10
(12 rows)

Now I can select the rows in perl and check if there is a gap in the level column and stop counting.

Actions #12

Updated by tinita 10 months ago

Another suggestion was to check for the exact same reason in the direct ancestors.
That's currently not possible with what we provide as configuration.
e.g. we configure auto_clone_regex = ...|backend died: .*VNC.*(timeout|timed out|refused)|...
Now the actual error message is
backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused
We could easily check if a job has the exact same error, but the port is always changing, so we would have to mask the port in the original message with \d+. Also easy to do, but we would need something generic for everything in auto_clone_regex.

Actions #13

Updated by tinita 10 months ago

Or even easier than note 11:

openqa=> with recursive orig_id as (
      select (13114452)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
 select orig_id, level from orig_id order by level desc;
 orig_id  | level 
----------+-------
 13114452 |    -1
(1 row)

openqa=> with recursive orig_id as (
      select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
 select orig_id, level from orig_id order by level desc;
 orig_id  | level 
----------+-------
 13084171 |    10
 13084184 |     9
 13084198 |     8
 13084213 |     7
 13084229 |     6
 13084244 |     5
 13084257 |     4
 13084271 |     3
 13084286 |     2
 13084302 |     1
 13084317 |     0
 13084330 |    -1
(12 rows)
Actions #14

Updated by okurz 10 months ago

  • Due date changed from 2023-12-29 to 2024-01-12

christmas vacation due date bump :)

Actions #15

Updated by tinita 10 months ago

I pushed an update to https://github.com/os-autoinst/openQA/pull/5397

Basically the SQL looks like this now:

with recursive orig_id as (
      select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result, ('')::varchar as orig_reason
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result, reason as orig_reason from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
 select orig_id, level, orig_reason from orig_id order by level asc;

And then in perl I can check all the returned reasons for the auto_clone_regex as well.

Still need to update the reason variable.

Actions #16

Updated by okurz 10 months ago

  • Priority changed from High to Normal

I checked current jobs on o3 and osd and have found currently no jobs running or scheduled on o3 at all so also no problematic indefinite reruns so reducing to "Normal" for now over vacation period.

Actions #18

Updated by okurz 9 months ago

  • Status changed from In Progress to Workable
Actions #19

Updated by okurz 9 months ago

  • Status changed from Workable to In Progress

https://github.com/os-autoinst/openQA/pull/5397 approved, waiting for CI checks. As we have observed the original problem on OSD we should await the next OSD deployment planned for tomorrow and then monitor
https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=svirt-xen-hvm&test=memtest&version=15-SP6#next_previous

Actions #20

Updated by tinita 9 months ago

It currently happens again:

select id, test, substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2024-01-10T00:00:00' and result = 'incomplete' and reason like '%unreal6%'  order by id desc;

https://openqa.suse.de/tests/13224772
So after https://github.com/os-autoinst/openQA/pull/5397 is merged, I can try to trigger osd-deployment (#153325) and if it is deployed, see if that recloning is stopped.

Actions #21

Updated by tinita 9 months ago

  • Status changed from In Progress to Feedback
Actions #22

Updated by tinita 9 months ago

  • Status changed from Feedback to Resolved

We can see that the feature works:
https://openqa.suse.de/tests/13230040
Reason: ... [Not restarting because job has been restarted already $auto_clone_limit times and failed with /$auto_clone_regex/]
Note that the job actually has been restarted, here's the clone: https://openqa.suse.de/tests/13230379
but according to the audit log that was done by a human :)

Actions #23

Updated by tinita 9 months ago

We also have the other case btw:
https://openqa.suse.de/tests/13231785
Reason: Cache service queue already full (10) [Auto-restarting because reason matches /$auto_clone_regex/]

Actions #24

Updated by okurz 9 months ago

  • Copied to action #153475: Reconsider the formatting of variable-names in the reason field, e.g. "$auto_clone_regex" size:S added
Actions

Also available in: Atom PDF