action #152569: Many incomplete jobs endlessly restarted over several weeks size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #152569

closed

Many incomplete jobs endlessly restarted over several weeks size:M

Added by tinita over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

tinita

Category:

Feature requests

Target version:

Ready

Start date:

2023-12-13

Due date:

2024-01-12

% Done:

Estimated time:

Description

Observation¶

When investigating #152560 we noticed that there are also a lot of restarted incomplete jobs like this one:
https://openqa.suse.de/tests/13062217

Reason: backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused

Apparently there is an auto_clone_regex feature that will restart a job directly in openQA if the reason matches a certain regex.

But it doesn't make sense to restart the job thousands of times. I couldn't even find the original job (haven't tried the recursion feature yet).

In total I could find over 17k jobs with that error about unreal6.qe.nue2.suse.org since mid november.

A symptom of having such huge restart/clone-chains is:

Dec 04 14:39:53 openqa openqa-gru[6326]: Deep recursion on subroutine "OpenQA::Schema::Result::Jobs::related_scheduled_product_id" at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 2016.

Acceptance Criteria¶

AC1: Incomplete jobs are restarted up to n times at most (configurable)

Suggestions¶

Implement a cap/limit on the automatic restarting of incomplete jobs
Search for auto_clone_regex in the code repository to find the relevant starting point
Have a look into avoiding the deep recursion as well

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by tinita over 1 year ago

Related to action #152578: Many incompletes with "Error connecting to VNC server <unreal6.qe.nue2.suse.org:...>" size:M added

Actions

Copy link

Updated by tinita over 1 year ago

Related to action #152560: [alert] Incomplete jobs (not restarted) of last 24h alert Salt added

Actions

Copy link

Updated by livdywan over 1 year ago

Subject changed from Many incomplete jobs endlessly restarted over several weeks to Many incomplete jobs endlessly restarted over several weeks size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by tinita over 1 year ago

Status changed from Workable to In Progress
Assignee set to tinita

Actions

Copy link

Updated by tinita over 1 year ago

https://github.com/os-autoinst/openQA/pull/5397 Limit number of auto_clone restarts

Now trying to find out if I can use postgresql recursion in OpenQA::Schema::Result::Jobs::related_scheduled_product_id instead of using SQL queries from perl recursively.

Actions

Copy link

Updated by tinita over 1 year ago

https://github.com/os-autoinst/openQA/pull/5398 Use postgresql recursive feature for getting scheduled product

Actions

Copy link

Updated by openqa_review over 1 year ago

Due date set to 2023-12-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by tinita over 1 year ago

Status changed from In Progress to Feedback

Waiting for feedback on both PRs

Actions

Copy link

Updated by tinita over 1 year ago · Edited

It should also be noted that also investigation jobs are auto cloned.
The incompletes of the last 2.5 days are exclusively investigations: (plus a memtest-poo152578 started by @okurz today)

openqa=> select count(id), test, substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-12-17T00:00:00' and result = 'incomplete' and reason like '%unreal6%' group by reason_substr, test order by count(id) desc;
 count |                                            test                                            |                             reason_substr                             
-------+--------------------------------------------------------------------------------------------+-----------------------------------------------------------------------
   378 | jeos-base+sdk+desktop:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   206 | jeos-filesystem:investigate:retry                                                          | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   205 | jeos-extratest:investigate:retry                                                           | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   205 | jeos-kdump:investigate:retry                                                               | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   204 | jeos-containers-docker:investigate:retry                                                   | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   107 | memtest:investigate:retry                                                                  | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   106 | memtest:investigate:last_good_tests:3a3104f2ab3bc31d94191dc20635f191ef914fe2               | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   105 | jeos-base+sdk+desktop:investigate:retry                                                    | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   105 | jeos-containers-podman:investigate:retry                                                   | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   104 | jeos-fs_stress:investigate:retry                                                           | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   102 | jeos-main:investigate:last_good_tests:62553f401b66a1ec01fa037476113a1a42016150             | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   100 | jeos-extratest:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056        | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
   100 | jeos-fips:investigate:retry                                                                | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
    10 | memtest-poo152578                                                                          | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
     1 | jeos-extratest:investigate:last_good_build:2.37                                            | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
(15 rows)

So it could be that the original problem is gone (like there was a problem in the test code), but the investigation jobs will continue to be triggered forever.
That's insane.

Actions

Copy link

#10

Updated by tinita over 1 year ago

Status changed from Feedback to In Progress

Actions

Copy link

#11

Updated by tinita over 1 year ago · Edited

As we discussed today, we would only stop the auto_clone if the result is incomplete and the $x direct ancestors are incomplete as well.
Any passed or failed or softfailed results in between will "reset", and auto_clone starts from zero.
That can of course still result in an endless loop if there is a progress ticket triggering a restart by ticket title via auto_review. We could still set an upper total limit.

An example how the SQL could look like. For demonstration I use a limit of 10.
The first job 13114452 is user_cancelled , but I mocked it here because I couldn't find an incomplete that fits my example's needs. It's two ancestors are both failed, and older ancestors are incomplete.
The second job has more than 10 direct incomplete ancestors.

openqa=> with recursive orig_id as (
      select (13114452)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where (10 < 0 or level < 10))
 select orig_id, level from orig_id where orig_result = 'incomplete' order by level asc;
 orig_id  | level 
----------+-------
 13114452 |    -1
 13089718 |     2         # <----- gap: missing 0 and 1 here
 13089696 |     3
 13089673 |     4
 13089651 |     5
 13089628 |     6
 13089606 |     7
 13089584 |     8
 13089559 |     9
 13089536 |    10
(10 rows)

openqa=> with recursive orig_id as (
      select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where (10 < 0 or level < 10))
 select orig_id, level from orig_id where orig_result = 'incomplete' order by level asc;
 orig_id  | level 
----------+-------
 13084330 |    -1
 13084317 |     0
 13084302 |     1
 13084286 |     2
 13084271 |     3
 13084257 |     4
 13084244 |     5
 13084229 |     6
 13084213 |     7
 13084198 |     8
 13084184 |     9
 13084171 |    10
(12 rows)

Now I can select the rows in perl and check if there is a gap in the level column and stop counting.

Actions

Copy link

#12

Updated by tinita over 1 year ago

Another suggestion was to check for the exact same reason in the direct ancestors.
That's currently not possible with what we provide as configuration.
e.g. we configure auto_clone_regex = ...|backend died: .*VNC.*(timeout|timed out|refused)|...
Now the actual error message is
backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused
We could easily check if a job has the exact same error, but the port is always changing, so we would have to mask the port in the original message with \d+. Also easy to do, but we would need something generic for everything in auto_clone_regex.

Actions

Copy link

#13

Updated by tinita over 1 year ago

Or even easier than note 11:

openqa=> with recursive orig_id as (
      select (13114452)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
 select orig_id, level from orig_id order by level desc;
 orig_id  | level 
----------+-------
 13114452 |    -1
(1 row)

openqa=> with recursive orig_id as (
      select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
 select orig_id, level from orig_id order by level desc;
 orig_id  | level 
----------+-------
 13084171 |    10
 13084184 |     9
 13084198 |     8
 13084213 |     7
 13084229 |     6
 13084244 |     5
 13084257 |     4
 13084271 |     3
 13084286 |     2
 13084302 |     1
 13084317 |     0
 13084330 |    -1
(12 rows)

Actions

Copy link

#14

Updated by okurz over 1 year ago

Due date changed from 2023-12-29 to 2024-01-12

christmas vacation due date bump :)

Actions

Copy link

#15

Updated by tinita over 1 year ago

I pushed an update to https://github.com/os-autoinst/openQA/pull/5397

Basically the SQL looks like this now:

with recursive orig_id as (
      select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result, ('')::varchar as orig_reason
         union all
      select id as orig_id, orig_id.level + 1 as level, result as orig_result, reason as orig_reason from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
 select orig_id, level, orig_reason from orig_id order by level asc;

And then in perl I can check all the returned reasons for the auto_clone_regex as well.

Still need to update the reason variable.

Actions

Copy link

#16

Updated by okurz over 1 year ago

Priority changed from High to Normal

I checked current jobs on o3 and osd and have found currently no jobs running or scheduled on o3 at all so also no problematic indefinite reruns so reducing to "Normal" for now over vacation period.

Actions

Copy link

#17

Updated by tinita over 1 year ago

Ready for review: https://github.com/os-autoinst/openQA/pull/5397

Actions

Copy link

#18

Updated by okurz over 1 year ago

Status changed from In Progress to Workable

Actions

Copy link

#19

Updated by okurz over 1 year ago

Status changed from Workable to In Progress

https://github.com/os-autoinst/openQA/pull/5397 approved, waiting for CI checks. As we have observed the original problem on OSD we should await the next OSD deployment planned for tomorrow and then monitor
https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=svirt-xen-hvm&test=memtest&version=15-SP6#next_previous

Actions

Copy link

#20

Updated by tinita over 1 year ago

It currently happens again:

select id, test, substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2024-01-10T00:00:00' and result = 'incomplete' and reason like '%unreal6%'  order by id desc;

https://openqa.suse.de/tests/13224772
So after https://github.com/os-autoinst/openQA/pull/5397 is merged, I can try to trigger osd-deployment (#153325) and if it is deployed, see if that recloning is stopped.

Actions

Copy link

#21

Updated by tinita over 1 year ago

Status changed from In Progress to Feedback

Actions

Copy link

#22

Updated by tinita over 1 year ago

Status changed from Feedback to Resolved

We can see that the feature works:
https://openqa.suse.de/tests/13230040
Reason: ... [Not restarting because job has been restarted already $auto_clone_limit times and failed with /$auto_clone_regex/]
Note that the job actually has been restarted, here's the clone: https://openqa.suse.de/tests/13230379
but according to the audit log that was done by a human :)

Actions

Copy link

#23

Updated by tinita over 1 year ago

We also have the other case btw:
https://openqa.suse.de/tests/13231785
Reason: Cache service queue already full (10) [Auto-restarting because reason matches /$auto_clone_regex/]

Actions

Copy link

#24

Updated by okurz over 1 year ago

Copied to action #153475: Reconsider the formatting of variable-names in the reason field, e.g. "$auto_clone_regex" size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #152569

Many incomplete jobs endlessly restarted over several weeks size:M

Observation¶

Acceptance Criteria¶

Suggestions¶

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by livdywan over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by openqa_review over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago · Edited

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago · Edited

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago