action #152569
closedMany incomplete jobs endlessly restarted over several weeks size:M
Description
Observation¶
When investigating #152560 we noticed that there are also a lot of restarted incomplete jobs like this one:
https://openqa.suse.de/tests/13062217
Reason: backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused
Apparently there is an auto_clone_regex feature that will restart a job directly in openQA if the reason matches a certain regex.
But it doesn't make sense to restart the job thousands of times. I couldn't even find the original job (haven't tried the recursion feature yet).
In total I could find over 17k jobs with that error about unreal6.qe.nue2.suse.org
since mid november.
A symptom of having such huge restart/clone-chains is:
Dec 04 14:39:53 openqa openqa-gru[6326]: Deep recursion on subroutine "OpenQA::Schema::Result::Jobs::related_scheduled_product_id" at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 2016.
Acceptance Criteria¶
- AC1: Incomplete jobs are restarted up to n times at most (configurable)
Suggestions¶
- Implement a cap/limit on the automatic restarting of incomplete jobs
- Search for
auto_clone_regex
in the code repository to find the relevant starting point - Have a look into avoiding the deep recursion as well
Updated by tinita 8 months ago
- Related to action #152578: Many incompletes with "Error connecting to VNC server <unreal6.qe.nue2.suse.org:...>" size:M added
Updated by tinita 8 months ago
- Related to action #152560: [alert] Incomplete jobs (not restarted) of last 24h alert Salt added
Updated by tinita 8 months ago
https://github.com/os-autoinst/openQA/pull/5397 Limit number of auto_clone restarts
Now trying to find out if I can use postgresql recursion in OpenQA::Schema::Result::Jobs::related_scheduled_product_id
instead of using SQL queries from perl recursively.
Updated by tinita 8 months ago
https://github.com/os-autoinst/openQA/pull/5398 Use postgresql recursive feature for getting scheduled product
Updated by openqa_review 8 months ago
- Due date set to 2023-12-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita 8 months ago · Edited
It should also be noted that also investigation jobs are auto cloned.
The incompletes of the last 2.5 days are exclusively investigations: (plus a memtest-poo152578
started by @okurz today)
openqa=> select count(id), test, substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-12-17T00:00:00' and result = 'incomplete' and reason like '%unreal6%' group by reason_substr, test order by count(id) desc;
count | test | reason_substr
-------+--------------------------------------------------------------------------------------------+-----------------------------------------------------------------------
378 | jeos-base+sdk+desktop:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
206 | jeos-filesystem:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
205 | jeos-extratest:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
205 | jeos-kdump:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
204 | jeos-containers-docker:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
107 | memtest:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
106 | memtest:investigate:last_good_tests:3a3104f2ab3bc31d94191dc20635f191ef914fe2 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
105 | jeos-base+sdk+desktop:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
105 | jeos-containers-podman:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
104 | jeos-fs_stress:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
102 | jeos-main:investigate:last_good_tests:62553f401b66a1ec01fa037476113a1a42016150 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
100 | jeos-extratest:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
100 | jeos-fips:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
10 | memtest-poo152578 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
1 | jeos-extratest:investigate:last_good_build:2.37 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
(15 rows)
So it could be that the original problem is gone (like there was a problem in the test code), but the investigation jobs will continue to be triggered forever.
That's insane.
Updated by tinita 8 months ago · Edited
As we discussed today, we would only stop the auto_clone if the result is incomplete and the $x direct ancestors are incomplete as well.
Any passed or failed or softfailed results in between will "reset", and auto_clone starts from zero.
That can of course still result in an endless loop if there is a progress ticket triggering a restart by ticket title via auto_review. We could still set an upper total limit.
An example how the SQL could look like. For demonstration I use a limit of 10.
The first job 13114452 is user_cancelled
, but I mocked it here because I couldn't find an incomplete that fits my example's needs. It's two ancestors are both failed
, and older ancestors are incomplete
.
The second job has more than 10 direct incomplete ancestors.
openqa=> with recursive orig_id as (
select (13114452)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where (10 < 0 or level < 10))
select orig_id, level from orig_id where orig_result = 'incomplete' order by level asc;
orig_id | level
----------+-------
13114452 | -1
13089718 | 2 # <----- gap: missing 0 and 1 here
13089696 | 3
13089673 | 4
13089651 | 5
13089628 | 6
13089606 | 7
13089584 | 8
13089559 | 9
13089536 | 10
(10 rows)
openqa=> with recursive orig_id as (
select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where (10 < 0 or level < 10))
select orig_id, level from orig_id where orig_result = 'incomplete' order by level asc;
orig_id | level
----------+-------
13084330 | -1
13084317 | 0
13084302 | 1
13084286 | 2
13084271 | 3
13084257 | 4
13084244 | 5
13084229 | 6
13084213 | 7
13084198 | 8
13084184 | 9
13084171 | 10
(12 rows)
Now I can select the rows in perl and check if there is a gap in the level
column and stop counting.
Updated by tinita 8 months ago
Another suggestion was to check for the exact same reason in the direct ancestors.
That's currently not possible with what we provide as configuration.
e.g. we configure auto_clone_regex = ...|backend died: .*VNC.*(timeout|timed out|refused)|...
Now the actual error message is
backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused
We could easily check if a job has the exact same error, but the port is always changing, so we would have to mask the port in the original message with \d+
. Also easy to do, but we would need something generic for everything in auto_clone_regex
.
Updated by tinita 8 months ago
Or even easier than note 11:
openqa=> with recursive orig_id as (
select (13114452)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
select orig_id, level from orig_id order by level desc;
orig_id | level
----------+-------
13114452 | -1
(1 row)
openqa=> with recursive orig_id as (
select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
select orig_id, level from orig_id order by level desc;
orig_id | level
----------+-------
13084171 | 10
13084184 | 9
13084198 | 8
13084213 | 7
13084229 | 6
13084244 | 5
13084257 | 4
13084271 | 3
13084286 | 2
13084302 | 1
13084317 | 0
13084330 | -1
(12 rows)
Updated by tinita 8 months ago
I pushed an update to https://github.com/os-autoinst/openQA/pull/5397
Basically the SQL looks like this now:
with recursive orig_id as (
select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result, ('')::varchar as orig_reason
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result, reason as orig_reason from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
select orig_id, level, orig_reason from orig_id order by level asc;
And then in perl I can check all the returned reasons for the auto_clone_regex
as well.
Still need to update the reason
variable.
Updated by tinita 7 months ago
Ready for review: https://github.com/os-autoinst/openQA/pull/5397
Updated by okurz 7 months ago
- Status changed from Workable to In Progress
https://github.com/os-autoinst/openQA/pull/5397 approved, waiting for CI checks. As we have observed the original problem on OSD we should await the next OSD deployment planned for tomorrow and then monitor
https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=svirt-xen-hvm&test=memtest&version=15-SP6#next_previous
Updated by tinita 7 months ago
It currently happens again:
select id, test, substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2024-01-10T00:00:00' and result = 'incomplete' and reason like '%unreal6%' order by id desc;
https://openqa.suse.de/tests/13224772
So after https://github.com/os-autoinst/openQA/pull/5397 is merged, I can try to trigger osd-deployment (#153325) and if it is deployed, see if that recloning is stopped.
Updated by tinita 7 months ago
- Status changed from Feedback to Resolved
We can see that the feature works:
https://openqa.suse.de/tests/13230040
Reason: ... [Not restarting because job has been restarted already $auto_clone_limit times and failed with /$auto_clone_regex/]
Note that the job actually has been restarted, here's the clone: https://openqa.suse.de/tests/13230379
but according to the audit log that was done by a human :)
Updated by tinita 7 months ago
We also have the other case btw:
https://openqa.suse.de/tests/13231785
Reason: Cache service queue already full (10) [Auto-restarting because reason matches /$auto_clone_regex/]
Updated by okurz 7 months ago
- Copied to action #153475: Reconsider the formatting of variable-names in the reason field, e.g. "$auto_clone_regex" size:S added