action #152569
closedMany incomplete jobs endlessly restarted over several weeks size:M
Description
Observation¶
When investigating #152560 we noticed that there are also a lot of restarted incomplete jobs like this one:
https://openqa.suse.de/tests/13062217
Reason: backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused
Apparently there is an auto_clone_regex feature that will restart a job directly in openQA if the reason matches a certain regex.
But it doesn't make sense to restart the job thousands of times. I couldn't even find the original job (haven't tried the recursion feature yet).
In total I could find over 17k jobs with that error about unreal6.qe.nue2.suse.org
since mid november.
A symptom of having such huge restart/clone-chains is:
Dec 04 14:39:53 openqa openqa-gru[6326]: Deep recursion on subroutine "OpenQA::Schema::Result::Jobs::related_scheduled_product_id" at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 2016.
Acceptance Criteria¶
- AC1: Incomplete jobs are restarted up to n times at most (configurable)
Suggestions¶
- Implement a cap/limit on the automatic restarting of incomplete jobs
- Search for
auto_clone_regex
in the code repository to find the relevant starting point - Have a look into avoiding the deep recursion as well
Updated by tinita 11 months ago
- Related to action #152578: Many incompletes with "Error connecting to VNC server <unreal6.qe.nue2.suse.org:...>" size:M added
Updated by tinita 11 months ago
- Related to action #152560: [alert] Incomplete jobs (not restarted) of last 24h alert Salt added
Updated by tinita 11 months ago
https://github.com/os-autoinst/openQA/pull/5397 Limit number of auto_clone restarts
Now trying to find out if I can use postgresql recursion in OpenQA::Schema::Result::Jobs::related_scheduled_product_id
instead of using SQL queries from perl recursively.
Updated by tinita 11 months ago
https://github.com/os-autoinst/openQA/pull/5398 Use postgresql recursive feature for getting scheduled product
Updated by openqa_review 11 months ago
- Due date set to 2023-12-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita 11 months ago · Edited
It should also be noted that also investigation jobs are auto cloned.
The incompletes of the last 2.5 days are exclusively investigations: (plus a memtest-poo152578
started by @okurz today)
openqa=> select count(id), test, substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-12-17T00:00:00' and result = 'incomplete' and reason like '%unreal6%' group by reason_substr, test order by count(id) desc;
count | test | reason_substr
-------+--------------------------------------------------------------------------------------------+-----------------------------------------------------------------------
378 | jeos-base+sdk+desktop:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
206 | jeos-filesystem:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
205 | jeos-extratest:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
205 | jeos-kdump:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
204 | jeos-containers-docker:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
107 | memtest:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
106 | memtest:investigate:last_good_tests:3a3104f2ab3bc31d94191dc20635f191ef914fe2 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
105 | jeos-base+sdk+desktop:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
105 | jeos-containers-podman:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
104 | jeos-fs_stress:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
102 | jeos-main:investigate:last_good_tests:62553f401b66a1ec01fa037476113a1a42016150 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
100 | jeos-extratest:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
100 | jeos-fips:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
10 | memtest-poo152578 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
1 | jeos-extratest:investigate:last_good_build:2.37 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
(15 rows)
So it could be that the original problem is gone (like there was a problem in the test code), but the investigation jobs will continue to be triggered forever.
That's insane.
Updated by tinita 11 months ago · Edited
As we discussed today, we would only stop the auto_clone if the result is incomplete and the $x direct ancestors are incomplete as well.
Any passed or failed or softfailed results in between will "reset", and auto_clone starts from zero.
That can of course still result in an endless loop if there is a progress ticket triggering a restart by ticket title via auto_review. We could still set an upper total limit.
An example how the SQL could look like. For demonstration I use a limit of 10.
The first job 13114452 is user_cancelled
, but I mocked it here because I couldn't find an incomplete that fits my example's needs. It's two ancestors are both failed
, and older ancestors are incomplete
.
The second job has more than 10 direct incomplete ancestors.
openqa=> with recursive orig_id as (
select (13114452)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where (10 < 0 or level < 10))
select orig_id, level from orig_id where orig_result = 'incomplete' order by level asc;
orig_id | level
----------+-------
13114452 | -1
13089718 | 2 # <----- gap: missing 0 and 1 here
13089696 | 3
13089673 | 4
13089651 | 5
13089628 | 6
13089606 | 7
13089584 | 8
13089559 | 9
13089536 | 10
(10 rows)
openqa=> with recursive orig_id as (
select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where (10 < 0 or level < 10))
select orig_id, level from orig_id where orig_result = 'incomplete' order by level asc;
orig_id | level
----------+-------
13084330 | -1
13084317 | 0
13084302 | 1
13084286 | 2
13084271 | 3
13084257 | 4
13084244 | 5
13084229 | 6
13084213 | 7
13084198 | 8
13084184 | 9
13084171 | 10
(12 rows)
Now I can select the rows in perl and check if there is a gap in the level
column and stop counting.
Updated by tinita 11 months ago
Another suggestion was to check for the exact same reason in the direct ancestors.
That's currently not possible with what we provide as configuration.
e.g. we configure auto_clone_regex = ...|backend died: .*VNC.*(timeout|timed out|refused)|...
Now the actual error message is
backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused
We could easily check if a job has the exact same error, but the port is always changing, so we would have to mask the port in the original message with \d+
. Also easy to do, but we would need something generic for everything in auto_clone_regex
.
Updated by tinita 11 months ago
Or even easier than note 11:
openqa=> with recursive orig_id as (
select (13114452)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
select orig_id, level from orig_id order by level desc;
orig_id | level
----------+-------
13114452 | -1
(1 row)
openqa=> with recursive orig_id as (
select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
select orig_id, level from orig_id order by level desc;
orig_id | level
----------+-------
13084171 | 10
13084184 | 9
13084198 | 8
13084213 | 7
13084229 | 6
13084244 | 5
13084257 | 4
13084271 | 3
13084286 | 2
13084302 | 1
13084317 | 0
13084330 | -1
(12 rows)
Updated by tinita 11 months ago
I pushed an update to https://github.com/os-autoinst/openQA/pull/5397
Basically the SQL looks like this now:
with recursive orig_id as (
select (13084330)::bigint as orig_id, -1 as level, ('incomplete')::varchar as orig_result, ('')::varchar as orig_reason
union all
select id as orig_id, orig_id.level + 1 as level, result as orig_result, reason as orig_reason from jobs join orig_id on orig_id.orig_id = jobs.clone_id where result = 'incomplete' and (10 < 0 or level < 10))
select orig_id, level, orig_reason from orig_id order by level asc;
And then in perl I can check all the returned reasons for the auto_clone_regex
as well.
Still need to update the reason
variable.
Updated by tinita 11 months ago
Ready for review: https://github.com/os-autoinst/openQA/pull/5397
Updated by okurz 11 months ago
- Status changed from Workable to In Progress
https://github.com/os-autoinst/openQA/pull/5397 approved, waiting for CI checks. As we have observed the original problem on OSD we should await the next OSD deployment planned for tomorrow and then monitor
https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=svirt-xen-hvm&test=memtest&version=15-SP6#next_previous
Updated by tinita 11 months ago
It currently happens again:
select id, test, substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2024-01-10T00:00:00' and result = 'incomplete' and reason like '%unreal6%' order by id desc;
https://openqa.suse.de/tests/13224772
So after https://github.com/os-autoinst/openQA/pull/5397 is merged, I can try to trigger osd-deployment (#153325) and if it is deployed, see if that recloning is stopped.
Updated by tinita 11 months ago
- Status changed from Feedback to Resolved
We can see that the feature works:
https://openqa.suse.de/tests/13230040
Reason: ... [Not restarting because job has been restarted already $auto_clone_limit times and failed with /$auto_clone_regex/]
Note that the job actually has been restarted, here's the clone: https://openqa.suse.de/tests/13230379
but according to the audit log that was done by a human :)
Updated by tinita 11 months ago
We also have the other case btw:
https://openqa.suse.de/tests/13231785
Reason: Cache service queue already full (10) [Auto-restarting because reason matches /$auto_clone_regex/]
Updated by okurz 10 months ago
- Copied to action #153475: Reconsider the formatting of variable-names in the reason field, e.g. "$auto_clone_regex" size:S added