Project

General

Profile

action #152569

Updated by livdywan 5 months ago

## Observation 

 When investigating #152560 we noticed that there are also a lot of *restarted* incomplete jobs like this one: 
 https://openqa.suse.de/tests/13062217 
 ``` 
 Reason: backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused 
 ``` 
 Apparently there is an auto_clone_regex feature that will restart a job directly in openQA if the reason matches a certain regex. 

 But it doesn't make sense to restart the job thousands of times. I couldn't even find the original job (haven't tried the recursion feature yet). 

 In total I could find over 17k jobs with that error about `unreal6.qe.nue2.suse.org` since mid november. 

 A symptom of having such huge restart/clone-chains is: 
 ``` 
 Dec 04 14:39:53 openqa openqa-gru[6326]: Deep recursion on subroutine "OpenQA::Schema::Result::Jobs::related_scheduled_product_id" at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 2016. 
 ``` 

 ## Acceptance Criteria 
 * **AC1**: Incomplete jobs are restarted up to n times at most (configurable) 

 ## Suggestions 
 * Implement a cap/limit on the automatic restarting of incomplete jobs 
 * Search for `auto_clone_regex` in the code repository to find the relevant starting point 
 * Have a look into avoiding the deep recursion as well

Back