Project

General

Profile

action #101283

Updated by okurz almost 2 years ago

## Observation 
 In #100859 mkittler triggered a vacuum of the OSD database which made OSD busy and (as expected) caused requests to run into a timeout. This caused jobs to incomplete and also not be restarted. 

 https://openqa.suse.de/tests/7489935 is one of those cases with reason "api failure: 408 response: Request Timeout". There is no autoinst-log.txt and no worker-log.txt but I could find the corresponding section from system journalon openqaworker-arm-3: 

 ``` 
 Oct 20 20:34:50 openqaworker-arm-3 worker[5368]: [debug] [pid:5368] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7489935/status 
 Oct 20 20:34:50 openqaworker-arm-3 worker[5368]: [debug] [pid:5368] Upload concluded (at boot_ltp) 
 Oct 20 20:35:00 openqaworker-arm-3 worker[5368]: [debug] [pid:5368] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7489935/status 
 Oct 20 20:35:01 openqaworker-arm-3 worker[5368]: [debug] [pid:5368] Upload concluded (at boot_ltp) 
 Oct 20 20:35:11 openqaworker-arm-3 worker[5368]: [debug] [pid:5368] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7489935/status 
 Oct 20 20:41:45 openqaworker-arm-3 worker[5368]: [error] [pid:5368] REST-API error (POST http://openqa.suse.de/api/v1/jobs/7489935/status): 403 response: ti> 
 Oct 20 20:41:45 openqaworker-arm-3 worker[5368]: [debug] [pid:5368] Stopping job 7489935 from openqa.suse.de: 07489935-sle-15-SP4-Online-aarch64-Build52.1-l> 
 Oct 20 20:41:45 openqaworker-arm-3 worker[5368]: [debug] [pid:5368] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7489935/status 
 Oct 20 20:41:45 openqaworker-arm-3 worker[5368]: [error] [pid:5368] Unable to make final image uploads. Maybe the web UI considers this job already dead. 
 Oct 20 20:41:45 openqaworker-arm-3 worker[5368]: [debug] [pid:5368] Upload concluded (at boot_ltp) 
 Oct 20 20:41:55 openqaworker-arm-3 worker[5368]: [info] [pid:5368] Test schedule has changed, reloading test_order.json 
 Oct 20 20:41:55 openqaworker-arm-3 worker[5368]: [debug] [pid:5368] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7489935/status 
 Oct 20 20:43:35 openqaworker-arm-3 worker[5368]: [info] [pid:5368] Trying to stop job gracefully by announcing it to command server via http://localhost:201> 
 Oct 20 20:43:37 openqaworker-arm-3 worker[5368]: [info] [pid:5368] Isotovideo exit status: 1 
 Oct 20 20:43:37 openqaworker-arm-3 worker[5368]: [info] [pid:5368] +++ worker notes +++ 
 Oct 20 20:43:37 openqaworker-arm-3 worker[5368]: [info] [pid:5368] End time: 2021-10-20 18:43:37 
 Oct 20 20:43:37 openqaworker-arm-3 worker[5368]: [info] [pid:5368] Result: api-failure 
 Oct 20 20:44:05 openqaworker-arm-3 worker[5368]: [error] [pid:5368] REST-API error (POST http://openqa.suse.de/api/v1/jobs/7489935/status): 408 response: Re> 
 Oct 20 20:45:08 openqaworker-arm-3 worker[5368]: [debug] [pid:13464] Optimizing /var/lib/openqa/pool/13/testresults/boot_ltp-3.png 
 Oct 20 20:45:08 openqaworker-arm-3 worker[5368]: [debug] [pid:13464] Uploading artefact boot_ltp-3.png as 8b92b9db6e2ba4f1e75127b955e02b5d 
 Oct 20 20:45:09 openqaworker-arm-3 worker[5368]: [debug] [pid:13464] Optimizing /var/lib/openqa/pool/13/testresults/.thumbs/boot_ltp-3.png 
 Oct 20 20:45:09 openqaworker-arm-3 worker[5368]: [debug] [pid:13464] Uploading artefact boot_ltp-3.png as 8b92b9db6e2ba4f1e75127b955e02b5d 
 … 
 ``` 

 see the two interesting lines about "REST-API error". 

 ## Acceptance criteria 
 * **AC1:** Incomplete jobs with "408 Request timeout" are automatically retriggered 
 * **AC2:** Jobs running into "408 Request timeout" internally retry for multiple minutes, e.g. 40 minutes downtime of a central webUI instance 

 ## Further details 
 entrance level issue

Back