Project

General

Profile

action #69274

Updated by mkittler almost 4 years ago

### observation 

 The worker log looks like this: 

 ``` 
 juil. 23 10:10:06 siodtw01 worker[2608]: [info] Accepting job 1339790 from queue 
 juil. 23 10:10:06 siodtw01 worker[2608]: [error] Unable to accept job 1339790 because the websocket connection to https://openqa.opensuse.org has been lost. 
 juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339792 from queue (parent faild with result api-failure) 
 juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339793 from queue (parent faild with result skipped) 
 juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339791 from queue (parent faild with result skipped) 
 juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339794 from queue (parent faild with result skipped) 
 ``` 

 However, the parent hasn't actually failed. The worker log is indeed full of connection/API errors: 

 ``` 
 [2020-07-23T10:09:18.0050 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds 
 [2020-07-23T10:09:24.0154 CEST] [info] Registering with openQA https://openqa.opensuse.org 
 [2020-07-23T10:09:24.0327 CEST] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258 
 [2020-07-23T10:09:24.0464 CEST] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 258 
 [2020-07-23T10:09:28.0055 CEST] [info] Registering with openQA http://192.168.0.28 
 [2020-07-23T10:09:31.0170 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds 
 [2020-07-23T10:09:41.0181 CEST] [info] Registering with openQA http://192.168.0.28 
 [2020-07-23T10:09:44.0290 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds 
 ``` 

 However, none of these errors Likely an API error happened but was not fatal after all. 

 Example job (parent): https://openqa.opensuse.org/tests/1339789 

 ### problems 

 * The openQA worker apparently does not clean the error state as needed and therefore wrongly skips the directly chained job. 
 * The further log lines have the result "skipped" and not "api-failure" anymore which also seems odd. 
 * There's a typo in "failed". 

 ### suggestions 

 * Investigate the worker code. 
 * Try to reproduce the scenario it within unit tests. 
 * Provide a fix the the problems. 

 ### further notes 

 * Judging by the worker code this problem is really only specific to directly chained jobs. 
 * The problem only happens when the web socket connection is dropped. 
 * As a workaround one can restart the jobs.

Back