action #69274
Updated by mkittler almost 4 years ago
### observation The worker log looks like this: ``` juil. 23 10:10:06 siodtw01 worker[2608]: [info] Accepting job 1339790 from queue juil. 23 10:10:06 siodtw01 worker[2608]: [error] Unable to accept job 1339790 because the websocket connection to https://openqa.opensuse.org has been lost. juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339792 from queue (parent faild with result api-failure) juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339793 from queue (parent faild with result skipped) juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339791 from queue (parent faild with result skipped) juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339794 from queue (parent faild with result skipped) ``` However, the parent hasn't actually failed. The worker log is indeed full of connection/API errors: ``` [2020-07-23T10:09:18.0050 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds [2020-07-23T10:09:24.0154 CEST] [info] Registering with openQA https://openqa.opensuse.org [2020-07-23T10:09:24.0327 CEST] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258 [2020-07-23T10:09:24.0464 CEST] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 258 [2020-07-23T10:09:28.0055 CEST] [info] Registering with openQA http://192.168.0.28 [2020-07-23T10:09:31.0170 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds [2020-07-23T10:09:41.0181 CEST] [info] Registering with openQA http://192.168.0.28 [2020-07-23T10:09:44.0290 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds ``` However, none of these errors Likely an API error happened but was not fatal after all. Example job (parent): https://openqa.opensuse.org/tests/1339789 ### problems * The openQA worker apparently does not clean the error state as needed and therefore wrongly skips the directly chained job. * The further log lines have the result "skipped" and not "api-failure" anymore which also seems odd. * There's a typo in "failed". ### suggestions * Investigate the worker code. * Try to reproduce the scenario it within unit tests. * Provide a fix the the problems. ### further notes * Judging by the worker code this problem is really only specific to directly chained jobs. * The problem only happens when the web socket connection is dropped. * As a workaround one can restart the jobs.