action #69274
openDirectly chained jobs are accidently skipped
0%
Description
observation¶
The worker log looks like this:
juil. 23 10:10:06 siodtw01 worker[2608]: [info] Accepting job 1339790 from queue
juil. 23 10:10:06 siodtw01 worker[2608]: [error] Unable to accept job 1339790 because the websocket connection to https://openqa.opensuse.org has been lost.
juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339792 from queue (parent faild with result api-failure)
juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339793 from queue (parent faild with result skipped)
juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339791 from queue (parent faild with result skipped)
juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339794 from queue (parent faild with result skipped)
However, the parent hasn't actually failed. The worker log is indeed full of connection/API errors:
[2020-07-23T10:09:18.0050 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds
[2020-07-23T10:09:24.0154 CEST] [info] Registering with openQA https://openqa.opensuse.org
[2020-07-23T10:09:24.0327 CEST] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258
[2020-07-23T10:09:24.0464 CEST] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 258
[2020-07-23T10:09:28.0055 CEST] [info] Registering with openQA http://192.168.0.28
[2020-07-23T10:09:31.0170 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds
[2020-07-23T10:09:41.0181 CEST] [info] Registering with openQA http://192.168.0.28
[2020-07-23T10:09:44.0290 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds
However, none of these errors was fatal after all.
Example job (parent): https://openqa.opensuse.org/tests/1339789
problems¶
- The openQA worker wrongly skips the directly chained job.
- The further log lines have the result "skipped" and not "api-failure" anymore which also seems odd.
- There's a typo in "failed".
suggestions¶
- Investigate the worker code.
- Try to reproduce the scenario it within unit tests.
- Provide a fix the the problems.
further notes¶
- Judging by the worker code this problem is really only specific to directly chained jobs.
- The problem only happens when the web socket connection is dropped.
- As a workaround one can restart the jobs.
Updated by okurz about 4 years ago
I don't understand the severity or the impact of this issue. Also can you say where and how often this happens and what could be a workaround so that we can prioritize accordingly
Updated by ggardet_arm about 4 years ago
okurz wrote:
I don't understand the severity or the impact of this issue. Also can you say where and how often this happens and what could be a workaround so that we can prioritize accordingly
It happens "quite often" on the siodtw01
machine (the workers dealing with my RPi2 and RPi3). Being a remote server, I guess some connection errors are expected (more than being on the same local network) and should be handled.
The workaround I have atm, is to restart manually the skipped
jobs.
Updated by mkittler about 4 years ago
- Description updated (diff)
I don't understand the severity or the impact of this issue.
If I would have known that I'd included it in the ticket. I now had a look at the worker code and I suppose only directly chained children can be affected. It also happens only when the web socket connection has been lost.
The problem is not hard to pin down:
- When the web socket connection is lost at the point the worker wants to accept the job it so far logs the error
Unable to accept job … because the websocket connection to … has been lost.
we've observed and then skips all further jobs as usual when an error happens. - Then the job is assigned to the worker again but not is children because they have been marked as skipped.
- The job gets accepted on this further attempt and passes. However, its children stay skipped. Hence it looks like the worker skipped the children despite the parent job passed.
This raises the question what the worker should do when it can not accept (and also not reject) a further because the ws connection is down. I have to think about it but maybe there's a very simple solution: Just drop all enqueued jobs and otherwise don't do anything. When the worker sends the next status update after recovering the ws connection it will claim it is idle and the stale job detection is supposed to deal with the jobs as usual. Maybe the stale job detection needs to be adjusted so it treats the directly chained jobs as a unit (and not clones some of them and re-schedules others).
Updated by okurz about 4 years ago
- Target version set to Ready
thank you guys, that should be enough to be able to start with this ticket and AFAICS the "Normal" prio is correct then.
Updated by okurz about 4 years ago
- Priority changed from Normal to Low
- Target version changed from Ready to future
To be honest as no one picked this up after 2 months I regard this as "Low" prio also because the impact seems to be limited to worker instances that struggle with connection stability in general anyway.