Project

General

Profile

Actions

action #69274

open

Directly chained jobs are accidently skipped

Added by mkittler almost 4 years ago. Updated over 3 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2020-07-23
Due date:
% Done:

0%

Estimated time:

Description

observation

The worker log looks like this:

juil. 23 10:10:06 siodtw01 worker[2608]: [info] Accepting job 1339790 from queue
juil. 23 10:10:06 siodtw01 worker[2608]: [error] Unable to accept job 1339790 because the websocket connection to https://openqa.opensuse.org has been lost.
juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339792 from queue (parent faild with result api-failure)
juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339793 from queue (parent faild with result skipped)
juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339791 from queue (parent faild with result skipped)
juil. 23 10:10:06 siodtw01 worker[2608]: [info] Skipping job 1339794 from queue (parent faild with result skipped)

However, the parent hasn't actually failed. The worker log is indeed full of connection/API errors:

[2020-07-23T10:09:18.0050 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds
[2020-07-23T10:09:24.0154 CEST] [info] Registering with openQA https://openqa.opensuse.org
[2020-07-23T10:09:24.0327 CEST] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258
[2020-07-23T10:09:24.0464 CEST] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 258
[2020-07-23T10:09:28.0055 CEST] [info] Registering with openQA http://192.168.0.28
[2020-07-23T10:09:31.0170 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds
[2020-07-23T10:09:41.0181 CEST] [info] Registering with openQA http://192.168.0.28
[2020-07-23T10:09:44.0290 CEST] [warn] Failed to register at http://192.168.0.28 - connection error: No route to host - trying again in 10 seconds

However, none of these errors was fatal after all.

Example job (parent): https://openqa.opensuse.org/tests/1339789

problems

  • The openQA worker wrongly skips the directly chained job.
  • The further log lines have the result "skipped" and not "api-failure" anymore which also seems odd.
  • There's a typo in "failed".

suggestions

  • Investigate the worker code.
  • Try to reproduce the scenario it within unit tests.
  • Provide a fix the the problems.

further notes

  • Judging by the worker code this problem is really only specific to directly chained jobs.
  • The problem only happens when the web socket connection is dropped.
  • As a workaround one can restart the jobs.
Actions

Also available in: Atom PDF