Project

General

Profile

action #73375

coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

coordination #62420: [epic] Distinguish all types of incompletes

Job incompletes with reason auto_review:"(?m)api failure$" (and no further details)

Added by mkittler 6 months ago. Updated 6 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
2020-10-14
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

The real error messages are only found within the worker log (systemd service journal) despite the fact that is was apparently possible to pass "api failure":

Okt 13 17:59:28 openqaworker4 worker[1908]: [info] Running on openqaworker4:7 (Linux 4.12.14-lp151.28.71-default #1 SMP Thu Oct 8 12:40:48 UTC 2020 (dda1832) x86_64)
Okt 13 17:59:28 openqaworker4 worker[1908]: [info] Downloading openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso.sha256, request #127800 sent to Cache Service
Okt 13 17:59:39 openqaworker4 worker[1908]: [info] Download of openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso.sha256 processed:
Okt 13 17:59:39 openqaworker4 worker[1908]: [info] [#127800] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
...
Okt 13 17:59:59 openqaworker4 worker[1908]: [info] [#127801] Downloading "openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso" from "http://openqa1-opensuse/tests/1432537/asset/iso/openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso"
...
Okt 13 19:17:22 openqaworker4 worker[1908]: [info] Uploading autoinst-log.txt
Okt 13 19:17:22 openqaworker4 worker[1908]: [info] Uploading worker-log.txt
Okt 13 19:17:32 openqaworker4 worker[1908]: [error] Error uploading worker-log.txt: connection error: Can't connect: Temporary failure in name resolution
Okt 13 19:17:34 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:17:34 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:17:34 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:17:34 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1005, no reason - trying again in 10 seconds
Okt 13 19:17:44 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:17:44 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:17:44 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:27:34 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:27:44 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:27:44 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:27:44 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:36:05 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:36:15 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:36:15 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:36:15 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:44:25 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:44:35 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:44:35 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:44:35 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:52:55 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:53:05 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:53:05 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:53:05 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170

from https://openqa.opensuse.org/tests/1432537

so it looks like the job ran fine from start till uploading files but failed in uploading worker-log.txt:

Okt 13 19:17:32 openqaworker4 worker[1908]: [error] Error uploading worker-log.txt: connection error: Can't connect: Temporary failure in name resolution

so the network on the worker host disappeared or something. After that the worker tries to reconnect to the webUI (possibly to retry uploading) and does again with pauses of 10 minutes (!) in between.

At the end of the log (what is not seen here) the worker picked up the next job and the previous job is just incomplete.

Steps to reproduce

Use https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label with argument "poo#73375" to find if jobs are still referencing this ticket

Acceptance criteria / Expected behaviour

  • The actual problem, i.e. in this case "connection error: Can't connect: Temporary failure in name resolution" is forwarded into the reason field, not just "api failure"
  • The job is not automatically retriggered in these cases of network problems from the side of worker as the worker can not say if the network comes back (the webui might still want to do that)

Suggestions, further information

  • This ticket already causes auto-review to automatically restart such jobs. Maybe it would make sense if openQA would restart jobs in case of API failures on its own.
  • Of course it would be nice if the worker would pass the exact error message and not just "api failure".
  • Considering the log there might be a bug in the retry logic (although it is intended that the worker re-registers in case of API errors).
  • Look into the worker code how it transfers the reason, call git grep -i 'api.failure', look for the occurence where maybe no details are added the string

Workaround

Ensure the network is working (again) and retrigger according jobs


Related issues

Related to openQA Project - action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retryResolved2020-10-15

History

#1 Updated by okurz 6 months ago

mkittler wrote:

  • This ticket already causes auto-review to automatically restart such jobs. Maybe it would make sense if openQA would restart jobs in case of API failures on its own.

Actually it does not. Only if you add :retry after the regex "..." string.

#2 Updated by okurz 6 months ago

  • Description updated (diff)
  • Category set to Concrete Bugs
  • Target version set to Ready

#3 Updated by okurz 6 months ago

  • Subject changed from Job incompletes with reason auto_review:"(?m)api failure$" (and no further details) to Job incompletes with reason auto_review:"(?m)api failure$":retry (and no further details)
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Normal to Low
  • Target version changed from Ready to future

made it workable together with mkittler

#4 Updated by okurz 6 months ago

  • Description updated (diff)

#5 Updated by mkittler 6 months ago

  • Subject changed from Job incompletes with reason auto_review:"(?m)api failure$":retry (and no further details) to Job incompletes with reason auto_review:"(?m)api failure$" (and no further details)
  • Description updated (diff)

The job is not automatically retriggered in these cases of network problems from the side of worker as the worker can not say if the network comes back (the webui might still want to do that)

Note that this applies to any API failure which looks like a network issue (but not to other API failures like failed authorization). The job mentioned in the ticket description is such a case because of Temporary failure in name resolution within the log. However, if there would be no further clues in the log and just had the reason "api failure", re-triggering wouldn't generally make sense. So I'm also removing the :retry from the ticket description again.

By the way, I've just seen an incomplete job with api failure: 408 response: Request Timeout. I suppose re-triggering would make sense here. Maybe also for setup failure: Failed to rsync tests: exit code 12. If we don't have already a ticket, it would make sense to add one for the re-triggering when it looks like a temporary network issue and consider all these kinds of errors.

#6 Updated by okurz 6 months ago

  • Related to action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retry added

#7 Updated by okurz 6 months ago

"exit code 12"? According to the rsync manpage that is "Error in rsync protocol data stream". This is closely related to #73396 with the PR https://github.com/os-autoinst/openQA/pull/3476

#8 Updated by okurz 6 months ago

  • Parent task set to #62420

Also available in: Atom PDF