Project

General

Profile

Actions

action #73375

open

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #102909: [epic] Prevent more incompletes already within os-autoinst or openQA

Job incompletes with reason auto_review:"(?m)api failure$" (and no further details)

Added by mkittler over 3 years ago. Updated over 2 years ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2020-10-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

The real error messages are only found within the worker log (systemd service journal) despite the fact that is was apparently possible to pass "api failure":

Okt 13 17:59:28 openqaworker4 worker[1908]: [info] Running on openqaworker4:7 (Linux 4.12.14-lp151.28.71-default #1 SMP Thu Oct 8 12:40:48 UTC 2020 (dda1832) x86_64)
Okt 13 17:59:28 openqaworker4 worker[1908]: [info] Downloading openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso.sha256, request #127800 sent to Cache Service
Okt 13 17:59:39 openqaworker4 worker[1908]: [info] Download of openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso.sha256 processed:
Okt 13 17:59:39 openqaworker4 worker[1908]: [info] [#127800] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
...
Okt 13 17:59:59 openqaworker4 worker[1908]: [info] [#127801] Downloading "openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso" from "http://openqa1-opensuse/tests/1432537/asset/iso/openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso"
...
Okt 13 19:17:22 openqaworker4 worker[1908]: [info] Uploading autoinst-log.txt
Okt 13 19:17:22 openqaworker4 worker[1908]: [info] Uploading worker-log.txt
Okt 13 19:17:32 openqaworker4 worker[1908]: [error] Error uploading worker-log.txt: connection error: Can't connect: Temporary failure in name resolution
Okt 13 19:17:34 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:17:34 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:17:34 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:17:34 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1005, no reason - trying again in 10 seconds
Okt 13 19:17:44 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:17:44 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:17:44 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:27:34 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:27:44 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:27:44 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:27:44 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:36:05 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:36:15 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:36:15 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:36:15 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:44:25 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:44:35 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:44:35 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:44:35 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:52:55 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:53:05 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:53:05 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:53:05 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170

from https://openqa.opensuse.org/tests/1432537

so it looks like the job ran fine from start till uploading files but failed in uploading worker-log.txt:

Okt 13 19:17:32 openqaworker4 worker[1908]: [error] Error uploading worker-log.txt: connection error: Can't connect: Temporary failure in name resolution

so the network on the worker host disappeared or something. After that the worker tries to reconnect to the webUI (possibly to retry uploading) and does again with pauses of 10 minutes (!) in between.

At the end of the log (what is not seen here) the worker picked up the next job and the previous job is just incomplete.

Steps to reproduce

Use https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label with argument "poo#73375" to find if jobs are still referencing this ticket

Acceptance criteria / Expected behaviour

  • The actual problem, i.e. in this case "connection error: Can't connect: Temporary failure in name resolution" is forwarded into the reason field, not just "api failure"
  • The job is not automatically retriggered in these cases of network problems from the side of worker as the worker can not say if the network comes back (the webui might still want to do that)

Suggestions, further information

  • This ticket already causes auto-review to automatically restart such jobs. Maybe it would make sense if openQA would restart jobs in case of API failures on its own.
  • Of course it would be nice if the worker would pass the exact error message and not just "api failure".
  • Considering the log there might be a bug in the retry logic (although it is intended that the worker re-registers in case of API errors).
  • Look into the worker code how it transfers the reason, call git grep -i 'api.failure', look for the occurence where maybe no details are added the string

Workaround

Ensure the network is working (again) and retrigger according jobs


Related issues 1 (0 open1 closed)

Related to openQA Project - action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retryResolvedXiaojing_liu2020-10-15

Actions
Actions

Also available in: Atom PDF