action #73375
Updated by okurz about 4 years ago
## Observation
The real error messages are only found within the worker log (systemd service journal) despite the fact that is was apparently possible to pass "api failure":
```
Okt 13 17:59:28 openqaworker4 worker[1908]: [info] Running on openqaworker4:7 (Linux 4.12.14-lp151.28.71-default #1 SMP Thu Oct 8 12:40:48 UTC 2020 (dda1832) x86_64)
Okt 13 17:59:28 openqaworker4 worker[1908]: [info] Downloading openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso.sha256, request #127800 sent to Cache Service
Okt 13 17:59:39 openqaworker4 worker[1908]: [info] Download of openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso.sha256 processed:
Okt 13 17:59:39 openqaworker4 worker[1908]: [info] [#127800] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
...
Okt 13 17:59:59 openqaworker4 worker[1908]: [info] [#127801] Downloading "openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso" from "http://openqa1-opensuse/tests/1432537/asset/iso/openSUSE-Jump-15.2.1-KDE-Live-x86_64-Build2.8-Media.iso"
...
Okt 13 19:17:22 openqaworker4 worker[1908]: [info] Uploading autoinst-log.txt
Okt 13 19:17:22 openqaworker4 worker[1908]: [info] Uploading worker-log.txt
Okt 13 19:17:32 openqaworker4 worker[1908]: [error] Error uploading worker-log.txt: connection error: Can't connect: Temporary failure in name resolution
Okt 13 19:17:34 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:17:34 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:17:34 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:17:34 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1005, no reason - trying again in 10 seconds
Okt 13 19:17:44 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:17:44 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:17:44 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:27:34 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:27:44 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:27:44 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:27:44 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:36:05 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:36:15 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:36:15 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:36:15 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:44:25 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:44:35 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:44:35 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:44:35 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
Okt 13 19:52:55 openqaworker4 worker[1908]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/170 finished by remote side with code 1006, no reason - trying again in 10 seconds
Okt 13 19:53:05 openqaworker4 worker[1908]: [info] Registering with openQA http://openqa1-opensuse
Okt 13 19:53:05 openqaworker4 worker[1908]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/170
Okt 13 19:53:05 openqaworker4 worker[1908]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 170
```
from https://openqa.opensuse.org/tests/1432537
so it looks like the job ran fine from start till uploading files but failed in uploading worker-log.txt:
```
Okt 13 19:17:32 openqaworker4 worker[1908]: [error] Error uploading worker-log.txt: connection error: Can't connect: Temporary failure in name resolution
```
so the network on the worker host disappeared or something. After that the worker tries to reconnect to the webUI (possibly to retry uploading) and does again with pauses of 10 minutes (!) in between.
At the end of the log (what is not seen here) the worker picked up the next job and the previous job is just incomplete.
## Steps to reproduce
Use https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label with argument "poo#73375" to find if jobs are still referencing this ticket
## Acceptance criteria / Expected behaviour
* The actual problem, i.e. in this case "connection error: Can't connect: Temporary failure in name resolution" is forwarded into the reason field, not just "api failure"
* The job is not automatically retriggered in these case of network problems from the side of worker as the worker can not say if the network comes back (the webui might still want to do that)
## Suggestions, further information
* This ticket already causes auto-review to automatically restart such jobs. Maybe it would make sense if openQA would restart jobs in case of API failures on its own.
* Of course it would be nice if the worker would pass the exact error message and not just "api failure".
* Considering the log there *might* be a bug in the retry logic (although it is intended that the worker re-registers in case of API errors).
* Look into the worker code how it transfers the reason, call `git grep -i 'api.failure'`, look for the occurence where maybe no details are added the string
## Workaround
Ensure the network is working (again) and retrigger according jobs