Project

General

Profile

action #108091

Most systemd units should not Want= or Require= network.target (bsc#1196359) size:M

Added by coolo 4 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-03-09
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

See https://bugzilla.suse.com/show_bug.cgi?id=1196359#c3

Acceptance criteria

  • AC1: Neither Want= nor Require= network used in service files

Suggestions

  • Remove our uses of it
  • Reboot 5 times afterwards
  • Use staging machines to test it

Related issues

Related to openQA Project - action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs)Resolved2021-01-18

Copied to openQA Project - action #108125: openqa-worker-cacheservice(-minion) can trigger "UNIQUE constraint failed: mojo_migrations.name at OpenQA/CacheService/Model/Cache.pm line 77."Resolved2022-03-09

History

#1 Updated by okurz 4 months ago

  • Target version set to Ready

Hm, weird. I thought we had that done quite some time ago already. Well, I agree, we should fix this. Also in the past that actually helped with some problems

#2 Updated by cdywan 4 months ago

  • Subject changed from Most systemd units should not Want= or Require= network.target (bsc#1196359) to Most systemd units should not Want= or Require= network.target (bsc#1196359) size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by okurz 4 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

#4 Updated by okurz 4 months ago

I am trying out on openqaworker11. I removed Wants=network.target from systemd services with vim /usr/lib/systemd/system/openqa-worker-auto-restart@.service /usr/lib/systemd/system/openqa-worker-cacheservice.service /usr/lib/systemd/system/openqa-worker-no-cleanup@.service /usr/lib/systemd/system/openqa-worker@.service

After reboot I see in the log for openqa-worker-auto-restart@1:

Mar 10 11:53:34 openqaworker11 systemd[1]: Started openQA Worker #1.
Mar 10 11:53:39 openqaworker11 worker[2124]: [info] [pid:2124] worker 1:
Mar 10 11:53:39 openqaworker11 worker[2124]:  - config file:           /etc/openqa/workers.ini
Mar 10 11:53:39 openqaworker11 worker[2124]:  - worker hostname:       openqaworker11
Mar 10 11:53:39 openqaworker11 worker[2124]:  - isotovideo version:    23
Mar 10 11:53:39 openqaworker11 worker[2124]:  - websocket API version: 1
Mar 10 11:53:39 openqaworker11 worker[2124]:  - web UI hosts:          openqa.suse.de,openqa-staging-1.qa.suse.de,openqa>
Mar 10 11:53:39 openqaworker11 worker[2124]:  - class:                 qemu_x86_64_debug_poo94783,tap,openqaworker11
Mar 10 11:53:39 openqaworker11 worker[2124]:  - no cleanup:            no
Mar 10 11:53:39 openqaworker11 worker[2124]:  - pool directory:        /var/lib/openqa/pool/1
Mar 10 11:53:39 openqaworker11 worker[2124]: [error] [pid:2124] Worker cache not available: Cache service info error: Co>
Mar 10 11:53:39 openqaworker11 worker[2124]: [info] [pid:2124] CACHE: caching is enabled, setting up /var/lib/openqa/cac>
Mar 10 11:53:39 openqaworker11 worker[2124]: [info] [pid:2124] Project dir for host openqa.suse.de is /var/lib/openqa/sh>
…
Mar 10 11:53:39 openqaworker11 worker[2124]: [warn] [pid:2124] Failed to register at openqa.suse.de - connection error: >
Mar 10 11:53:39 openqaworker11 worker[2124]: [info] [pid:2124] Registering with openQA openqa-staging-1.qa.suse.de

So there is an error , likely because the network is not yet up. We should avoid that as an error and likely also the warnings should be just info.

openqa-worker-cacheservice-minion says:

Mar 10 11:53:34 openqaworker11 systemd[1]: Started OpenQA Worker Cache Service Minion.
Mar 10 11:53:38 openqaworker11 openqa-worker-cacheservice-minion[2119]: [2119] [i] Creating cache directory tree for "/v>
Mar 10 11:53:38 openqaworker11 openqa-worker-cacheservice-minion[2119]: DBD::SQLite::st execute failed: UNIQUE constrain>
Mar 10 11:53:38 openqaworker11 openqa-worker-cacheservice-minion[2119]:  at /usr/share/openqa/script/../lib/OpenQA/Cache>
Mar 10 11:53:38 openqaworker11 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, >
Mar 10 11:53:38 openqaworker11 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Mar 10 11:53:38 openqaworker11 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Mar 10 11:53:38 openqaworker11 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, >
Mar 10 11:53:38 openqaworker11 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Mar 10 11:53:38 openqaworker11 systemd[1]: Started OpenQA Worker Cache Service Minion.

And openqa-worker-cacheservice.service says there is also a problem:

Mar 10 11:53:34 openqaworker11 systemd[1]: Started OpenQA Worker Cache Service.
Mar 10 11:53:38 openqaworker11 openqa-workercache-daemon[2117]: [2117] [i] Creating cache directory tree for "/var/lib/o>
Mar 10 11:53:38 openqaworker11 openqa-workercache-daemon[2117]: [2117] [i] Cache size of "/var/lib/openqa/cache" is 0 By>
Mar 10 11:53:38 openqaworker11 openqa-workercache-daemon[2117]: Can't create listen socket: Address family for hostname >
Mar 10 11:53:38 openqaworker11 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=>
Mar 10 11:53:38 openqaworker11 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Mar 10 11:53:38 openqaworker11 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Mar 10 11:53:38 openqaworker11 systemd[1]: openqa-worker-cacheservice.service: Service RestartSec=100ms expired, schedul>
Mar 10 11:53:38 openqaworker11 systemd[1]: Stopped OpenQA Worker Cache Service.
Mar 10 11:53:38 openqaworker11 systemd[1]: Started OpenQA Worker Cache Service.
Mar 10 11:53:39 openqaworker11 openqa-workercache-daemon[2357]: Web application available at http://127.0.0.1:9530
Mar 10 11:53:39 openqaworker11 openqa-workercache-daemon[2357]: Web application available at http://[::1]:9530
Mar 10 11:53:39 openqaworker11 openqa-workercache-daemon[2357]: [2357] [i] Cache size of "/var/lib/openqa/cache" is 0 By>
Mar 10 11:53:39 openqaworker11 openqa-workercache-daemon[2357]: [2357] [i] Listening at "http://127.0.0.1:9530"
Mar 10 11:53:39 openqaworker11 openqa-workercache-daemon[2357]: [2357] [i] Listening at "http://[::1]:9530"
Mar 10 11:53:39 openqaworker11 openqa-workercache-daemon[2357]: [2357] [i] Manager 2357 started

so also recovering automatically but initially an error which we could avoid. https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ also explains that with "If you are a developer, instead of wondering what to do about network.target, please just fix your program to be friendly to dynamically changing network configuration." And it says "If you write a server: listen on [::], [::1], 0.0.0.0 and 127.0.0.1 only. These pseudo-addresses are unconditionally available. If you always bind to these addresses you will have code that doesn't have to react to network changes, as all you listen on is catch-all and private addresses.". I thought we do that properly already but maybe the cache service is missing something here?

#5 Updated by okurz 4 months ago

We are running in circles it seems. So https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ says Wants=network.target is bad, Wants=network-online.target is better, even better would be to "fix the code". In https://github.com/os-autoinst/openQA/pull/1878 in 2018 Wants=network.target was introduced for openqa-worker-cacheservice and there was already a discussion in the PR not agreeing on some points. Then in https://github.com/os-autoinst/openQA/pull/1884 we also discussed what should be right and I suggested "network-online.target" but the PR was closed without direct impact. In https://github.com/os-autoinst/openQA/pull/1896 we changed to network-online.target as suggested in before. However in https://github.com/os-autoinst/openQA/pull/2783 we changed that again to use "network.target" because we also added dependency for "nss-lookup.target remote-fs.target", same as for the webUI which we did in https://github.com/os-autoinst/openQA/pull/2782 to fix an actual problem. For the worker however at this point we just wanted to consolidate and did not check or expect any change in behaviour. Maybe we still need to add network-online.target as dependency for the cache service. An alternative that I would likely prefer is to catch the application start failing in CacheService.pm, e.g. with eval and retry. I guess kraih is against that and looks to systemd to fix that :)

But first for the UNIQUE constraint problem about sqlite. That actually seems to be a problem not related to systemd so I created #108125

Back to the previous problem. For that I am testing by changing the service configurations on openqaworker11 and taking a look into the logs of the relevant services after bootup with

sudo journalctl -b _SYSTEMD_UNIT=openqa-worker-cacheservice-minion.service + _SYSTEMD_UNIT=openqa-worker-auto-restart@1.service + _SYSTEMD_UNIT=openqa-worker-cacheservice.service

#6 Updated by okurz 4 months ago

  • Copied to action #108125: openqa-worker-cacheservice(-minion) can trigger "UNIQUE constraint failed: mojo_migrations.name at OpenQA/CacheService/Model/Cache.pm line 77." added

#7 Updated by openqa_review 4 months ago

  • Due date set to 2022-03-25

Setting due date based on mean cycle time of SUSE QE Tools

#8 Updated by okurz 3 months ago

  • Status changed from In Progress to Feedback

For the sake of simplicity I am waiting first for #108125 to be out of the way

#9 Updated by okurz 3 months ago

I added back Wants=network.target to crosscheck how the log looks like before. I removed openqa-staging-1.qa and openqa-staging-2.qa from /etc/openqa/workers.ini because otherwise the log is filled up easily with

Mar 16 10:55:52 openqaworker11 worker[2103]: [warn] [pid:2103] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying aga>
Mar 16 10:55:52 openqaworker11 worker[2103]: [info] [pid:2103] Registering with openQA openqa-staging-1.qa.suse.de
Mar 16 10:55:52 openqaworker11 worker[2103]: [warn] [pid:2103] Failed to register at openqa-staging-1.qa.suse.de - connection error: Can't connect: Name or service not known>
Mar 16 10:55:52 openqaworker11 worker[2103]: [info] [pid:2103] Registering with openQA openqa-staging-2.qa.suse.de
Mar 16 10:55:52 openqaworker11 worker[2103]: [warn] [pid:2103] Failed to register at openqa-staging-2.qa.suse.de - connection error: Can't connect: Name or service not known>
Mar 16 10:56:02 openqaworker11 worker[2103]: [info] [pid:2103] Registering with openQA openqa.suse.de
Mar 16 10:56:02 openqaworker11 worker[2103]: [warn] [pid:2103] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying aga>
Mar 16 10:56:02 openqaworker11 worker[2103]: [info] [pid:2103] Registering with openQA openqa-staging-1.qa.suse.de
…

It looks like this for now without openqa-staging-1.qa and openqa-staging-2.qa:

-- Logs begin at Fri 2018-10-05 11:17:53 CEST, end at Wed 2022-03-16 11:03:21 CET. --
Mar 16 10:55:41 openqaworker11 openqa-workercache-daemon[2096]: [2096] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Mar 16 10:55:41 openqaworker11 openqa-worker-cacheservice-minion[2098]: [2098] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Mar 16 10:55:41 openqaworker11 openqa-workercache-daemon[2096]: Web application available at http://127.0.0.1:9530
Mar 16 10:55:41 openqaworker11 openqa-workercache-daemon[2096]: Web application available at http://[::1]:9530
Mar 16 10:55:41 openqaworker11 openqa-workercache-daemon[2096]: [2096] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Mar 16 10:55:41 openqaworker11 openqa-workercache-daemon[2096]: [2096] [i] Listening at "http://127.0.0.1:9530"
Mar 16 10:55:41 openqaworker11 openqa-workercache-daemon[2096]: [2096] [i] Listening at "http://[::1]:9530"
Mar 16 10:55:41 openqaworker11 openqa-workercache-daemon[2096]: [2096] [i] Manager 2096 started
Mar 16 10:55:41 openqaworker11 worker[2103]: [info] [pid:2103] worker 1:
Mar 16 10:55:41 openqaworker11 worker[2103]:  - config file:           /etc/openqa/workers.ini
Mar 16 10:55:41 openqaworker11 worker[2103]:  - worker hostname:       openqaworker11
Mar 16 10:55:41 openqaworker11 worker[2103]:  - isotovideo version:    23
Mar 16 10:55:41 openqaworker11 worker[2103]:  - websocket API version: 1
Mar 16 10:55:41 openqaworker11 worker[2103]:  - web UI hosts:          openqa.suse.de
Mar 16 10:55:41 openqaworker11 worker[2103]:  - class:                 qemu_x86_64_debug_poo94783,tap,openqaworker11
Mar 16 10:55:41 openqaworker11 worker[2103]:  - no cleanup:            no
Mar 16 10:55:41 openqaworker11 worker[2103]:  - pool directory:        /var/lib/openqa/pool/1
Mar 16 10:55:41 openqaworker11 worker[2103]: [info] [pid:2103] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Mar 16 10:55:41 openqaworker11 worker[2103]: [info] [pid:2103] Project dir for host openqa.suse.de is /var/lib/openqa/share
Mar 16 10:55:41 openqaworker11 worker[2103]: [info] [pid:2103] Registering with openQA openqa.suse.de
Mar 16 10:55:42 openqaworker11 worker[2103]: [warn] [pid:2103] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying aga>
Mar 16 10:55:45 openqaworker11 openqa-websockets-daemon[2094]: [info] Listening at "http://127.0.0.1:9527"
Mar 16 10:55:45 openqaworker11 openqa-websockets-daemon[2094]: Web application available at http://127.0.0.1:9527
Mar 16 10:55:45 openqaworker11 openqa-websockets-daemon[2094]: [info] Listening at "http://[::1]:9527"
Mar 16 10:55:52 openqaworker11 worker[2103]: [info] [pid:2103] Registering with openQA openqa.suse.de
Mar 16 10:55:52 openqaworker11 worker[2103]: [warn] [pid:2103] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying aga>
Mar 16 10:56:02 openqaworker11 worker[2103]: [info] [pid:2103] Registering with openQA openqa.suse.de
Mar 16 10:56:02 openqaworker11 worker[2103]: [warn] [pid:2103] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying aga>
Mar 16 10:56:12 openqaworker11 worker[2103]: [info] [pid:2103] Registering with openQA openqa.suse.de
Mar 16 10:56:12 openqaworker11 worker[2103]: [info] [pid:2103] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1828

or only "warn" or more critical log messages:

Mar 16 10:55:52 openqaworker11 worker[2103]: [warn] [pid:2103] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying aga>
Mar 16 10:56:02 openqaworker11 worker[2103]: [warn] [pid:2103] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying aga>

which shows up twice before the system succeeds to register. We could consider changing that to "info" level. Now changing again the service and doing some reboot tests:

for i in {1..10}; do echo "### Run $i" && while : ; do ping -c 30 openqaworker11 && break || sleep 1; done && while :; do nc -z openqaworker11 22 && break || sleep 1; done && sleep 30 && ssh openqaworker11 "sudo journalctl -b _SYSTEMD_UNIT=openqa-websockets.service + _SYSTEMD_UNIT=openqa-worker-cacheservice-minion.service + _SYSTEMD_UNIT=openqa-worker-auto-restart@1.service + _SYSTEMD_UNIT=openqa-worker-cacheservice.service ; sudo reboot"; done

#10 Updated by okurz 3 months ago

Result from for i in {1..10}; do echo "### Run $i" && while : ; do ping -c 30 openqaworker11 && break || sleep 1; done && while :; do nc -z openqaworker11 22 && break || sleep 1; done && sleep 30 && ssh openqaworker11 "sudo journalctl -b _SYSTEMD_UNIT=openqa-websockets.service + _SYSTEMD_UNIT=openqa-worker-cacheservice-minion.service + _SYSTEMD_UNIT=openqa-worker-auto-restart@1.service + _SYSTEMD_UNIT=openqa-worker-cacheservice.service ; sudo reboot"; done, only warn or above and resolution messages:

### Run 1
Mar 16 11:15:30 openqaworker11 worker[2104]: [warn] [pid:2104] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mar 16 11:15:50 openqaworker11 worker[2104]: [info] [pid:2104] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828
Connection to openqaworker11 closed by remote host.
### Run 2
…
Mar 16 11:24:13 openqaworker11 openqa-workercache-daemon[2109]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
…
Mar 16 11:24:14 openqaworker11 worker[2120]: [error] [pid:2120] Worker cache not available: Cache service info error: Connection refused
…
Mar 16 11:24:14 openqaworker11 worker[2120]: [warn] [pid:2120] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:24:34 openqaworker11 worker[2120]: [warn] [pid:2120] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mar 16 11:24:44 openqaworker11 worker[2120]: [info] [pid:2120] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828
### Run 3
…
Mar 16 11:30:50 openqaworker11 openqa-workercache-daemon[2049]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
…
Mar 16 11:30:50 openqaworker11 worker[2084]: [error] [pid:2084] Worker cache not available: Cache service info error: Connection refused
…
Mar 16 11:30:50 openqaworker11 worker[2084]: [warn] [pid:2084] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:31:10 openqaworker11 worker[2084]: [warn] [pid:2084] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:31:20 openqaworker11 worker[2084]: [info] [pid:2084] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828
### Run 4
…
Mar 16 11:37:16 openqaworker11 openqa-workercache-daemon[2117]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
…
Mar 16 11:37:17 openqaworker11 worker[2126]: [error] [pid:2126] Worker cache not available: Cache service info error: Connection refused
…
Mar 16 11:37:17 openqaworker11 worker[2126]: [warn] [pid:2126] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:37:27 openqaworker11 worker[2126]: [warn] [pid:2126] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:37:47 openqaworker11 worker[2126]: [info] [pid:2126] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828
### Run 5
…
Mar 16 11:43:53 openqaworker11 openqa-workercache-daemon[2079]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
…
Mar 16 11:43:54 openqaworker11 worker[2087]: [error] [pid:2087] Worker cache not available: Cache service info error: Connection refused
…
Mar 16 11:43:54 openqaworker11 worker[2087]: [warn] [pid:2087] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:44:04 openqaworker11 worker[2087]: [warn] [pid:2087] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:44:14 openqaworker11 worker[2087]: [warn] [pid:2087] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:44:24 openqaworker11 worker[2087]: [info] [pid:2087] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828
### Run 6
…
Mar 16 11:50:24 openqaworker11 worker[2108]: [warn] [pid:2108] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:50:34 openqaworker11 worker[2108]: [warn] [pid:2108] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:50:44 openqaworker11 worker[2108]: [warn] [pid:2108] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:50:54 openqaworker11 worker[2108]: [info] [pid:2108] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828
### Run 7
…
Mar 16 11:56:48 openqaworker11 worker[2106]: [warn] [pid:2106] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:56:58 openqaworker11 worker[2106]: [warn] [pid:2106] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 11:57:08 openqaworker11 worker[2106]: [info] [pid:2106] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828
### Run 8
…
Mar 16 12:03:26 openqaworker11 worker[2123]: [warn] [pid:2123] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 12:03:36 openqaworker11 worker[2123]: [warn] [pid:2123] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 12:03:46 openqaworker11 worker[2123]: [warn] [pid:2123] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 12:03:56 openqaworker11 worker[2123]: [info] [pid:2123] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828
### Run 9
…
Mar 16 12:09:49 openqaworker11 worker[2069]: [warn] [pid:2069] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 12:10:09 openqaworker11 worker[2069]: [warn] [pid:2069] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mar 16 12:10:20 openqaworker11 worker[2069]: [info] [pid:2069] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828
### Run 10
Mar 16 12:16:36 openqaworker11 openqa-worker-cacheservice-minion[2064]: DBD::SQLite::db do failed: disk I/O error at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite/Transaction.pm line 17.
…
Mar 16 12:16:36 openqaworker11 openqa-workercache-daemon[2054]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
…
Mar 16 12:16:37 openqaworker11 worker[2075]: [error] [pid:2075] Worker cache not available: Cache service info error: Connection refused
…
Mar 16 12:16:37 openqaworker11 worker[2075]: [warn] [pid:2075] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
…
Mar 16 12:16:47 openqaworker11 worker[2075]: [warn] [pid:2075] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying …
Mar 16 12:16:57 openqaworker11 worker[2075]: [warn] [pid:2075] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mar 16 12:17:07 openqaworker11 worker[2075]: [info] [pid:2075] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1828

#11 Updated by okurz 3 months ago

So observations

  1. The crash with "openqa-workercache-daemon.: Can't create listen socket: Address family for hostname not supported" happens in 5/10 runs, i.e. 50% of the cases. In all those cases but only then the worker reports "worker[.]: [error] [pid:.*] Worker cache not available: Cache service info error: Connection refused" -> We should catch that gracefully but expect to test at least 3 times to be sure we covered all cases as so far it only happens with 50% probability
  2. in 1/10 there is "openqa-worker-cacheservice-minion.*: DBD::SQLite::db do failed: disk I/O error" -> currently not following up with any action on this
  3. In 10/10 cases the worker shows "[warn] [pid:.*] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds" at least once but never more than 3 times -> We could enlarge the time from 10s to 40-60s before outputting a warning. The according code is in https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker.pm#L696 . We could just set the environment variable OPENQA_WORKER_CONNECT_INTERVAL to 60s but then we would delay all reconnect attempts by that long. I would like it better if we could one one level conduct retries more often but warn about failed attempts if they happen too often but still continue retrying.

Most important is 1. I managed to simulate the error condition more easily with

sudo ip addr del 127.0.0.1/8 dev lo
sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache-daemon

I am thinking of either

-    my $cmd_return_code = $app->start(@args);
+    do {
+        my $cmd_return_code = eval { $app->start(@args) };
+    } while ($@ =~ /Cannot assign requested address/);

as I prefer EAFP over LBYL. As alternative something like using lower level perl commands to check if we can bind.

#12 Updated by okurz 3 months ago

  • Related to action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs) added

#14 Updated by okurz 3 months ago

  • Due date deleted (2022-03-25)
  • Priority changed from Normal to Low

Created https://github.com/os-autoinst/openQA/pull/4577 for the systemd only changes. I plan to continue with this with low prio to increase test coverage for the cache service startup functionality.

#15 Updated by okurz about 2 months ago

  • Status changed from Feedback to Workable

https://github.com/os-autoinst/openQA/pull/4577 is merged. https://github.com/os-autoinst/openQA/pull/4573 is effectively blocked as there is no agreement on using the loop approach within our cache service implementation. We should look again into the implementation of how some services like apache, nginx, cockpit, mail servers, minidlna, postgresql handle that. I am assuming that maybe wicked has a bug there. We should reconsider the choice in #62567 where I removed the network-online.target dependency. Maybe I should add it back for openqa-worker-cacheservice.service specifically.

#16 Updated by okurz about 2 months ago

Running the script from #108091#note-10 I could directly confirm that the same error still pop up for the cache service

May 04 15:50:28 openqaworker11 openqa-workercache-daemon[2097]: [2097] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
May 04 15:50:28 openqaworker11 openqa-workercache-daemon[2097]: [2097] [i] Starting cache service: /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
May 04 15:50:28 openqaworker11 openqa-workercache-daemon[2097]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.

Before continuing first upgrading 15.2->15.3

okurz wrote:

We should look again into the implementation of how some services like […] handle that

#17 Updated by okurz about 1 month ago

  • Status changed from Workable to Resolved

so all in all quite different. I assume some services don't actually handle it properly and some others like postgresql handle it gracefully internally.

I don't have good ideas how to properly prevent the error showing up but in most cases nobody needs to care. I give up to get rid of the message. The original problem was resolved long ago anyway.

Also available in: Atom PDF