Project

General

Profile

action #62567

openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported"

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
-
Start date:
2020-01-17
Due date:
2020-03-06
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

On a system where the network setup is not instantanious, e.g. NetworkManager+DHCP, when openQA systemd services are enabled to automatically startup, they can fail like

Jan 22 21:42:29 falafel openqa-scheduler[1282]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:29 falafel openqa-websockets[1283]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:31 falafel openqa-livehandler[1248]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:32 falafel.suse.cz openqa[1284]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.

Reproducible

I think the issue is reproducible on any system, just with slow DHCP it is more likely to observe unless reproduced differently, e.g. on a system without any network

Problem

Currently the systemd services do not depend on the network being up, just the network controller stack initialized.

Expected result: Programs should be designed to work regardless of a ready external network.

Suggestions

  • Check startup of services in an environment where network is not up (yet), e.g. container with removed network
  • Ensure all our network related services start up fine regardless of network state

Workaround

As a workaround the systemd services can wait for the network being online as described on https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ :

# systemctl cat openqa-scheduler
# /usr/lib/systemd/system/openqa-scheduler.service
[Unit]
Description=The openQA Scheduler
After=postgresql.service openqa-setup-db.service
Wants=openqa-setup-db.service

[Service]
User=geekotest
ExecStart=/usr/share/openqa/script/openqa-scheduler daemon -m production
TimeoutStopSec=120

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/openqa-scheduler.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target

same is necessary in /etc/systemd/system/openqa-livehandler.service.d/override.conf

os-autoinst_job1.txt (4.51 KB) os-autoinst_job1.txt Logs from one of first fails, on Tumbleweed syrianidou_sofia, 2020-01-17 13:15
pool_folder1.tar.gz (432 KB) pool_folder1.tar.gz test failing in container openQA syrianidou_sofia, 2020-01-17 13:15
pool_folder2.tar.gz (171 KB) pool_folder2.tar.gz another test failing in container syrianidou_sofia, 2020-01-17 13:16
logs (160 KB) logs okurz, 2020-02-27 11:38

Related issues

Related to openQA Project - action #44105: if workercache dies, we get *tons* of incompletesResolved2018-11-21

Copied from openQA Project - action #62243: After latest updates, openQA has problematic behavior on Dell Precision 5810Resolved2020-01-17

History

#1 Updated by okurz over 1 year ago

  • Copied from action #62243: After latest updates, openQA has problematic behavior on Dell Precision 5810 added

#2 Updated by okurz over 1 year ago

  • Category changed from Feature requests to Concrete Bugs
  • Priority changed from Normal to High

hm, I just have the suspicion that this is a problem that was introduced in the past months, either in our code or the dependencies.

#3 Updated by kraih over 1 year ago

There have been no upstream changes from the Mojo side of things in recent months that would seem relevant. Looking at the openQA code, it seems we set the listen address to 127.0.0.1, and for a long time (it was switched from localhost 17 months ago).

#4 Updated by okurz over 1 year ago

  • Description updated (diff)

#5 Updated by okurz over 1 year ago

  • Description updated (diff)
  • Status changed from New to Workable

#6 Updated by okurz over 1 year ago

mkittler and me compared the systemd service dependencies to Apache and nginx and found that it's good to rely on nss-lookup.target and maybe also remote-fs.target

#7 Updated by okurz over 1 year ago

  • Related to action #44105: if workercache dies, we get *tons* of incompletes added

#8 Updated by okurz over 1 year ago

  • File logs logs added
  • Status changed from Workable to Feedback
  • Assignee set to okurz

I tried to simulate the error condition with two mocked systemd services "block.service" and "after-block.service" and then setting nscd.service to start after that. The scheduler started in before and was fine. From original logs on falafel (see attached) I could find that the problem happened when the NIC didn't even have a link yet. Based on that I will just suggest to depend on nss-lookup.target, same as apache2.service does.

https://github.com/os-autoinst/openQA/pull/2782 for the webui related service and https://github.com/os-autoinst/openQA/pull/2783 also including worker if we want to.

EDIT: 2020-02-29: Both PRs merged. Let's wait for feedback from production and users.

#9 Updated by okurz over 1 year ago

  • Due date set to 2020-03-06

#10 Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

seems fine, no negative reports received

Also available in: Atom PDF