action #62567: openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported" - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #62567

closed

openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported"

Added by okurz about 5 years ago. Updated almost 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Start date:

2020-01-17

Due date:

2020-03-06

% Done:

Estimated time:

Description

Observation¶

On a system where the network setup is not instantanious, e.g. NetworkManager+DHCP, when openQA systemd services are enabled to automatically startup, they can fail like

Jan 22 21:42:29 falafel openqa-scheduler[1282]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:29 falafel openqa-websockets[1283]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:31 falafel openqa-livehandler[1248]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:32 falafel.suse.cz openqa[1284]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.

Reproducible¶

I think the issue is reproducible on any system, just with slow DHCP it is more likely to observe unless reproduced differently, e.g. on a system without any network

Problem¶

Currently the systemd services do not depend on the network being up, just the network controller stack initialized.

Expected result: Programs should be designed to work regardless of a ready external network.

Suggestions¶

Check startup of services in an environment where network is not up (yet), e.g. container with removed network
Ensure all our network related services start up fine regardless of network state

Workaround¶

As a workaround the systemd services can wait for the network being online as described on https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ :

# systemctl cat openqa-scheduler
# /usr/lib/systemd/system/openqa-scheduler.service
[Unit]
Description=The openQA Scheduler
After=postgresql.service openqa-setup-db.service
Wants=openqa-setup-db.service

[Service]
User=geekotest
ExecStart=/usr/share/openqa/script/openqa-scheduler daemon -m production
TimeoutStopSec=120

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/openqa-scheduler.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target

same is necessary in /etc/systemd/system/openqa-livehandler.service.d/override.conf

Files

Download all files

os-autoinst_job1.txt (4.51 KB) os-autoinst_job1.txt	Logs from one of first fails, on Tumbleweed	syrianidou_sofia, 2020-01-17 13:15
pool_folder1.tar.gz (432 KB) pool_folder1.tar.gz	test failing in container openQA	syrianidou_sofia, 2020-01-17 13:15
pool_folder2.tar.gz (171 KB) pool_folder2.tar.gz	another test failing in container	syrianidou_sofia, 2020-01-17 13:16
logs (160 KB) logs		okurz, 2020-02-27 11:38

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz about 5 years ago

Copied from action #62243: After latest updates, openQA has problematic behavior on Dell Precision 5810 added

Actions

Copy link

Updated by okurz about 5 years ago

Category changed from Feature requests to Regressions/Crashes
Priority changed from Normal to High

hm, I just have the suspicion that this is a problem that was introduced in the past months, either in our code or the dependencies.

Actions

Copy link

Updated by kraih about 5 years ago

There have been no upstream changes from the Mojo side of things in recent months that would seem relevant. Looking at the openQA code, it seems we set the listen address to 127.0.0.1, and for a long time (it was switched from localhost 17 months ago).

Actions

Copy link

Updated by okurz almost 5 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz almost 5 years ago

Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz almost 5 years ago

mkittler and me compared the systemd service dependencies to Apache and nginx and found that it's good to rely on nss-lookup.target and maybe also remote-fs.target

Actions

Copy link

Updated by okurz almost 5 years ago

Related to action #44105: if workercache dies, we get *tons* of incompletes added

Actions

Copy link

Updated by okurz almost 5 years ago

File logs logs added
Status changed from Workable to Feedback
Assignee set to okurz

I tried to simulate the error condition with two mocked systemd services "block.service" and "after-block.service" and then setting nscd.service to start after that. The scheduler started in before and was fine. From original logs on falafel (see attached) I could find that the problem happened when the NIC didn't even have a link yet. Based on that I will just suggest to depend on nss-lookup.target, same as apache2.service does.

https://github.com/os-autoinst/openQA/pull/2782 for the webui related service and https://github.com/os-autoinst/openQA/pull/2783 also including worker if we want to.

EDIT: 2020-02-29: Both PRs merged. Let's wait for feedback from production and users.

Actions

Copy link

Updated by okurz almost 5 years ago

Due date set to 2020-03-06

Actions

Copy link

#10

Updated by okurz almost 5 years ago

Status changed from Feedback to Resolved

seems fine, no negative reports received

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #62567

openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported"

Observation¶

Reproducible¶

Problem¶

Suggestions¶

Workaround¶

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by kraih about 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago