Project

General

Profile

Actions

action #62567

closed

openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported"

Added by okurz over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2020-01-17
Due date:
2020-03-06
% Done:

0%

Estimated time:

Description

Observation

On a system where the network setup is not instantanious, e.g. NetworkManager+DHCP, when openQA systemd services are enabled to automatically startup, they can fail like

Jan 22 21:42:29 falafel openqa-scheduler[1282]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:29 falafel openqa-websockets[1283]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:31 falafel openqa-livehandler[1248]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:32 falafel.suse.cz openqa[1284]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.

Reproducible

I think the issue is reproducible on any system, just with slow DHCP it is more likely to observe unless reproduced differently, e.g. on a system without any network

Problem

Currently the systemd services do not depend on the network being up, just the network controller stack initialized.

Expected result: Programs should be designed to work regardless of a ready external network.

Suggestions

  • Check startup of services in an environment where network is not up (yet), e.g. container with removed network
  • Ensure all our network related services start up fine regardless of network state

Workaround

As a workaround the systemd services can wait for the network being online as described on https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ :

# systemctl cat openqa-scheduler
# /usr/lib/systemd/system/openqa-scheduler.service
[Unit]
Description=The openQA Scheduler
After=postgresql.service openqa-setup-db.service
Wants=openqa-setup-db.service

[Service]
User=geekotest
ExecStart=/usr/share/openqa/script/openqa-scheduler daemon -m production
TimeoutStopSec=120

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/openqa-scheduler.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target

same is necessary in /etc/systemd/system/openqa-livehandler.service.d/override.conf


Files

os-autoinst_job1.txt (4.51 KB) os-autoinst_job1.txt Logs from one of first fails, on Tumbleweed syrianidou_sofia, 2020-01-17 13:15
pool_folder1.tar.gz (432 KB) pool_folder1.tar.gz test failing in container openQA syrianidou_sofia, 2020-01-17 13:15
pool_folder2.tar.gz (171 KB) pool_folder2.tar.gz another test failing in container syrianidou_sofia, 2020-01-17 13:16
logs (160 KB) logs okurz, 2020-02-27 11:38

Related issues 2 (0 open2 closed)

Related to openQA Project - action #44105: if workercache dies, we get *tons* of incompletesResolvedmkittler2018-11-21

Actions
Copied from openQA Project - action #62243: After latest updates, openQA has problematic behavior on Dell Precision 5810Resolvedokurz2020-01-17

Actions
Actions #1

Updated by okurz over 4 years ago

  • Copied from action #62243: After latest updates, openQA has problematic behavior on Dell Precision 5810 added
Actions #2

Updated by okurz over 4 years ago

  • Category changed from Feature requests to Regressions/Crashes
  • Priority changed from Normal to High

hm, I just have the suspicion that this is a problem that was introduced in the past months, either in our code or the dependencies.

Actions #3

Updated by kraih about 4 years ago

There have been no upstream changes from the Mojo side of things in recent months that would seem relevant. Looking at the openQA code, it seems we set the listen address to 127.0.0.1, and for a long time (it was switched from localhost 17 months ago).

Actions #4

Updated by okurz about 4 years ago

  • Description updated (diff)
Actions #5

Updated by okurz about 4 years ago

  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz about 4 years ago

mkittler and me compared the systemd service dependencies to Apache and nginx and found that it's good to rely on nss-lookup.target and maybe also remote-fs.target

Actions #7

Updated by okurz about 4 years ago

  • Related to action #44105: if workercache dies, we get *tons* of incompletes added
Actions #8

Updated by okurz about 4 years ago

  • File logs logs added
  • Status changed from Workable to Feedback
  • Assignee set to okurz

I tried to simulate the error condition with two mocked systemd services "block.service" and "after-block.service" and then setting nscd.service to start after that. The scheduler started in before and was fine. From original logs on falafel (see attached) I could find that the problem happened when the NIC didn't even have a link yet. Based on that I will just suggest to depend on nss-lookup.target, same as apache2.service does.

https://github.com/os-autoinst/openQA/pull/2782 for the webui related service and https://github.com/os-autoinst/openQA/pull/2783 also including worker if we want to.

EDIT: 2020-02-29: Both PRs merged. Let's wait for feedback from production and users.

Actions #9

Updated by okurz about 4 years ago

  • Due date set to 2020-03-06
Actions #10

Updated by okurz about 4 years ago

  • Status changed from Feedback to Resolved

seems fine, no negative reports received

Actions

Also available in: Atom PDF