action #62567
closedopenqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported"
0%
Description
Observation¶
On a system where the network setup is not instantanious, e.g. NetworkManager+DHCP, when openQA systemd services are enabled to automatically startup, they can fail like
Jan 22 21:42:29 falafel openqa-scheduler[1282]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:29 falafel openqa-websockets[1283]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:31 falafel openqa-livehandler[1248]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:32 falafel.suse.cz openqa[1284]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Reproducible¶
I think the issue is reproducible on any system, just with slow DHCP it is more likely to observe unless reproduced differently, e.g. on a system without any network
Problem¶
Currently the systemd services do not depend on the network being up, just the network controller stack initialized.
Expected result: Programs should be designed to work regardless of a ready external network.
Suggestions¶
- Check startup of services in an environment where network is not up (yet), e.g. container with removed network
- Ensure all our network related services start up fine regardless of network state
Workaround¶
As a workaround the systemd services can wait for the network being online as described on https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ :
# systemctl cat openqa-scheduler
# /usr/lib/systemd/system/openqa-scheduler.service
[Unit]
Description=The openQA Scheduler
After=postgresql.service openqa-setup-db.service
Wants=openqa-setup-db.service
[Service]
User=geekotest
ExecStart=/usr/share/openqa/script/openqa-scheduler daemon -m production
TimeoutStopSec=120
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/openqa-scheduler.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target
same is necessary in /etc/systemd/system/openqa-livehandler.service.d/override.conf
Files
Updated by okurz almost 5 years ago
- Copied from action #62243: After latest updates, openQA has problematic behavior on Dell Precision 5810 added
Updated by okurz almost 5 years ago
- Category changed from Feature requests to Regressions/Crashes
- Priority changed from Normal to High
hm, I just have the suspicion that this is a problem that was introduced in the past months, either in our code or the dependencies.
Updated by kraih almost 5 years ago
There have been no upstream changes from the Mojo side of things in recent months that would seem relevant. Looking at the openQA code, it seems we set the listen address to 127.0.0.1
, and for a long time (it was switched from localhost
17 months ago).
Updated by okurz almost 5 years ago
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz almost 5 years ago
mkittler and me compared the systemd service dependencies to Apache and nginx and found that it's good to rely on nss-lookup.target and maybe also remote-fs.target
Updated by okurz almost 5 years ago
- Related to action #44105: if workercache dies, we get *tons* of incompletes added
Updated by okurz almost 5 years ago
I tried to simulate the error condition with two mocked systemd services "block.service" and "after-block.service" and then setting nscd.service to start after that. The scheduler started in before and was fine. From original logs on falafel (see attached) I could find that the problem happened when the NIC didn't even have a link yet. Based on that I will just suggest to depend on nss-lookup.target, same as apache2.service does.
https://github.com/os-autoinst/openQA/pull/2782 for the webui related service and https://github.com/os-autoinst/openQA/pull/2783 also including worker if we want to.
EDIT: 2020-02-29: Both PRs merged. Let's wait for feedback from production and users.
Updated by okurz almost 5 years ago
- Status changed from Feedback to Resolved
seems fine, no negative reports received