action #110494
closedalert: openqaworker5 host up size:M
Added by jbaier_cz over 2 years ago. Updated over 2 years ago.
0%
Description
Observation¶
It seems that openqaworker5 is down since early morning: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker5/worker-dashboard-openqaworker5?tab=alert&viewPanel=65105&orgId=1&from=1651359305036&to=1651398742983
The host is reachable via IPMI and it is up.
Suggestions¶
ipmi-openqaworker5-ipmi sol activate
allowed login but the network was down. There is one failed service "os-autoinst-openvswitch" failed withcan't parse bridge…
which could be the cause or is just a symptom of another network issue. okurz triggeredsystemctl start default.target
. Check again if it's reachable, login over SoL, check logs, try multiple reboots, fix any problems in network config or os-autoinst-openvswitch- We already have multiple levels of retry, e.g. in os-autoinst-openvswitch. Check if this retry was not enough, different problem, or add more retrying on systemd level
Updated by jbaier_cz over 2 years ago
- Related to action #110497: Minion influxdb data causing unusual download rates size:M added
Updated by livdywan over 2 years ago
- Subject changed from alert: openqaworker5 host up to alert: openqaworker5 host up size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler over 2 years ago
Looks like it just took too long:
-- Reboot --
Mai 01 03:38:52 openqaworker5 systemd[1]: Starting os-autoinst openvswitch helper...
Mai 01 03:39:00 openqaworker5 systemd[1]: Started os-autoinst openvswitch helper.
Mai 01 03:39:01 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 1200s left ...
Mai 01 03:39:02 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 1199s left ...
…
Mai 01 03:59:01 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 3s left ...
Mai 01 03:59:02 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 2s left ...
Mai 01 03:59:03 openqaworker5 os-autoinst-openvswitch[1734]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 66.
Mai 01 03:59:03 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 1s left ...
Mai 01 03:59:03 openqaworker5 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/EXCEPTION
Mai 01 03:59:03 openqaworker5 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Mai 02 11:25:51 openqaworker5 systemd[1]: Starting os-autoinst openvswitch helper...
Mai 02 11:25:52 openqaworker5 systemd[1]: Started os-autoinst openvswitch helper.
Mai 02 11:25:53 openqaworker5 os-autoinst-openvswitch[3664]: Waiting for IP on bridge 'br1', 1200s left ...
Mai 02 11:25:54 openqaworker5 os-autoinst-openvswitch[3664]: Waiting for IP on bridge 'br1', 1199s left ...
…
Mai 02 11:27:38 openqaworker5 os-autoinst-openvswitch[3664]: Waiting for IP on bridge 'br1', 1095s left ...
Mai 02 11:27:39 openqaworker5 os-autoinst-openvswitch[3664]: Waiting for IP on bridge 'br1', 1094s left ...
Mai 02 12:52:23 openqaworker5 ovs-vsctl[37226]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap5 tag=68 vlan_mode=dot1q-tunnel
Mai 02 12:52:23 openqaworker5 ovs-vsctl[37566]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap2 tag=58 vlan_mode=dot1q-tunnel
Mai 02 12:52:24 openqaworker5 ovs-vsctl[38374]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap38 tag=18 vlan_mode=dot1q-tunnel
Mai 02 12:52:26 openqaworker5 ovs-vsctl[38635]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap29 tag=18 vlan_mode=dot1q-tunnel
Mai 02 12:52:26 openqaworker5 ovs-vsctl[38639]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap1 tag=25 vlan_mode=dot1q-tunnel
Judging by the code, the exception in the end is "normal" if the bridge's IP isn't assigned in time.
The timeout has already been increased:
# /etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf
[Service]
Environment="OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=1200"
(There are no further drop-in configs for the service.)
The full journal shows that the network took apparently very long to setup as other services run into network-related errors as well:
Mai 01 03:40:11 openqaworker5 worker[1850]: [warn] [pid:1850] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mai 01 03:40:11 openqaworker5 worker[1853]: [warn] [pid:1853] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mai 01 03:40:11 openqaworker5 worker[1832]: [warn] [pid:1832] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mai 01 03:40:12 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 1129s left ...
Mai 01 03:40:13 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 1128s left ...
Mai 01 03:40:14 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 1127s left ...
Mai 01 03:40:15 openqaworker5 openqa-worker-cacheservice-minion[2419]: [2419] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Mai 01 03:40:15 openqaworker5 openqa-worker-cacheservice-minion[2419]: [2419] [i] Resetting all leftover locks after restart
Mai 01 03:40:15 openqaworker5 openqa-worker-cacheservice-minion[2419]: [2419] [i] Worker 2419 started
Mai 01 03:40:15 openqaworker5 openqa-worker-cacheservice-minion[2419]: [2419] [i] Worker 2419 stopped
Mai 01 03:40:15 openqaworker5 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=88/n/a
Mai 01 03:40:15 openqaworker5 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Mai 01 03:40:15 openqaworker5 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Mai 01 03:40:15 openqaworker5 systemd[1]: Stopped OpenQA Worker Cache Service.
Mai 01 03:40:15 openqaworker5 systemd[1]: Started OpenQA Worker Cache Service.
Mai 01 03:40:15 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 1126s left ...
Mai 01 03:40:15 openqaworker5 systemd[1]: Started OpenQA Worker Cache Service Minion.
Mai 01 03:40:15 openqaworker5 openqa-workercache-daemon[2426]: [2426] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Mai 01 03:40:15 openqaworker5 openqa-workercache-daemon[2426]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Mai 01 03:40:15 openqaworker5 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Mai 01 03:40:15 openqaworker5 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Mai 01 03:40:15 openqaworker5 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 15.
…
Mai 01 03:59:02 openqaworker5 worker[1867]: [warn] [pid:1867] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mai 01 03:59:02 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 2s left ...
Mai 01 03:59:02 openqaworker5 openqa-workercache-daemon[5099]: [5099] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Mai 01 03:59:02 openqaworker5 openqa-workercache-daemon[5099]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Mai 01 03:59:02 openqaworker5 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Mai 01 03:59:02 openqaworker5 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Mai 01 03:59:02 openqaworker5 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 224.
Mai 01 03:59:02 openqaworker5 systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Mai 01 03:59:03 openqaworker5 os-autoinst-openvswitch[1734]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 66.
Mai 01 03:59:03 openqaworker5 os-autoinst-openvswitch[1734]: Waiting for IP on bridge 'br1', 1s left ...
Mai 01 03:59:03 openqaworker5 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/EXCEPTION
Mai 01 03:59:03 openqaworker5 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Mai 01 03:59:05 openqaworker5 telegraf[1745]: 2022-05-01T01:59:05Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": dial tcp: lookup openqa-monitor.qa.suse.de on [::1]:53: read udp [::1]:41586->[::1]:53: read: connection >
Mai 01 03:59:05 openqaworker5 telegraf[1745]: 2022-05-01T01:59:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Mai 01 03:59:07 openqaworker5 openqa-worker-cacheservice-minion[5100]: [5100] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Mai 01 03:59:07 openqaworker5 openqa-worker-cacheservice-minion[5100]: [5100] [i] Resetting all leftover locks after restart
Mai 01 03:59:07 openqaworker5 openqa-worker-cacheservice-minion[5100]: [5100] [i] Worker 5100 started
Mai 01 03:59:07 openqaworker5 openqa-worker-cacheservice-minion[5100]: [5100] [i] Worker 5100 stopped
Mai 01 03:59:07 openqaworker5 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=176/n/a
Mai 01 03:59:07 openqaworker5 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Mai 01 03:59:07 openqaworker5 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Mai 01 03:59:07 openqaworker5 systemd[1]: Stopped OpenQA Worker Cache Service.
The network was not restored automatically so a retry within/of our service wouldn't have helped (as it didn't help for the cache service).
The network has only been restored on the next day:
Mai 02 11:27:39 openqaworker5 wickedd-dhcp4[3667]: eth0: Request to acquire DHCPv4 lease with UUID 05a46f62-2e0c-0500-690e-00009a000000
Mai 02 11:27:39 openqaworker5 kernel: NET: Registered protocol family 17
Mai 02 11:27:39 openqaworker5 wickedd-dhcp6[1423]: eth0: Request to acquire DHCPv6 lease with UUID 05a46f62-2e0c-0500-690e-00009b000000 in mode auto
Mai 02 11:27:39 openqaworker5 os-autoinst-openvswitch[3664]: Waiting for IP on bridge 'br1', 1094s left ...
Mai 02 11:27:39 openqaworker5 nscd[1333]: 1333 ignored inotify event for `/etc/resolv.conf` (file exists)
Mai 02 11:27:39 openqaworker5 nscd[1333]: 1333 ignored inotify event for `/etc/resolv.conf` (file exists)
Mai 02 11:27:39 openqaworker5 wickedd-dhcp4[3667]: eth0: Committed DHCPv4 lease with address 10.160.1.93 (lease time 64800 sec, renew in 32400 sec, rebind in 56700 sec)
I suppose that happened only because @okurz invoked systemctl start default.target
.
So the failing os-autoinst-openvswitch is a mere symptom and further retried wouldn't help. It looks like wicket's startup got stuck due to a dependency issue:
martchus@openqaworker5:~> sudo journalctl --boot -u wicked
-- Logs begin at Thu 2022-04-28 03:31:31 CEST, end at Tue 2022-05-03 17:23:44 CEST. --
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found ordering cycle on wicked.service/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on wickedd.service/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on wickedd-auto4.service/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on dbus.service/verify-active
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on basic.target/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on sockets.target/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on xvnc.socket/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on YaST2-Firstboot.service/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on YaST2-Second-Stage.service/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on systemd-user-sessions.service/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on remote-fs.target/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on iscsi.service/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Found dependency on network-online.target/start
Mai 01 03:38:18 openqaworker5 systemd[1]: network-online.target: Job wicked.service/start deleted to break ordering cycle starting with network-online.target/start
Mai 02 11:25:52 openqaworker5 systemd[1]: Starting wicked managed network interfaces...
Mai 02 11:26:22 openqaworker5 wicked[3708]: lo up
Mai 02 11:26:22 openqaworker5 wicked[3708]: eth0 device-not-running
Mai 02 11:26:22 openqaworker5 wicked[3708]: ovs-system device-not-running
Updated by openqa_review over 2 years ago
- Due date set to 2022-05-18
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 2 years ago
- This looks to be a dependency issue / race condition
- Can we reduce the prio to High?
- Yes, it is not happening everywhere; not sure how to even reproduce it
- It seems related to the overarching multi-machine setup / how we manage the network
- Not really; none of that units were part of the dependency cycle.
Updated by mkittler over 2 years ago
We're not the only ones with this kind of issue (e.g. https://bugs.launchpad.net/ubuntu/+source/nbd/+bug/1487679, https://forum.siduction.org/index.php?topic=7993.0, https://bugzilla.redhat.com/show_bug.cgi?id=1600823). There's even a SUSE-specific issue (https://www.suse.com/zh-cn/support/kb/doc/?id=000018796) but we are not in that particular situation. (We have a nfs mount but there are no nested mounts within.)
I'm still trying to figure out which of the services in "our" cycle is the culprit.
Updated by mkittler over 2 years ago
TLDR: I suspect the culprit is Before=xvnc.socket
in YaST2-Firstboot.service
/YaST2-Second-Stage.service
. Note that this is also the only significant difference to my TW system where I reboot very often and never run into this issue.
This is the list of dependencies systemd's walked through until it detected the cycle (same order as in the journal entry):
1. network-online.target systemd's starting point
2. wicked.service/start because `Before=network-online.target`, makes sense
3. wickedd.service/start because `Before=wicked.service`, makes sense
4. wickedd-auto4.service/start because `Before=wickedd.service wicked.service`, makes sense
5. dbus.service/verify-active because `systemctl list-dependencies --before dbus.service` shows `wickedd.service` (and other wicked services)
but why? maybe because it requires a socket unit (`Requires=dbus.socket`) which in turn requires any of the previous points
it is the same on my TW system (except that the precessor is `NetworkManager.service`)
6. basic.target/start because `systemctl list-dependencies --before basic.target` shows `wickedd.service` (and other wicked services)
but why? it is *not* due to the NFS mount because it is the same on my TW system (except that the precessor is `NetworkManager.service`)
7. sockets.target/start because `systemctl list-dependencies --before sockets.target` shows `basic.target`
but why?
8. xvnc.socket/start because `systemctl list-dependencies --before xvnc.socket` shows `sockets.target`
presumably due to `WantedBy=sockets.target`
9. YaST2-Firstboot.service/start because `Before=xvnc.socket` (8.), does it make sense?
10. YaST2-Second-Stage.service/start because `Before=xvnc.socket` (8.), does it make sense?
11. systemd-user-sessions.service/start because `systemctl list-dependencies --before systemd-user-sessions.service` shows `YaST2-Second-Stage.service`
due to `After=systemd-user-sessions.service` in YaST2-Second-Stage.service
12. remote-fs.target/start because `systemctl list-dependencies --before remote-fs.target` shows `systemd-user-sessions.service`
due to `After=remote-fs.target` in systemd-user-sessions.service
13. iscsi.service/start because `Before=remote-fs.target`
14. network-online.target/start because `After=network-online.target` in iscsi.service
- I assume the list is sorted so a unit further down the list needs to be started before a unit further up in the list.
- I guess then the most interesting question is why
remote-fs.target
is required by any of the units further up in the list because it is of course wrong requiring this target to be started beforenetwork-online.target
(while it makes sense thatremote-fs.target
itself requires to be started afternetwork-online.target
). - I've been comparing this to my TW system (which I reboot daily and sometimes more often but never experienced that problem). It looks very similar except that wicked services are replaced by
NetworkManager.service
. There's are other differences, though:- The YaST2 services (9. and 10.) do not have the ordering
Before=xvnc.socket
. - Actually there are no further differences. I assume it is really the YaST2 services which make the difference.
- The YaST2 services (9. and 10.) do not have the ordering
Considering
martchus@openqaworker5:~> systemctl list-dependencies --reverse var-lib-openqa-share.mount
var-lib-openqa-share.mount
● └─openqa-worker.target
martchus@openqaworker5:~> systemctl list-dependencies --after var-lib-openqa-share.mount
var-lib-openqa-share.mount
● ├─-.mount
● ├─openqa_nvme_prepare.service
● ├─system.slice
● ├─systemd-journald.socket
● ├─var-lib-openqa-share.automount
● ├─var-lib-openqa.mount
● ├─network-online.target
● │ ├─wicked.service
● │ └─network.target
● │ ├─openvswitch.service
● │ ├─ovs-vswitchd.service
● │ ├─ovsdb-server.service
● │ ├─wicked.service
● │ ├─wickedd-auto4.service
● │ ├─wickedd-dhcp4.service
● │ ├─wickedd-dhcp6.service
● │ ├─wickedd-nanny.service
● │ ├─wickedd.service
● │ └─network-pre.target
● │ └─firewalld.service
● ├─network.target
● │ ├─openvswitch.service
● │ ├─ovs-vswitchd.service
● │ ├─ovsdb-server.service
● │ ├─wicked.service
● │ ├─wickedd-auto4.service
● │ ├─wickedd-dhcp4.service
● │ ├─wickedd-dhcp6.service
● │ ├─wickedd-nanny.service
● │ ├─wickedd.service
● │ └─network-pre.target
● │ └─firewalld.service
● └─remote-fs-pre.target
● ├─dracut-initqueue.service
● ├─iscsid.service
● └─iscsiuio.service
the NFS mount is only required by the worker and the ordering doesn't look cyclic. So I don't think the NFS mount is the culprit.
Updated by mkittler over 2 years ago
If the conclusion from my last comment is right, the problematic package is yast2-installation
developed at https://github.com/yast/yast-installation. I found the PR https://github.com/yast/yast-installation/pull/1041 in that repository and it looks like our issue, indeed. (Except that another unit in the cycle is deleted in their case but I assume the concrete unit which is deleted to break the cycle is a random pick.) Note that I haven't found this issue when Googling for the error message because it only contains the error message in form of a screenshot. (If I would have found the issue that would have saved me a lot of my own investigation.)
So we need to get yast2-installation
updated on our system or fix is via a drop-in config.
Updated by okurz over 2 years ago
mkittler wrote:
So we need to get
yast2-installation
updated on our system or fix is via a drop-in config.
To receive the proper fix would likely mean waiting for https://build.suse.de/request/show/270972 to be accepted into SLE and then forwarded to Leap. As a workaround you could try the drop-in or actually rollback yast2-installation to an earlier, unaffected version and lock the package until we have the SLE update available.
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to Normal
Considering the change has already been picked on the SLE-15-SP3 branch (see https://github.com/yast/yast-installation/blob/SLE-15-SP3/package/YaST2-Firstboot.service) I suppose all we have to do is waiting to receive the change.
Note that using a drop-in config to fix the dependency only on systemd level is likely not the best idea because the proper revert also contains other changes. So I've been downgrading our hosts with sudo salt '*' cmd.run 'zypper --non-interactive --no-refresh in --oldpackage http://download.opensuse.org/distribution/leap/15.3/repo/oss/noarch/yast2-installation-4.3.37-1.1.noarch.rpm && zypper addlock yast2-installation'
. Since the issue has been worked around I'm lowering the priority.
So I'm waiting for a new version (should likely show up on http://download.opensuse.org/update/leap/15.3/oss/noarch) and if it looks good I'll remove the lock again.
Updated by mkittler over 2 years ago
I've just tried upgrading but still got the broken yast2-installation-4.3.50-150300.3.27.1.noarch
.
Updated by mkittler over 2 years ago
Apparently we're waiting for https://smelt.suse.de/request/272018.
Updated by mkittler over 2 years ago
- Status changed from Feedback to Blocked
Still no progress. Setting to "Blocked" because the issue is blocked by https://smelt.suse.de/request/272018.
Updated by livdywan over 2 years ago
- Due date changed from 2022-05-18 to 2022-05-27
mkittler wrote:
Still no progress. Setting to "Blocked" because the issue is blocked by https://smelt.suse.de/request/272018.
Ondřej offered to help with that. Either way I'll bump the due date since this is pending the review.
Updated by mkittler over 2 years ago
- Status changed from Blocked to Resolved
The request has been accepted and 4.3.52-150300.3.30.1
is available. It looks good on openqaworker5 so I've been removing the locks on all other hosts as well and consider this done.