action #110497
Updated by livdywan over 2 years ago
## Observation Since early morning the download rate for this worker dropped to zero: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker2/worker-dashboard-openqaworker2?tab=alert&viewPanel=65109&orgId=1&from=1651366172156&to=1651391164493 The time frame correlates with the problem on openqaworker5 mentioned in #110494; side note: both servers are in the same rack. Initial investigation shows massive problems with the networking in the mentioned time-frame and then a reboot with more networking issues ``` May 01 03:31:26 openqaworker2 worker[5458]: [error] [pid:5458] REST-API error (POST http://openqa.suse.de/api/v1/jobs/8663359/status): Connection error: Connection refused (remaining tries: 42) May 01 03:31:26 openqaworker2 worker[5539]: [error] [pid:5539] REST-API error (POST http://openqa.suse.de/api/v1/jobs/8663360/status): Connection error: Connection refused (remaining tries: 42) May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@16.service: State 'stop-sigterm' timed out. Killing. May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@16.service: Killing process 5539 (worker) with signal SIGKILL. May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: State 'stop-sigterm' timed out. Killing. May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: Killing process 26744 (mdcheck) with signal SIGKILL. May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: Killing process 30336 (sleep) with signal SIGKILL. May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@10.service: State 'stop-sigterm' timed out. Killing. May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@10.service: Killing process 23563 (worker) with signal SIGKILL. May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@9.service: State 'stop-sigterm' timed out. Killing. May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@9.service: Killing process 5458 (worker) with signal SIGKILL. May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: Main process exited, code=killed, status=9/KILL May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: Failed with result 'timeout'. May 01 03:31:30 openqaworker2 systemd[1]: Stopped MD array scrubbing. May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@9.service: Main process exited, code=killed, status=9/KILL May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@9.service: Failed with result 'timeout'. May 01 03:31:30 openqaworker2 systemd[1]: Stopped openQA Worker #9. May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@16.service: Main process exited, code=killed, status=9/KILL May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@16.service: Failed with result 'timeout'. May 01 03:31:30 openqaworker2 systemd[1]: Stopped openQA Worker #16. May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@10.service: Main process exited, code=killed, status=9/KILL May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@10.service: Failed with result 'timeout'. May 01 03:31:30 openqaworker2 systemd[1]: Stopped openQA Worker #10. May 01 03:31:30 openqaworker2 systemd[1]: Removed slice Slice /openqa/worker. May 01 03:31:30 openqaworker2 systemd[1]: Removed slice Slice /openqa. May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.timer: Succeeded. May 01 03:31:30 openqaworker2 systemd[1]: Stopped MD array scrubbing. May 01 03:31:30 openqaworker2 systemd[1]: Stopped target Remote File Systems. May 01 03:31:30 openqaworker2 systemd[1]: Stopped target System Time Synchronized. May 01 03:31:30 openqaworker2 systemd[1]: Stopped target System Time Set. May 01 03:31:30 openqaworker2 systemd[1]: Stopping NTP client/server... May 01 03:31:30 openqaworker2 chronyd[2589]: chronyd exiting May 01 03:31:30 openqaworker2 systemd[1]: Stopping Login and scanning of iSCSI devices... May 01 03:31:30 openqaworker2 iscsiadm[30379]: iscsiadm: No matching sessions found May 01 03:31:30 openqaworker2 systemd[1]: chronyd.service: Succeeded. May 01 03:31:30 openqaworker2 systemd[1]: Stopped NTP client/server. May 01 03:31:30 openqaworker2 systemd[1]: Stopped target Host and Network Name Lookups. May 01 03:31:30 openqaworker2 iscsiadm[30382]: iscsiadm: No matching sessions found May 01 03:31:30 openqaworker2 systemd[1]: Stopping Name Service Cache Daemon... May 01 03:31:30 openqaworker2 systemd[1]: iscsi.service: Succeeded. May 01 03:31:30 openqaworker2 systemd[1]: Stopped Login and scanning of iSCSI devices. May 01 03:31:30 openqaworker2 systemd[1]: nscd.service: Succeeded. May 01 03:31:30 openqaworker2 systemd[1]: Stopped Name Service Cache Daemon. May 01 03:31:30 openqaworker2 systemd[1]: Stopped target Network is Online. May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: State 'stop-sigterm' timed out. Killing. May 01 03:31:32 openqaworker2 kernel: br0: port 1(eth0) entered disabled state May 01 03:31:32 openqaworker2 kernel: device eth0 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: br0: port 1(eth0) entered disabled state May 01 03:31:32 openqaworker2 kernel: device tap0 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap64 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap128 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap1 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap65 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap129 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap2 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap66 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap130 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap3 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap67 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap131 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap4 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap68 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap132 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap5 left promiscuous mode May 01 03:31:31 openqaworker2 wickedd-dhcp4[1563]: br0: Request to release DHCPv4 lease with UUID 17aa6462-1047-0a00-2107-000089000000: rele> May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Killing process 2570 (tgtd) with signal SIGKILL. May 01 03:31:32 openqaworker2 kernel: device tap69 left promiscuous mode May 01 03:31:31 openqaworker2 wickedd-dhcp6[1565]: br0: Request to release DHCPv6 lease using UUID 17aa6462-1047-0a00-2107-00008a000000: rel> May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Killing process 2571 (tgtd) with signal SIGKILL. May 01 03:31:31 openqaworker2 ovs-vsctl[30609]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap0 May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Killing process 2680 (tgtd) with signal SIGKILL. May 01 03:31:31 openqaworker2 ovs-vsctl[30613]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap64 May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Killing process 2684 (n/a) with signal SIGKILL. May 01 03:31:31 openqaworker2 ovs-vsctl[30618]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap128 May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Main process exited, code=killed, status=9/KILL May 01 03:31:32 openqaworker2 kernel: device tap133 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap6 left promiscuous mode May 01 03:31:32 openqaworker2 kernel: device tap70 left promiscuous mode May 01 03:31:31 openqaworker2 ovs-vsctl[30648]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap1 May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Failed with result 'timeout'. May 01 03:31:31 openqaworker2 ovs-vsctl[30689]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap65 May 01 03:31:31 openqaworker2 systemd[1]: Stopped tgt admin. May 01 03:31:31 openqaworker2 ovs-vsctl[30706]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap129 May 01 03:31:31 openqaworker2 systemd[1]: Stopped target Network. May 01 03:31:31 openqaworker2 ovs-vsctl[30746]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap2 May 01 03:31:31 openqaworker2 systemd[1]: Stopping wicked managed network interfaces... ... May 01 03:31:36 openqaworker2 firewalld[17694]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw> May 01 03:31:37 openqaworker2 firewalld[17694]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw> May 01 03:31:37 openqaworker2 firewalld[17694]: WARNING: NOT_ENABLED: br2 May 01 03:31:38 openqaworker2 firewalld[17694]: ERROR: UNKNOWN_INTERFACE: 'br2' is not in any zone May 01 03:31:38 openqaworker2 firewalld[17694]: WARNING: NOT_ENABLED: br3 ... May 01 03:33:05 openqaworker2 kernel: audit: type=1305 audit(1651368784.572:15253): op=set audit_pid=0 old=1396 auid=4294967295 ses=42949672> May 01 03:33:05 openqaworker2 kernel: audit: type=1131 audit(1651368784.604:15254): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfi> May 01 03:33:05 openqaworker2 kernel: audit: type=1131 audit(1651368784.640:15255): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfi> -- Reboot -- May 01 03:35:52 openqaworker2 kernel: microcode: microcode updated early to revision 0x49, date = 2021-08-11 May 01 03:35:52 openqaworker2 kernel: Linux version 5.3.18-150300.59.63-default (geeko@buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Tu> May 01 03:35:52 openqaworker2 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.3.18-150300.59.63-default root=UUID=3c729661-cf27-4a69-870d-9> ... May 01 03:37:03 openqaworker2 systemd-udevd[1565]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable. May 01 03:37:03 openqaworker2 systemd-udevd[1565]: br0: Could not generate persistent MAC: No data available May 01 03:37:03 openqaworker2 systemd-udevd[1839]: Using default interface naming scheme 'sle15-sp3'. May 01 03:37:06 openqaworker2 firewalld[1416]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: check_config_dict() takes 2 positional arguments but 3 were given ... May 01 03:38:01 openqaworker2 worker[2642]: [info] [pid:2642] Registering with openQA openqa.suse.de May 01 03:38:01 openqaworker2 worker[2654]: [warn] [pid:2654] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds May 01 03:38:01 openqaworker2 worker[2650]: [warn] [pid:2650] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds May 01 03:38:01 openqaworker2 worker[2664]: [warn] [pid:2664] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds May 01 03:38:01 openqaworker2 worker[2648]: [info] [pid:2648] Registering with openQA openqa.suse.de May 01 03:38:01 openqaworker2 worker[2658]: [warn] [pid:2658] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds May 01 03:38:01 openqaworker2 worker[2667]: [warn] [pid:2667] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds May 01 03:38:01 openqaworker2 worker[2634]: [warn] [pid:2634] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds May 01 03:38:01 openqaworker2 worker[2641]: [warn] [pid:2641] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds May 01 03:38:01 openqaworker2 worker[2649]: [warn] [pid:2649] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds ``` ## Suggestions * Try to find openQA jobs according to the time frame where a zero or low bandwidth was recorded * The above errors are all within the timeframe where we do a full upgrade and reboot of workers as well as webUI host so maybe we have false positive alerting. * Likely the openQA minion influxdb route actually reports a download speed of true zero if the minion was restarted (e.g. after reboot and system startup) and either no openQA jobs had been running at all or only already cached assets would been used -> Research how the code behaves after minion startup and only using cached assets. * Avoid the reporting of zero if there is no actual download going on (grafana is already configured to keep the last state if there are no new values)