Project

General

Profile

action #110497

Minion influxdb data causing unusual download rates size:M

Added by jbaier_cz about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-05-01
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Since early morning the download rate for this worker dropped to zero: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker2/worker-dashboard-openqaworker2?tab=alert&viewPanel=65109&orgId=1&from=1651366172156&to=1651391164493

The time frame correlates with the problem on openqaworker5 mentioned in #110494; side note: both servers are in the same rack.

Initial investigation shows massive problems with the networking in the mentioned time-frame and then a reboot with more networking issues

May 01 03:31:26 openqaworker2 worker[5458]: [error] [pid:5458] REST-API error (POST http://openqa.suse.de/api/v1/jobs/8663359/status): Connection error: Connection refused (remaining tries: 42)
May 01 03:31:26 openqaworker2 worker[5539]: [error] [pid:5539] REST-API error (POST http://openqa.suse.de/api/v1/jobs/8663360/status): Connection error: Connection refused (remaining tries: 42)
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@16.service: State 'stop-sigterm' timed out. Killing.
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@16.service: Killing process 5539 (worker) with signal SIGKILL.
May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: State 'stop-sigterm' timed out. Killing.
May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: Killing process 26744 (mdcheck) with signal SIGKILL.
May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: Killing process 30336 (sleep) with signal SIGKILL.
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@10.service: State 'stop-sigterm' timed out. Killing.
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@10.service: Killing process 23563 (worker) with signal SIGKILL.
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@9.service: State 'stop-sigterm' timed out. Killing.
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@9.service: Killing process 5458 (worker) with signal SIGKILL.
May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: Main process exited, code=killed, status=9/KILL
May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.service: Failed with result 'timeout'.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped MD array scrubbing.
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@9.service: Main process exited, code=killed, status=9/KILL
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@9.service: Failed with result 'timeout'.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped openQA Worker #9.
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@16.service: Main process exited, code=killed, status=9/KILL
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@16.service: Failed with result 'timeout'.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped openQA Worker #16.
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@10.service: Main process exited, code=killed, status=9/KILL
May 01 03:31:30 openqaworker2 systemd[1]: openqa-worker-auto-restart@10.service: Failed with result 'timeout'.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped openQA Worker #10.
May 01 03:31:30 openqaworker2 systemd[1]: Removed slice Slice /openqa/worker.
May 01 03:31:30 openqaworker2 systemd[1]: Removed slice Slice /openqa.
May 01 03:31:30 openqaworker2 systemd[1]: mdcheck_start.timer: Succeeded.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped MD array scrubbing.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped target Remote File Systems.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped target System Time Synchronized.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped target System Time Set.
May 01 03:31:30 openqaworker2 systemd[1]: Stopping NTP client/server...
May 01 03:31:30 openqaworker2 chronyd[2589]: chronyd exiting
May 01 03:31:30 openqaworker2 systemd[1]: Stopping Login and scanning of iSCSI devices...
May 01 03:31:30 openqaworker2 iscsiadm[30379]: iscsiadm: No matching sessions found
May 01 03:31:30 openqaworker2 systemd[1]: chronyd.service: Succeeded.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped NTP client/server.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped target Host and Network Name Lookups.
May 01 03:31:30 openqaworker2 iscsiadm[30382]: iscsiadm: No matching sessions found
May 01 03:31:30 openqaworker2 systemd[1]: Stopping Name Service Cache Daemon...
May 01 03:31:30 openqaworker2 systemd[1]: iscsi.service: Succeeded.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped Login and scanning of iSCSI devices.
May 01 03:31:30 openqaworker2 systemd[1]: nscd.service: Succeeded.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped Name Service Cache Daemon.
May 01 03:31:30 openqaworker2 systemd[1]: Stopped target Network is Online.
May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: State 'stop-sigterm' timed out. Killing.
May 01 03:31:32 openqaworker2 kernel: br0: port 1(eth0) entered disabled state
May 01 03:31:32 openqaworker2 kernel: device eth0 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: br0: port 1(eth0) entered disabled state
May 01 03:31:32 openqaworker2 kernel: device tap0 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap64 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap128 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap1 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap65 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap129 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap2 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap66 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap130 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap3 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap67 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap131 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap4 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap68 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap132 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap5 left promiscuous mode
May 01 03:31:31 openqaworker2 wickedd-dhcp4[1563]: br0: Request to release DHCPv4 lease with UUID 17aa6462-1047-0a00-2107-000089000000: rele>
May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Killing process 2570 (tgtd) with signal SIGKILL.
May 01 03:31:32 openqaworker2 kernel: device tap69 left promiscuous mode
May 01 03:31:31 openqaworker2 wickedd-dhcp6[1565]: br0: Request to release DHCPv6 lease using UUID 17aa6462-1047-0a00-2107-00008a000000: rel>
May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Killing process 2571 (tgtd) with signal SIGKILL.
May 01 03:31:31 openqaworker2 ovs-vsctl[30609]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap0
May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Killing process 2680 (tgtd) with signal SIGKILL.
May 01 03:31:31 openqaworker2 ovs-vsctl[30613]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap64
May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Killing process 2684 (n/a) with signal SIGKILL.
May 01 03:31:31 openqaworker2 ovs-vsctl[30618]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap128
May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Main process exited, code=killed, status=9/KILL
May 01 03:31:32 openqaworker2 kernel: device tap133 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap6 left promiscuous mode
May 01 03:31:32 openqaworker2 kernel: device tap70 left promiscuous mode
May 01 03:31:31 openqaworker2 ovs-vsctl[30648]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap1
May 01 03:31:31 openqaworker2 systemd[1]: tgtd.service: Failed with result 'timeout'.
May 01 03:31:31 openqaworker2 ovs-vsctl[30689]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap65
May 01 03:31:31 openqaworker2 systemd[1]: Stopped tgt admin.
May 01 03:31:31 openqaworker2 ovs-vsctl[30706]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap129
May 01 03:31:31 openqaworker2 systemd[1]: Stopped target Network.
May 01 03:31:31 openqaworker2 ovs-vsctl[30746]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port br1 tap2
May 01 03:31:31 openqaworker2 systemd[1]: Stopping wicked managed network interfaces...
...
May 01 03:31:36 openqaworker2 firewalld[17694]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw>
May 01 03:31:37 openqaworker2 firewalld[17694]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw>
May 01 03:31:37 openqaworker2 firewalld[17694]: WARNING: NOT_ENABLED: br2
May 01 03:31:38 openqaworker2 firewalld[17694]: ERROR: UNKNOWN_INTERFACE: 'br2' is not in any zone
May 01 03:31:38 openqaworker2 firewalld[17694]: WARNING: NOT_ENABLED: br3
...
May 01 03:33:05 openqaworker2 kernel: audit: type=1305 audit(1651368784.572:15253): op=set audit_pid=0 old=1396 auid=4294967295 ses=42949672>
May 01 03:33:05 openqaworker2 kernel: audit: type=1131 audit(1651368784.604:15254): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfi>
May 01 03:33:05 openqaworker2 kernel: audit: type=1131 audit(1651368784.640:15255): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfi>
-- Reboot --
May 01 03:35:52 openqaworker2 kernel: microcode: microcode updated early to revision 0x49, date = 2021-08-11
May 01 03:35:52 openqaworker2 kernel: Linux version 5.3.18-150300.59.63-default (geeko@buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Tu>
May 01 03:35:52 openqaworker2 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.3.18-150300.59.63-default root=UUID=3c729661-cf27-4a69-870d-9>
...
May 01 03:37:03 openqaworker2 systemd-udevd[1565]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
May 01 03:37:03 openqaworker2 systemd-udevd[1565]: br0: Could not generate persistent MAC: No data available
May 01 03:37:03 openqaworker2 systemd-udevd[1839]: Using default interface naming scheme 'sle15-sp3'.
May 01 03:37:06 openqaworker2 firewalld[1416]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: check_config_dict() takes 2 positional arguments but 3 were given
...
May 01 03:38:01 openqaworker2 worker[2642]: [info] [pid:2642] Registering with openQA openqa.suse.de
May 01 03:38:01 openqaworker2 worker[2654]: [warn] [pid:2654] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
May 01 03:38:01 openqaworker2 worker[2650]: [warn] [pid:2650] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
May 01 03:38:01 openqaworker2 worker[2664]: [warn] [pid:2664] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
May 01 03:38:01 openqaworker2 worker[2648]: [info] [pid:2648] Registering with openQA openqa.suse.de
May 01 03:38:01 openqaworker2 worker[2658]: [warn] [pid:2658] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
May 01 03:38:01 openqaworker2 worker[2667]: [warn] [pid:2667] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
May 01 03:38:01 openqaworker2 worker[2634]: [warn] [pid:2634] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
May 01 03:38:01 openqaworker2 worker[2641]: [warn] [pid:2641] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
May 01 03:38:01 openqaworker2 worker[2649]: [warn] [pid:2649] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds

Suggestions

  • Try to find openQA jobs according to the time frame where a zero or low bandwidth was recorded
  • The above errors are all within the timeframe where we do a full upgrade and reboot of workers as well as webUI host so maybe we have false positive alerting.
  • Likely the openQA minion influxdb route actually reports a download speed of true zero if the minion was restarted (e.g. after reboot and system startup) and either no openQA jobs had been running at all or only already cached assets would been used -> Research how the code behaves after minion startup and only using cached assets.
  • Avoid the reporting of zero if there is no actual download going on (grafana is already configured to keep the last state if there are no new values)

Related issues

Related to openQA Infrastructure - action #110494: alert: openqaworker5 host up size:MResolved2022-05-012022-05-27

Related to openQA Project - coordination #102951: [epic] Better network performance monitoringResolved2021-11-24

Related to openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetesBlocked2018-06-292022-10-01

History

#1 Updated by jbaier_cz about 2 months ago

  • Related to action #110494: alert: openqaworker5 host up size:M added

#2 Updated by cdywan about 2 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from alert: openqaworker2 unusual download rate to Minion influxdb data causing unusual download rates size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler about 2 months ago

  • Assignee set to mkittler

#4 Updated by mkittler about 2 months ago

  • Status changed from Workable to Feedback

#5 Updated by okurz about 2 months ago

#6 Updated by okurz about 2 months ago

  • Related to coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes added

#7 Updated by okurz about 2 months ago

  • Tags set to reactive work

#8 Updated by mkittler about 2 months ago

The PR has been merged. Let's wait until the next reboot to see how it looks like in Grafana.

#9 Updated by mkittler about 2 months ago

  • Priority changed from Urgent to High

#10 Updated by mkittler about 2 months ago

  • Status changed from Feedback to Resolved

I've checked a few worker dashboards and judging by the state history we haven't gotten any pending/alerting states. So it looks like the issue has been fixed and I'm marking the ticket as resolved. If we see a false alert again nevertheless we can still reopen this ticket.

Also available in: Atom PDF