Project

General

Profile

Actions

action #107437

closed

[alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-02-23
Due date:
% Done:

0%

Estimated time:

Description

Observation

I am receiving multiple emails since we had the QA labs move regarding "no data" that resolve themselves shortly afterwards. At first I suspected our maintenance work when actually changing the cabling or so but by now I think there is another recurring problem as I doubt at times I have seen the alert we had someone doing something on the network or switches or configuration.

Suggestions

  • Crosscheck network bandwidth between different machines in different locations to find out if monitor.qa.suse.de can receive data with sufficient bandwidth
  • Crosscheck monitoring data from switches if there is anything excessive
  • Take a look into logs on monitor.qa if there are problems reported about receiving data, maybe to influxdb
  • Take a look into logs on osd or workers if telegraf has problems to write to monitor.qa and influxdb

journalctl -u telegraf on osd lists:

Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [agent] Error writing to outputs.influxdb: could not write any address
Feb 24 11:45:20 openqa telegraf[13914]: 2022-02-24T10:45:20Z W! [outputs.influxdb] Metric buffer overflow; 259 metrics have been dropped
Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [agent] Error writing to outputs.influxdb: could not write any address
Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z W! [outputs.influxdb] Metric buffer overflow; 123 metrics have been dropped
Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [agent] Error writing to outputs.influxdb: could not write any address

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure - action #102650: Organize labs move to new building and SRV2 size:MResolvednicksinger2021-11-182022-05-27

Actions
Related to openQA Infrastructure - action #107257: [alert][osd] Apache Response Time alert size:MResolvedokurz2022-02-22

Actions
Related to openQA Infrastructure - action #107515: [Alerting] web UI: Too many Minion job failures alert size:SResolvedmkittler2022-02-24

Actions
Related to openQA Infrastructure - action #108266: grenache: script_run() commands randomly time out since server room moveNew2022-03-14

Actions
Actions #1

Updated by okurz over 2 years ago

I am changing more alerts to not alert on "no data": https://gitlab.suse.de/okurz/salt-states-openqa/-/merge_requests/3 . This won't address the original problem. To me it seems like the network connection between OSD and the new location of monitor.qa.suse.de within SUSE Nbg SRV2 server room might suffer from unreliabilities

Actions #2

Updated by okurz over 2 years ago

  • Related to action #102650: Organize labs move to new building and SRV2 size:M added
Actions #3

Updated by okurz over 2 years ago

  • Related to action #107257: [alert][osd] Apache Response Time alert size:M added
Actions #4

Updated by nicksinger over 2 years ago

I did a mtr yesterday from OSD to openqa-monitor.qa.suse.de and saw a very tiny package drop of 0.3% there. I think this shouldn't cause the problem. However the telegraf logs on OSD show:

Feb 24 08:54:42 openqa telegraf[13914]: 2022-02-24T07:54:42Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 08:54:42 openqa telegraf[13914]: 2022-02-24T07:54:42Z E! [agent] Error writing to outputs.influxdb: could not write any address

All I can find (and what the error message also explains) this is most likely due to some network problem: https://github.com/influxdata/telegraf/issues/10566#issuecomment-1027974951
But I cannot make sense what causes this. It was after the move but nothing really broke. The VM for monitor is not particular busy, package loss looks good and DNS resolution from OSD als looks good with 356 ms max. I used the following command to test this:

for i in {1..1000}; do dig monitor.qa.suse.de | grep -i "Query time"; done | cut -d ":" -f 2 | cut -d " " -f 2 | sort -h
Actions #5

Updated by okurz over 2 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #6

Updated by okurz over 2 years ago

  • Related to action #107515: [Alerting] web UI: Too many Minion job failures alert size:S added
Actions #7

Updated by okurz over 2 years ago

  • Subject changed from [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move to [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:M
  • Description updated (diff)
  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)
Actions #8

Updated by okurz over 2 years ago

  • Priority changed from High to Urgent
Actions #9

Updated by okurz over 2 years ago

Running on OSD while true; do dd bs=100M count=20 if=/dev/zero | nc -l 42420; done

and a test using multiple endpoints for i in backup.qa qanet.qa monitor.qa openqaworker13 qa-power8-5-kvm.qa root@seth-1.qa ; do ssh $i "echo \"### $i\" && timeout 3 nc openqa.suse.de 42420 | dd of=/dev/null" ;done reveals that qamaster (also running monitor.qa) seem to be slow, not all machines connected to the network switch qanet15nue:

### backup.qa
672+151 records in
777+1 records out
398200 bytes (398 kB, 389 KiB) copied, 3.00182 s, 133 kB/s
Welcome to qanet - DHCP/DNS server for vlan 12
### qanet.qa
608884+10604 records in
612084+1 records out
313387448 bytes (313 MB) copied, 3.00127 s, 104 MB/s
### monitor.qa
497+108 records in
574+1 records out
293944 bytes (294 kB, 287 KiB) copied, 3.00081 s, 98.0 kB/s
### openqaworker13
469280+11624 records in
474425+1 records out
242905720 bytes (243 MB, 232 MiB) copied, 3.00046 s, 81.0 MB/s
### qa-power8-5-kvm.qa
641934+45647 records in
648489+1 records out
332026752 bytes (332 MB, 317 MiB) copied, 3.00025 s, 111 MB/s
### root@seth-1.qa
432038+10003 records in
435561+1 records out
223007600 bytes (223 MB, 213 MiB) copied, 3.00018 s, 74.3 MB/s

Also qamaster itself is slow.

Actions #10

Updated by okurz over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

Investigated further with nsinger:

w13->grenache is fast in both directions. until now this points to a problem with qamaster. hm, seth-1.qa is also fine. And now osd->qanet is fine. So it looks like only qamaster is problematic right now. I already reproduced with qamaster itself, backup.qa, monitor.qa so not just VMs according to qanet15nue I find the mac of qamaster on gi1. interface status on qanet15nue with show interfaces status GE 1 looks fine. nsinger ran test cable-diagnostics tdr interface GE 1. Now I can't ping the machine anymore at all.

Actions #11

Updated by okurz over 2 years ago

Nick Singer Yes we connected a new cable and a monitor now
Oliver Kurz you mean a new patch cable?
Nick Singer yes
Nick Singer now the throughput looks also perfectly fine
Nick Singer so either it was port 1 on the switch or the cable itself
Oliver Kurz Is it still on port 1?
Nick Singer no, 14
Oliver Kurz would you like to crosscheck? can you connect back to 1?
Nick Singer jup give me a sec
Nick Singer apparently I killed port 1
Oliver Kurz ping works on gi14 within a second after I see the link up on the switch, but not on gi1
Nick Singer now on port 14 with old cable
Oliver Kurz oh, ok. so not the cable, the port. I see. Can you connect something else to gi1? Maybe the "cable diagnostic mode" is still on for gi1?

my bandwidth check yields 110 MB/s (~0.88 GBit/s), as expected for 1 GBit/s connection minus overhead.

We tried different approaches to access the management interface. nsinger, okurz, mkittler, jbaier all failed with impitool, maybe that is misconfigured or disabled in the BMC. https://www.supermicro.com/support/faqs/faq.cfm?faq=28752 suggests to "Please try to do the factory default. 1) Log in to Web GUI. 2) Go to Maintenance >> BMC Restore Factory Defaults >> Click on Restore Factory Defaults". That is something we can try later. We have also tried "IPMIView" and the ipmitool from https://www.supermicro.com/en/solutions/management-software/ipmi-utilities but no joy.

Oliver Kurz show cable-diagnostics tdr interface gi1 says "no cable", for "gi2" it says "not tested". Can you connect a cable again please?
Oliver Kurz actually it should work with just a cable in gi1 with loose end

According to https://community.cisco.com/t5/switching/cable-diagnostics-tdr-not-completed/td-p/3035913 maybe the cable diagnostics could actually not be completed and now could be tried again after I called show cable-diagnostics tdr interface gi1.

Actions #12

Updated by okurz over 2 years ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

https://monitor.qa.suse.de/ looks good again. Currently no failing tests.

Actions #13

Updated by okurz over 2 years ago

  • Status changed from Feedback to Resolved

All looks normal again. Potential future improvements out of scope: Crosscheck gi1 on qanet15nue and reset BMC to allow IPMI access to qamaster but this is not necessarily something for SUSE QE Tools.

Actions #14

Updated by MDoucha over 2 years ago

  • Related to action #108266: grenache: script_run() commands randomly time out since server room move added
Actions

Also available in: Atom PDF