action #107437
[alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:M
0%
Description
Observation¶
I am receiving multiple emails since we had the QA labs move regarding "no data" that resolve themselves shortly afterwards. At first I suspected our maintenance work when actually changing the cabling or so but by now I think there is another recurring problem as I doubt at times I have seen the alert we had someone doing something on the network or switches or configuration.
Suggestions¶
- Crosscheck network bandwidth between different machines in different locations to find out if monitor.qa.suse.de can receive data with sufficient bandwidth
- Crosscheck monitoring data from switches if there is anything excessive
- Take a look into logs on monitor.qa if there are problems reported about receiving data, maybe to influxdb
- Take a look into logs on osd or workers if telegraf has problems to write to monitor.qa and influxdb
journalctl -u telegraf
on osd lists:
Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [agent] Error writing to outputs.influxdb: could not write any address Feb 24 11:45:20 openqa telegraf[13914]: 2022-02-24T10:45:20Z W! [outputs.influxdb] Metric buffer overflow; 259 metrics have been dropped Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [agent] Error writing to outputs.influxdb: could not write any address Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z W! [outputs.influxdb] Metric buffer overflow; 123 metrics have been dropped Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [agent] Error writing to outputs.influxdb: could not write any address
Related issues
History
#1
Updated by okurz 4 months ago
I am changing more alerts to not alert on "no data": https://gitlab.suse.de/okurz/salt-states-openqa/-/merge_requests/3 . This won't address the original problem. To me it seems like the network connection between OSD and the new location of monitor.qa.suse.de within SUSE Nbg SRV2 server room might suffer from unreliabilities
#2
Updated by okurz 4 months ago
- Related to action #102650: Organize labs move to new building and SRV2 size:M added
#3
Updated by okurz 4 months ago
- Related to action #107257: [alert][osd] Apache Response Time alert size:M added
#4
Updated by nicksinger 4 months ago
I did a mtr
yesterday from OSD to openqa-monitor.qa.suse.de and saw a very tiny package drop of 0.3% there. I think this shouldn't cause the problem. However the telegraf logs on OSD show:
Feb 24 08:54:42 openqa telegraf[13914]: 2022-02-24T07:54:42Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Feb 24 08:54:42 openqa telegraf[13914]: 2022-02-24T07:54:42Z E! [agent] Error writing to outputs.influxdb: could not write any address
All I can find (and what the error message also explains) this is most likely due to some network problem: https://github.com/influxdata/telegraf/issues/10566#issuecomment-1027974951
But I cannot make sense what causes this. It was after the move but nothing really broke. The VM for monitor is not particular busy, package loss looks good and DNS resolution from OSD als looks good with 356 ms max. I used the following command to test this:
for i in {1..1000}; do dig monitor.qa.suse.de | grep -i "Query time"; done | cut -d ":" -f 2 | cut -d " " -f 2 | sort -h
#6
Updated by okurz 4 months ago
- Related to action #107515: [Alerting] web UI: Too many Minion job failures alert size:S added
#7
Updated by okurz 4 months ago
- Subject changed from [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move to [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:M
- Description updated (diff)
- Status changed from In Progress to Workable
- Assignee deleted (
okurz)
#9
Updated by okurz 4 months ago
Running on OSD while true; do dd bs=100M count=20 if=/dev/zero | nc -l 42420; done
and a test using multiple endpoints for i in backup.qa qanet.qa monitor.qa openqaworker13 qa-power8-5-kvm.qa root@seth-1.qa ; do ssh $i "echo \"### $i\" && timeout 3 nc openqa.suse.de 42420 | dd of=/dev/null" ;done
reveals that qamaster (also running monitor.qa) seem to be slow, not all machines connected to the network switch qanet15nue:
### backup.qa 672+151 records in 777+1 records out 398200 bytes (398 kB, 389 KiB) copied, 3.00182 s, 133 kB/s Welcome to qanet - DHCP/DNS server for vlan 12 ### qanet.qa 608884+10604 records in 612084+1 records out 313387448 bytes (313 MB) copied, 3.00127 s, 104 MB/s ### monitor.qa 497+108 records in 574+1 records out 293944 bytes (294 kB, 287 KiB) copied, 3.00081 s, 98.0 kB/s ### openqaworker13 469280+11624 records in 474425+1 records out 242905720 bytes (243 MB, 232 MiB) copied, 3.00046 s, 81.0 MB/s ### qa-power8-5-kvm.qa 641934+45647 records in 648489+1 records out 332026752 bytes (332 MB, 317 MiB) copied, 3.00025 s, 111 MB/s ### root@seth-1.qa 432038+10003 records in 435561+1 records out 223007600 bytes (223 MB, 213 MiB) copied, 3.00018 s, 74.3 MB/s
Also qamaster itself is slow.
#10
Updated by okurz 4 months ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Investigated further with nsinger:
w13->grenache is fast in both directions. until now this points to a problem with qamaster. hm, seth-1.qa is also fine. And now osd->qanet is fine. So it looks like only qamaster is problematic right now. I already reproduced with qamaster itself, backup.qa, monitor.qa so not just VMs according to qanet15nue I find the mac of qamaster on gi1. interface status
on qanet15nue with show interfaces status GE 1 looks fine. nsinger ran test cable-diagnostics tdr interface GE 1
. Now I can't ping the machine anymore at all.
#11
Updated by okurz 4 months ago
Nick Singer Yes we connected a new cable and a monitor now
Oliver Kurz you mean a new patch cable?
Nick Singer yes
Nick Singer now the throughput looks also perfectly fine
Nick Singer so either it was port 1 on the switch or the cable itself
Oliver Kurz Is it still on port 1?
Nick Singer no, 14
Oliver Kurz would you like to crosscheck? can you connect back to 1?
Nick Singer jup give me a sec
Nick Singer apparently I killed port 1
Oliver Kurz ping works on gi14 within a second after I see the link up on the switch, but not on gi1
Nick Singer now on port 14 with old cable
Oliver Kurz oh, ok. so not the cable, the port. I see. Can you connect something else to gi1? Maybe the "cable diagnostic mode" is still on for gi1?
my bandwidth check yields 110 MB/s (~0.88 GBit/s), as expected for 1 GBit/s connection minus overhead.
We tried different approaches to access the management interface. nsinger, okurz, mkittler, jbaier all failed with impitool, maybe that is misconfigured or disabled in the BMC. https://www.supermicro.com/support/faqs/faq.cfm?faq=28752 suggests to "Please try to do the factory default. 1) Log in to Web GUI. 2) Go to Maintenance >> BMC Restore Factory Defaults >> Click on Restore Factory Defaults". That is something we can try later. We have also tried "IPMIView" and the ipmitool from https://www.supermicro.com/en/solutions/management-software/ipmi-utilities but no joy.
Oliver Kurz
show cable-diagnostics tdr interface gi1
says "no cable", for "gi2" it says "not tested". Can you connect a cable again please?
Oliver Kurz actually it should work with just a cable in gi1 with loose end
According to https://community.cisco.com/t5/switching/cable-diagnostics-tdr-not-completed/td-p/3035913 maybe the cable diagnostics could actually not be completed and now could be tried again after I called show cable-diagnostics tdr interface gi1
.
#12
Updated by okurz 4 months ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
https://monitor.qa.suse.de/ looks good again. Currently no failing tests.
#14
Updated by MDoucha 3 months ago
- Related to action #108266: grenache: script_run() commands randomly time out since server room move added