Actions
action #107437
closed[alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:M
Start date:
2022-02-23
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
I am receiving multiple emails since we had the QA labs move regarding "no data" that resolve themselves shortly afterwards. At first I suspected our maintenance work when actually changing the cabling or so but by now I think there is another recurring problem as I doubt at times I have seen the alert we had someone doing something on the network or switches or configuration.
Suggestions¶
- Crosscheck network bandwidth between different machines in different locations to find out if monitor.qa.suse.de can receive data with sufficient bandwidth
- Crosscheck monitoring data from switches if there is anything excessive
- Take a look into logs on monitor.qa if there are problems reported about receiving data, maybe to influxdb
- Take a look into logs on osd or workers if telegraf has problems to write to monitor.qa and influxdb
journalctl -u telegraf
on osd lists:
Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [agent] Error writing to outputs.influxdb: could not write any address
Feb 24 11:45:20 openqa telegraf[13914]: 2022-02-24T10:45:20Z W! [outputs.influxdb] Metric buffer overflow; 259 metrics have been dropped
Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [agent] Error writing to outputs.influxdb: could not write any address
Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z W! [outputs.influxdb] Metric buffer overflow; 123 metrics have been dropped
Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [agent] Error writing to outputs.influxdb: could not write any address
Actions