Project

General

Profile

action #107437

Updated by okurz about 2 years ago

## Observation 

 I am receiving multiple emails since we had the QA labs move regarding "no data" that resolve themselves shortly afterwards. At first I suspected our maintenance work when actually changing the cabling or so but by now I think there is another recurring problem as I doubt at times I have seen the alert we had someone doing something on the network or switches or configuration. 

 ## Suggestions 
 * Crosscheck network bandwidth between different machines in different locations to find out if monitor.qa.suse.de can receive data with sufficient bandwidth 
 * Crosscheck monitoring data from switches if there is anything excessive 
 * Take a look into logs on monitor.qa if there are problems reported about receiving data, maybe to influxdb 
 * Take a look into logs on osd or workers if telegraf has problems to write to monitor.qa and influxdb 

 `journalctl -u telegraf` on osd lists: 

 ``` 
 Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
 Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [agent] Error writing to outputs.influxdb: could not write any address 
 Feb 24 11:45:20 openqa telegraf[13914]: 2022-02-24T10:45:20Z W! [outputs.influxdb] Metric buffer overflow; 259 metrics have been dropped 
 Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
 Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [agent] Error writing to outputs.influxdb: could not write any address 
 Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z W! [outputs.influxdb] Metric buffer overflow; 123 metrics have been dropped 
 Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
 Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [agent] Error writing to outputs.influxdb: could not write any address 
 ```

Back