Project

General

Profile

Actions

action #107437

closed

[alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:M

Added by okurz almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2022-02-23
Due date:
% Done:

0%

Estimated time:

Description

Observation

I am receiving multiple emails since we had the QA labs move regarding "no data" that resolve themselves shortly afterwards. At first I suspected our maintenance work when actually changing the cabling or so but by now I think there is another recurring problem as I doubt at times I have seen the alert we had someone doing something on the network or switches or configuration.

Suggestions

  • Crosscheck network bandwidth between different machines in different locations to find out if monitor.qa.suse.de can receive data with sufficient bandwidth
  • Crosscheck monitoring data from switches if there is anything excessive
  • Take a look into logs on monitor.qa if there are problems reported about receiving data, maybe to influxdb
  • Take a look into logs on osd or workers if telegraf has problems to write to monitor.qa and influxdb

journalctl -u telegraf on osd lists:

Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 11:45:15 openqa telegraf[13914]: 2022-02-24T10:45:15Z E! [agent] Error writing to outputs.influxdb: could not write any address
Feb 24 11:45:20 openqa telegraf[13914]: 2022-02-24T10:45:20Z W! [outputs.influxdb] Metric buffer overflow; 259 metrics have been dropped
Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z E! [agent] Error writing to outputs.influxdb: could not write any address
Feb 24 11:45:25 openqa telegraf[13914]: 2022-02-24T10:45:25Z W! [outputs.influxdb] Metric buffer overflow; 123 metrics have been dropped
Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [outputs.influxdb] when writing to [http://openqa-monitor.qa.suse.de:8086]: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 24 11:45:30 openqa telegraf[13914]: 2022-02-24T10:45:30Z E! [agent] Error writing to outputs.influxdb: could not write any address

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure (public) - action #102650: Organize labs move to new building and SRV2 size:MResolvednicksinger2021-11-182022-05-27

Actions
Related to openQA Infrastructure (public) - action #107257: [alert][osd] Apache Response Time alert size:MResolvedokurz2022-02-22

Actions
Related to openQA Infrastructure (public) - action #107515: [Alerting] web UI: Too many Minion job failures alert size:SResolvedmkittler2022-02-24

Actions
Related to openQA Infrastructure (public) - action #108266: grenache: script_run() commands randomly time out since server room moveNew2022-03-14

Actions
Actions

Also available in: Atom PDF