action #130210
closed[FIRING:1] Packet loss between worker hosts and other hosts alert Salt (2Z025iB4km)
0%
Description
The grafana alert fired at 31.05. 07:37 and is still firing as of 01.06. 11:27.
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/2Z025iB4km/view?orgId=1
Updated by mkittler over 1 year ago
- Status changed from New to In Progress
- Assignee set to mkittler
Updated by mkittler over 1 year ago
This is about s390zp14.suse.de
not pingable from openqaworker-arm-3.suse.de
and worker3.oqa.suse.de
acccording to the data in Grafana. I've just connected to those hosts and pinged s390zp14.suse.de
manually and it just works as on OSD (so IPv4 works and IPv6 gives ping: s390zp14.suse.de: Address family for hostname not supported
).
This doesn't look good:
May 28 03:30:00 worker3 telegraf[2443]: 2023-05-28T01:30:00Z E! [inputs.ping] Error in plugin: host s390zl14.suse.de (linuxOne III): /usr/bin/ping: s390zl14.suse.de (linuxOne III): Name or service not known, exit status 2
It looks similarly on the arm worker except that there are more s390 hosts affected. Strangely, the last message is from 2023-05-28.
When invoking sudo /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -test
the output looks good, e.g.:
ping,host=worker3,url=s390zp14.suse.de average_response_ms=0.557,maximum_response_ms=0.557,minimum_response_ms=0.557,packets_received=1i,packets_transmitted=1i,percent_packet_loss=0,result_code=0i,standard_deviation_ms=0,ttl=62i 1685617878000000000
Nevertheless, restarting telegraf didn't help. The data in Grafana remains unchanged, even though no error messages are logged anymore.
Updated by mkittler over 1 year ago
- Status changed from In Progress to Resolved
The alert is good again. It looks like it just took a moment until new data reached Grafana.
It appears that telegraf was stuck after reaching an error condition several times. At least it didn't report the correct ping data for a while and after I restarted it and waiting for a few minutes it worked again. This is the first time I'm seeing this issue so I suppose we don't need to investigate it further at this point. So I'm resolving this issue.
Updated by mkittler over 1 year ago
- Tags set to alert
- Project changed from 46 to openQA Infrastructure (public)
- Target version set to Ready