action #130210


[FIRING:1] Packet loss between worker hosts and other hosts alert Salt (2Z025iB4km)

Added by dheidler about 1 year ago. Updated about 1 year ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:


The grafana alert fired at 31.05. 07:37 and is still firing as of 01.06. 11:27.

Actions #1

Updated by mkittler about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #2

Updated by mkittler about 1 year ago

This is about not pingable from and acccording to the data in Grafana. I've just connected to those hosts and pinged manually and it just works as on OSD (so IPv4 works and IPv6 gives ping: Address family for hostname not supported).

This doesn't look good:

May 28 03:30:00 worker3 telegraf[2443]: 2023-05-28T01:30:00Z E! [] Error in plugin: host (linuxOne III): /usr/bin/ping: (linuxOne III): Name or service not known, exit status 2

It looks similarly on the arm worker except that there are more s390 hosts affected. Strangely, the last message is from 2023-05-28.

When invoking sudo /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -test the output looks good, e.g.:

ping,host=worker3, average_response_ms=0.557,maximum_response_ms=0.557,minimum_response_ms=0.557,packets_received=1i,packets_transmitted=1i,percent_packet_loss=0,result_code=0i,standard_deviation_ms=0,ttl=62i 1685617878000000000

Nevertheless, restarting telegraf didn't help. The data in Grafana remains unchanged, even though no error messages are logged anymore.

Actions #3

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Resolved

The alert is good again. It looks like it just took a moment until new data reached Grafana.

It appears that telegraf was stuck after reaching an error condition several times. At least it didn't report the correct ping data for a while and after I restarted it and waiting for a few minutes it worked again. This is the first time I'm seeing this issue so I suppose we don't need to investigate it further at this point. So I'm resolving this issue.

Actions #4

Updated by mkittler about 1 year ago

  • Tags set to alert
  • Project changed from 46 to openQA Infrastructure
  • Target version set to Ready

Also available in: Atom PDF