Project

General

Profile

Actions

action #130210

closed

[FIRING:1] Packet loss between worker hosts and other hosts alert Salt (2Z025iB4km)

Added by dheidler 11 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-06-01
Due date:
% Done:

0%

Estimated time:
Tags:

Description

The grafana alert fired at 31.05. 07:37 and is still firing as of 01.06. 11:27.

https://stats.openqa-monitor.qa.suse.de/alerting/grafana/2Z025iB4km/view?orgId=1

Actions #1

Updated by mkittler 11 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #2

Updated by mkittler 11 months ago

This is about s390zp14.suse.de not pingable from openqaworker-arm-3.suse.de and worker3.oqa.suse.de acccording to the data in Grafana. I've just connected to those hosts and pinged s390zp14.suse.de manually and it just works as on OSD (so IPv4 works and IPv6 gives ping: s390zp14.suse.de: Address family for hostname not supported).

This doesn't look good:

May 28 03:30:00 worker3 telegraf[2443]: 2023-05-28T01:30:00Z E! [inputs.ping] Error in plugin: host s390zl14.suse.de (linuxOne III): /usr/bin/ping: s390zl14.suse.de (linuxOne III): Name or service not known, exit status 2

It looks similarly on the arm worker except that there are more s390 hosts affected. Strangely, the last message is from 2023-05-28.

When invoking sudo /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -test the output looks good, e.g.:

ping,host=worker3,url=s390zp14.suse.de average_response_ms=0.557,maximum_response_ms=0.557,minimum_response_ms=0.557,packets_received=1i,packets_transmitted=1i,percent_packet_loss=0,result_code=0i,standard_deviation_ms=0,ttl=62i 1685617878000000000

Nevertheless, restarting telegraf didn't help. The data in Grafana remains unchanged, even though no error messages are logged anymore.

Actions #3

Updated by mkittler 11 months ago

  • Status changed from In Progress to Resolved

The alert is good again. It looks like it just took a moment until new data reached Grafana.

It appears that telegraf was stuck after reaching an error condition several times. At least it didn't report the correct ping data for a while and after I restarted it and waiting for a few minutes it worked again. This is the first time I'm seeing this issue so I suppose we don't need to investigate it further at this point. So I'm resolving this issue.

Actions #4

Updated by mkittler 11 months ago

  • Tags set to alert
  • Project changed from 46 to openQA Infrastructure
  • Target version set to Ready
Actions

Also available in: Atom PDF