Project

General

Profile

Actions

action #138527

closed

Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal size:S

Added by livdywan about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-07-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

Problem started at 12:50:21 on 2023.10.25
Problem name: Zabbix agent is not available (or nodata for 30m)
Host: ariel.dmz-prg2.suse.org
Severity: Average
Operational data: Up (1)
Original problem ID: 600373209

Checking the journal shows nothing:

sudo journalctl -u zabbix_agentd
-- No entries --

Acceptance criteria

  • AC1: It is understand what was causing Zabbix agent not reporting any data

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #138551: DNS outage of 2023-10-25, e.g. Cron <root@openqa-service> (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log Max retries exceeded with url size:SResolvedlivdywan2023-10-23

Actions
Related to openQA Infrastructure (public) - action #138545: Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:SResolvedtinita2023-11-28

Actions
Actions #1

Updated by livdywan about 1 year ago

I can't apparently suppress this problem in Zabbix because it's not showing up anywhere, and all I can see is unrelated problems that aren't relevant to us.

Actions #2

Updated by tinita about 1 year ago

This keeps happening regularly for a short timeframe. Right now everything seems fine again, duration was 35min.

% grep zabbix-proxy.dmz-prg2.suse.org /var/log/zabbix_agentd.log

  1641:20231022:033032.672 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail (cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
  1641:20231022:033132.688 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] is working again
  1641:20231024:200405.516 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail ([connect] cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
  1641:20231024:200546.006 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail (cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
  1641:20231024:201159.094 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] is working again
  1641:20231024:201246.432 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] is working again

# The alert seems to be about this timeframe:
  1641:20231025:102049.982 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail ([connect] cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
  1641:20231025:102052.173 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail (cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
  1641:20231025:112646.694 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] is working again
  1641:20231025:112654.706 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] is working again

  1641:20231025:133801.103 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail ([connect] cannot connect to [[zabbix-proxy.dmz-prg2.suse.org]:10051]: [4] Interrupted system call)
  1641:20231025:133801.105 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] is working again

So apparently there are connection problems to the zabbix host from time to time.
Not sure how we could debug or improve this...

Actions #3

Updated by jbaier_cz about 1 year ago

cannot resolve [zabbix-proxy.dmz-prg2.suse.org]

Looks like DNS issue, might be related to #138551

Actions #4

Updated by jbaier_cz about 1 year ago

  • Related to action #138551: DNS outage of 2023-10-25, e.g. Cron <root@openqa-service> (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log Max retries exceeded with url size:S added
Actions #5

Updated by livdywan about 1 year ago

  • Subject changed from Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal to Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal size:S
  • Status changed from New to In Progress
  • Assignee set to livdywan
  • Priority changed from Urgent to High

Maybe related to, or the same as #138551 and also lowering priority as we're not seeing this right now. I'll try and confirm the root cause and monitor the situation going forward.

Actions #6

Updated by livdywan about 1 year ago

  • Status changed from In Progress to Feedback

Hasn't come back so far

Actions #7

Updated by livdywan about 1 year ago

  • Related to action #138545: Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S added
Actions #8

Updated by livdywan about 1 year ago

  • Status changed from Feedback to Resolved

I suppose we're good here.

Actions

Also available in: Atom PDF