action #138527
closed
Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal size:S
Added by livdywan about 1 year ago.
Updated about 1 year ago.
Description
Observation¶
Problem started at 12:50:21 on 2023.10.25
Problem name: Zabbix agent is not available (or nodata for 30m)
Host: ariel.dmz-prg2.suse.org
Severity: Average
Operational data: Up (1)
Original problem ID: 600373209
Checking the journal shows nothing:
sudo journalctl -u zabbix_agentd
-- No entries --
Acceptance criteria¶
- AC1: It is understand what was causing Zabbix agent not reporting any data
Suggestions¶
I can't apparently suppress this problem in Zabbix because it's not showing up anywhere, and all I can see is unrelated problems that aren't relevant to us.
This keeps happening regularly for a short timeframe. Right now everything seems fine again, duration was 35min.
% grep zabbix-proxy.dmz-prg2.suse.org /var/log/zabbix_agentd.log
1641:20231022:033032.672 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail (cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
1641:20231022:033132.688 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] is working again
1641:20231024:200405.516 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail ([connect] cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
1641:20231024:200546.006 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail (cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
1641:20231024:201159.094 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] is working again
1641:20231024:201246.432 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] is working again
# The alert seems to be about this timeframe:
1641:20231025:102049.982 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail ([connect] cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
1641:20231025:102052.173 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail (cannot resolve [zabbix-proxy.dmz-prg2.suse.org])
1641:20231025:112646.694 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] is working again
1641:20231025:112654.706 active check configuration update from [zabbix-proxy.dmz-prg2.suse.org:10051] is working again
1641:20231025:133801.103 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] started to fail ([connect] cannot connect to [[zabbix-proxy.dmz-prg2.suse.org]:10051]: [4] Interrupted system call)
1641:20231025:133801.105 active check data upload to [zabbix-proxy.dmz-prg2.suse.org:10051] is working again
So apparently there are connection problems to the zabbix host from time to time.
Not sure how we could debug or improve this...
cannot resolve [zabbix-proxy.dmz-prg2.suse.org]
Looks like DNS issue, might be related to #138551
- Related to action #138551: DNS outage of 2023-10-25, e.g. Cron <root@openqa-service> (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log Max retries exceeded with url size:S added
- Subject changed from Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal to Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal size:S
- Status changed from New to In Progress
- Assignee set to livdywan
- Priority changed from Urgent to High
Maybe related to, or the same as #138551 and also lowering priority as we're not seeing this right now. I'll try and confirm the root cause and monitor the situation going forward.
- Status changed from In Progress to Feedback
- Related to action #138545: Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S added
- Status changed from Feedback to Resolved
I suppose we're good here.
Also available in: Atom
PDF