Project

General

Profile

Actions

action #123825

closed

Ensure proper o3 monitoring after shutdown of thruk/icinga by SUSE-IT Eng-Infra

Added by okurz about 1 year ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-01-31
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See email from Eng-Infra to devel@suse.de:

We are going to shutdown the old monitoring system based on Icinga.
The Icinga monitoring system started in 2013 as replacement of the old Nagios monitoring system.
Over the years it has been evolving and growing. We implemented several new technologies and
approaches there. For example check_mk , Thruk or SUSE HA.
But now the time went forward and monitoring tool also evolved. The opensource community came with new
tools. Its time to say good bye Icinga and thanks for all mails.
The Icinga monitoring system running on the portal thruk.suse.de will be replaced new tool Zabbix
running on machine zabbix.suse.de.
The new tool monitors infrastructure, servers and services in SUSE IT infra team responsibility.
We as team are open to offer this service to other SUSE teams.
The service will be shutdown on 1.2.2023.

With this we should ensure that monitoring&alerting in particular of o3 still works. Likely we need to make sure the according configuration is added to the new instance.

Acceptance criteria

  • AC1: Monitoring data regarding o3 is available for SUSE QE Tools team members
  • AC2: SUSE QE tools team members are alerted in case of critical problems regarding o3
  • AC3: Our documentation mentions/links the new system

Suggestions

  • Clarify about the current situation, if the o3 monitoring was migrated, etc.
  • Get access to the new system
  • Test access for multiple team members
  • Add missing configuration as needed
  • Test alerting
  • Mention/link the new system on our documentation, e.g. progress.opensuse.org/projects/openqav3/wiki/ and progress.opensuse.org/projects/qa/wiki/tools#Common-tasks-for-team-members
Actions #1

Updated by okurz about 1 year ago

  • Due date set to 2023-02-10
  • Status changed from New to Feedback
  • Assignee set to okurz

I asked in https://suse.slack.com/archives/C029APBKLGK/p1675168974431129?thread_ts=1675166806.368939&cid=C029APBKLGK:

The monitoring of openqa.opensuse.org is/was relying on thruk/icinga. So 1. are the o3 hosts monitored by zabbix now? 2. What about alerting? 3. Where can I see monitoring data? 4. How we can add/change related configuration?

Actions #2

Updated by okurz about 1 year ago

  • Due date changed from 2023-02-10 to 2023-05-26
  • Status changed from Feedback to Blocked
Actions #3

Updated by okurz about 1 year ago

  • Priority changed from High to Normal
Actions #4

Updated by okurz 12 months ago

There had been updates in the ticket and also a conversation in Slack.

Jiri Novak will create a user group "o3" in zabbix. I added a DHCP and hosts entry for a zabbix proxy host within the o3 DMZ.

To add more hosts to monitoring we can execute the following:

zypper ref
zypper in -y zabbix-agent
sed -i 's/=127.0.0.1/=zabbix-proxy-opensuse.openqanet.opensuse.org/g' /etc/zabbix/zabbix_agentd.conf
sed -i 's|# Include=/usr/local/etc/zabbix_agentd\.conf\.d/\*\.conf|Include=/etc/zabbix/zabbix_agentd.conf.d/*.conf|g' /etc/zabbix/zabbix_agentd.conf 
sed -i "s/=Zabbix server/=$(hostname -f)/g" /etc/zabbix/zabbix_agentd.conf
mkdir /etc/zabbix/zabbix_agentd.conf.d
echo "HostMetadataItem=" >/etc/zabbix/zabbix_agentd.conf.d/00-HostMetadata.conf
echo "HostMetadata=o-o3,t-linux" >>/etc/zabbix/zabbix_agentd.conf.d/00-HostMetadata.conf
echo 'UserParameter=net.if.ip4[*],ip -4 addr show dev $1 | grep inet | tr -s " " | cut -d" " -f3' >/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf 
echo 'UserParameter=net.if.ip6[*],ip -6 addr show dev $1 | grep inet | tr -s " " | cut -d" " -f3' >>/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf 
echo 'UserParameter=net.if.mac[*],ip link show $1 | grep link | tr -s " " | cut -d" " -f3' >>/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf 
echo 'UserParameter=sys.os.release,grep PRETTY /etc/os-release | cut -d'"'"'"'"'"' -f2' >>/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf 
echo 'UserParameter=sys.hw.manufacturer,cat /sys/devices/virtual/dmi/id/chassis_vendor' >>/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf 
echo 'UserParameter=sys.hw.uuid,sudo cat /sys/devices/virtual/dmi/id/product_uuid' >>/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf 
echo 'UserParameter=sys.hw.metadata,cat /etc/machine-metadata.yaml 2>/dev/null' >>/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf 
echo 'UserParameter=sys.mount.nfs,mount | grep "type nfs" | cut -d" " -f-3' >>/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf 
echo 'UserParameter=sys.net.listen,ss -tuln $(awk '"'"'{print "( sport < :"$1" or sport > :"$2" )"}'"'"' /proc/sys/net/ipv4/ip_local_port_range) | awk '"'"'BEGIN {col=5} NR==1 {if($4~/Local/){col=4}; next} $col!~/127\.|\[::1\]/{print $col}'"'"' | awk -F: '"'"'!a[$NF]++ {print $NF}'"'"' | sort -n | paste -sd '"'"' '"'"'' >>/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf 
echo 'UserParameter=sys.net.allfqdns,for x in `ip a | grep -E " inet6? " | tr -s " " | cut -d" " -f3 | cut -d"/" -f1`; do host $x | grep pointer | cut -d" " -f5; done | grep -v localhost | sort | uniq | sed '"'"'s/\.$//g'"'"' | paste -sd,' >>/etc/zabbix/zabbix_agentd.conf.d/30-TemplateConfigs.conf
chown -R root:zabbix /etc/zabbix/zabbix_agentd.conf.d/
echo "zabbix ALL=(ALL) NOPASSWD: /usr/bin/cat /sys/devices/virtual/dmi/id/product_uuid" >/etc/sudoers.d/zabbix-agent
systemctl restart zabbix_agentd

(Jiri Novak) so ariel has the data in zabbix properly now. if you want to add any more hosts like this, you can run the bunch of commands and it will auto-add to zabbix. i'll make you users into web interface in the evening

Actions #5

Updated by okurz 12 months ago

  • Status changed from Blocked to In Progress

SD ticket was resolved with

users created, set up meta user o3-notifications and gave him email and to send warning and up (let us know if you want that changed)
please have a look on ariel if you can also edit it, i'm not sure with the settings.
if not, please ping Martin Caj to change on all users in "O3 team" on permissins page in user settings from user role to admin role. […]

https://zabbix.suse.de/zabbix.php?action=host.edit&hostid=10855 shows something useful now. It looks like I can configure stuff. I guess that's good enough for now until we have learned more about zabbix.

I suggest everybody from the team as a first step tries to login. Everybody should see "ariel" under https://zabbix.suse.de/zabbix.php?action=host.view

Actions #6

Updated by okurz 12 months ago

  • Tags changed from reactive work, infra to reactive work, infra, mob

Let's try to also cover a basic zabbix intro tomorrow during the MOB session. I updated https://progress.opensuse.org/projects/qa/wiki/Tools and https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Monitoring with according information about the new monitoring system.

Actions #7

Updated by okurz 12 months ago

  • Status changed from In Progress to Blocked

https://zabbix.suse.de/ was not reachable during that time. We reported https://sd.suse.com/servicedesk/customer/portal/1/SD-113969 but now zabbix.suse.de seems to be fine. Anyone can test login and get accustomed while we can also track the ticket.

Actions #8

Updated by okurz 9 months ago

  • Due date deleted (2023-05-26)
  • Status changed from Blocked to Resolved

https://sd.suse.com/servicedesk/customer/portal/1/SD-113969 is still open and not answered by 2023-03-03. We gave up on the SD ticket. The zabbix instance is reachable and fine.

Actions

Also available in: Atom PDF