action #18164: [devops][tools] monitoring of openqa worker instances - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by nicksinger about 8 years ago

Related to action #12912: [tools]monitoring of o3/osd added

Actions

Copy link

#2

Updated by okurz about 8 years ago

https://infra.nue.suse.com/SelfService/Display.html?id=63262 about monitoring of the workers has just been resolved. I fail to login to icinga right now. Anyone can check?

Actions

Copy link

#3

Updated by RBrownSUSE about 8 years ago

Assignee set to szarate
Priority changed from Normal to High
Target version set to Milestone 8

Actions

Copy link

#4

Updated by okurz almost 8 years ago

Related to action #19564: [tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked? added

Actions

Copy link

#5

Updated by RBrownSUSE almost 8 years ago

Target version changed from Milestone 8 to Milestone 9

Actions

Copy link

#6

Updated by RBrownSUSE almost 8 years ago

Priority changed from High to Normal

Actions

Copy link

#7

Updated by coolo over 7 years ago

Related to deleted (action #12912: [tools]monitoring of o3/osd)

Actions

Copy link

#8

Updated by okurz over 7 years ago

Related to action #12912: [tools]monitoring of o3/osd added

Actions

Copy link

#9

Updated by szarate over 7 years ago

Assignee deleted (~~szarate~~)
Target version changed from Milestone 9 to future

Moving to future, but we should tackle this eventually

Actions

Copy link

#10

Updated by szarate over 7 years ago

Related to action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setup added

Actions

Copy link

#11

Updated by acarvajal over 7 years ago

Status changed from New to In Progress
Assignee set to acarvajal

Discussing with szarate regarding this issue, will focus initially on monitoring the following for each worker:

That it's reachable
SSH status
Core dumps
Services status (os-autoinst-openvswitch, etc)

Actions

Copy link

#12

Updated by szarate over 7 years ago

Also salt-minion

Actions

Copy link

#13

Updated by acarvajal about 7 years ago

Discussing with coolo regarding this issue, updated summary:

Focus initially on one worker, and then replicate to the rest.
Focus on:

That the worker is reachable
SSH status
salt-minion status
openqa services status (os-autoinst-openvswitch, openqa-worker@*, etc)
Core dumps
openqa jobs assigned to workers vs scheduled

Review grafana as an alternate monitoring/report tool
IT-managed nagios instance could solve some of the metrics/services to focus on

Actions

Copy link

#14

Updated by szarate about 7 years ago

https://wiki.microfocus.net/index.php?title=SUSE-Development/OPS/Services/Monitoring

Actions

Copy link

#15

Updated by nicksinger about 7 years ago

acarvajal wrote:

Review grafana as an alternate monitoring/report tool

grafana and its stack is actually more suited for performance profiling. Newer versions indeed include some monitoring capabilities but IMHO it's way more efficient to use icinga/nagios for monitoring.

acarvajal wrote:

IT-managed nagios instance could solve some of the metrics/services to focus on

Definitely the way to go if you really just want to get some kind of monitoring. Especially the basic scenarios like ping-check, ssh-check, disk-check and so on are already covered in the basic modules provided by infra/icinga/nagios.

Actions

Copy link

#16

Updated by acarvajal about 7 years ago

Requesting access to host group 'openqa-suse' on thruk with https://infra.nue.suse.com/SelfService/Display.html?id=109884

Actions

Copy link

#17

Updated by acarvajal about 7 years ago

New ticket created specifically to request access to openqa-suse host group: https://infra.nue.suse.com/Ticket/Display.html?id=110564

Actions

Copy link

#18

Updated by szarate about 7 years ago

Related to action #35290: [tools] again needles could not be pushed from osd to gitlab.suse.de due to "account has been blocked" and apparently no monitoring alert about this was observed added

Actions

Copy link

#19

Updated by acarvajal about 7 years ago

Created 2 subtasks to tackle independently:

(1) The monitoring of the workers via the existing monitoring platform in SUSE
(2) A proof-of-concept with grafana/graphite for performance profiling

Actions

Copy link

#20

Updated by acarvajal about 7 years ago

Also add to monitoring list: Threshold of jobs scheduled vs free workers

Actions

Copy link

#21

Updated by okurz almost 7 years ago

Target version changed from future to future

Actions

Copy link

#22

Updated by acarvajal almost 7 years ago

Updated sub-task with systemd timer information and salt proposal: https://progress.opensuse.org/issues/35536

Actions

Copy link

#23

Updated by coolo over 6 years ago

Related to action #40583: Provide job stats for telegraf to poll added

Actions

Copy link

#24

Updated by szarate over 6 years ago

Assignee changed from acarvajal to szarate

We have now: http://openqa-monitoring.qa.suse.de for the time being points to the grafana instance, monitoring of workers to be added there by today EOD per request of schlad

Actions

Copy link

#25

Updated by szarate over 6 years ago

Target version changed from future to Current Sprint

Requested anonymous access/special account for monitoring purposes: https://infra.nue.suse.com/Ticket/Display.html?id=121425

Actions

Copy link

#26

Updated by sebchlad over 6 years ago

Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added

Actions

Copy link

#27

Updated by szarate over 6 years ago

One thing to note is that the dashboard is not always reliable, for this we need two tickets from infra to be solved as expressed in poo#41336 and the get-metrics script has to be moved to collectd to help poo#35536 to move forward and have better reaction and monitoring capabilities

Actions

Copy link

#28

Updated by szarate over 6 years ago

Related to deleted (action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance)

Actions

Copy link

#29

Updated by szarate over 6 years ago

Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added

Actions

Copy link

#30

Updated by szarate over 6 years ago

infra reported that they can't provide a vm in the current cloud: https://infra.nue.suse.com/Ticket/Display.html?id=121613 steven mentioned that Max would take a look at this and propose a different solution

Actions

Copy link

#31

Updated by szarate over 6 years ago

fqdn requested for the current grafana instance: monitoring.openqa.suse.de -> openqa-monitoring.suse.de.
Capacity of the grafana instance stays at ~50GB but now collectd data takes 3.5GB for all workers (only x86 for the time being due to collectd not being available on other platforms).
still no ETA on the nagios monitoring ticket

Actions

Copy link

#32

Updated by okurz over 6 years ago

Subject changed from [tools] monitoring of openqa worker instances to [functional][u][tools] monitoring of openqa worker instances
Target version changed from Current Sprint to Milestone 20

szarate joined qsf-u

Actions

Copy link

#33

Updated by szarate over 6 years ago

Assignee changed from szarate to nicksinger

Thought I passed this to Nick already :)

Actions

Copy link

#34

Updated by szarate over 6 years ago

Project changed from openQA Project (public) to openQA Infrastructure (public)
Subject changed from [functional][u][tools] monitoring of openqa worker instances to [devops][tools] monitoring of openqa worker instances
Category deleted (~~168~~)
Status changed from In Progress to Blocked

Setting to blocked for the time being.
Status is the same from 21 days ago (#31) but collectd data is now taking ~300MB per worker, as the CPU plugin was disabled and only load is still enabled.

Actions

Copy link

#35

Updated by okurz over 6 years ago

Target version deleted (~~Milestone 20~~)

removing target version as tools-team does not use milestones

Actions

Copy link

#36

Updated by okurz over 5 years ago

I don't think it's blocked anymore, nicksinger, WDYT?

Thinking of monitoring systemctl --failed. As suggested by coolo: one can run commands repeatedly and check their exit status to monitor, small shell snippets in telegraf will do. https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec . so count systemctl --failed | wc -l and spit it as influxdb value.

Actions

Copy link

#37

Updated by nicksinger over 5 years ago

Status changed from Blocked to Resolved

Yes, we have something now in place: https://stats.openqa-monitor.qa.suse.de

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #18164

[devops][tools] monitoring of openqa worker instances

Updated by nicksinger about 8 years ago

Updated by okurz about 8 years ago

Updated by RBrownSUSE about 8 years ago

Updated by okurz almost 8 years ago

Updated by RBrownSUSE almost 8 years ago

Updated by RBrownSUSE almost 8 years ago

Updated by coolo over 7 years ago

Updated by okurz over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by acarvajal over 7 years ago

Updated by szarate over 7 years ago

Updated by acarvajal about 7 years ago

Updated by szarate about 7 years ago

Updated by nicksinger about 7 years ago

Updated by acarvajal about 7 years ago

Updated by acarvajal about 7 years ago

Updated by szarate about 7 years ago

Updated by acarvajal about 7 years ago

Updated by acarvajal about 7 years ago

Updated by okurz almost 7 years ago

Updated by acarvajal almost 7 years ago

Updated by coolo over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by sebchlad over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by okurz over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by okurz over 6 years ago

Updated by okurz over 5 years ago

Updated by nicksinger over 5 years ago