Project

General

Profile

action #18164

[devops][tools] monitoring of openqa worker instances

Added by nicksinger about 6 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
2018-04-25
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

As already mentioned by okurz in poo#12912 we need proper monitoring of all important machines according openQA.
OSD is already in the icinga instance maintained by Infra so i create this ticket to also keep track of the workers themselves.


Subtasks

action #35533: [tools] Monitoring of openqa worker instances via existing SUSE Infra servicesResolvedokurz

action #35536: [tools] Performance Profiling of openQA workers & OSDRejectedacarvajal

action #41336: Create a monitoring dashboard for openqa.suse.deResolved

action #41975: Evaluate graphite vs prometheusRejected


Related issues

Related to openQA Project - action #19564: [tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked?Closed2017-06-04

Related to openQA Project - action #12912: [tools]monitoring of o3/osdResolved2016-07-28

Related to openQA Project - action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setupResolved2017-11-24

Related to openQA Infrastructure - action #35290: [tools] again needles could not be pushed from osd to gitlab.suse.de due to "account has been blocked" and apparently no monitoring alert about this was observedResolved2018-04-20

Related to openQA Project - action #40583: Provide job stats for telegraf to pollResolved2018-09-04

Related to openQA Infrastructure - action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instanceResolved2018-09-18

History

#1 Updated by nicksinger about 6 years ago

#2 Updated by okurz about 6 years ago

https://infra.nue.suse.com/SelfService/Display.html?id=63262 about monitoring of the workers has just been resolved. I fail to login to icinga right now. Anyone can check?

#3 Updated by RBrownSUSE about 6 years ago

  • Assignee set to szarate
  • Priority changed from Normal to High
  • Target version set to Milestone 8

#4 Updated by okurz almost 6 years ago

  • Related to action #19564: [tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked? added

#5 Updated by RBrownSUSE almost 6 years ago

  • Target version changed from Milestone 8 to Milestone 9

#6 Updated by RBrownSUSE almost 6 years ago

  • Priority changed from High to Normal

#7 Updated by coolo over 5 years ago

  • Related to deleted (action #12912: [tools]monitoring of o3/osd)

#8 Updated by okurz over 5 years ago

#9 Updated by szarate over 5 years ago

  • Assignee deleted (szarate)
  • Target version changed from Milestone 9 to future

Moving to future, but we should tackle this eventually

#10 Updated by szarate over 5 years ago

  • Related to action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setup added

#11 Updated by acarvajal over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to acarvajal

Discussing with szarate regarding this issue, will focus initially on monitoring the following for each worker:

  • That it's reachable
  • SSH status
  • Core dumps
  • Services status (os-autoinst-openvswitch, etc)

#12 Updated by szarate over 5 years ago

Also salt-minion

#13 Updated by acarvajal over 5 years ago

Discussing with coolo regarding this issue, updated summary:

1) Focus initially on one worker, and then replicate to the rest.
2) Focus on:

  • That the worker is reachable
  • SSH status
  • salt-minion status
  • openqa services status (os-autoinst-openvswitch, openqa-worker@*, etc)
  • Core dumps
  • openqa jobs assigned to workers vs scheduled

3) Review grafana as an alternate monitoring/report tool
4) IT-managed nagios instance could solve some of the metrics/services to focus on

#15 Updated by nicksinger about 5 years ago

acarvajal wrote:

3) Review grafana as an alternate monitoring/report tool

grafana and its stack is actually more suited for performance profiling. Newer versions indeed include some monitoring capabilities but IMHO it's way more efficient to use icinga/nagios for monitoring.

acarvajal wrote:

4) IT-managed nagios instance could solve some of the metrics/services to focus on

Definitely the way to go if you really just want to get some kind of monitoring. Especially the basic scenarios like ping-check, ssh-check, disk-check and so on are already covered in the basic modules provided by infra/icinga/nagios.

#16 Updated by acarvajal about 5 years ago

Requesting access to host group 'openqa-suse' on thruk with https://infra.nue.suse.com/SelfService/Display.html?id=109884

#17 Updated by acarvajal about 5 years ago

New ticket created specifically to request access to openqa-suse host group: https://infra.nue.suse.com/Ticket/Display.html?id=110564

#18 Updated by szarate about 5 years ago

  • Related to action #35290: [tools] again needles could not be pushed from osd to gitlab.suse.de due to "account has been blocked" and apparently no monitoring alert about this was observed added

#19 Updated by acarvajal about 5 years ago

Created 2 subtasks to tackle independently:

(1) The monitoring of the workers via the existing monitoring platform in SUSE
(2) A proof-of-concept with grafana/graphite for performance profiling

#20 Updated by acarvajal about 5 years ago

Also add to monitoring list: Threshold of jobs scheduled vs free workers

#21 Updated by okurz almost 5 years ago

  • Target version changed from future to future

#22 Updated by acarvajal almost 5 years ago

Updated sub-task with systemd timer information and salt proposal: https://progress.opensuse.org/issues/35536

#23 Updated by coolo almost 5 years ago

  • Related to action #40583: Provide job stats for telegraf to poll added

#24 Updated by szarate over 4 years ago

  • Assignee changed from acarvajal to szarate

We have now: http://openqa-monitoring.qa.suse.de for the time being points to the grafana instance, monitoring of workers to be added there by today EOD per request of schlad

#25 Updated by szarate over 4 years ago

  • Target version changed from future to Current Sprint

Requested anonymous access/special account for monitoring purposes: https://infra.nue.suse.com/Ticket/Display.html?id=121425

#26 Updated by sebchlad over 4 years ago

  • Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added

#27 Updated by szarate over 4 years ago

One thing to note is that the dashboard is not always reliable, for this we need two tickets from infra to be solved as expressed in poo#41336 and the get-metrics script has to be moved to collectd to help poo#35536 to move forward and have better reaction and monitoring capabilities

#28 Updated by szarate over 4 years ago

  • Related to deleted (action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance)

#29 Updated by szarate over 4 years ago

  • Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added

#30 Updated by szarate over 4 years ago

infra reported that they can't provide a vm in the current cloud: https://infra.nue.suse.com/Ticket/Display.html?id=121613 steven mentioned that Max would take a look at this and propose a different solution

#31 Updated by szarate over 4 years ago

  • fqdn requested for the current grafana instance: monitoring.openqa.suse.de -> openqa-monitoring.suse.de.
  • Capacity of the grafana instance stays at ~50GB but now collectd data takes 3.5GB for all workers (only x86 for the time being due to collectd not being available on other platforms).
  • still no ETA on the nagios monitoring ticket

#32 Updated by okurz over 4 years ago

  • Subject changed from [tools] monitoring of openqa worker instances to [functional][u][tools] monitoring of openqa worker instances
  • Target version changed from Current Sprint to Milestone 20

szarate joined qsf-u

#33 Updated by szarate over 4 years ago

  • Assignee changed from szarate to nicksinger

Thought I passed this to Nick already :)

#34 Updated by szarate over 4 years ago

  • Project changed from openQA Project to openQA Infrastructure
  • Subject changed from [functional][u][tools] monitoring of openqa worker instances to [devops][tools] monitoring of openqa worker instances
  • Category deleted (168)
  • Status changed from In Progress to Blocked

Setting to blocked for the time being.
Status is the same from 21 days ago (#31) but collectd data is now taking ~300MB per worker, as the CPU plugin was disabled and only load is still enabled.

#35 Updated by okurz over 4 years ago

  • Target version deleted (Milestone 20)

removing target version as tools-team does not use milestones

#36 Updated by okurz over 3 years ago

I don't think it's blocked anymore, nicksinger, WDYT?

Thinking of monitoring systemctl --failed. As suggested by coolo: one can run commands repeatedly and check their exit status to monitor, small shell snippets in telegraf will do. https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec . so count systemctl --failed | wc -l and spit it as influxdb value.

#37 Updated by nicksinger over 3 years ago

  • Status changed from Blocked to Resolved

Yes, we have something now in place: https://stats.openqa-monitor.qa.suse.de

Also available in: Atom PDF