Project

General

Profile

Actions

action #18164

closed

[devops][tools] monitoring of openqa worker instances

Added by nicksinger over 7 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2018-04-25
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

As already mentioned by okurz in poo#12912 we need proper monitoring of all important machines according openQA.
OSD is already in the icinga instance maintained by Infra so i create this ticket to also keep track of the workers themselves.


Subtasks 4 (0 open4 closed)

action #35533: [tools] Monitoring of openqa worker instances via existing SUSE Infra servicesResolvedokurz2018-04-25

Actions
action #35536: [tools] Performance Profiling of openQA workers & OSDRejectedacarvajal2018-04-25

Actions
action #41336: Create a monitoring dashboard for openqa.suse.deResolved2018-09-19

Actions
action #41975: Evaluate graphite vs prometheusRejected2018-10-04

Actions

Related issues 6 (0 open6 closed)

Related to openQA Project (public) - action #19564: [tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked?Closed2017-06-04

Actions
Related to openQA Project (public) - action #12912: [tools]monitoring of o3/osdResolvedokurz2016-07-28

Actions
Related to openQA Project (public) - action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setupResolvedEDiGiacinto2017-11-24

Actions
Related to openQA Infrastructure (public) - action #35290: [tools] again needles could not be pushed from osd to gitlab.suse.de due to "account has been blocked" and apparently no monitoring alert about this was observedResolvedokurz2018-04-20

Actions
Related to openQA Project (public) - action #40583: Provide job stats for telegraf to pollResolvedcoolo2018-09-04

Actions
Related to openQA Infrastructure (public) - action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instanceResolvednicksinger2018-09-18

Actions
Actions #1

Updated by nicksinger over 7 years ago

Actions #2

Updated by okurz over 7 years ago

https://infra.nue.suse.com/SelfService/Display.html?id=63262 about monitoring of the workers has just been resolved. I fail to login to icinga right now. Anyone can check?

Actions #3

Updated by RBrownSUSE over 7 years ago

  • Assignee set to szarate
  • Priority changed from Normal to High
  • Target version set to Milestone 8
Actions #4

Updated by okurz over 7 years ago

  • Related to action #19564: [tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked? added
Actions #5

Updated by RBrownSUSE over 7 years ago

  • Target version changed from Milestone 8 to Milestone 9
Actions #6

Updated by RBrownSUSE over 7 years ago

  • Priority changed from High to Normal
Actions #7

Updated by coolo about 7 years ago

  • Related to deleted (action #12912: [tools]monitoring of o3/osd)
Actions #8

Updated by okurz about 7 years ago

Actions #9

Updated by szarate about 7 years ago

  • Assignee deleted (szarate)
  • Target version changed from Milestone 9 to future

Moving to future, but we should tackle this eventually

Actions #10

Updated by szarate about 7 years ago

  • Related to action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setup added
Actions #11

Updated by acarvajal almost 7 years ago

  • Status changed from New to In Progress
  • Assignee set to acarvajal

Discussing with szarate regarding this issue, will focus initially on monitoring the following for each worker:

  • That it's reachable
  • SSH status
  • Core dumps
  • Services status (os-autoinst-openvswitch, etc)
Actions #12

Updated by szarate almost 7 years ago

Also salt-minion

Actions #13

Updated by acarvajal almost 7 years ago

Discussing with coolo regarding this issue, updated summary:

1) Focus initially on one worker, and then replicate to the rest.
2) Focus on:

  • That the worker is reachable
  • SSH status
  • salt-minion status
  • openqa services status (os-autoinst-openvswitch, openqa-worker@*, etc)
  • Core dumps
  • openqa jobs assigned to workers vs scheduled

3) Review grafana as an alternate monitoring/report tool
4) IT-managed nagios instance could solve some of the metrics/services to focus on

Actions #15

Updated by nicksinger almost 7 years ago

acarvajal wrote:

3) Review grafana as an alternate monitoring/report tool

grafana and its stack is actually more suited for performance profiling. Newer versions indeed include some monitoring capabilities but IMHO it's way more efficient to use icinga/nagios for monitoring.

acarvajal wrote:

4) IT-managed nagios instance could solve some of the metrics/services to focus on

Definitely the way to go if you really just want to get some kind of monitoring. Especially the basic scenarios like ping-check, ssh-check, disk-check and so on are already covered in the basic modules provided by infra/icinga/nagios.

Actions #16

Updated by acarvajal over 6 years ago

Requesting access to host group 'openqa-suse' on thruk with https://infra.nue.suse.com/SelfService/Display.html?id=109884

Actions #17

Updated by acarvajal over 6 years ago

New ticket created specifically to request access to openqa-suse host group: https://infra.nue.suse.com/Ticket/Display.html?id=110564

Actions #18

Updated by szarate over 6 years ago

  • Related to action #35290: [tools] again needles could not be pushed from osd to gitlab.suse.de due to "account has been blocked" and apparently no monitoring alert about this was observed added
Actions #19

Updated by acarvajal over 6 years ago

Created 2 subtasks to tackle independently:

(1) The monitoring of the workers via the existing monitoring platform in SUSE
(2) A proof-of-concept with grafana/graphite for performance profiling

Actions #20

Updated by acarvajal over 6 years ago

Also add to monitoring list: Threshold of jobs scheduled vs free workers

Actions #21

Updated by okurz over 6 years ago

  • Target version changed from future to future
Actions #22

Updated by acarvajal over 6 years ago

Updated sub-task with systemd timer information and salt proposal: https://progress.opensuse.org/issues/35536

Actions #23

Updated by coolo over 6 years ago

  • Related to action #40583: Provide job stats for telegraf to poll added
Actions #24

Updated by szarate over 6 years ago

  • Assignee changed from acarvajal to szarate

We have now: http://openqa-monitoring.qa.suse.de for the time being points to the grafana instance, monitoring of workers to be added there by today EOD per request of schlad

Actions #25

Updated by szarate over 6 years ago

  • Target version changed from future to Current Sprint

Requested anonymous access/special account for monitoring purposes: https://infra.nue.suse.com/Ticket/Display.html?id=121425

Actions #26

Updated by sebchlad over 6 years ago

  • Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added
Actions #27

Updated by szarate over 6 years ago

One thing to note is that the dashboard is not always reliable, for this we need two tickets from infra to be solved as expressed in poo#41336 and the get-metrics script has to be moved to collectd to help poo#35536 to move forward and have better reaction and monitoring capabilities

Actions #28

Updated by szarate about 6 years ago

  • Related to deleted (action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance)
Actions #29

Updated by szarate about 6 years ago

  • Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added
Actions #30

Updated by szarate about 6 years ago

infra reported that they can't provide a vm in the current cloud: https://infra.nue.suse.com/Ticket/Display.html?id=121613 steven mentioned that Max would take a look at this and propose a different solution

Actions #31

Updated by szarate about 6 years ago

  • fqdn requested for the current grafana instance: monitoring.openqa.suse.de -> openqa-monitoring.suse.de.
  • Capacity of the grafana instance stays at ~50GB but now collectd data takes 3.5GB for all workers (only x86 for the time being due to collectd not being available on other platforms).
  • still no ETA on the nagios monitoring ticket
Actions #32

Updated by okurz about 6 years ago

  • Subject changed from [tools] monitoring of openqa worker instances to [functional][u][tools] monitoring of openqa worker instances
  • Target version changed from Current Sprint to Milestone 20

szarate joined qsf-u

Actions #33

Updated by szarate about 6 years ago

  • Assignee changed from szarate to nicksinger

Thought I passed this to Nick already :)

Actions #34

Updated by szarate about 6 years ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Subject changed from [functional][u][tools] monitoring of openqa worker instances to [devops][tools] monitoring of openqa worker instances
  • Category deleted (168)
  • Status changed from In Progress to Blocked

Setting to blocked for the time being.
Status is the same from 21 days ago (#31) but collectd data is now taking ~300MB per worker, as the CPU plugin was disabled and only load is still enabled.

Actions #35

Updated by okurz almost 6 years ago

  • Target version deleted (Milestone 20)

removing target version as tools-team does not use milestones

Actions #36

Updated by okurz almost 5 years ago

I don't think it's blocked anymore, nicksinger, WDYT?

Thinking of monitoring systemctl --failed. As suggested by coolo: one can run commands repeatedly and check their exit status to monitor, small shell snippets in telegraf will do. https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec . so count systemctl --failed | wc -l and spit it as influxdb value.

Actions #37

Updated by nicksinger almost 5 years ago

  • Status changed from Blocked to Resolved

Yes, we have something now in place: https://stats.openqa-monitor.qa.suse.de

Actions

Also available in: Atom PDF