action #18164: [devops][tools] monitoring of openqa worker instances - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #18164

closed

[devops][tools] monitoring of openqa worker instances

Added by nicksinger about 8 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

Start date:

2018-04-25

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

As already mentioned by okurz in poo#12912 we need proper monitoring of all important machines according openQA.
OSD is already in the icinga instance maintained by Infra so i create this ticket to also keep track of the workers themselves.

Subtasks 4 (0 open — 4 closed)

action #35533: [tools] Monitoring of openqa worker instances via existing SUSE Infra services

Resolved

okurz

2018-04-25

Actions

action #35536: [tools] Performance Profiling of openQA workers & OSD

Rejected

acarvajal

2018-04-25

Actions

action #41336: Create a monitoring dashboard for openqa.suse.de

Resolved

2018-09-19

Actions

action #41975: Evaluate graphite vs prometheus

Rejected

2018-10-04

Actions

Related issues 6 (0 open — 6 closed)

Related to openQA Project (public) - action #19564: [tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked?

Closed

2017-06-04

Actions

Related to openQA Project (public) - action #12912: [tools]monitoring of o3/osd

Resolved

okurz

2016-07-28

Actions

Related to openQA Project (public) - action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setup

Resolved

EDiGiacinto

2017-11-24

Actions

Related to openQA Infrastructure (public) - action #35290: [tools] again needles could not be pushed from osd to gitlab.suse.de due to "account has been blocked" and apparently no monitoring alert about this was observed

Resolved

okurz

2018-04-20

Actions

Related to openQA Project (public) - action #40583: Provide job stats for telegraf to poll

Resolved

coolo

2018-09-04

Actions

Related to openQA Infrastructure (public) - action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance

Resolved

nicksinger

2018-09-18

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by nicksinger about 8 years ago

Related to action #12912: [tools]monitoring of o3/osd added

Actions

Copy link

Updated by okurz almost 8 years ago

https://infra.nue.suse.com/SelfService/Display.html?id=63262 about monitoring of the workers has just been resolved. I fail to login to icinga right now. Anyone can check?

Actions

Copy link

Updated by RBrownSUSE almost 8 years ago

Assignee set to szarate
Priority changed from Normal to High
Target version set to Milestone 8

Actions

Copy link

Updated by okurz almost 8 years ago

Related to action #19564: [tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked? added

Actions

Copy link

Updated by RBrownSUSE almost 8 years ago

Target version changed from Milestone 8 to Milestone 9

Actions

Copy link

Updated by RBrownSUSE almost 8 years ago

Priority changed from High to Normal

Actions

Copy link

Updated by coolo over 7 years ago

Related to deleted (action #12912: [tools]monitoring of o3/osd)

Actions

Copy link

Updated by okurz over 7 years ago

Related to action #12912: [tools]monitoring of o3/osd added

Actions

Copy link

Updated by szarate over 7 years ago

Assignee deleted (~~szarate~~)
Target version changed from Milestone 9 to future

Moving to future, but we should tackle this eventually

Actions

Copy link

#10

Updated by szarate over 7 years ago

Related to action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setup added

Actions

Copy link

#11

Updated by acarvajal about 7 years ago

Status changed from New to In Progress
Assignee set to acarvajal

Discussing with szarate regarding this issue, will focus initially on monitoring the following for each worker:

That it's reachable
SSH status
Core dumps
Services status (os-autoinst-openvswitch, etc)

Actions

Copy link

#12

Updated by szarate about 7 years ago

Also salt-minion

Actions

Copy link

#13

Updated by acarvajal about 7 years ago

Discussing with coolo regarding this issue, updated summary:

Focus initially on one worker, and then replicate to the rest.
Focus on:

That the worker is reachable
SSH status
salt-minion status
openqa services status (os-autoinst-openvswitch, openqa-worker@*, etc)
Core dumps
openqa jobs assigned to workers vs scheduled

Review grafana as an alternate monitoring/report tool
IT-managed nagios instance could solve some of the metrics/services to focus on

Actions

Copy link

#14

Updated by szarate about 7 years ago

https://wiki.microfocus.net/index.php?title=SUSE-Development/OPS/Services/Monitoring

Actions

Copy link

#15

Updated by nicksinger about 7 years ago

acarvajal wrote:

Review grafana as an alternate monitoring/report tool

grafana and its stack is actually more suited for performance profiling. Newer versions indeed include some monitoring capabilities but IMHO it's way more efficient to use icinga/nagios for monitoring.

acarvajal wrote:

IT-managed nagios instance could solve some of the metrics/services to focus on

Definitely the way to go if you really just want to get some kind of monitoring. Especially the basic scenarios like ping-check, ssh-check, disk-check and so on are already covered in the basic modules provided by infra/icinga/nagios.

Actions

Copy link

#16

Updated by acarvajal about 7 years ago

Requesting access to host group 'openqa-suse' on thruk with https://infra.nue.suse.com/SelfService/Display.html?id=109884

Actions

Copy link

#17

Updated by acarvajal about 7 years ago

New ticket created specifically to request access to openqa-suse host group: https://infra.nue.suse.com/Ticket/Display.html?id=110564

Actions

Copy link

#18

Updated by szarate about 7 years ago

Related to action #35290: [tools] again needles could not be pushed from osd to gitlab.suse.de due to "account has been blocked" and apparently no monitoring alert about this was observed added

Actions

Copy link

#19

Updated by acarvajal about 7 years ago

Created 2 subtasks to tackle independently:

(1) The monitoring of the workers via the existing monitoring platform in SUSE
(2) A proof-of-concept with grafana/graphite for performance profiling

Actions

Copy link

#20

Updated by acarvajal about 7 years ago

Also add to monitoring list: Threshold of jobs scheduled vs free workers

Actions

Copy link

#21

Updated by okurz almost 7 years ago

Target version changed from future to future

Actions

Copy link

#22

Updated by acarvajal almost 7 years ago

Updated sub-task with systemd timer information and salt proposal: https://progress.opensuse.org/issues/35536

Actions

Copy link

#23

Updated by coolo over 6 years ago

Related to action #40583: Provide job stats for telegraf to poll added

Actions

Copy link

#24

Updated by szarate over 6 years ago

Assignee changed from acarvajal to szarate

We have now: http://openqa-monitoring.qa.suse.de for the time being points to the grafana instance, monitoring of workers to be added there by today EOD per request of schlad

Actions

Copy link

#25

Updated by szarate over 6 years ago

Target version changed from future to Current Sprint

Requested anonymous access/special account for monitoring purposes: https://infra.nue.suse.com/Ticket/Display.html?id=121425

Actions

Copy link

#26

Updated by sebchlad over 6 years ago

Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added

Actions

Copy link

#27

Updated by szarate over 6 years ago

One thing to note is that the dashboard is not always reliable, for this we need two tickets from infra to be solved as expressed in poo#41336 and the get-metrics script has to be moved to collectd to help poo#35536 to move forward and have better reaction and monitoring capabilities

https://infra.nue.suse.com/Ticket/Display.html?id=121425
https://infra.nue.suse.com/Ticket/Display.html?id=121433

Actions

Copy link

#28

Updated by szarate over 6 years ago

Related to deleted (action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance)

Actions

Copy link

#29

Updated by szarate over 6 years ago

Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added

Actions

Copy link

#30

Updated by szarate over 6 years ago

infra reported that they can't provide a vm in the current cloud: https://infra.nue.suse.com/Ticket/Display.html?id=121613 steven mentioned that Max would take a look at this and propose a different solution

Actions

Copy link

#31

Updated by szarate over 6 years ago

fqdn requested for the current grafana instance: monitoring.openqa.suse.de -> openqa-monitoring.suse.de.
Capacity of the grafana instance stays at ~50GB but now collectd data takes 3.5GB for all workers (only x86 for the time being due to collectd not being available on other platforms).
still no ETA on the nagios monitoring ticket

Actions

Copy link

#32

Updated by okurz over 6 years ago

Subject changed from [tools] monitoring of openqa worker instances to [functional][u][tools] monitoring of openqa worker instances
Target version changed from Current Sprint to Milestone 20

szarate joined qsf-u

Actions

Copy link

#33

Updated by szarate over 6 years ago

Assignee changed from szarate to nicksinger

Thought I passed this to Nick already :)

Actions

Copy link

#34

Updated by szarate over 6 years ago

Project changed from openQA Project (public) to openQA Infrastructure (public)
Subject changed from [functional][u][tools] monitoring of openqa worker instances to [devops][tools] monitoring of openqa worker instances
Category deleted (~~168~~)
Status changed from In Progress to Blocked

Setting to blocked for the time being.
Status is the same from 21 days ago (#31) but collectd data is now taking ~300MB per worker, as the CPU plugin was disabled and only load is still enabled.

Actions

Copy link

#35

Updated by okurz about 6 years ago

Target version deleted (~~Milestone 20~~)

removing target version as tools-team does not use milestones

Actions

Copy link

#36

Updated by okurz over 5 years ago

I don't think it's blocked anymore, nicksinger, WDYT?

Thinking of monitoring systemctl --failed. As suggested by coolo: one can run commands repeatedly and check their exit status to monitor, small shell snippets in telegraf will do. https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec . so count systemctl --failed | wc -l and spit it as influxdb value.

Actions

Copy link

#37

Updated by nicksinger over 5 years ago

Status changed from Blocked to Resolved

Yes, we have something now in place: https://stats.openqa-monitor.qa.suse.de

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #18164

[devops][tools] monitoring of openqa worker instances

Updated by nicksinger about 8 years ago

Updated by okurz almost 8 years ago

Updated by RBrownSUSE almost 8 years ago

Updated by okurz almost 8 years ago

Updated by RBrownSUSE almost 8 years ago

Updated by RBrownSUSE almost 8 years ago

Updated by coolo over 7 years ago

Updated by okurz over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by acarvajal about 7 years ago

Updated by szarate about 7 years ago

Updated by acarvajal about 7 years ago

Updated by szarate about 7 years ago

Updated by nicksinger about 7 years ago

Updated by acarvajal about 7 years ago

Updated by acarvajal about 7 years ago

Updated by szarate about 7 years ago

Updated by acarvajal about 7 years ago

Updated by acarvajal about 7 years ago

Updated by okurz almost 7 years ago

Updated by acarvajal almost 7 years ago

Updated by coolo over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by sebchlad over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by okurz over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by okurz about 6 years ago

Updated by okurz over 5 years ago

Updated by nicksinger over 5 years ago