action #18164
closed[devops][tools] monitoring of openqa worker instances
100%
Description
As already mentioned by okurz in poo#12912 we need proper monitoring of all important machines according openQA.
OSD is already in the icinga instance maintained by Infra so i create this ticket to also keep track of the workers themselves.
Updated by nicksinger over 7 years ago
- Related to action #12912: [tools]monitoring of o3/osd added
Updated by okurz over 7 years ago
https://infra.nue.suse.com/SelfService/Display.html?id=63262 about monitoring of the workers has just been resolved. I fail to login to icinga right now. Anyone can check?
Updated by RBrownSUSE over 7 years ago
- Assignee set to szarate
- Priority changed from Normal to High
- Target version set to Milestone 8
Updated by okurz over 7 years ago
- Related to action #19564: [tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked? added
Updated by RBrownSUSE over 7 years ago
- Target version changed from Milestone 8 to Milestone 9
Updated by coolo about 7 years ago
- Related to deleted (action #12912: [tools]monitoring of o3/osd)
Updated by okurz about 7 years ago
- Related to action #12912: [tools]monitoring of o3/osd added
Updated by szarate about 7 years ago
- Assignee deleted (
szarate) - Target version changed from Milestone 9 to future
Moving to future, but we should tackle this eventually
Updated by szarate about 7 years ago
- Related to action #28355: [tools][bonus][Sprint 201711.2] Worker loop dies during job setup added
Updated by acarvajal almost 7 years ago
- Status changed from New to In Progress
- Assignee set to acarvajal
Discussing with szarate regarding this issue, will focus initially on monitoring the following for each worker:
- That it's reachable
- SSH status
- Core dumps
- Services status (os-autoinst-openvswitch, etc)
Updated by acarvajal almost 7 years ago
Discussing with coolo regarding this issue, updated summary:
1) Focus initially on one worker, and then replicate to the rest.
2) Focus on:
- That the worker is reachable
- SSH status
- salt-minion status
- openqa services status (os-autoinst-openvswitch, openqa-worker@*, etc)
- Core dumps
- openqa jobs assigned to workers vs scheduled
3) Review grafana as an alternate monitoring/report tool
4) IT-managed nagios instance could solve some of the metrics/services to focus on
Updated by nicksinger almost 7 years ago
acarvajal wrote:
3) Review grafana as an alternate monitoring/report tool
grafana and its stack is actually more suited for performance profiling. Newer versions indeed include some monitoring capabilities but IMHO it's way more efficient to use icinga/nagios for monitoring.
acarvajal wrote:
4) IT-managed nagios instance could solve some of the metrics/services to focus on
Definitely the way to go if you really just want to get some kind of monitoring. Especially the basic scenarios like ping-check, ssh-check, disk-check and so on are already covered in the basic modules provided by infra/icinga/nagios.
Updated by acarvajal over 6 years ago
Requesting access to host group 'openqa-suse' on thruk with https://infra.nue.suse.com/SelfService/Display.html?id=109884
Updated by acarvajal over 6 years ago
New ticket created specifically to request access to openqa-suse host group: https://infra.nue.suse.com/Ticket/Display.html?id=110564
Updated by szarate over 6 years ago
- Related to action #35290: [tools] again needles could not be pushed from osd to gitlab.suse.de due to "account has been blocked" and apparently no monitoring alert about this was observed added
Updated by acarvajal over 6 years ago
Created 2 subtasks to tackle independently:
(1) The monitoring of the workers via the existing monitoring platform in SUSE
(2) A proof-of-concept with grafana/graphite for performance profiling
Updated by acarvajal over 6 years ago
Also add to monitoring list: Threshold of jobs scheduled vs free workers
Updated by acarvajal over 6 years ago
Updated sub-task with systemd timer information and salt proposal: https://progress.opensuse.org/issues/35536
Updated by coolo over 6 years ago
- Related to action #40583: Provide job stats for telegraf to poll added
Updated by szarate over 6 years ago
- Assignee changed from acarvajal to szarate
We have now: http://openqa-monitoring.qa.suse.de for the time being points to the grafana instance, monitoring of workers to be added there by today EOD per request of schlad
Updated by szarate over 6 years ago
- Target version changed from future to Current Sprint
Requested anonymous access/special account for monitoring purposes: https://infra.nue.suse.com/Ticket/Display.html?id=121425
Updated by sebchlad over 6 years ago
- Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added
Updated by szarate over 6 years ago
One thing to note is that the dashboard is not always reliable, for this we need two tickets from infra to be solved as expressed in poo#41336 and the get-metrics script has to be moved to collectd to help poo#35536 to move forward and have better reaction and monitoring capabilities
Updated by szarate about 6 years ago
- Related to deleted (action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance)
Updated by szarate about 6 years ago
- Related to action #41189: [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance added
Updated by szarate about 6 years ago
infra reported that they can't provide a vm in the current cloud: https://infra.nue.suse.com/Ticket/Display.html?id=121613 steven mentioned that Max would take a look at this and propose a different solution
Updated by szarate about 6 years ago
- fqdn requested for the current grafana instance: monitoring.openqa.suse.de -> openqa-monitoring.suse.de.
- Capacity of the grafana instance stays at ~50GB but now collectd data takes 3.5GB for all workers (only x86 for the time being due to collectd not being available on other platforms).
- still no ETA on the nagios monitoring ticket
Updated by okurz about 6 years ago
- Subject changed from [tools] monitoring of openqa worker instances to [functional][u][tools] monitoring of openqa worker instances
- Target version changed from Current Sprint to Milestone 20
szarate joined qsf-u
Updated by szarate about 6 years ago
- Assignee changed from szarate to nicksinger
Thought I passed this to Nick already :)
Updated by szarate about 6 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Subject changed from [functional][u][tools] monitoring of openqa worker instances to [devops][tools] monitoring of openqa worker instances
- Category deleted (
168) - Status changed from In Progress to Blocked
Setting to blocked for the time being.
Status is the same from 21 days ago (#31) but collectd data is now taking ~300MB per worker, as the CPU plugin was disabled and only load is still enabled.
Updated by okurz almost 6 years ago
- Target version deleted (
Milestone 20)
removing target version as tools-team does not use milestones
Updated by okurz almost 5 years ago
I don't think it's blocked anymore, nicksinger, WDYT?
Thinking of monitoring systemctl --failed
. As suggested by coolo: one can run commands repeatedly and check their exit status to monitor, small shell snippets in telegraf will do. https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec . so count systemctl --failed | wc -l
and spit it as influxdb value.
Updated by nicksinger almost 5 years ago
- Status changed from Blocked to Resolved
Yes, we have something now in place: https://stats.openqa-monitor.qa.suse.de