action #71098
closed
openQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
openqaworker3 down but no alert was raised
Added by okurz over 4 years ago.
Updated almost 4 years ago.
- Priority changed from Urgent to Normal
openqaworker3 was (again) stuck in recovery mode, trying to run openqa_nvme_prepare (or format?). I recovered using the IPMI sol and the machine picked up jobs again.
Regarding the missing alerts discussed with nsinger: First simple step: Consolidate the use of worker dashboards to use "Keep Last State" for all panels except for a single one that we will use to detect if a machine is completely down. The caveat is that we can not give a custom message for the "No Data" alert so we just need to know what it means if there is "No Data" alerting. Alternatives: Use a ping from e.g. openqa-monitor to each worker. This should always return data as long as openqa-monitor itself is running. Or use a meta query?
okurz wrote:
First simple step: Consolidate the use of worker dashboards to use "Keep Last State" for all panels except for a single one that we will use to detect if a machine is completely down.
If we would do that then this would also be activated for openqaworker-arm-1 through …3 which would be annoying as they restart a lot but are also automatically recovered.
As we are already using a template for all workers we could use salt to fill in "no data" in position for all workers except the arm workers.
What we did now: Added a new panel "Machine pingable" and put that on the dashboard of each worker. Added an alert. But then we realized that we would also have the alert for arm workers which we want to avoid. So probably a better choice is to create another dashboard based on a template for each worker except the arm workers. We do not yet have the information about which worker is unstable in salt except in grafana/automatic_actions.json
. So what we could also do is also make a template out of grafana/automatic_actions.json
with entries for each worker.
Maybe we can use more grafana variables here?
Created a draft in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/354
and discussion in chat about open points:
<okurz> hi. Do you have an idea how I could in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/354/diffs#2d8bb0912fa3517d0e03bf383c59a7b32f8229d1_135_135 get grains for each worker based on "nodename"?
<nsinger> hm, not sure if I get your question 100% right but you should be able to use `grains.get("unstable", False)` no?
<nsinger> "based on nodename" is implicit with grains since they get assigned to every host individually
<okurz> well but it's not like we want to evaluate the salt code on each worker. This is the grafana panels we want to generate on monitor.qa
<nsinger> right. I think then the salt mine is the right place for you 😉 Let me dig up an example we already use
<nsinger> so https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3 is how to access the grain/attribute. I just don't find the place where we populate it in the mine
<nsinger> right, it was in the pillars: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls - I think if you add your unstable grain here then you can access it from the master through the mine. Makes sense?
<okurz> yes. but does `salt['mine.get']('roles:worker', 'nodename', tgt_type='grain')` give me the single value or list?
<nsinger> it should be a list with the salt-id as key and the requested grain(s) as list of values. But you can simply give it a try on the command line: `salt '*' mine.get 'roles:worker' nodename grain`
<okurz> that's a good hint, thx
<okurz> I don't understand https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls though.
<nsinger> IIRC it is a function inside the mine called "nodename" which is calling "grains.get" with the parameter "nodename"
<nsinger> which would match `salt['mine.get']('roles:worker', 'nodename', tgt_type='grain')` because the second parameter is the function in the mine to call
<okurz> I see. Would we need to define another function "unstable" just to get the grain "unstable"? I don't understand the difference between this approach here and other places where we directly access grains, like the "role" we have for worker, monitoring and webui
<nsinger> I don't see a place where we do it differently. AFAIK besides the nodename we do not make use of the mine but yeah, if you think it's wrong go play around with it and find a better solution 🙂
<nsinger> I don't say my solution is the way to got it's just how it worked for me 😉
- Status changed from In Progress to Feedback
For now hoping for some good ideas by others. Otherwise it would be a lengthy investigation cycle for myself how to iterate over all worker data in jinja template and put in according strings based on it.
okurz wrote:
Maybe we can use more grafana variables here?
Today I found out that grafana template variables can not be used in alert queries. If a query includes e.g. host =~ /^$host$/
instead of host = 'openqa'
the alert settings will show an error "Template variables are not supported in alert queries". This seems to be a highly desired feature request on grafana, see https://github.com/grafana/grafana/issues/6557
That's the reason we create other alerts in salt templates
do you mean this is the reason why the worker dashboards use the jinja template managed by salt?
Can you clarify what jinja or salt templates you're referring to? Thinking if I can contribute to the brainstorming I'm a bit lost :-D
- Status changed from Feedback to In Progress
- Priority changed from Normal to Low
As the above approach isn't leading anywhere I am trying a different way now.
On the webUI summary dashboard we have "Average Ping time" where we fill non-existant values with "null" hence not receiving alerts if ping does not return. There is an alert attached to that panel but only for "high ping times". We could either not fill values and alert on that or keep it at null but actually report an exact value of zero (or null?) as alert. I am creating a new panel on "WIP" for experimentation. This will likely take some time to gather enough results as our hosts are not down that often. So I am also setting the prio to "Low".
- Due date set to 2020-11-30
Setting due date based on mean cycle time of SUSE QE Tools
- Estimated time set to 80142.00 h
- Estimated time deleted (
80142.00 h)
- Status changed from In Progress to Feedback
- Parent task set to #80142
The alarm I put in hasn't triggered the past days even though I can see that currently qa-power8-5-kvm is (still) down. I changed to alarm also explicitly on "no data". Let's see if this triggers. If not I plan to deliberately power off a host on the weekend when less ressources are needed and I can check for the alert to trigger.
- Related to action #78010: unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN) added
- Status changed from Feedback to Resolved
- Due date deleted (
2020-11-30)
Also available in: Atom
PDF