Project

General

Profile

action #71098

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

openqaworker3 down but no alert was raised

Added by okurz 11 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2020-09-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

dzedro reported in https://chat.suse.de/channel/testing?msg=mq6q7RGxM2jznNAsR that openqaworker3 is down. This was not reported by any alerts. It seems we do not have a consistent alert for any host being simply down. This should be crosschecked in grafana.


Related issues

Related to openQA Infrastructure - action #78010: unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN)Resolved2020-11-162021-04-21

History

#1 Updated by okurz 11 months ago

  • Priority changed from Urgent to Normal

openqaworker3 was (again) stuck in recovery mode, trying to run openqa_nvme_prepare (or format?). I recovered using the IPMI sol and the machine picked up jobs again.

Regarding the missing alerts discussed with nsinger: First simple step: Consolidate the use of worker dashboards to use "Keep Last State" for all panels except for a single one that we will use to detect if a machine is completely down. The caveat is that we can not give a custom message for the "No Data" alert so we just need to know what it means if there is "No Data" alerting. Alternatives: Use a ping from e.g. openqa-monitor to each worker. This should always return data as long as openqa-monitor itself is running. Or use a meta query?

#2 Updated by okurz 11 months ago

okurz wrote:

First simple step: Consolidate the use of worker dashboards to use "Keep Last State" for all panels except for a single one that we will use to detect if a machine is completely down.

If we would do that then this would also be activated for openqaworker-arm-1 through …3 which would be annoying as they restart a lot but are also automatically recovered.

As we are already using a template for all workers we could use salt to fill in "no data" in position for all workers except the arm workers.

What we did now: Added a new panel "Machine pingable" and put that on the dashboard of each worker. Added an alert. But then we realized that we would also have the alert for arm workers which we want to avoid. So probably a better choice is to create another dashboard based on a template for each worker except the arm workers. We do not yet have the information about which worker is unstable in salt except in grafana/automatic_actions.json. So what we could also do is also make a template out of grafana/automatic_actions.json with entries for each worker.

Maybe we can use more grafana variables here?

#3 Updated by okurz 10 months ago

Created a draft in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/354
and discussion in chat about open points:

<okurz> hi. Do you have an idea how I could in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/354/diffs#2d8bb0912fa3517d0e03bf383c59a7b32f8229d1_135_135 get grains for each worker based on "nodename"?
<nsinger> hm, not sure if I get your question 100% right but you should be able to use `grains.get("unstable", False)` no?
<nsinger> "based on nodename" is implicit with grains since they get assigned to every host individually
<okurz> well but it's not like we want to evaluate the salt code on each worker. This is the grafana panels we want to generate on monitor.qa
<nsinger> right. I think then the salt mine is the right place for you 😉 Let me dig up an example we already use
<nsinger> so https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3 is how to access the grain/attribute. I just don't find the place where we populate it in the mine
<nsinger> right, it was in the pillars: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls - I think if you add your unstable grain here then you can access it from the master through the mine. Makes sense?
<okurz> yes. but does `salt['mine.get']('roles:worker', 'nodename', tgt_type='grain')` give me the single value or list?
<nsinger> it should be a list with the salt-id as key and the requested grain(s) as list of values. But you can simply give it a try on the command line: `salt '*' mine.get 'roles:worker' nodename grain`
<okurz> that's a good hint, thx
<okurz> I don't understand https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls though.
<nsinger> IIRC it is a function inside the mine called "nodename" which is calling "grains.get" with the parameter "nodename"
<nsinger> which would match `salt['mine.get']('roles:worker', 'nodename', tgt_type='grain')` because the second parameter is the function in the mine to call
<okurz> I see. Would we need to define another function "unstable" just to get the grain "unstable"? I don't understand the difference between this approach here and other places where we directly access grains, like the "role" we have for worker, monitoring and webui
<nsinger> I don't see a place where we do it differently. AFAIK besides the nodename we do not make use of the mine but yeah, if you think it's wrong go play around with it and find a better solution 🙂
<nsinger> I don't say my solution is the way to got it's just how it worked for me 😉

#4 Updated by okurz 10 months ago

  • Status changed from In Progress to Feedback

For now hoping for some good ideas by others. Otherwise it would be a lengthy investigation cycle for myself how to iterate over all worker data in jinja template and put in according strings based on it.

#5 Updated by okurz 10 months ago

okurz wrote:

Maybe we can use more grafana variables here?

Today I found out that grafana template variables can not be used in alert queries. If a query includes e.g. host =~ /^$host$/ instead of host = 'openqa' the alert settings will show an error "Template variables are not supported in alert queries". This seems to be a highly desired feature request on grafana, see https://github.com/grafana/grafana/issues/6557

#6 Updated by coolo 10 months ago

That's the reason we create other alerts in salt templates

#7 Updated by okurz 10 months ago

do you mean this is the reason why the worker dashboards use the jinja template managed by salt?

#8 Updated by cdywan 9 months ago

Can you clarify what jinja or salt templates you're referring to? Thinking if I can contribute to the brainstorming I'm a bit lost :-D

#9 Updated by okurz 9 months ago

Basically the only relevant parts are the file https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls , which is a "salt state file" or something like that, and the folder https://gitlab.suse.de/openqa/salt-states-openqa/-/tree/master/openqa/monitoring/grafana/ which includes grafana dashboard files in JSON format, e.g. https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana/automatic_actions.json and a jinja template file https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana/worker.json.template which is parsed within the salt state definitions where variables are replaced and one result json file per worker host is created on monitor.qa.suse.de, e.g. "openqaworker2.json, openqaworker3.json, …"

#10 Updated by okurz 8 months ago

  • Status changed from Feedback to In Progress
  • Priority changed from Normal to Low

As the above approach isn't leading anywhere I am trying a different way now.

On the webUI summary dashboard we have "Average Ping time" where we fill non-existant values with "null" hence not receiving alerts if ping does not return. There is an alert attached to that panel but only for "high ping times". We could either not fill values and alert on that or keep it at null but actually report an exact value of zero (or null?) as alert. I am creating a new panel on "WIP" for experimentation. This will likely take some time to gather enough results as our hosts are not down that often. So I am also setting the prio to "Low".

#11 Updated by okurz 8 months ago

  • Due date set to 2020-11-30

Setting due date based on mean cycle time of SUSE QE Tools

#12 Updated by okurz 8 months ago

  • Estimated time set to 80142.00 h

#13 Updated by okurz 8 months ago

  • Estimated time deleted (80142.00 h)

#14 Updated by okurz 8 months ago

  • Status changed from In Progress to Feedback
  • Parent task set to #80142

The alarm I put in hasn't triggered the past days even though I can see that currently qa-power8-5-kvm is (still) down. I changed to alarm also explicitly on "no data". Let's see if this triggers. If not I plan to deliberately power off a host on the weekend when less ressources are needed and I can check for the alert to trigger.

#15 Updated by okurz 8 months ago

  • Related to action #78010: unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN) added

#16 Updated by okurz 8 months ago

I could not get my previous approach done yet but I am trying another alternative that is simple now:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/412

#17 Updated by okurz 8 months ago

  • Status changed from Feedback to Resolved

Forgot to mention the name in each alert so created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/413

Now live and working as expected.

This is now enabling alerts for all hosts without exception. We can try to work with this and see if it's again too much.

#18 Updated by okurz 2 months ago

  • Due date deleted (2020-11-30)

Also available in: Atom PDF