openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
openqaworker3 down but no alert was raised
dzedro reported in https://chat.suse.de/channel/testing?msg=mq6q7RGxM2jznNAsR that openqaworker3 is down. This was not reported by any alerts. It seems we do not have a consistent alert for any host being simply down. This should be crosschecked in grafana.
- Priority changed from Urgent to Normal
openqaworker3 was (again) stuck in recovery mode, trying to run openqa_nvme_prepare (or format?). I recovered using the IPMI sol and the machine picked up jobs again.
Regarding the missing alerts discussed with nsinger: First simple step: Consolidate the use of worker dashboards to use "Keep Last State" for all panels except for a single one that we will use to detect if a machine is completely down. The caveat is that we can not give a custom message for the "No Data" alert so we just need to know what it means if there is "No Data" alerting. Alternatives: Use a ping from e.g. openqa-monitor to each worker. This should always return data as long as openqa-monitor itself is running. Or use a meta query?
First simple step: Consolidate the use of worker dashboards to use "Keep Last State" for all panels except for a single one that we will use to detect if a machine is completely down.
If we would do that then this would also be activated for openqaworker-arm-1 through …3 which would be annoying as they restart a lot but are also automatically recovered.
As we are already using a template for all workers we could use salt to fill in "no data" in position for all workers except the arm workers.
What we did now: Added a new panel "Machine pingable" and put that on the dashboard of each worker. Added an alert. But then we realized that we would also have the alert for arm workers which we want to avoid. So probably a better choice is to create another dashboard based on a template for each worker except the arm workers. We do not yet have the information about which worker is unstable in salt except in
grafana/automatic_actions.json. So what we could also do is also make a template out of
grafana/automatic_actions.json with entries for each worker.
Maybe we can use more grafana variables here?
Created a draft in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/354
and discussion in chat about open points:
<okurz> hi. Do you have an idea how I could in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/354/diffs#2d8bb0912fa3517d0e03bf383c59a7b32f8229d1_135_135 get grains for each worker based on "nodename"? <nsinger> hm, not sure if I get your question 100% right but you should be able to use `grains.get("unstable", False)` no? <nsinger> "based on nodename" is implicit with grains since they get assigned to every host individually <okurz> well but it's not like we want to evaluate the salt code on each worker. This is the grafana panels we want to generate on monitor.qa <nsinger> right. I think then the salt mine is the right place for you 😉 Let me dig up an example we already use <nsinger> so https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3 is how to access the grain/attribute. I just don't find the place where we populate it in the mine <nsinger> right, it was in the pillars: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls - I think if you add your unstable grain here then you can access it from the master through the mine. Makes sense? <okurz> yes. but does `salt['mine.get']('roles:worker', 'nodename', tgt_type='grain')` give me the single value or list? <nsinger> it should be a list with the salt-id as key and the requested grain(s) as list of values. But you can simply give it a try on the command line: `salt '*' mine.get 'roles:worker' nodename grain` <okurz> that's a good hint, thx <okurz> I don't understand https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls though. <nsinger> IIRC it is a function inside the mine called "nodename" which is calling "grains.get" with the parameter "nodename" <nsinger> which would match `salt['mine.get']('roles:worker', 'nodename', tgt_type='grain')` because the second parameter is the function in the mine to call <okurz> I see. Would we need to define another function "unstable" just to get the grain "unstable"? I don't understand the difference between this approach here and other places where we directly access grains, like the "role" we have for worker, monitoring and webui <nsinger> I don't see a place where we do it differently. AFAIK besides the nodename we do not make use of the mine but yeah, if you think it's wrong go play around with it and find a better solution 🙂 <nsinger> I don't say my solution is the way to got it's just how it worked for me 😉
Maybe we can use more grafana variables here?
Today I found out that grafana template variables can not be used in alert queries. If a query includes e.g.
host =~ /^$host$/ instead of
host = 'openqa' the alert settings will show an error "Template variables are not supported in alert queries". This seems to be a highly desired feature request on grafana, see https://github.com/grafana/grafana/issues/6557
Basically the only relevant parts are the file https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls , which is a "salt state file" or something like that, and the folder https://gitlab.suse.de/openqa/salt-states-openqa/-/tree/master/openqa/monitoring/grafana/ which includes grafana dashboard files in JSON format, e.g. https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana/automatic_actions.json and a jinja template file https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana/worker.json.template which is parsed within the salt state definitions where variables are replaced and one result json file per worker host is created on monitor.qa.suse.de, e.g. "openqaworker2.json, openqaworker3.json, …"
- Status changed from Feedback to In Progress
- Priority changed from Normal to Low
As the above approach isn't leading anywhere I am trying a different way now.
On the webUI summary dashboard we have "Average Ping time" where we fill non-existant values with "null" hence not receiving alerts if ping does not return. There is an alert attached to that panel but only for "high ping times". We could either not fill values and alert on that or keep it at null but actually report an exact value of zero (or null?) as alert. I am creating a new panel on "WIP" for experimentation. This will likely take some time to gather enough results as our hosts are not down that often. So I am also setting the prio to "Low".
- Status changed from In Progress to Feedback
- Parent task set to #80142
The alarm I put in hasn't triggered the past days even though I can see that currently qa-power8-5-kvm is (still) down. I changed to alarm also explicitly on "no data". Let's see if this triggers. If not I plan to deliberately power off a host on the weekend when less ressources are needed and I can check for the alert to trigger.
I could not get my previous approach done yet but I am trying another alternative that is simple now:
- Status changed from Feedback to Resolved
Forgot to mention the name in each alert so created
Now live and working as expected.
This is now enabling alerts for all hosts without exception. We can try to work with this and see if it's again too much.