action #71098: openqaworker3 down but no alert was raised - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #71098

closed

openQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

openqaworker3 down but no alert was raised

Added by okurz over 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-09-08

Due date:

% Done:

Estimated time:

Description

Observation¶

dzedro reported in https://chat.suse.de/channel/testing?msg=mq6q7RGxM2jznNAsR that openqaworker3 is down. This was not reported by any alerts. It seems we do not have a consistent alert for any host being simply down. This should be crosschecked in grafana.

Related issues 1 (0 open — 1 closed)

Updated by okurz over 4 years ago

Priority changed from Urgent to Normal

Updated by okurz over 4 years ago

Created a draft in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/354
and discussion in chat about open points:

<okurz> hi. Do you have an idea how I could in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/354/diffs#2d8bb0912fa3517d0e03bf383c59a7b32f8229d1_135_135 get grains for each worker based on "nodename"?
<nsinger> hm, not sure if I get your question 100% right but you should be able to use `grains.get("unstable", False)` no?
<nsinger> "based on nodename" is implicit with grains since they get assigned to every host individually
<okurz> well but it's not like we want to evaluate the salt code on each worker. This is the grafana panels we want to generate on monitor.qa
<nsinger> right. I think then the salt mine is the right place for you 😉 Let me dig up an example we already use
<nsinger> so https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3 is how to access the grain/attribute. I just don't find the place where we populate it in the mine
<nsinger> right, it was in the pillars: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls - I think if you add your unstable grain here then you can access it from the master through the mine. Makes sense?
<okurz> yes. but does `salt['mine.get']('roles:worker', 'nodename', tgt_type='grain')` give me the single value or list?
<nsinger> it should be a list with the salt-id as key and the requested grain(s) as list of values. But you can simply give it a try on the command line: `salt '*' mine.get 'roles:worker' nodename grain`
<okurz> that's a good hint, thx
<okurz> I don't understand https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls though.
<nsinger> IIRC it is a function inside the mine called "nodename" which is calling "grains.get" with the parameter "nodename"
<nsinger> which would match `salt['mine.get']('roles:worker', 'nodename', tgt_type='grain')` because the second parameter is the function in the mine to call
<okurz> I see. Would we need to define another function "unstable" just to get the grain "unstable"? I don't understand the difference between this approach here and other places where we directly access grains, like the "role" we have for worker, monitoring and webui
<nsinger> I don't see a place where we do it differently. AFAIK besides the nodename we do not make use of the mine but yeah, if you think it's wrong go play around with it and find a better solution 🙂
<nsinger> I don't say my solution is the way to got it's just how it worked for me 😉

Updated by okurz over 4 years ago

Status changed from In Progress to Feedback

#10

Updated by okurz over 4 years ago

Status changed from Feedback to In Progress
Priority changed from Normal to Low

#11

Updated by okurz over 4 years ago

Due date set to 2020-11-30

#12

Updated by okurz over 4 years ago

Estimated time set to 80142.00 h

#13

Updated by okurz over 4 years ago

Estimated time deleted (~~80142.00 h~~)

#14

Updated by okurz over 4 years ago

Status changed from In Progress to Feedback
Parent task set to #80142

#15

Updated by okurz over 4 years ago

Related to action #78010: unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN) added

#17

Updated by okurz over 4 years ago

Status changed from Feedback to Resolved

#18

Updated by okurz almost 4 years ago

Due date deleted (~~2020-11-30~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Tags

Custom queries

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

action #71098

openqaworker3 down but no alert was raised

Observation¶

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by coolo over 4 years ago

Updated by okurz over 4 years ago

Updated by livdywan over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz almost 4 years ago