Project

General

Profile

action #106832

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Monitor masked units on our infrastructure

Added by nicksinger 5 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2022-02-15
Due date:
% Done:

0%

Estimated time:

Description

Motivation

#106666#note-6 raised the valid question how we become aware of units being masked for too long.

Suggestions


Related issues

Related to openQA Infrastructure - action #106666: Improve worker startup in our salt states or "openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.de"Resolved2022-02-11

History

#1 Updated by nicksinger 5 months ago

  • Related to action #106666: Improve worker startup in our salt states or "openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.de" added

#2 Updated by okurz 5 months ago

  • Description updated (diff)
  • Assignee set to okurz
  • Target version set to Ready

Thank you. Your suggestion looks good. Let me take a quick look. I have an idea.

#3 Updated by okurz 5 months ago

  • Status changed from New to Feedback

#4 Updated by cdywan 4 months ago

okurz wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/653

FYI you have outstanding review comments

#5 Updated by okurz 3 months ago

  • Parent task set to #109743

#7 Updated by okurz 2 months ago

I made a mistake in the syntax calling the shell script, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/676 for the fix.

#8 Updated by okurz 2 months ago

I broke the script in case there are actually no failed or masked units, fixed in: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/680

#9 Updated by okurz 2 months ago

Monitoring works now. I found that we have quite some number of masked units:

okurz@openqa:~> sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'systemctl list-units --state=masked'
openqaworker3.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
openqaworker8.suse.de:
      UNIT                     LOAD   ACTIVE   SUB  DESCRIPTION             
    * apparmor.service         masked inactive dead apparmor.service        
    * openqa-worker@1.service  masked inactive dead openqa-worker@1.service 
    * openqa-worker@10.service masked inactive dead openqa-worker@10.service
    * openqa-worker@11.service masked inactive dead openqa-worker@11.service
    * openqa-worker@12.service masked inactive dead openqa-worker@12.service
    * openqa-worker@13.service masked inactive dead openqa-worker@13.service
    * openqa-worker@14.service masked inactive dead openqa-worker@14.service
    * openqa-worker@15.service masked inactive dead openqa-worker@15.service
    * openqa-worker@16.service masked inactive dead openqa-worker@16.service
    * openqa-worker@2.service  masked inactive dead openqa-worker@2.service 
    * openqa-worker@3.service  masked inactive dead openqa-worker@3.service 
    * openqa-worker@4.service  masked inactive dead openqa-worker@4.service 
    * openqa-worker@5.service  masked inactive dead openqa-worker@5.service 
    * openqa-worker@6.service  masked inactive dead openqa-worker@6.service 
    * openqa-worker@7.service  masked inactive dead openqa-worker@7.service 
    * openqa-worker@8.service  masked inactive dead openqa-worker@8.service 
    * openqa-worker@9.service  masked inactive dead openqa-worker@9.service 

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    17 loaded units listed.
openqaworker9.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
openqaworker6.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
openqaworker5.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
openqaworker2.suse.de:
      UNIT                    LOAD   ACTIVE   SUB  DESCRIPTION            
    * apparmor.service        masked inactive dead apparmor.service       
    * openqa-worker@5.service masked inactive dead openqa-worker@5.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    2 loaded units listed.
powerqaworker-qam-1.qa.suse.de:
      UNIT                    LOAD   ACTIVE   SUB  DESCRIPTION            
    * apparmor.service        masked inactive dead apparmor.service       
    * openqa-worker@1.service masked inactive dead openqa-worker@1.service
    * openqa-worker@2.service masked inactive dead openqa-worker@2.service
    * openqa-worker@3.service masked inactive dead openqa-worker@3.service
    * openqa-worker@4.service masked inactive dead openqa-worker@4.service
    * openqa-worker@5.service masked inactive dead openqa-worker@5.service
    * openqa-worker@6.service masked inactive dead openqa-worker@6.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    7 loaded units listed.
QA-Power8-5-kvm.qa.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
QA-Power8-4-kvm.qa.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service
    * postfix.service  masked inactive dead postfix.service 

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    2 loaded units listed.
openqaworker14.qa.suse.cz:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
openqaworker15.qa.suse.cz:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
malbec.arch.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
openqaworker13.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
grenache-1.qa.suse.de:
      UNIT                                         LOAD   ACTIVE   SUB  DESCRIPTION                                 
    * apparmor.service                             masked inactive dead apparmor.service                            
    * openqa-reload-worker-auto-restart@10.service masked inactive dead openqa-reload-worker-auto-restart@10.service
    * openqa-reload-worker-auto-restart@21.service masked inactive dead openqa-reload-worker-auto-restart@21.service
    * openqa-reload-worker-auto-restart@22.service masked inactive dead openqa-reload-worker-auto-restart@22.service
    * openqa-reload-worker-auto-restart@23.service masked inactive dead openqa-reload-worker-auto-restart@23.service
    * openqa-reload-worker-auto-restart@25.service masked inactive dead openqa-reload-worker-auto-restart@25.service
    * openqa-reload-worker-auto-restart@27.service masked inactive dead openqa-reload-worker-auto-restart@27.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    7 loaded units listed.
openqaworker10.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
openqaworker-arm-1.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service
    * postfix.service  masked inactive dead postfix.service 

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    2 loaded units listed.
openqaworker-arm-3.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    1 loaded units listed.
openqaworker-arm-2.suse.de:
      UNIT             LOAD   ACTIVE   SUB  DESCRIPTION     
    * apparmor.service masked inactive dead apparmor.service
    * postfix.service  masked inactive dead postfix.service 

    LOAD   = Reflects whether the unit definition was properly loaded.
    ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
    SUB    = The low-level unit activation state, values depend on unit type.

    2 loaded units listed.

Should we add an alert for a threshold of like "50"? Seems quite arbitrary, WDYT?

#10 Updated by nicksinger 2 months ago

I think there are two things involved here:

  1. What to do with masked services we expect to be masked because we decided we don't want them
    • apparmor seems to be disabled everywhere. I think these kind of services should be excluded in the script querying them.
  2. What is a proper threshold for masked services without it being arbitrary
    • One single dashboard for all services over all workers would provide little to no information. Single workers causing problems can trigger the alert all the time causing alert fatigue. Disabling the alert completely removes our capabilities to detect "forgotten" masks for all machine
    • One (templated) graph per worker should exist on e.g. the worker-dashboard which we already deploy for each worker
    • With this graph we can have an alert with a threshold of >0 for 1w
    • If a single worker is known to be problematic we can just disable the single alert without loosing monitoring on the other machines

#11 Updated by nicksinger 2 months ago

for postfix I think this is a perfect example where the alert would have helped us to catch it's still masked. Doesn't seem to cause problems on all workers (so doesn't need to be excluded) but it's still masked on some workers

#12 Updated by okurz 2 months ago

ok. I have added an exclude option for the monitoring script in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/681 . So we could put specific services to exclude into our salt pillars. That could address your point 1.
Not sure if trying to handle all masked services within salt pillars is feasible. So far we state that the already complicated process to take out individual openQA workers is using masked systemd services. If we would alert on all masked services that we don't explicitly track as such in git then we would alert ourselves on our own actions for temporary service masking. But you stated a sane value with the 1 week alerting threshold. Ok, let's see what we can do about a templated per-worker graph.

#13 Updated by okurz about 2 months ago

  • Status changed from Feedback to Resolved

So we have monitoring for masked services but no alerting. I guess it's ok for now.

#14 Updated by nicksinger about 2 months ago

Do we want to already create a follow-up ticket (I can do this) or just wait for the "OMG! This unit was masked for x years, why nobody noticed it"-moment? and react on that? :)

#15 Updated by okurz about 2 months ago

nicksinger wrote:

Do we want to already create a follow-up ticket (I can do this) or just wait for the "OMG! This unit was masked for x years, why nobody noticed it"-moment? and react on that? :)

The second.

Also available in: Atom PDF