action #106832
closedMonitor masked units on our infrastructure
0%
Description
Motivation¶
#106666#note-6 raised the valid question how we become aware of units being masked for too long.
Suggestions¶
- Use e.g.
systemctl list-unit-files --state=masked --no-legend
to figure out what units are currently masked - Feed this information into our monitoring/grafana - one way to do this would be to extend https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/scripts/systemd_failed.sh
- Create an appropriate dashboard in grafana with reasonable thresholds for alerting. E.g. don't alert if a service is masked <1w
Updated by nicksinger almost 3 years ago
- Related to action #106666: Improve worker startup in our salt states or "openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.de" added
Updated by okurz almost 3 years ago
- Description updated (diff)
- Assignee set to okurz
- Target version set to Ready
Thank you. Your suggestion looks good. Let me take a quick look. I have an idea.
Updated by okurz almost 3 years ago
- Status changed from New to Feedback
Updated by livdywan over 2 years ago
okurz wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/653
FYI you have outstanding review comments
Updated by okurz over 2 years ago
Updated by okurz over 2 years ago
I made a mistake in the syntax calling the shell script, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/676 for the fix.
Updated by okurz over 2 years ago
I broke the script in case there are actually no failed or masked units, fixed in: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/680
Updated by okurz over 2 years ago
Monitoring works now. I found that we have quite some number of masked units:
okurz@openqa:~> sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'systemctl list-units --state=masked'
openqaworker3.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
openqaworker8.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
* openqa-worker@1.service masked inactive dead openqa-worker@1.service
* openqa-worker@10.service masked inactive dead openqa-worker@10.service
* openqa-worker@11.service masked inactive dead openqa-worker@11.service
* openqa-worker@12.service masked inactive dead openqa-worker@12.service
* openqa-worker@13.service masked inactive dead openqa-worker@13.service
* openqa-worker@14.service masked inactive dead openqa-worker@14.service
* openqa-worker@15.service masked inactive dead openqa-worker@15.service
* openqa-worker@16.service masked inactive dead openqa-worker@16.service
* openqa-worker@2.service masked inactive dead openqa-worker@2.service
* openqa-worker@3.service masked inactive dead openqa-worker@3.service
* openqa-worker@4.service masked inactive dead openqa-worker@4.service
* openqa-worker@5.service masked inactive dead openqa-worker@5.service
* openqa-worker@6.service masked inactive dead openqa-worker@6.service
* openqa-worker@7.service masked inactive dead openqa-worker@7.service
* openqa-worker@8.service masked inactive dead openqa-worker@8.service
* openqa-worker@9.service masked inactive dead openqa-worker@9.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
17 loaded units listed.
openqaworker9.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
openqaworker6.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
openqaworker5.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
openqaworker2.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
* openqa-worker@5.service masked inactive dead openqa-worker@5.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
2 loaded units listed.
powerqaworker-qam-1.qa.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
* openqa-worker@1.service masked inactive dead openqa-worker@1.service
* openqa-worker@2.service masked inactive dead openqa-worker@2.service
* openqa-worker@3.service masked inactive dead openqa-worker@3.service
* openqa-worker@4.service masked inactive dead openqa-worker@4.service
* openqa-worker@5.service masked inactive dead openqa-worker@5.service
* openqa-worker@6.service masked inactive dead openqa-worker@6.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
7 loaded units listed.
QA-Power8-5-kvm.qa.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
QA-Power8-4-kvm.qa.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
* postfix.service masked inactive dead postfix.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
2 loaded units listed.
openqaworker14.qa.suse.cz:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
openqaworker15.qa.suse.cz:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
malbec.arch.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
openqaworker13.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
grenache-1.qa.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
* openqa-reload-worker-auto-restart@10.service masked inactive dead openqa-reload-worker-auto-restart@10.service
* openqa-reload-worker-auto-restart@21.service masked inactive dead openqa-reload-worker-auto-restart@21.service
* openqa-reload-worker-auto-restart@22.service masked inactive dead openqa-reload-worker-auto-restart@22.service
* openqa-reload-worker-auto-restart@23.service masked inactive dead openqa-reload-worker-auto-restart@23.service
* openqa-reload-worker-auto-restart@25.service masked inactive dead openqa-reload-worker-auto-restart@25.service
* openqa-reload-worker-auto-restart@27.service masked inactive dead openqa-reload-worker-auto-restart@27.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
7 loaded units listed.
openqaworker10.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
openqaworker-arm-1.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
* postfix.service masked inactive dead postfix.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
2 loaded units listed.
openqaworker-arm-3.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
openqaworker-arm-2.suse.de:
UNIT LOAD ACTIVE SUB DESCRIPTION
* apparmor.service masked inactive dead apparmor.service
* postfix.service masked inactive dead postfix.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
2 loaded units listed.
Should we add an alert for a threshold of like "50"? Seems quite arbitrary, WDYT?
Updated by nicksinger over 2 years ago
I think there are two things involved here:
- What to do with masked services we expect to be masked because we decided we don't want them
- apparmor seems to be disabled everywhere. I think these kind of services should be excluded in the script querying them.
- What is a proper threshold for masked services without it being arbitrary
- One single dashboard for all services over all workers would provide little to no information. Single workers causing problems can trigger the alert all the time causing alert fatigue. Disabling the alert completely removes our capabilities to detect "forgotten" masks for all machine
- One (templated) graph per worker should exist on e.g. the worker-dashboard which we already deploy for each worker
- With this graph we can have an alert with a threshold of >0 for 1w
- If a single worker is known to be problematic we can just disable the single alert without loosing monitoring on the other machines
Updated by nicksinger over 2 years ago
for postfix I think this is a perfect example where the alert would have helped us to catch it's still masked. Doesn't seem to cause problems on all workers (so doesn't need to be excluded) but it's still masked on some workers
Updated by okurz over 2 years ago
ok. I have added an exclude option for the monitoring script in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/681 . So we could put specific services to exclude into our salt pillars. That could address your point 1.
Not sure if trying to handle all masked services within salt pillars is feasible. So far we state that the already complicated process to take out individual openQA workers is using masked systemd services. If we would alert on all masked services that we don't explicitly track as such in git then we would alert ourselves on our own actions for temporary service masking. But you stated a sane value with the 1 week alerting threshold. Ok, let's see what we can do about a templated per-worker graph.
Updated by okurz over 2 years ago
- Status changed from Feedback to Resolved
So we have monitoring for masked services but no alerting. I guess it's ok for now.
Updated by nicksinger over 2 years ago
Do we want to already create a follow-up ticket (I can do this) or just wait for the "OMG! This unit was masked for x years, why nobody noticed it"-moment? and react on that? :)
Updated by okurz over 2 years ago
nicksinger wrote:
Do we want to already create a follow-up ticket (I can do this) or just wait for the "OMG! This unit was masked for x years, why nobody noticed it"-moment? and react on that? :)
The second.