Project

General

Profile

Actions

action #178015

open

coordination #161414: [epic] Improved salt based infrastructure management

[false negative] Many failed systemd services but no alert

Added by okurz 4 days ago. Updated about 3 hours ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-02-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

It often starts innocent like in https://suse.slack.com/archives/C02CANHLANP/p1740668762857669 when José Fernández asked why a change in os-autoinst-distri-opensuse does not seem to work on aarch64. Some steps later digging down the rabbit hole I found that we have many failed systemd services on various hosts which https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services happily shows along with green hearts and there are no related firing alerts though there should be.

Suggestions

  • Check current alert definitions in grafana
  • Check our git history in https://gitlab.suse.de/openqa/salt-states-openqa or ticket history for potential regression introducing candidates
  • Identify the problem and fix it and let the team learn how it came to this

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #177318: 2 bare-metal machines are offline on OSDResolvedmkittler2025-02-172025-03-15

Actions
Actions #1

Updated by okurz 4 days ago

  • Tracker changed from coordination to action
Actions #2

Updated by mkittler 3 days ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #3

Updated by mkittler 3 days ago · Edited

Link to Grafana with the relevant time window: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-24T09%3A05%3A51.970Z&to=2025-02-28T18%3A50%3A29.508Z&timezone=UTC

When showing the query of the alert in the explore view it looks the same: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22params%22%3A%5B%22failed%22%5D%2C%22type%22%3A%22field%22%7D%2C%7B%22params%22%3A%5B%5D%2C%22type%22%3A%22sum%22%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221740491060697%22%2C%22to%22%3A%221740792506518%22%7D%7D%7D&orgId=1

So the alert query seems to be correct. The alert condition also makes sense.

The alert was also recently firing (2025-02-17 04:36:24). It should have been firing much sooner, though: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-04T22%3A54%3A39.117Z&to=2025-02-17T14%3A43%3A34.958Z&timezone=UTC
(explore link: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22params%22%3A%5B%22failed%22%5D%2C%22type%22%3A%22field%22%7D%2C%7B%22params%22%3A%5B%5D%2C%22type%22%3A%22sum%22%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221738714427023%22%2C%22to%22%3A%221739830938673%22%7D%7D%7D&orgId=1)

The weird think is that we have an "ok" marker without a preceding "firing" marker and the "ok" marker appears right in the middle of a problematic section where nothing was ok.

I guess the problem are these zero values: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22type%22%3A%22field%22%2C%22params%22%3A%5B%22failed%22%5D%7D%2C%7B%22type%22%3A%22sum%22%2C%22params%22%3A%5B%5D%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%2C%7B%22refId%22%3A%22B%22%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%2C%22query%22%3A%22%22%2C%22rawQuery%22%3Atrue%2C%22resultFormat%22%3A%22time_series%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221739274607047%22%2C%22to%22%3A%221739275074610%22%7D%7D%7D&orgId=1

So I guess the averaging we had before https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/4429f893f545dc91c06db4a4db0b5d17ccadb457 made some sense. However, if we just revert the MR is would lead to the also not desirable behavior we had before.

Actions #4

Updated by mkittler 3 days ago · Edited

  • Status changed from In Progress to Feedback

MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1389

EDIT: The change has been deployed. I restarted the grafana service and it looks as expected on the web UI.

Actions #5

Updated by okurz 3 days ago

  • Blocks action #177318: 2 bare-metal machines are offline on OSD added
Actions #6

Updated by mkittler 3 days ago

I merged the MR because considering #177318 it is really kind of bad not having this alert.

Actions #7

Updated by mkittler about 5 hours ago

  • Status changed from Feedback to In Progress

So far there were no failing systemd services. I guess I'll provoke a failing unit to see whether it works.

Actions #8

Updated by mkittler about 3 hours ago

This still doesn't work, now we get a cycle between pending and ok. (And I provoked openqa-reload-worker-auto-restart@999 to be constantly failing on worker33.)

Maybe it makes sense to switch to @nicksinger 's approach then: https://stats.openqa-monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view

Actions #9

Updated by mkittler about 3 hours ago

  • Blocks deleted (action #177318: 2 bare-metal machines are offline on OSD)
Actions #10

Updated by mkittler about 3 hours ago

  • Related to action #177318: 2 bare-metal machines are offline on OSD added
Actions

Also available in: Atom PDF