action #178015: [false negative] Many failed systemd services but no alert has fired size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #178015

closed

coordination #161414: [epic] Improved salt based infrastructure management

[false negative] Many failed systemd services but no alert has fired size:S

Added by okurz 3 months ago. Updated 12 days ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-02-27

Due date:

% Done:

Estimated time:

Tags:

alert, systemd, grafana, infra, false negative

Description

Observation¶

It often starts innocent like in https://suse.slack.com/archives/C02CANHLANP/p1740668762857669 when José Fernández asked why a change in os-autoinst-distri-opensuse does not seem to work on aarch64. Some steps later digging down the rabbit hole I found that we have many failed systemd services on various hosts which https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services happily shows along with green hearts and there are no related firing alerts though there should be.

Acceptance Criteria¶

AC1: It is understood why aarch64 revealed issues with systemd services and follow-up tickets are filed

Suggestions¶

Check current alert definitions in grafana
Check our git history in https://gitlab.suse.de/openqa/salt-states-openqa or ticket history for potential regression introducing candidates
Identify the problem and fix it and let the team learn how it came to this

Rollback steps¶

Reset the failed state of openqa-reload-worker-auto-restart@999 on worker33 and run systemctl unmask openqa-worker-auto-restart@999.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 3 months ago

Tracker changed from coordination to action

Actions

Copy link

Updated by mkittler 3 months ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 3 months ago · Edited

Link to Grafana with the relevant time window: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-24T09%3A05%3A51.970Z&to=2025-02-28T18%3A50%3A29.508Z&timezone=UTC

When showing the query of the alert in the explore view it looks the same: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22params%22%3A%5B%22failed%22%5D%2C%22type%22%3A%22field%22%7D%2C%7B%22params%22%3A%5B%5D%2C%22type%22%3A%22sum%22%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221740491060697%22%2C%22to%22%3A%221740792506518%22%7D%7D%7D&orgId=1

So the alert query seems to be correct. The alert condition also makes sense.

The alert was also recently firing (2025-02-17 04:36:24). It should have been firing much sooner, though: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-04T22%3A54%3A39.117Z&to=2025-02-17T14%3A43%3A34.958Z&timezone=UTC
(explore link: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22params%22%3A%5B%22failed%22%5D%2C%22type%22%3A%22field%22%7D%2C%7B%22params%22%3A%5B%5D%2C%22type%22%3A%22sum%22%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221738714427023%22%2C%22to%22%3A%221739830938673%22%7D%7D%7D&orgId=1)

The weird think is that we have an "ok" marker without a preceding "firing" marker and the "ok" marker appears right in the middle of a problematic section where nothing was ok.

I guess the problem are these zero values: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22type%22%3A%22field%22%2C%22params%22%3A%5B%22failed%22%5D%7D%2C%7B%22type%22%3A%22sum%22%2C%22params%22%3A%5B%5D%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%2C%7B%22refId%22%3A%22B%22%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%2C%22query%22%3A%22%22%2C%22rawQuery%22%3Atrue%2C%22resultFormat%22%3A%22time_series%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221739274607047%22%2C%22to%22%3A%221739275074610%22%7D%7D%7D&orgId=1

So I guess the averaging we had before https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/4429f893f545dc91c06db4a4db0b5d17ccadb457 made some sense. However, if we just revert the MR is would lead to the also not desirable behavior we had before.

Actions

Copy link

Updated by mkittler 3 months ago · Edited

Status changed from In Progress to Feedback

MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1389

EDIT: The change has been deployed. I restarted the grafana service and it looks as expected on the web UI.

Actions

Copy link

Updated by okurz 3 months ago

Blocks action #177318: 2 bare-metal machines are offline on OSD added

Actions

Copy link

Description updated (diff)
Assignee changed from mkittler to nicksinger

I discussed this with @nicksinger who adjusted his approach at the same time. It works now by looking at a time interval of 5 minutes.

My previous attempt to increase the interval/time-grouping of the current alert to 150 seconds turned out to be insufficient.

We decided to go for @nicksinger's change as we now also saw that it actually works.

Actions

Copy link

#12

Updated by openqa_review 3 months ago

Due date set to 2025-03-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#13

Updated by livdywan 3 months ago

Subject changed from [false negative] Many failed systemd services but no alert to [false negative] Many failed systemd services but no alert has fired size:S
Description updated (diff)

Actions

Copy link

#14

Updated by nicksinger 3 months ago

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1393 to replace the single alert with an instantiated one which can now be seen here: https://monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view. Each machine has its own alert-instance as can be seen on https://monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view?tab=instances
The graph shows the old alert at the beginning and since ~2025-03-05 12:00 it shows the new one https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&viewPanel=panel-6&from=2025-03-04T09:34:03.000Z&to=now&timezone=UTC - without being flaky or triggering on and off. I however realized that including the units as tag might not have been a good idea since without any failing units the contents of the tags change and therefore the alert instance definition. I will remove them again and we can look later into including the failed units into alert mails.

Actions

Copy link

#15

Updated by nicksinger 3 months ago

Status changed from In Progress to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1396 merged and deployed (I had to use alerts_to_delete.yaml to redeploy my changes). This should be sufficient for now

Actions

Copy link

#16

Updated by okurz 12 days ago

Due date deleted (~~2025-03-18~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #178015

[false negative] Many failed systemd services but no alert has fired size:S

Observation¶

Acceptance Criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz 3 months ago

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago · Edited

Updated by mkittler 3 months ago · Edited

Updated by okurz 3 months ago

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago

Updated by openqa_review 3 months ago

Updated by livdywan 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by okurz 12 days ago