Project

General

Profile

Actions

action #178015

closed

coordination #161414: [epic] Improved salt based infrastructure management

[false negative] Many failed systemd services but no alert has fired size:S

Added by okurz 3 months ago. Updated 12 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-02-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

It often starts innocent like in https://suse.slack.com/archives/C02CANHLANP/p1740668762857669 when José Fernández asked why a change in os-autoinst-distri-opensuse does not seem to work on aarch64. Some steps later digging down the rabbit hole I found that we have many failed systemd services on various hosts which https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services happily shows along with green hearts and there are no related firing alerts though there should be.

Acceptance Criteria

  • AC1: It is understood why aarch64 revealed issues with systemd services and follow-up tickets are filed

Suggestions

  • Check current alert definitions in grafana
  • Check our git history in https://gitlab.suse.de/openqa/salt-states-openqa or ticket history for potential regression introducing candidates
  • Identify the problem and fix it and let the team learn how it came to this

Rollback steps

  • Reset the failed state of openqa-reload-worker-auto-restart@999 on worker33 and run systemctl unmask openqa-worker-auto-restart@999.

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #177318: 2 bare-metal machines are offline on OSDResolvedmkittler2025-02-17

Actions
Actions #1

Updated by okurz 3 months ago

  • Tracker changed from coordination to action
Actions #2

Updated by mkittler 3 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #3

Updated by mkittler 3 months ago · Edited

Link to Grafana with the relevant time window: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-24T09%3A05%3A51.970Z&to=2025-02-28T18%3A50%3A29.508Z&timezone=UTC

When showing the query of the alert in the explore view it looks the same: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22params%22%3A%5B%22failed%22%5D%2C%22type%22%3A%22field%22%7D%2C%7B%22params%22%3A%5B%5D%2C%22type%22%3A%22sum%22%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221740491060697%22%2C%22to%22%3A%221740792506518%22%7D%7D%7D&orgId=1

So the alert query seems to be correct. The alert condition also makes sense.

The alert was also recently firing (2025-02-17 04:36:24). It should have been firing much sooner, though: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-04T22%3A54%3A39.117Z&to=2025-02-17T14%3A43%3A34.958Z&timezone=UTC
(explore link: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22params%22%3A%5B%22failed%22%5D%2C%22type%22%3A%22field%22%7D%2C%7B%22params%22%3A%5B%5D%2C%22type%22%3A%22sum%22%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221738714427023%22%2C%22to%22%3A%221739830938673%22%7D%7D%7D&orgId=1)

The weird think is that we have an "ok" marker without a preceding "firing" marker and the "ok" marker appears right in the middle of a problematic section where nothing was ok.

I guess the problem are these zero values: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22type%22%3A%22field%22%2C%22params%22%3A%5B%22failed%22%5D%7D%2C%7B%22type%22%3A%22sum%22%2C%22params%22%3A%5B%5D%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%2C%7B%22refId%22%3A%22B%22%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%2C%22query%22%3A%22%22%2C%22rawQuery%22%3Atrue%2C%22resultFormat%22%3A%22time_series%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221739274607047%22%2C%22to%22%3A%221739275074610%22%7D%7D%7D&orgId=1

So I guess the averaging we had before https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/4429f893f545dc91c06db4a4db0b5d17ccadb457 made some sense. However, if we just revert the MR is would lead to the also not desirable behavior we had before.

Actions #4

Updated by mkittler 3 months ago · Edited

  • Status changed from In Progress to Feedback

MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1389

EDIT: The change has been deployed. I restarted the grafana service and it looks as expected on the web UI.

Actions #5

Updated by okurz 3 months ago

  • Blocks action #177318: 2 bare-metal machines are offline on OSD added
Actions #6

Updated by mkittler 3 months ago

I merged the MR because considering #177318 it is really kind of bad not having this alert.

Actions #7

Updated by mkittler 3 months ago

  • Status changed from Feedback to In Progress

So far there were no failing systemd services. I guess I'll provoke a failing unit to see whether it works.

Actions #8

Updated by mkittler 3 months ago

This still doesn't work, now we get a cycle between pending and ok. (And I provoked openqa-reload-worker-auto-restart@999 to be constantly failing on worker33.)

Maybe it makes sense to switch to @nicksinger 's approach then: https://stats.openqa-monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view

Actions #9

Updated by mkittler 3 months ago

  • Blocks deleted (action #177318: 2 bare-metal machines are offline on OSD)
Actions #10

Updated by mkittler 3 months ago

  • Related to action #177318: 2 bare-metal machines are offline on OSD added
Actions #11

Updated by mkittler 3 months ago

  • Description updated (diff)
  • Assignee changed from mkittler to nicksinger

I discussed this with @nicksinger who adjusted his approach at the same time. It works now by looking at a time interval of 5 minutes.

My previous attempt to increase the interval/time-grouping of the current alert to 150 seconds turned out to be insufficient.

We decided to go for @nicksinger's change as we now also saw that it actually works.

Actions #12

Updated by openqa_review 3 months ago

  • Due date set to 2025-03-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by livdywan 3 months ago

  • Subject changed from [false negative] Many failed systemd services but no alert to [false negative] Many failed systemd services but no alert has fired size:S
  • Description updated (diff)
Actions #14

Updated by nicksinger 3 months ago

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1393 to replace the single alert with an instantiated one which can now be seen here: https://monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view. Each machine has its own alert-instance as can be seen on https://monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view?tab=instances
The graph shows the old alert at the beginning and since ~2025-03-05 12:00 it shows the new one https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&viewPanel=panel-6&from=2025-03-04T09:34:03.000Z&to=now&timezone=UTC - without being flaky or triggering on and off. I however realized that including the units as tag might not have been a good idea since without any failing units the contents of the tags change and therefore the alert instance definition. I will remove them again and we can look later into including the failed units into alert mails.

Actions #15

Updated by nicksinger 3 months ago

  • Status changed from In Progress to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1396 merged and deployed (I had to use alerts_to_delete.yaml to redeploy my changes). This should be sufficient for now

Actions #16

Updated by okurz 12 days ago

  • Due date deleted (2025-03-18)
Actions

Also available in: Atom PDF