Project

General

Profile

Actions

action #178015

open

coordination #161414: [epic] Improved salt based infrastructure management

[false negative] Many failed systemd services but no alert

Added by okurz 4 days ago. Updated about 6 hours ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-02-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

It often starts innocent like in https://suse.slack.com/archives/C02CANHLANP/p1740668762857669 when José Fernández asked why a change in os-autoinst-distri-opensuse does not seem to work on aarch64. Some steps later digging down the rabbit hole I found that we have many failed systemd services on various hosts which https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services happily shows along with green hearts and there are no related firing alerts though there should be.

Suggestions

  • Check current alert definitions in grafana
  • Check our git history in https://gitlab.suse.de/openqa/salt-states-openqa or ticket history for potential regression introducing candidates
  • Identify the problem and fix it and let the team learn how it came to this

Rollback steps

  • Reset the failed state of openqa-reload-worker-auto-restart@999 on worker33 and run systemctl unmask openqa-worker-auto-restart@999.

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #177318: 2 bare-metal machines are offline on OSDResolvedmkittler2025-02-172025-03-15

Actions
Actions

Also available in: Atom PDF