action #73333: Failed systemd services alert (workers) flaky - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #73333

closed

Failed systemd services alert (workers) flaky

Added by nicksinger over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-10-14

Due date:

2020-10-22

% Done:

Estimated time:

Tags:

alert, monitoring, systemd

Description

Observation¶

In the last 12h we had quite some alerts for failing systemd services on ~~the worker~~ a host. Looking at https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&panelId=6&fullscreen&edit&tab=alert&from=1602604933324&to=1602655964690 it seems like one service is repeatedly failing and recovering. The alert stated values for systemd_failed.sum between 1.2 and 0.167 which I find kind of confusing and is a result how we sample the data

Expected result¶

Alert does not fail on flaky user@486.service on staging-1.qa.suse.de

Suggestions¶

check on staging-1.qa.suse.de why service "user@486.service" is failing, e.g. journalctl -u user@486.service, and fix that or prevent the alert, e.g. by disabling/masking telegraf on that host.

Actions

Copy link

Updated by okurz over 4 years ago

Target version set to Ready

Actions

Copy link

Updated by okurz over 4 years ago

Category set to Regressions/Crashes

Actions

Copy link

Updated by nicksinger over 4 years ago

the failing service in question is user@486.service on staging-1. We might still need to refine this alert to provide better insights

Actions

Copy link

Updated by okurz over 4 years ago

Tags changed from alert to alert, monitoring, systemd
Project changed from openQA Project (public) to openQA Infrastructure (public)
Description updated (diff)
Category deleted (~~Regressions/Crashes~~)
Status changed from New to Workable
Priority changed from Normal to Urgent

We should prevent the alert failing on that specific host to remove the urgency for now. Next step after that would be to improve the information we have available in the alert. However I would be ok to call this ticket done after we resolved the immediate issue except someone has a really good idea of how to improve the alert. We probably all agree that it could look better but wishful thinking also does not bring us further ;)

Actions

Copy link

Updated by okurz over 4 years ago

Due date set to 2020-10-22
Status changed from Workable to Feedback
Assignee set to okurz
Priority changed from Urgent to High

created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/383 to prevent the confusion due to the name of the alert and monitoring panel.

I assume that mkittler applied a complete "webui salt state" on that machine and that included even something like /etc/cron.d/SLES.CRON running some cron jobs which likely fail.

I removed

roles:
  - webui

from /etc/salt/grains, restarted salt-master service and did rm /etc/cron.d/SLES.CRON. Setting to feedback and waiting if the alert still triggers.

Actions

Copy link

Updated by okurz over 4 years ago

I have another change prepared to cover the suggestions from nsinger to improve the output in case of alerts. But I can not push these changes to gitlab because my account seems to be blocked now, reported in https://infra.nue.suse.com/SelfService/Display.html?id=178670

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Feedback to Resolved

All necessary changes are in and no alert triggered again.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #73333

Failed systemd services alert (workers) flaky

Observation¶

Expected result¶

Suggestions¶

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by nicksinger over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago