Project

General

Profile

Actions

action #166739

closed

Consistent alerts for failed systemd services on o3 size:S

Added by livdywan 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-09-12
Due date:
2024-10-18
% Done:

0%

Estimated time:

Description

Motivation

There is no consistent monitoring of systemd services on o3. Most errors are ignored or only acted upon when there is a visible impact.

an example of this is errors in openqa-continuous-update:

Sep 11 03:21:01 ariel openqa-continuous-update[9321]: /usr/share/openqa/script/openqa-check-devel-repo: line 39: echo: write error: Broken pipe                       
Sep 10 10:44:13 ariel openqa-continuous-update[26983]: Could not refresh the repositories because of errors.                                                          
Sep 10 10:44:13 ariel openqa-continuous-update[26983]: Skipping repository 'openQA' because of the above error.                                                       
Sep 10 10:39:12 ariel openqa-continuous-update[23326]: Could not refresh the repositories because of errors.                                                          
Sep 10 10:39:12 ariel openqa-continuous-update[23326]: Skipping repository 'openQA' because of the above error.                                                       
Sep 04 05:16:11 ariel openqa-continuous-update[8892]: /usr/share/openqa/script/openqa-check-devel-repo: line 39: echo: write error: Broken pipe                       
Sep 02 19:52:54 ariel openqa-continuous-update[21123]: /usr/share/openqa/script/openqa-check-devel-repo: line 39: echo: write error: Broken pipe                      
Sep 02 00:00:02 ariel openqa-continuous-update[19069]: Could not refresh the repositories because of errors.                                                          
Sep 02 00:00:02 ariel openqa-continuous-update[19069]: Skipping repository 'openQA' because of the above error.

My guess is nobody looked into those errors. I couldn't find relevant tickets or Slack conversations about those.

Suggestions

  • Use Munin's systemd_status plugin git
  • Look into the plugin if it allows us to get more details about the actually failed services. If not feasible leave it out. Don't implement your own :)
  • Research how systemd usually keeps a record of failures

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #166433: [alert] Waves of emails due to manual changes in /opt/openqa-trigger-from-obs size:SResolvedlivdywan

Actions
Actions

Also available in: Atom PDF