action #157438: Failed systemd services alert (jenkins-plugins-update, snapper-cleanup) - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #157438

closed

Failed systemd services alert (jenkins-plugins-update, snapper-cleanup)

Added by tinita about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-03-18

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

Date: Sun, 17 Mar 2024 03:56:33 +0100

1 firing alert instance
[IMAGE]

   1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
Failed systemd services alert (except openqa.suse.de)
View alert [stats.openqa-monitor.qa.suse.de]
Values
B0=1 
Labels
alertname
Failed systemd services alert (except openqa.suse.de)
grafana_folder
Salt
rule_uid
Uk02cifVkz
Annotations
message
Check failed systemd services on hosts with `systemctl --failed`. Hint: Go to parent dashboard https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services to see a list of affected hosts.
Silence [stats.openqa-monitor.qa.suse.de]
View dashboard [stats.openqa-monitor.qa.suse.de]
View panel [stats.openqa-monitor.qa.suse.de]

2024-03-18 10:27:30
jenkins
jenkins-plugins-update, snapper-cleanup

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by tinita about 1 year ago

Subject changed from Failed systemd services alert (except openqa.suse.de) to Failed systemd services alert (jenkins-plugins-update, snapper-cleanup)
Description updated (diff)

Actions

Copy link

Updated by okurz about 1 year ago

Tags set to infra, reactive work
Priority changed from Normal to Urgent

Actions

Copy link

Updated by okurz about 1 year ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler about 1 year ago · Edited

-- Boot 97dd97becca043cb99d6b59a09dc12cf --
Mar 18 03:00:00 jenkins systemd[1]: Started Automatically update jenkins plugins..
Mar 18 03:00:00 jenkins systemd[1]: jenkins-plugins-update.service: Main process exited, code=exited, status=1/FAILURE
Mar 18 03:00:00 jenkins systemd[1]: jenkins-plugins-update.service: Failed with result 'exit-code'.

-- Boot 97dd97becca043cb99d6b59a09dc12cf --
Mar 18 03:44:30 jenkins systemd[1]: Started Daily Cleanup of Snapper Snapshots.
Mar 18 03:44:30 jenkins systemd-helper[10964]: running cleanup for 'root'.
Mar 18 03:44:30 jenkins systemd-helper[10964]: running number cleanup for 'root'.
Mar 18 03:44:30 jenkins systemd-helper[10964]: IO Error (.snapshots is not a btrfs subvolume).
Mar 18 03:44:30 jenkins systemd-helper[10964]: number cleanup for 'root' failed.
Mar 18 03:44:30 jenkins systemd-helper[10964]: running timeline cleanup for 'root'.
Mar 18 03:44:30 jenkins systemd-helper[10964]: running empty-pre-post cleanup for 'root'.
Mar 18 03:44:30 jenkins systemd[1]: snapper-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Mar 18 03:44:30 jenkins systemd[1]: snapper-cleanup.service: Failed with result 'exit-code'.

Both problems persist after restarting the services.

Looks like there's a problem with snapshots on that machine, indeed:

martchus@jenkins:~> sudo snapper list
 # | Type   | Pre # | Date | User | Cleanup | Description | Userdata
---+--------+-------+------+------+---------+-------------+---------
0  | single |       |      | root |         | current     |
martchus@jenkins:~> sudo ls -l /.snapshots/
total 0

After a reboot sudo ls -l /.snapshots/ shows the expected output again but sudo napper list hangs. Maybe because the cleanup is now running; not sure as also systemd commands hang.

There are lots of BTRFS warnings qgroup rescan is already in progress being logged.

EDIT: It works now again after the rebalancing is done. Not sure what caused the btrfs filesystem not being fully mounted. The web service is accessible again.

Actions

Copy link

Updated by mkittler about 1 year ago

Status changed from New to In Progress

Actions

Copy link

Updated by mkittler about 1 year ago

Status changed from In Progress to Resolved

I'm resolving this ticket because I'm not sure about the root cause and thus wouldn't know what to improve to prevent this from happening again. It only happened once so far anyway.

Actions

Copy link

Updated by okurz about 1 year ago

Status changed from Resolved to New
Assignee deleted (~~mkittler~~)

same problem happened again as reported on https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

 # | Type   | Pre # | Date | User | Cleanup | Description | Userdata
---+--------+-------+------+------+---------+-------------+---------
0  | single |       |      | root |         | current     |

and recovered after reboot.

Actions

Copy link

Updated by okurz about 1 year ago

Status changed from New to In Progress
Assignee set to okurz
Priority changed from Urgent to High

Started btrfs balance start / and will check afterwards

Actions

Copy link

Updated by okurz about 1 year ago

No problems on full balance. I added now a local-only restarting of snapper-cleanup on problems with systemctl edit snapper-cleanup adding content

[Service]
Restart=on-failure
RestartSec=10

Actions

Copy link

#10

Updated by okurz about 1 year ago

Status changed from In Progress to Resolved

Actions

Copy link

#11

Updated by jbaier_cz about 1 year ago

Related to action #158505: Failed systemd services alert for jenkins-plugins-update size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #157438

Failed systemd services alert (jenkins-plugins-update, snapper-cleanup)

Observation¶

Updated by tinita about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago · Edited

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by jbaier_cz about 1 year ago