Project

General

Profile

Actions

action #157438

closed

Failed systemd services alert (jenkins-plugins-update, snapper-cleanup)

Added by tinita 9 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-03-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

Date: Sun, 17 Mar 2024 03:56:33 +0100

1 firing alert instance
[IMAGE]

   1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
Failed systemd services alert (except openqa.suse.de)
View alert [stats.openqa-monitor.qa.suse.de]
Values
B0=1 
Labels
alertname
Failed systemd services alert (except openqa.suse.de)
grafana_folder
Salt
rule_uid
Uk02cifVkz
Annotations
message
Check failed systemd services on hosts with `systemctl --failed`. Hint: Go to parent dashboard https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services to see a list of affected hosts.
Silence [stats.openqa-monitor.qa.suse.de]
View dashboard [stats.openqa-monitor.qa.suse.de]
View panel [stats.openqa-monitor.qa.suse.de]

2024-03-18 10:27:30
jenkins
jenkins-plugins-update, snapper-cleanup


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #158505: Failed systemd services alert for jenkins-plugins-update size:SResolvedokurz2024-03-18

Actions
Actions #1

Updated by tinita 9 months ago

  • Subject changed from Failed systemd services alert (except openqa.suse.de) to Failed systemd services alert (jenkins-plugins-update, snapper-cleanup)
  • Description updated (diff)
Actions #2

Updated by okurz 9 months ago

  • Tags set to infra, reactive work
  • Priority changed from Normal to Urgent
Actions #3

Updated by okurz 9 months ago

  • Assignee set to mkittler
Actions #4

Updated by mkittler 9 months ago · Edited

-- Boot 97dd97becca043cb99d6b59a09dc12cf --
Mar 18 03:00:00 jenkins systemd[1]: Started Automatically update jenkins plugins..
Mar 18 03:00:00 jenkins systemd[1]: jenkins-plugins-update.service: Main process exited, code=exited, status=1/FAILURE
Mar 18 03:00:00 jenkins systemd[1]: jenkins-plugins-update.service: Failed with result 'exit-code'.
-- Boot 97dd97becca043cb99d6b59a09dc12cf --
Mar 18 03:44:30 jenkins systemd[1]: Started Daily Cleanup of Snapper Snapshots.
Mar 18 03:44:30 jenkins systemd-helper[10964]: running cleanup for 'root'.
Mar 18 03:44:30 jenkins systemd-helper[10964]: running number cleanup for 'root'.
Mar 18 03:44:30 jenkins systemd-helper[10964]: IO Error (.snapshots is not a btrfs subvolume).
Mar 18 03:44:30 jenkins systemd-helper[10964]: number cleanup for 'root' failed.
Mar 18 03:44:30 jenkins systemd-helper[10964]: running timeline cleanup for 'root'.
Mar 18 03:44:30 jenkins systemd-helper[10964]: running empty-pre-post cleanup for 'root'.
Mar 18 03:44:30 jenkins systemd[1]: snapper-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Mar 18 03:44:30 jenkins systemd[1]: snapper-cleanup.service: Failed with result 'exit-code'.

Both problems persist after restarting the services.

Looks like there's a problem with snapshots on that machine, indeed:

martchus@jenkins:~> sudo snapper list
 # | Type   | Pre # | Date | User | Cleanup | Description | Userdata
---+--------+-------+------+------+---------+-------------+---------
0  | single |       |      | root |         | current     |
martchus@jenkins:~> sudo ls -l /.snapshots/
total 0

After a reboot sudo ls -l /.snapshots/ shows the expected output again but sudo napper list hangs. Maybe because the cleanup is now running; not sure as also systemd commands hang.

There are lots of BTRFS warnings qgroup rescan is already in progress being logged.

EDIT: It works now again after the rebalancing is done. Not sure what caused the btrfs filesystem not being fully mounted. The web service is accessible again.

Actions #5

Updated by mkittler 9 months ago

  • Status changed from New to In Progress
Actions #6

Updated by mkittler 9 months ago

  • Status changed from In Progress to Resolved

I'm resolving this ticket because I'm not sure about the root cause and thus wouldn't know what to improve to prevent this from happening again. It only happened once so far anyway.

Actions #7

Updated by okurz 9 months ago

  • Status changed from Resolved to New
  • Assignee deleted (mkittler)

same problem happened again as reported on https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

 # | Type   | Pre # | Date | User | Cleanup | Description | Userdata
---+--------+-------+------+------+---------+-------------+---------
0  | single |       |      | root |         | current     |         

and recovered after reboot.

Actions #8

Updated by okurz 9 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Urgent to High

Started btrfs balance start / and will check afterwards

Actions #9

Updated by okurz 9 months ago

No problems on full balance. I added now a local-only restarting of snapper-cleanup on problems with systemctl edit snapper-cleanup adding content

[Service]
Restart=on-failure
RestartSec=10
Actions #10

Updated by okurz 9 months ago

  • Status changed from In Progress to Resolved
Actions #11

Updated by jbaier_cz 9 months ago

  • Related to action #158505: Failed systemd services alert for jenkins-plugins-update size:S added
Actions

Also available in: Atom PDF