Project

General

Profile

Actions

action #159186

closed

[alert] Systemd-services alert failing due to unit "rsnapshot@alpha" on host "storage"

Added by mkittler about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-04-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

See https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1713364104162&to=1713367862661, not sure why this is coming up now. According to salt-key -L the storage host is not even in salt anymore.

Looks like the number of failing alerts went down to zero again so it is probably not useful to pause the alert.

Rollback steps


Related issues 1 (0 open1 closed)

Related to QA - action #153742: Move of OSD machine NUE1 to PRG2 - storage.qe.prg2.suse.orgResolvedokurz2024-01-16

Actions
Actions #1

Updated by nicksinger about 2 months ago

  • Related to action #153742: Move of OSD machine NUE1 to PRG2 - storage.qe.prg2.suse.org added
Actions #2

Updated by okurz about 2 months ago

  • Category set to Feature requests
  • Target version set to Ready
Actions #3

Updated by mkittler about 2 months ago

Maybe we can close this ticket and handle it as part of #153742.

Actions #4

Updated by okurz about 2 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz

#153742 first. Then we can check rsnapshot after that.

Actions #5

Updated by livdywan about 2 months ago

Failed systemd services alert triggered:

2024-04-19 15:58:40 storage rsnapshot@alpha 1
Actions #6

Updated by livdywan about 2 months ago

2024-04-22 09:33:20  storage     rsnapshot@alpha     1
Actions #7

Updated by mkittler about 2 months ago

Since progress was down I couldn't check what's currently being done about this. So I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1161 when I saw the alert again. It nevertheless looks like the storage host cannot connect to both o3 and OSD.

Actions #8

Updated by mkittler about 2 months ago

  • Description updated (diff)

To avoid this from happening again I disabled services on the storage host, see added rollback steps.

Actions #9

Updated by okurz about 2 months ago

  • Status changed from Blocked to In Progress

Running rsnapshot@alpha manually now and verified the new host key for o3.

Actions #10

Updated by openqa_review about 2 months ago

  • Due date set to 2024-05-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by okurz about 2 months ago

  • Due date deleted (2024-05-10)
  • Status changed from In Progress to Blocked

The manual run of rsnapshot@alpha ended successfully. Updating host entry in /etc/salt/minion_id and started again salt-minion. But apparently salt can not reach the salt master:
https://sd.suse.com/servicedesk/customer/portal/1/SD-155344

Actions #12

Updated by livdywan about 2 months ago

Unfortunately it still doesn't look that successful on the failed systemd services alert:

2024-04-26 14:53:40 storage rsnapshot@alpha 1
Actions #13

Updated by okurz about 2 months ago

did you setup a silence then?

Actions #14

Updated by okurz about 2 months ago

  • Description updated (diff)
  • Due date set to 2024-05-13
  • Status changed from Blocked to In Progress
Actions #15

Updated by okurz about 2 months ago

I setup a silence, monitored, rsnapshot@alpha was fine again after the last service. Everything updated with salt. I realized that the host is still on Leap 15.4, so doing https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades

Actions #16

Updated by okurz about 2 months ago

  • Due date deleted (2024-05-13)
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF