action #177318: 2 bare-metal machines are offline on OSD - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #177318

closed

coordination #161414: [epic] Improved salt based infrastructure management

2 bare-metal machines are offline on OSD

Added by Julie_CAO 15 days ago. Updated about 9 hours ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-02-17

Due date:

2025-03-15

% Done:

Estimated time:

Tags:

osd, infra, reactive work, prg2, bare-metal

Description

bare-metal1 and bare-metal2 can be accessed over IPMI. They were working a day ago but show offline now on OSD webUI due to known reasons. Could you check what happened and bring them back to the worker pool?

Files

Screenshot from 2025-02-17 14-57-02.png (60.1 KB) Screenshot from 2025-02-17 14-57-02.png

Julie_CAO, 2025-02-17 07:04

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by okurz 15 days ago

Tags set to infra, reactive work, osd, bare-metal, prg2
Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by okurz 13 days ago

Target version changed from Ready to future

Actions

Copy link

Updated by Julie_CAO 3 days ago

Is this issue related to https://progress.opensuse.org/issues/177892? when shall we expect them back to the pool? Our tests are blocked by missing these 2 machines.

Actions

Copy link

Updated by okurz 3 days ago

Priority changed from Normal to High
Target version changed from future to Ready

Julie_CAO wrote in #note-3:

Is this issue related to https://progress.opensuse.org/issues/177892?

When you link tickets please use the format #<id> so that we have direct preview of the subject and status of the linked ticket.

No, this is not related to #177892. If the worker instances would be unavailable due to limited number of available connection slots then the word "limited" would show up in the status column.

when shall we expect them back to the pool? Our tests are blocked by missing these 2 machines.

oh, ok. You created the ticket with priority "normal" so I assumed that this was not immediately blocking. Feel welcome to use "High" for such cases in the future. We should work on #178015 first which is why we didn't notice that those worker instances are offline when they shouldn't be

Actions

Copy link

Updated by okurz 3 days ago

Blocked by action #178015: [false negative] Many failed systemd services but no alert added

Actions

Copy link

Updated by okurz 3 days ago

Parent task set to #161414

Actions

Copy link

Updated by mkittler 3 days ago · Edited

Status changed from New to In Progress
Assignee set to mkittler

Those units are masked:

martchus@worker33:~> sudo systemctl status openqa-worker-auto-restart@{16,17}.service 
Warning: The unit file, source configuration file or drop-ins of openqa-worker-auto-restart@16.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: The unit file, source configuration file or drop-ins of openqa-worker-auto-restart@17.service changed on disk. Run 'systemctl daemon-reload' to reload units.
○ openqa-worker-auto-restart@16.service
     Loaded: masked (Reason: Unit openqa-worker-auto-restart@16.service is masked.)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
     Active: inactive (dead)

○ openqa-worker-auto-restart@17.service
     Loaded: masked (Reason: Unit openqa-worker-auto-restart@17.service is masked.)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
     Active: inactive (dead)

That means someone intentionally disabled them. The question is what the intention was.

Judging by the job history on https://openqa.suse.de/admin/workers/2662 and https://openqa.suse.de/admin/workers/2663 those worker slots don't seem to be broken. So I took the liberty of umasking them.

Note that many other units are masked on this host as well and I don't know why. I will not change that because there probably was and maybe still is a reason for that.

EDIT: I now unmasked everything because the masking wasn't done consistently leading to failed systemd units.

Actions

Copy link

Updated by openqa_review 3 days ago

Due date set to 2025-03-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by Julie_CAO about 17 hours ago

okurz wrote in #note-4:

Julie_CAO wrote in #note-3:

Is this issue related to https://progress.opensuse.org/issues/177892?

When you link tickets please use the format #<id> so that we have direct preview of the subject and status of the linked ticket.

OK, noted.

No, this is not related to #177892. If the worker instances would be unavailable due to limited number of available connection slots then the word "limited" would show up in the status column.

Get it. thanks.

when shall we expect them back to the pool? Our tests are blocked by missing these 2 machines.

oh, ok. You created the ticket with priority "normal" so I assumed that this was not immediately blocking. Feel welcome to use "High" for such cases in the future. We should work on #178015 first which is why we didn't notice that those worker instances are offline when they shouldn't be

The tests were not being blocked when the tickets was created. Ok, I will be aware to set priority.

Actions

Copy link

#10

Updated by Julie_CAO about 17 hours ago

Now the two machine are back to normal, thank you for fixing.

Actions

Copy link

#11

Updated by mkittler about 11 hours ago · Edited

Status changed from In Progress to Feedback

Not sure why those worker slots were disabled, though.

Actions

Copy link

#12

Updated by mkittler about 9 hours ago

Blocked by deleted (action #178015: [false negative] Many failed systemd services but no alert)

Actions

Copy link

#13

Updated by mkittler about 9 hours ago

Related to action #178015: [false negative] Many failed systemd services but no alert added

Actions

Copy link

#14

Updated by mkittler about 9 hours ago

Status changed from Feedback to Resolved

I guess noone remembers why that worker was left in that state. So I'm just considering this resolved now.

Note that I have now created one failing unit on purpose for another ticket (see #178015#note-8) but it will not interfere with any production worker slots.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #177318

2 bare-metal machines are offline on OSD

Updated by okurz 15 days ago

Updated by okurz 13 days ago

Updated by Julie_CAO 3 days ago

Updated by okurz 3 days ago

Updated by okurz 3 days ago

Updated by okurz 3 days ago

Updated by mkittler 3 days ago · Edited

Updated by openqa_review 3 days ago

Updated by Julie_CAO about 17 hours ago

Updated by Julie_CAO about 17 hours ago

Updated by mkittler about 11 hours ago · Edited

Updated by mkittler about 9 hours ago

Updated by mkittler about 9 hours ago

Updated by mkittler about 9 hours ago