Project

General

Profile

Actions

action #177318

closed

coordination #161414: [epic] Improved salt based infrastructure management

2 bare-metal machines are offline on OSD

Added by Julie_CAO 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-02-17
Due date:
% Done:

0%

Estimated time:

Description

bare-metal1 and bare-metal2 can be accessed over IPMI. They were working a day ago but show offline now on OSD webUI due to known reasons. Could you check what happened and bring them back to the worker pool?

![](Screenshot from 2025-02-17 14-57-02.png)


Files


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #178015: [false negative] Many failed systemd services but no alert has fired size:SResolvednicksinger2025-02-27

Actions
Actions #1

Updated by okurz 3 months ago

  • Tags set to infra, reactive work, osd, bare-metal, prg2
  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #2

Updated by okurz 3 months ago

  • Target version changed from Ready to future
Actions #3

Updated by Julie_CAO 3 months ago

Is this issue related to https://progress.opensuse.org/issues/177892? when shall we expect them back to the pool? Our tests are blocked by missing these 2 machines.

Actions #4

Updated by okurz 3 months ago

  • Priority changed from Normal to High
  • Target version changed from future to Ready

Julie_CAO wrote in #note-3:

Is this issue related to https://progress.opensuse.org/issues/177892?

When you link tickets please use the format #<id> so that we have direct preview of the subject and status of the linked ticket.

No, this is not related to #177892. If the worker instances would be unavailable due to limited number of available connection slots then the word "limited" would show up in the status column.

when shall we expect them back to the pool? Our tests are blocked by missing these 2 machines.

oh, ok. You created the ticket with priority "normal" so I assumed that this was not immediately blocking. Feel welcome to use "High" for such cases in the future. We should work on #178015 first which is why we didn't notice that those worker instances are offline when they shouldn't be

Actions #5

Updated by okurz 3 months ago

  • Blocked by action #178015: [false negative] Many failed systemd services but no alert has fired size:S added
Actions #6

Updated by okurz 3 months ago

  • Parent task set to #161414
Actions #7

Updated by mkittler 3 months ago · Edited

  • Status changed from New to In Progress
  • Assignee set to mkittler

Those units are masked:

martchus@worker33:~> sudo systemctl status openqa-worker-auto-restart@{16,17}.service 
Warning: The unit file, source configuration file or drop-ins of openqa-worker-auto-restart@16.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: The unit file, source configuration file or drop-ins of openqa-worker-auto-restart@17.service changed on disk. Run 'systemctl daemon-reload' to reload units.
○ openqa-worker-auto-restart@16.service
     Loaded: masked (Reason: Unit openqa-worker-auto-restart@16.service is masked.)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
     Active: inactive (dead)

○ openqa-worker-auto-restart@17.service
     Loaded: masked (Reason: Unit openqa-worker-auto-restart@17.service is masked.)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
     Active: inactive (dead)

That means someone intentionally disabled them. The question is what the intention was.

Judging by the job history on https://openqa.suse.de/admin/workers/2662 and https://openqa.suse.de/admin/workers/2663 those worker slots don't seem to be broken. So I took the liberty of umasking them.

Note that many other units are masked on this host as well and I don't know why. I will not change that because there probably was and maybe still is a reason for that.

EDIT: I now unmasked everything because the masking wasn't done consistently leading to failed systemd units.

Actions #8

Updated by openqa_review 3 months ago

  • Due date set to 2025-03-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by Julie_CAO 3 months ago

okurz wrote in #note-4:

Julie_CAO wrote in #note-3:

Is this issue related to https://progress.opensuse.org/issues/177892?

When you link tickets please use the format #<id> so that we have direct preview of the subject and status of the linked ticket.

OK, noted.

No, this is not related to #177892. If the worker instances would be unavailable due to limited number of available connection slots then the word "limited" would show up in the status column.

Get it. thanks.

when shall we expect them back to the pool? Our tests are blocked by missing these 2 machines.

oh, ok. You created the ticket with priority "normal" so I assumed that this was not immediately blocking. Feel welcome to use "High" for such cases in the future. We should work on #178015 first which is why we didn't notice that those worker instances are offline when they shouldn't be

The tests were not being blocked when the tickets was created. Ok, I will be aware to set priority.

Actions #10

Updated by Julie_CAO 3 months ago

Now the two machine are back to normal, thank you for fixing.

Actions #11

Updated by mkittler 3 months ago · Edited

  • Status changed from In Progress to Feedback

Not sure why those worker slots were disabled, though.

Actions #12

Updated by mkittler 3 months ago

  • Blocked by deleted (action #178015: [false negative] Many failed systemd services but no alert has fired size:S)
Actions #13

Updated by mkittler 3 months ago

  • Related to action #178015: [false negative] Many failed systemd services but no alert has fired size:S added
Actions #14

Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

I guess noone remembers why that worker was left in that state. So I'm just considering this resolved now.

Note that I have now created one failing unit on purpose for another ticket (see #178015#note-8) but it will not interfere with any production worker slots.

Actions #15

Updated by okurz 3 months ago

  • Due date deleted (2025-03-15)
Actions

Also available in: Atom PDF