action #177318
closedcoordination #161414: [epic] Improved salt based infrastructure management
2 bare-metal machines are offline on OSD
0%
Description
bare-metal1
and bare-metal2
can be accessed over IPMI. They were working a day ago but show offline now on OSD webUI due to known reasons. Could you check what happened and bring them back to the worker pool?
Files
Updated by Julie_CAO 3 days ago
Is this issue related to https://progress.opensuse.org/issues/177892? when shall we expect them back to the pool? Our tests are blocked by missing these 2 machines.
Updated by okurz 3 days ago
- Priority changed from Normal to High
- Target version changed from future to Ready
Julie_CAO wrote in #note-3:
Is this issue related to https://progress.opensuse.org/issues/177892?
When you link tickets please use the format #<id>
so that we have direct preview of the subject and status of the linked ticket.
No, this is not related to #177892. If the worker instances would be unavailable due to limited number of available connection slots then the word "limited" would show up in the status column.
when shall we expect them back to the pool? Our tests are blocked by missing these 2 machines.
oh, ok. You created the ticket with priority "normal" so I assumed that this was not immediately blocking. Feel welcome to use "High" for such cases in the future. We should work on #178015 first which is why we didn't notice that those worker instances are offline when they shouldn't be
Updated by okurz 3 days ago
- Blocked by action #178015: [false negative] Many failed systemd services but no alert added
Updated by mkittler 3 days ago · Edited
- Status changed from New to In Progress
- Assignee set to mkittler
Those units are masked:
martchus@worker33:~> sudo systemctl status openqa-worker-auto-restart@{16,17}.service
Warning: The unit file, source configuration file or drop-ins of openqa-worker-auto-restart@16.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: The unit file, source configuration file or drop-ins of openqa-worker-auto-restart@17.service changed on disk. Run 'systemctl daemon-reload' to reload units.
○ openqa-worker-auto-restart@16.service
Loaded: masked (Reason: Unit openqa-worker-auto-restart@16.service is masked.)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
Active: inactive (dead)
○ openqa-worker-auto-restart@17.service
Loaded: masked (Reason: Unit openqa-worker-auto-restart@17.service is masked.)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
Active: inactive (dead)
That means someone intentionally disabled them. The question is what the intention was.
Judging by the job history on https://openqa.suse.de/admin/workers/2662 and https://openqa.suse.de/admin/workers/2663 those worker slots don't seem to be broken. So I took the liberty of umasking them.
Note that many other units are masked on this host as well and I don't know why. I will not change that because there probably was and maybe still is a reason for that.
EDIT: I now unmasked everything because the masking wasn't done consistently leading to failed systemd units.
Updated by openqa_review 3 days ago
- Due date set to 2025-03-15
Setting due date based on mean cycle time of SUSE QE Tools
Updated by Julie_CAO about 17 hours ago
okurz wrote in #note-4:
Julie_CAO wrote in #note-3:
Is this issue related to https://progress.opensuse.org/issues/177892?
When you link tickets please use the format
#<id>
so that we have direct preview of the subject and status of the linked ticket.
OK, noted.
No, this is not related to #177892. If the worker instances would be unavailable due to limited number of available connection slots then the word "limited" would show up in the status column.
Get it. thanks.
when shall we expect them back to the pool? Our tests are blocked by missing these 2 machines.
oh, ok. You created the ticket with priority "normal" so I assumed that this was not immediately blocking. Feel welcome to use "High" for such cases in the future. We should work on #178015 first which is why we didn't notice that those worker instances are offline when they shouldn't be
The tests were not being blocked when the tickets was created. Ok, I will be aware to set priority.
Updated by Julie_CAO about 17 hours ago
Now the two machine are back to normal, thank you for fixing.
Updated by mkittler about 11 hours ago · Edited
- Status changed from In Progress to Feedback
Not sure why those worker slots were disabled, though.
Updated by mkittler about 9 hours ago
- Blocked by deleted (action #178015: [false negative] Many failed systemd services but no alert)
Updated by mkittler about 9 hours ago
- Related to action #178015: [false negative] Many failed systemd services but no alert added
Updated by mkittler about 9 hours ago
- Status changed from Feedback to Resolved
I guess noone remembers why that worker was left in that state. So I'm just considering this resolved now.
Note that I have now created one failing unit on purpose for another ticket (see #178015#note-8) but it will not interfere with any production worker slots.