action #174352
closed
2 ipmi backend baremetal machines in OSD worker pool are offline size:S
Added by Julie_CAO 6 days ago.
Updated 3 days ago.
Category:
Regressions/Crashes
Description
Observation¶
They are bare-metal5.oqa.prg2.suse.org
and bare-metal6.oqa.prg2.suse.org
.
I tried that bare-metal5
can be accessed via its ipmi web access(10.146.4.108), but it showed "Offline (graceful disconnect)" on OSD worker(https://10.145.10.207/admin/workers).
Anything wrong with the worker services? could you take a look?
Suggestions¶
- IPMI is working and the machine is online so probably a problem with w36 itself
- Both control instances are on w36 which is used for experimentation within #162296 -> @dheidler
Related issues
1 (1 open — 0 closed)
- Tags set to infra, reactive work, osd
- Category set to Regressions/Crashes
- Priority changed from Normal to Urgent
- Target version set to Ready
- Subject changed from 2 ipmi backend baremetal machines in OSD worker pool are offline to 2 ipmi backend baremetal machines in OSD worker pool are offline size:S
- Description updated (diff)
- Status changed from New to Workable
- Assignee set to dheidler
- Status changed from Workable to In Progress
- Priority changed from Urgent to High
I had stopped the workers during the update to not have any failed jobs.
I enabled it now again
systemctl unmask --now openqa-worker-auto-restart@{1..63}.service openqa-reload-worker-auto-restart@{1..63}.{service,path}
systemctl enable --now openqa-worker-auto-restart@{1..63}.service openqa-reload-worker-auto-restart@{1..63}.{service,path}
- Due date set to 2024-12-28
Setting due date based on mean cycle time of SUSE QE Tools
Thanks, I see these 2 workers are online now, and one queue job got assigned on bare-metal5
. but it is weird why another paired jobs did not get assigned, they have been queueing longer waiting for these 2 machines? https://openqa.suse.de/tests/16175049
I did not touch the job for your look. If you think there is no problem, I'll cancle and restart them.
dheidler wrote in #note-4:
@Julie_CAO btw why are you not using https://openqa.suse.de/admin/workers but posting a link only using the IP address?
My vpn always had DNS resolution problem. or the dns server had problems. I am used to creating maps of commonly used websites on my local host. I am often unaware of posting the IP dirrectly on a ticket or a bug report, it is a little misleading to some extend. I'll pay more attention :)
- Status changed from In Progress to Blocked
- Status changed from Blocked to Resolved
- Related to action #174448: bare-metal5 and bare-metal6 fail to boot from PXE most times added
Julie_CAO wrote in #note-12:
Hi @dheidler , should I cancle and restart the pair jobs which are still queued? see my comment in https://progress.opensuse.org/issues/174352#note-6
The mentioned jobs are scheduled for WORKER_CLASS
=virt-mm-unreal-ipmi
and zone-cc
, but the only two virt-mm-unreal-ipmi
workers are in NUE2, which is not in the CC zone. So there are no workers available that have a matching worker class.
So this is an unrelated issue.
dheidler wrote in #note-13:
The mentioned jobs are scheduled for WORKER_CLASS
=virt-mm-unreal-ipmi
and zone-cc
, but the only two virt-mm-unreal-ipmi
workers are in NUE2, which is not in the CC zone. So there are no workers available that have a matching worker class.
So this is an unrelated issue.
Yes, you are correct. I mapped "machine" to the incorrect worker class. Thank you for helping me find it out.
Also available in: Atom
PDF