Project

General

Profile

Actions

action #174352

closed

2 ipmi backend baremetal machines in OSD worker pool are offline size:S

Added by Julie_CAO 6 days ago. Updated 3 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-12-13
Due date:
2024-12-28
% Done:

0%

Estimated time:

Description

Observation

They are bare-metal5.oqa.prg2.suse.org and bare-metal6.oqa.prg2.suse.org.

I tried that bare-metal5 can be accessed via its ipmi web access(10.146.4.108), but it showed "Offline (graceful disconnect)" on OSD worker(https://10.145.10.207/admin/workers).

Anything wrong with the worker services? could you take a look?

Suggestions

  • IPMI is working and the machine is online so probably a problem with w36 itself
  • Both control instances are on w36 which is used for experimentation within #162296 -> @dheidler

Related issues 1 (1 open0 closed)

Related to openQA Infrastructure (public) - action #174448: bare-metal5 and bare-metal6 fail to boot from PXE most timesIn Progressdheidler2024-12-162024-12-31

Actions
Actions #1

Updated by okurz 6 days ago

  • Tags set to infra, reactive work, osd
  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #2

Updated by mkittler 6 days ago

  • Subject changed from 2 ipmi backend baremetal machines in OSD worker pool are offline to 2 ipmi backend baremetal machines in OSD worker pool are offline size:S
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to dheidler
Actions #3

Updated by dheidler 6 days ago

  • Status changed from Workable to In Progress
  • Priority changed from Urgent to High

I had stopped the workers during the update to not have any failed jobs.
I enabled it now again

systemctl unmask --now openqa-worker-auto-restart@{1..63}.service openqa-reload-worker-auto-restart@{1..63}.{service,path}
systemctl enable --now openqa-worker-auto-restart@{1..63}.service openqa-reload-worker-auto-restart@{1..63}.{service,path}
Actions #4

Updated by dheidler 6 days ago

@Julie_CAO btw why are you not using https://openqa.suse.de/admin/workers but posting a link only using the IP address?

Actions #5

Updated by openqa_review 5 days ago

  • Due date set to 2024-12-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by Julie_CAO 3 days ago · Edited

Thanks, I see these 2 workers are online now, and one queue job got assigned on bare-metal5. but it is weird why another paired jobs did not get assigned, they have been queueing longer waiting for these 2 machines? https://openqa.suse.de/tests/16175049

I did not touch the job for your look. If you think there is no problem, I'll cancle and restart them.

Actions #7

Updated by Julie_CAO 3 days ago

dheidler wrote in #note-4:

@Julie_CAO btw why are you not using https://openqa.suse.de/admin/workers but posting a link only using the IP address?

My vpn always had DNS resolution problem. or the dns server had problems. I am used to creating maps of commonly used websites on my local host. I am often unaware of posting the IP dirrectly on a ticket or a bug report, it is a little misleading to some extend. I'll pay more attention :)

Actions #8

Updated by Julie_CAO 3 days ago · Edited

@xguo bare-metal5 failed to boot from PXE, https://openqa.suse.de/tests/16190397
I rebooted manually and it did not boot from PXE either, do you know what happened to it?

Actions #9

Updated by dheidler 3 days ago

  • Status changed from In Progress to Blocked

Created https://progress.opensuse.org/issues/174448 as a followup on the boot issue.

Actions #10

Updated by dheidler 3 days ago

  • Status changed from Blocked to Resolved
Actions #11

Updated by dheidler 3 days ago

  • Related to action #174448: bare-metal5 and bare-metal6 fail to boot from PXE most times added
Actions #12

Updated by Julie_CAO 3 days ago

Hi @dheidler , should I cancle and restart the pair jobs which are still queued? see my comment in https://progress.opensuse.org/issues/174352#note-6

Actions #13

Updated by dheidler 3 days ago · Edited

Julie_CAO wrote in #note-12:

Hi @dheidler , should I cancle and restart the pair jobs which are still queued? see my comment in https://progress.opensuse.org/issues/174352#note-6

The mentioned jobs are scheduled for WORKER_CLASS=virt-mm-unreal-ipmi and zone-cc, but the only two virt-mm-unreal-ipmi workers are in NUE2, which is not in the CC zone. So there are no workers available that have a matching worker class.
So this is an unrelated issue.

Actions #14

Updated by Julie_CAO 3 days ago

dheidler wrote in #note-13:

The mentioned jobs are scheduled for WORKER_CLASS=virt-mm-unreal-ipmi and zone-cc, but the only two virt-mm-unreal-ipmi workers are in NUE2, which is not in the CC zone. So there are no workers available that have a matching worker class.
So this is an unrelated issue.

Yes, you are correct. I mapped "machine" to the incorrect worker class. Thank you for helping me find it out.

Actions

Also available in: Atom PDF