action #174352: 2 ipmi backend baremetal machines in OSD worker pool are offline size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #174352

closed

2 ipmi backend baremetal machines in OSD worker pool are offline size:S

Added by Julie_CAO 3 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

High

Assignee:

dheidler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-12-13

Due date:

2024-12-28

% Done:

Estimated time:

Tags:

osd, infra, reactive work

Description

Observation¶

They are bare-metal5.oqa.prg2.suse.org and bare-metal6.oqa.prg2.suse.org.

I tried that bare-metal5 can be accessed via its ipmi web access(10.146.4.108), but it showed "Offline (graceful disconnect)" on OSD worker(https://10.145.10.207/admin/workers).

Anything wrong with the worker services? could you take a look?

Suggestions¶

IPMI is working and the machine is online so probably a problem with w36 itself
Both control instances are on w36 which is used for experimentation within #162296 -> @dheidler

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 3 months ago

Tags set to infra, reactive work, osd
Category set to Regressions/Crashes
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by mkittler 3 months ago

Subject changed from 2 ipmi backend baremetal machines in OSD worker pool are offline to 2 ipmi backend baremetal machines in OSD worker pool are offline size:S
Description updated (diff)
Status changed from New to Workable
Assignee set to dheidler

Actions

Copy link

Updated by dheidler 3 months ago

Status changed from Workable to In Progress
Priority changed from Urgent to High

I had stopped the workers during the update to not have any failed jobs.
I enabled it now again

systemctl unmask --now openqa-worker-auto-restart@{1..63}.service openqa-reload-worker-auto-restart@{1..63}.{service,path}
systemctl enable --now openqa-worker-auto-restart@{1..63}.service openqa-reload-worker-auto-restart@{1..63}.{service,path}

Actions

Copy link

Updated by dheidler 3 months ago

@Julie_CAO btw why are you not using https://openqa.suse.de/admin/workers but posting a link only using the IP address?

Actions

Copy link

Updated by openqa_review 3 months ago

Due date set to 2024-12-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by Julie_CAO 3 months ago · Edited

Thanks, I see these 2 workers are online now, and one queue job got assigned on bare-metal5. but it is weird why another paired jobs did not get assigned, they have been queueing longer waiting for these 2 machines? https://openqa.suse.de/tests/16175049

I did not touch the job for your look. If you think there is no problem, I'll cancle and restart them.

Actions

Copy link

Updated by Julie_CAO 3 months ago

dheidler wrote in #note-4:

@Julie_CAO btw why are you not using https://openqa.suse.de/admin/workers but posting a link only using the IP address?

My vpn always had DNS resolution problem. or the dns server had problems. I am used to creating maps of commonly used websites on my local host. I am often unaware of posting the IP dirrectly on a ticket or a bug report, it is a little misleading to some extend. I'll pay more attention :)

Actions

Copy link

Updated by Julie_CAO 3 months ago · Edited

@xguo bare-metal5 failed to boot from PXE, https://openqa.suse.de/tests/16190397
I rebooted manually and it did not boot from PXE either, do you know what happened to it?

Actions

Copy link

Updated by dheidler 3 months ago

Status changed from In Progress to Blocked

Created https://progress.opensuse.org/issues/174448 as a followup on the boot issue.

Actions

Copy link

#10

Updated by dheidler 3 months ago

Status changed from Blocked to Resolved

Actions

Copy link

#11

Updated by dheidler 3 months ago

Related to action #174448: bare-metal5 and bare-metal6 fail to boot from PXE most times added

Actions

Copy link

#12

Updated by Julie_CAO 3 months ago

Hi @dheidler , should I cancle and restart the pair jobs which are still queued? see my comment in https://progress.opensuse.org/issues/174352#note-6

Actions

Copy link

#13

Updated by dheidler 3 months ago · Edited

Julie_CAO wrote in #note-12:

Hi @dheidler , should I cancle and restart the pair jobs which are still queued? see my comment in https://progress.opensuse.org/issues/174352#note-6

The mentioned jobs are scheduled for WORKER_CLASS=virt-mm-unreal-ipmi and zone-cc, but the only two virt-mm-unreal-ipmi workers are in NUE2, which is not in the CC zone. So there are no workers available that have a matching worker class.
So this is an unrelated issue.

Actions

Copy link

#14

Updated by Julie_CAO 3 months ago

dheidler wrote in #note-13:

The mentioned jobs are scheduled for WORKER_CLASS=virt-mm-unreal-ipmi and zone-cc, but the only two virt-mm-unreal-ipmi workers are in NUE2, which is not in the CC zone. So there are no workers available that have a matching worker class.
So this is an unrelated issue.

Yes, you are correct. I mapped "machine" to the incorrect worker class. Thank you for helping me find it out.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #174352

2 ipmi backend baremetal machines in OSD worker pool are offline size:S

Observation¶

Suggestions¶

Updated by okurz 3 months ago

Updated by mkittler 3 months ago

Updated by dheidler 3 months ago

Updated by dheidler 3 months ago

Updated by openqa_review 3 months ago

Updated by Julie_CAO 3 months ago · Edited

Updated by Julie_CAO 3 months ago

Updated by Julie_CAO 3 months ago · Edited

Updated by dheidler 3 months ago

Updated by dheidler 3 months ago

Updated by dheidler 3 months ago

Updated by Julie_CAO 3 months ago

Updated by dheidler 3 months ago · Edited

Updated by Julie_CAO 3 months ago