action #115094: [tools] test fails in bootloader_start: redcurrant is down size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #115094

closed

[tools] test fails in bootloader_start: redcurrant is down size:M

Added by zoecao over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Low

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-08-08

Due date:

% Done:

Estimated time:

Description

Hi, it looks like the powerVM worker is down, please help to take a look at it, thanks.

Observation¶

openQA test in scenario sle-15-SP4-Full-ppc64le-sles+sdk+proxy_SCC_via_YaST_ncurses_test@ppc64le-hmc-single-disk fails in
bootloader_start

Test suite description¶

Add add-on via SCC using YaST module in ncurses.

Reproducible¶

Fails since (at least) Build 151.1 (current job)

Rollback steps¶

On grenache-1.qa

systemctl unmask openqa-worker-auto-restart@{21..28}.service openqa-reload-worker-auto-restart@{21..28}.{service,path}
systemctl start openqa-worker-auto-restart@{21..28}.service openqa-reload-worker-auto-restart@{21..28}.path

Further details¶

Always latest result in this scenario: latest
Machine in racktables: https://racktables.suse.de/index.php?page=object&object_id=11354

Acceptance criteria¶

AC1: redcurrant is up and running and able to execute jobs

Workaround¶

Use WORKER_CLASS=spvm_ppc64le

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by szarate over 2 years ago

Copied to action #115118: [qe-core] test fails in bootloader_start: redcurrant is down - trigger a job failure when an lpar is not available added

Actions

Copy link

Updated by szarate over 2 years ago

Project changed from openQA Tests (public) to openQA Infrastructure (public)
Subject changed from [qe-core] test fails in bootloader_start: redcurrant is down to [tools] test fails in bootloader_start: redcurrant is down
Category deleted (~~Bugs in existing tests~~)

As this is related to the physical state of the lpar, it is more of a tools/infrastructure thing than it is of core; taking #115118 as an action item for us though

Actions

Copy link

Updated by okurz over 2 years ago

Status changed from New to Feedback
Target version set to Ready

@nicksinger are you aware of what was the latest update regarding when to expect SRV2e to be back, in particular redcurrant?

Actions

Copy link

Updated by mkittler over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 2 years ago

Due date set to 2022-10-13
Assignee changed from nicksinger to okurz

I masked the according worker instances with

sudo systemctl mask --now openqa-worker-auto-restart@{21..28}

I asked in https://suse.slack.com/archives/C029APBKLGK/p1660212514627429 about current plans regarding SRV2e.

bmwiedemann: AC upgrade investment was declined, because we plan to move out of there "soon" (TM)
okurz: ok, good to know. But we have to develop and test stuff somewhere, in particular ppc64le PVM. Do we have such ressources elsewhere?
mmaher: same status as bernhard mentioned, that was the last information we got
mmaher: even onlining just a hand full of machines heats the room up to a critical level, we can not operate there anymore

so I expect this to take longer. Waiting for any update regarding individual machines. Maybe we can also move this machine to NUE-2.2.14? Why don't we have any other PVM capable machine elsewhere?

Actions

Copy link

Updated by okurz over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 2 years ago

Priority changed from Normal to Low

hint from rfan: http://openqa.nue.suse.com/tests/9305139#step/bootloader/6 shows that it uses 10.162.6.242 which is actually grenache-6.qa.suse.de . So we do have pvm hosts in SRV2 and that can be used as alternative. I am also preparing a change to use DNS entries in the worker conf for those instances on grenache.

For testing I applied the changes manually. I did

openqa-clone-job --skip-chained-deps --within-instance https://openqa.nue.suse.com/tests/9305139 TEST=okurz_test_dns_pvm BUILD= _GROUP=0

Created job #9314969: sle-15-SP5-Online-ppc64le-Build9.1-create_hdd_textmode@ppc64le-spvm -> https://openqa.nue.suse.com/t9314969

Actions

Copy link

#10

Updated by okurz over 2 years ago

Description updated (diff)

I missed the reload services, see https://gitlab.suse.de/openqa/salt-states-openqa#remarks-about-the-systemd-units-used-to-start-workers. So I did

sudo systemctl mask --now openqa-worker-auto-restart@{21..28}.service openqa-reload-worker-auto-restart@{21..28}.{service,path}
sudo systemctl reset-failed

updated rollback steps

Actions

Copy link

#11

Updated by okurz over 2 years ago

In case anyone still needs tests on the former worker class please consider to use "WORKER_CLASS=spvm_ppc64le" instead.

Actions

Copy link

#12

Updated by zoecao over 2 years ago

okurz wrote:

In case anyone still needs tests on the former worker class please consider to use "WORKER_CLASS=spvm_ppc64le" instead.

ok, I see, thanks.

Actions

Copy link

#13

Updated by zoecao over 2 years ago

Thank you for checking this, the redcurrant is still down, but I will use "WORKER_CLASS=spvm_ppc64le" instead, thanks.

Actions

Copy link

#14

Updated by tjyrinki_suse over 2 years ago

We have some 100+ jobs waiting for hmc_ppc64le-1disk WORKER_CLASS (via ppc64le-hmc-single-disk machine, using pvm_hmc backend).

Should it become back online at some point or soon? We're currently missing testing several alpha builds of SP5 on PowerVM.

(or should we really switch to that WORKER_CLASS=spvm_ppc64le for ppc64le-hmc-single-disk?)

Actions

Copy link

#15

Updated by okurz over 2 years ago

tjyrinki_suse wrote:

We have some 100+ jobs waiting for hmc_ppc64le-1disk WORKER_CLASS (via ppc64le-hmc-single-disk machine, using pvm_hmc backend).

Should it become back online at some point or soon? We're currently missing testing several alpha builds of SP5 on PowerVM.

(or should we really switch to that WORKER_CLASS=spvm_ppc64le for ppc64le-hmc-single-disk?)

Yes, switch to BACKEND=spvm to use WORKER_CLASS=spvm_ppc64le. The testing should be equivalent and hardware in SRV2-ext is not expected to come online in the next weeks.

Actions

Copy link

#16

Updated by jstehlik over 2 years ago

Related to action #116078: Recover o3 worker kerosene formerly known as power8, restore IPMI access size:M added

Actions

Copy link

#17

Updated by tjyrinki_suse over 2 years ago

I've tried to so at https://openqa.suse.de/tests/9468422

The job just dies quickly, so I guess it is not a drop-in replacement to just switch the WORKER_CLASS? The error about "Need variable HMC_MACHINE_NAME" is the one that stops it? I'm not finding any existing jobs that would define HMC_MACHINE_NAME variable however, while the job used to run fine with hmc_ppc64le-1disk (https://openqa.suse.de/tests/8751389).

Actions

Copy link

#18

Updated by JERiveraMoya over 2 years ago

@tjyrinki_suse I guess you are missing to switch to BACKEND: spvm i.e.:
https://openqa.suse.de/tests/9417297#settings
we will try also this solution, hope this will not overload this workers.

Actions

Copy link

#22

Updated by livdywan over 2 years ago

Subject changed from [tools] test fails in bootloader_start: redcurrant is down to [tools] test fails in bootloader_start: redcurrant is down size:M
Description updated (diff)
Status changed from Feedback to Blocked

Supposedly blocked as redcurrant will stay offline until the AC problem in SRV2e is fixed

Actions

Copy link

#23

Updated by tjyrinki_suse over 2 years ago

JERiveraMoya wrote:

I guess you are missing to switch to BACKEND: spvm i.e.:
https://openqa.suse.de/tests/9417297#settings
we will try also this solution, hope this will not overload this workers.

Tried that now at https://openqa.suse.de/tests/9508575 - progress, but it gave "Can't locate backend/svpm.pm in @INC (you may need to install the backend::svpm module)"

Did you have any more success lately?

Actions

Copy link

#24

Updated by okurz over 2 years ago

https://suse.slack.com/archives/C03RW44HKFH/p1663332187497139 provides an update:

Current Plan for SRV2 and SRV2-Ext: As the AC in SRV2-Ext is known to fail (and cannot/will not be repaired) it should be considered out-of-use for normal operations. The exception here are PowerPC Machines; these are production grade systems which a known to survive higher temperatures. So we should be able to use the PowerPC Machines in SRV2-Ext, as even in the event of an AC failure there the PPC machines should be good with the minimal cooling they'll be by opening the doors to SRV2 and installing fans. We have started moving critical machines from SRV2-Ext into SRV2, and we should have some more free space in SRV2. So if you have critical machines in SRV2-Ext please get in touch to check if they can be relocated into SRV2. Otherwise we're waiting for FC Basement to become operational (ETA here is Mid-November), and will start moving machines after that. Once that move is done we should have enough free space in SRV2 to host all remaining machines there and get back to fully operational.

I asked in https://suse.slack.com/archives/C03RW44HKFH/p1663332470195329?thread_ts=1663332187.497139&cid=C03RW44HKFH

This is great news. Thank you for the update. There is one machine "redcurrant" that some QA engineers need within the openqa.suse.de infrastructure (#115094 for details). Can this machine be moved to SRV2 or can this PowerPC machine be used in-place in SRV2-ext?

Actions

Copy link

#25

Updated by okurz over 2 years ago

Due date deleted (~~2022-10-13~~)
Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)

According to hreinecke in https://suse.slack.com/archives/C03RW44HKFH/p1663333156535389?thread_ts=1663332187.497139&cid=C03RW44HKFH the machine should be available again. I could not ping redcurrant-1.qa.suse.de and right now don't know how to check HMC or something. Somebody else should check this.

EDIT: The machine can be controlled over https://powerhmc1.arch.suse.de/. There we can see that the machine and LPARs are up. We need #116794 as a controlling instance or move openQA worker instances to another place.

Actions

Copy link

#26

Updated by okurz over 2 years ago

Parent task set to #114379

Actions

Copy link

#27

Updated by nicksinger over 2 years ago

Status changed from Workable to Resolved
Assignee set to nicksinger

I could recover grenache-1 and was able to successfully execute runs on redcurrant: https://openqa.suse.de/tests/9518325#

Actions

Copy link

#28

Updated by okurz over 2 years ago

Related to coordination #117268: [epic] Handle reduced PowerPC ressources added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #115094

[tools] test fails in bootloader_start: redcurrant is down size:M

Observation¶

Test suite description¶

Reproducible¶

Rollback steps¶

Further details¶

Acceptance criteria¶

Workaround¶

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by zoecao over 2 years ago

Updated by zoecao over 2 years ago

Updated by tjyrinki_suse over 2 years ago

Updated by okurz over 2 years ago

Updated by jstehlik over 2 years ago

Updated by tjyrinki_suse over 2 years ago

Updated by JERiveraMoya over 2 years ago

Updated by livdywan over 2 years ago

Updated by tjyrinki_suse over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by nicksinger over 2 years ago

Updated by okurz over 2 years ago