Project

General

Profile

action #115094

[tools] test fails in bootloader_start: redcurrant is down size:M

Added by zoecao about 2 months ago. Updated 6 days ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2022-08-08
Due date:
% Done:

0%

Estimated time:

Description

Hi, it looks like the powerVM worker is down, please help to take a look at it, thanks.

Observation

openQA test in scenario sle-15-SP4-Full-ppc64le-sles+sdk+proxy_SCC_via_YaST_ncurses_test@ppc64le-hmc-single-disk fails in
bootloader_start

Test suite description

Add add-on via SCC using YaST module in ncurses.

Reproducible

Fails since (at least) Build 151.1 (current job)

Rollback steps

On grenache-1.qa

systemctl unmask openqa-worker-auto-restart@{21..28}.service openqa-reload-worker-auto-restart@{21..28}.{service,path}
systemctl start openqa-worker-auto-restart@{21..28}.service openqa-reload-worker-auto-restart@{21..28}.path

Further details

Always latest result in this scenario: latest
Machine in racktables: https://racktables.suse.de/index.php?page=object&object_id=11354

Acceptance criteria

  • AC1: redcurrant is up and running and able to execute jobs

Workaround

  • Use WORKER_CLASS=spvm_ppc64le

Related issues

Related to openQA Infrastructure - action #116078: Recover o3 worker power8, restore IPMI access size:MFeedback2022-08-312022-11-02

Copied to openQA Tests - action #115118: [qe-core] test fails in bootloader_start: redcurrant is down - trigger a job failure when an lpar is not availableFeedback2022-08-08

History

#2 Updated by szarate about 2 months ago

  • Copied to action #115118: [qe-core] test fails in bootloader_start: redcurrant is down - trigger a job failure when an lpar is not available added

#3 Updated by szarate about 2 months ago

  • Project changed from openQA Tests to openQA Infrastructure
  • Subject changed from [qe-core] test fails in bootloader_start: redcurrant is down to [tools] test fails in bootloader_start: redcurrant is down
  • Category deleted (Bugs in existing tests)

As this is related to the physical state of the lpar, it is more of a tools/infrastructure thing than it is of core; taking #115118 as an action item for us though

#4 Updated by okurz about 2 months ago

  • Status changed from New to Feedback
  • Target version set to Ready

nicksinger are you aware of what was the latest update regarding when to expect SRV2e to be back, in particular redcurrant?

#5 Updated by mkittler about 2 months ago

  • Description updated (diff)

#6 Updated by okurz about 2 months ago

  • Due date set to 2022-10-13
  • Assignee changed from nicksinger to okurz

I masked the according worker instances with

sudo systemctl mask --now openqa-worker-auto-restart@{21..28}

I asked in https://suse.slack.com/archives/C029APBKLGK/p1660212514627429 about current plans regarding SRV2e.

bmwiedemann: AC upgrade investment was declined, because we plan to move out of there "soon" (TM)
okurz: ok, good to know. But we have to develop and test stuff somewhere, in particular ppc64le PVM. Do we have such ressources elsewhere?
mmaher: same status as bernhard mentioned, that was the last information we got
mmaher: even onlining just a hand full of machines heats the room up to a critical level, we can not operate there anymore

so I expect this to take longer. Waiting for any update regarding individual machines. Maybe we can also move this machine to NUE-2.2.14? Why don't we have any other PVM capable machine elsewhere?

#7 Updated by okurz about 2 months ago

  • Description updated (diff)

#8 Updated by okurz about 2 months ago

  • Description updated (diff)

#9 Updated by okurz about 2 months ago

  • Priority changed from Normal to Low

hint from rfan: http://openqa.nue.suse.com/tests/9305139#step/bootloader/6 shows that it uses 10.162.6.242 which is actually grenache-6.qa.suse.de . So we do have pvm hosts in SRV2 and that can be used as alternative. I am also preparing a change to use DNS entries in the worker conf for those instances on grenache.

For testing I applied the changes manually. I did

openqa-clone-job --skip-chained-deps --within-instance https://openqa.nue.suse.com/tests/9305139 TEST=okurz_test_dns_pvm BUILD= _GROUP=0

Created job #9314969: sle-15-SP5-Online-ppc64le-Build9.1-create_hdd_textmode@ppc64le-spvm -> https://openqa.nue.suse.com/t9314969

#10 Updated by okurz about 2 months ago

  • Description updated (diff)

I missed the reload services, see https://gitlab.suse.de/openqa/salt-states-openqa#remarks-about-the-systemd-units-used-to-start-workers. So I did

sudo systemctl mask --now openqa-worker-auto-restart@{21..28}.service openqa-reload-worker-auto-restart@{21..28}.{service,path}
sudo systemctl reset-failed

updated rollback steps

#11 Updated by okurz about 1 month ago

In case anyone still needs tests on the former worker class please consider to use "WORKER_CLASS=spvm_ppc64le" instead.

#12 Updated by zoecao about 1 month ago

okurz wrote:

In case anyone still needs tests on the former worker class please consider to use "WORKER_CLASS=spvm_ppc64le" instead.

ok, I see, thanks.

#13 Updated by zoecao about 1 month ago

Thank you for checking this, the redcurrant is still down, but I will use "WORKER_CLASS=spvm_ppc64le" instead, thanks.

#14 Updated by tjyrinki_suse 25 days ago

We have some 100+ jobs waiting for hmc_ppc64le-1disk WORKER_CLASS (via ppc64le-hmc-single-disk machine, using pvm_hmc backend).

Should it become back online at some point or soon? We're currently missing testing several alpha builds of SP5 on PowerVM.

(or should we really switch to that WORKER_CLASS=spvm_ppc64le for ppc64le-hmc-single-disk?)

#15 Updated by okurz 20 days ago

tjyrinki_suse wrote:

We have some 100+ jobs waiting for hmc_ppc64le-1disk WORKER_CLASS (via ppc64le-hmc-single-disk machine, using pvm_hmc backend).

Should it become back online at some point or soon? We're currently missing testing several alpha builds of SP5 on PowerVM.

(or should we really switch to that WORKER_CLASS=spvm_ppc64le for ppc64le-hmc-single-disk?)

Yes, switch to BACKEND=spvm to use WORKER_CLASS=spvm_ppc64le. The testing should be equivalent and hardware in SRV2-ext is not expected to come online in the next weeks.

#16 Updated by jstehlik 19 days ago

  • Related to action #116078: Recover o3 worker power8, restore IPMI access size:M added

#17 Updated by tjyrinki_suse 19 days ago

I've tried to so at https://openqa.suse.de/tests/9468422

The job just dies quickly, so I guess it is not a drop-in replacement to just switch the WORKER_CLASS? The error about "Need variable HMC_MACHINE_NAME" is the one that stops it? I'm not finding any existing jobs that would define HMC_MACHINE_NAME variable however, while the job used to run fine with hmc_ppc64le-1disk (https://openqa.suse.de/tests/8751389).

#18 Updated by JERiveraMoya 19 days ago

tjyrinki_suse I guess you are missing to switch to BACKEND: spvm i.e.:
https://openqa.suse.de/tests/9417297#settings
we will try also this solution, hope this will not overload this workers.

#22 Updated by cdywan 18 days ago

  • Subject changed from [tools] test fails in bootloader_start: redcurrant is down to [tools] test fails in bootloader_start: redcurrant is down size:M
  • Description updated (diff)
  • Status changed from Feedback to Blocked

Supposedly blocked as redcurrant will stay offline until the AC problem in SRV2e is fixed

#23 Updated by tjyrinki_suse 12 days ago

JERiveraMoya wrote:

I guess you are missing to switch to BACKEND: spvm i.e.:
https://openqa.suse.de/tests/9417297#settings
we will try also this solution, hope this will not overload this workers.

Tried that now at https://openqa.suse.de/tests/9508575 - progress, but it gave "Can't locate backend/svpm.pm in @INC (you may need to install the backend::svpm module)"

Did you have any more success lately?

#24 Updated by okurz 10 days ago

https://suse.slack.com/archives/C03RW44HKFH/p1663332187497139 provides an update:

Current Plan for SRV2 and SRV2-Ext: As the AC in SRV2-Ext is known to fail (and cannot/will not be repaired) it should be considered out-of-use for normal operations. The exception here are PowerPC Machines; these are production grade systems which a known to survive higher temperatures. So we should be able to use the PowerPC Machines in SRV2-Ext, as even in the event of an AC failure there the PPC machines should be good with the minimal cooling they'll be by opening the doors to SRV2 and installing fans. We have started moving critical machines from SRV2-Ext into SRV2, and we should have some more free space in SRV2. So if you have critical machines in SRV2-Ext please get in touch to check if they can be relocated into SRV2. Otherwise we're waiting for FC Basement to become operational (ETA here is Mid-November), and will start moving machines after that. Once that move is done we should have enough free space in SRV2 to host all remaining machines there and get back to fully operational.

I asked in https://suse.slack.com/archives/C03RW44HKFH/p1663332470195329?thread_ts=1663332187.497139&cid=C03RW44HKFH

This is great news. Thank you for the update. There is one machine "redcurrant" that some QA engineers need within the openqa.suse.de infrastructure (#115094 for details). Can this machine be moved to SRV2 or can this PowerPC machine be used in-place in SRV2-ext?

#25 Updated by okurz 10 days ago

  • Due date deleted (2022-10-13)
  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

According to hreinecke in https://suse.slack.com/archives/C03RW44HKFH/p1663333156535389?thread_ts=1663332187.497139&cid=C03RW44HKFH the machine should be available again. I could not ping redcurrant-1.qa.suse.de and right now don't know how to check HMC or something. Somebody else should check this.

EDIT: The machine can be controlled over https://powerhmc1.arch.suse.de/. There we can see that the machine and LPARs are up. We need #116794 as a controlling instance or move openQA worker instances to another place.

#26 Updated by okurz 7 days ago

  • Parent task set to #114379

#27 Updated by nicksinger 6 days ago

  • Status changed from Workable to Resolved
  • Assignee set to nicksinger

I could recover grenache-1 and was able to successfully execute runs on redcurrant: https://openqa.suse.de/tests/9518325#

Also available in: Atom PDF