Project

General

Profile

Actions

action #124685

closed

[qe-tools] Make sure Power8 and Power9 machines can be used with *any* usable HMC (was: move to new powerhmc3.arch.suse.de) size:M

Added by zluo over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-02-16
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

powerhmc1 is broken.

Antonio Eisner has installed a new powerhmc as replacement. To be enable to manage our QA-Power8-2-8247-22L-SN1010D5A and Power9 like shutdown, restart as before, we need to move both Power machines to powerhmc3.

Power8: QA-Power8-2-8247-22L-SN1010D5A
Power9: redcurrant (novalink)

At moment we can still use powerhmc2 to trigger network installation, but we are not able to manage redcurrant lpars. AndwWe don't have control of QA-Power8-2-8247-22L-SN1010D5A at all, only current running lpars can be accessed directly.

Acceptance criteria

  • AC1: PVM based openQA jobs pass the bootloader step connecting to the machines over HMC

Suggestions


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #124391: test fails in bootloader_start - The command server of powerhmc1 cannot be connected size:MResolvednicksinger2023-02-13

Actions
Actions #1

Updated by okurz over 1 year ago

  • Tags set to infra
  • Target version set to Ready
Actions #2

Updated by okurz over 1 year ago

  • Related to action #124391: test fails in bootloader_start - The command server of powerhmc1 cannot be connected size:M added
Actions #3

Updated by okurz over 1 year ago

  • Description updated (diff)
  • Priority changed from Normal to High

zluo wrote:

powerhmc1 is broken.

Antonio Eisner has installed a new powerhmc as replacement. To be enable to manage our QA Power8 and Power9 like shutdown, restart as before, we need to move both Power machines to powerhmc3.

which one do you mean as replacement, powerhmc3? That one already exists for longer. I am receiving a bit mixed messages. In #124391 it was stated that first a reboot of powerhmc1.arch.suse.de helped but I can confirm that powerhmc1.arch.suse.de can't be reached over web nor ssh. I guess it shouldn't harm to add a dedicated user account for testing and add the machines to powerhmc3 additionally to powerhmc1.

@zluo you created the ticket with "Normal" priority which would mean according to our SLO the expected reaction time is within one year but I assume this is more urgent? Bumping up to "High" accordingly.

Actions #4

Updated by zluo over 1 year ago

@okurz well, as I understood from Antonio, he has prepared powerhmc3 already, whether and how it is running for QA power machines, I don't know.
Normally redundant powerhmc should cover this kind of failure. Long time ago it was a decision that we shutdown own QA powerhmc server.

Actions #5

Updated by mkittler over 1 year ago

  • Subject changed from [qe-tools] move Power8 and Power9 to new powerhmc3.arch.suse.de to [qe-tools] move Power8 and Power9 to new powerhmc3.arch.suse.de size:M
  • Description updated (diff)
Actions #6

Updated by livdywan over 1 year ago

  • Status changed from New to Workable
Actions #7

Updated by mkittler over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #8

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

I've created a dedicated user for openqa tests on hmc3 and a MR to change the config accordingly: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/496

Is that enough or do I need to:

Actions #9

Updated by zluo over 1 year ago

mkittler wrote:

I've created a dedicated user for openqa tests on hmc3 and a MR to change the config accordingly: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/496

Is that enough or do I need to:

@mkittler: the old powerhm1 is gone with all managed machines/lpars. We need to discovery/connect these machine again and manage these on powerhmc3. As I understood, we just use our nis account to access powerhmc3 in future. but for openQA maybe it requires a dedicated user for powerhmc to trigger connection and run installation steps afterwards.

Actions #10

Updated by zluo over 1 year ago

  • Subject changed from [qe-tools] move Power8 and Power9 to new powerhmc3.arch.suse.de size:M to [qe-tools] move Power8 and Power9 to new powerhmc3.arch.suse.de
Actions #11

Updated by okurz over 1 year ago

  • Subject changed from [qe-tools] move Power8 and Power9 to new powerhmc3.arch.suse.de to [qe-tools] move Power8 and Power9 to new powerhmc3.arch.suse.de size:M

@mkittler I suggest you add machines over the ssh interface or over the webUI, preferably using hostnames, not IPv4 addresses. I suggest to ignore hmc1 for now.

  • qa-power8-2.qa.suse.de is reserved for manual testing. It is not to be used for openQA tests
  • grenache-[2..8] are using "spvm" worker class (see our workerconf.sls) meaning relying on a "Novalink" running on grenache but we should still connect the HMC
  • redcurrant is using "hmc" worker class (see our workerconf.sls)

In the end all three should be reachable over HMC, grenache-[2..8] and redcurrant must work in openQA

Please look on Confluence or ask people what the hostnames are and credentials for connecting HMC and please add information to racktables where missing. If you don't get/find that information within a reasonable time please remove the estimate, put back to "New" and unassign

Actions #12

Updated by mkittler over 1 year ago

I've got a statement from Antonio Eisner:

Okay, the explanation powerhmc1 has now been reinstalled because it did not come up after a reboot. You could migrate all machines from powerhmc3 to powerhmc1. The version is higher. At some point we update powerhmc2 and then we can turn off powerhmc3.

That means this ticket makes no sense. Those hosts should rather be re-added to hmc1.

He also shared a link to https://confluence.suse.com/display/labsprojectmanagement/Hardware+inventory but this doesn't answer my questions about FQDNs/IPs/BMC-credentials.

Actions #13

Updated by okurz over 1 year ago

mkittler wrote:

I've got a statement from Antonio Eisner:

Okay, the explanation powerhmc1 has now been reinstalled because it did not come up after a reboot. You could migrate all machines from powerhmc3 to powerhmc1. The version is higher. At some point we update powerhmc2 and then we can turn off powerhmc3.

That means this ticket makes no sense. Those hosts should rather be re-added to hmc1.

I still think this ticket makes sense. As I stated there shouldn't be a problem to have machines connected to both. And we were asked to use powerhmc3 so let's do it.

He also shared a link to https://confluence.suse.com/display/labsprojectmanagement/Hardware+inventory but this doesn't answer my questions about FQDNs/IPs/BMC-credentials.

Actions #14

Updated by mkittler over 1 year ago

I was able to find out the IP of redcurrant as it is still connected to hmc2 (and I could guess the credentials for hmc2). I could also guess the ASM credentials of redcurrant. However, adding redcurrant to hmc3 still doesn't seem to work. It is just showing "Systeme verbinden". Maybe redcurrant is not reachable from hmc3 (as it is not reachable from the normal engineering VPN as well).

I could not find the IP of QA-Power8-2-8247-22L-SN1010D5A in a similar way as it is not connected to hmc2.

I've asked Antonio Eisner to give use the new credentials of hmc1 but I doubt this will be useful at this point.

So I have no idea how to do what @zluo is asking for.

Actions #15

Updated by okurz over 1 year ago

  • Due date set to 2023-03-14
Actions #16

Updated by zluo over 1 year ago

mkittler wrote:

I was able to find out the IP of redcurrant as it is still connected to hmc2 (and I could guess the credentials for hmc2). I could also guess the ASM credentials of redcurrant. However, adding redcurrant to hmc3 still doesn't seem to work. It is just showing "Systeme verbinden". Maybe redcurrant is not reachable from hmc3 (as it is not reachable from the normal engineering VPN as well).

I could not find the IP of QA-Power8-2-8247-22L-SN1010D5A in a similar way as it is not connected to hmc2.

I've asked Antonio Eisner to give use the new credentials of hmc1 but I doubt this will be useful at this point.

So I have no idea how to do what @zluo is asking for.

redcurrant is connected to hmc2, but you cannot manage it for restart or shutdown after hmc1 is gone. And it has novalink installed like other Power machines used for openQA.
I am not sure about this: to check sub network of hmc3 and redcurrant, they need to be reachable from each other. Maybe there is an issue with gateway if both they are in different sub networks.

QA-Power8-2-8247-22L-SN1010D5A was connected to hmc1 only. So Antonio might know the differences of hmc versions, is the new hmc still able to manage Power8 machine?

Actions #17

Updated by mkittler over 1 year ago

I was able to add grenache on hmc3 by scanning IPs and guessing credentials. Apparently grenache-fsp1.qa.suse.de and grenache-fsp2.qa.suse.de are redundant connections to the same host. Note that this is not a big step forward because those tests are using Novalink (spvm backend) and work independently from the hmc anyways (see e.g. https://openqa.suse.de/tests/10640668 for a passing test that passed regardless of this).

Actions #18

Updated by mkittler over 1 year ago

@nsinger has also created an SD ticket regarding redcurrant: https://sd.suse.com/servicedesk/customer/portal/1/SD-114488

By the way, the IP of redcurrant was 172.16.222.73 (visible on hmc2).

Actions #19

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to Blocked
  • Assignee changed from mkittler to nicksinger
Actions #20

Updated by okurz over 1 year ago

mkittler wrote:

@nsinger has also created an SD ticket regarding redcurrant: https://sd.suse.com/servicedesk/customer/portal/1/SD-114488

By the way, the IP of redcurrant was 172.16.222.73 (visible on hmc2).

please make sure to add all such information to https://racktables.nue.suse.com/ where applicable

Actions #21

Updated by mkittler over 1 year ago

Note that Antonio also suggested to "create a ticket" (not mentioning where) for access via NIS credentials on HMC1. I suppose that's nevertheless something we should peruse (as it cannot hurt to have more options than HMC3).

Actions #22

Updated by nicksinger over 1 year ago

Toni told us in the SD ticket that HMC3 will be switched of after every machine is migrated to HMC1 - so should we reconsider this task here as "move to powerhmc1"?

Actions #23

Updated by okurz over 1 year ago

  • Subject changed from [qe-tools] move Power8 and Power9 to new powerhmc3.arch.suse.de size:M to [qe-tools] Make sure Power8 and Power9 machines can be used with *any* usable HMC (was: move to new powerhmc3.arch.suse.de) size:M
  • Due date changed from 2023-03-14 to 2023-03-21
  • Status changed from Blocked to Feedback
  • Priority changed from High to Urgent

nicksinger wrote:

Toni told us in the SD ticket that HMC3 will be switched of after every machine is migrated to HMC1 - so should we reconsider this task here as "move to powerhmc1"?

Either way this ticket should focus on getting Power8+9 machines under control of a usable HMC, at least one, at best two :)

Actions #24

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to In Progress

Redcurrant is already in powerhmc1 and I created a openqa service user in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/511

Actions #25

Updated by okurz over 1 year ago

MR merged.

Actions #26

Updated by JERiveraMoya over 1 year ago

is there any connection with ppc64le-spvm now having a high rate of failures (not seen in months btw, it was very very stable until todays build 81.1) ?
https://openqa.suse.de/tests/overview?arch=&flavor=&machine=ppc64le-spvm&test=&modules=&module_re=&distri=sle&version=15-SP5&build=81.1&groupid=129

Actions #27

Updated by mkittler over 1 year ago

I don't think so. If you're affected by this issue then your test would just fail regardless of how often you retry. Besides, the spvm backend should be independent from the HMC.

Actions #28

Updated by nicksinger over 1 year ago

For some reason power8-2 is already in hmc1: https://powerhmc1.arch.suse.de/dashboard/#resources/systems/9d3c4d88-a65a-3ad5-97e0-a3f9147d76fb/logical-partitions - I used the opportunity to update racktables with the according (static) IP for that interface.
I tried to verify that redcurrant is executing jobs again and found a strange needle in the most recent run (https://openqa.suse.de/tests/10720780#step/bootloader_start/4) which I don't understand where it comes from. Asked in #team-qa-tools for help

Actions #29

Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Feedback

Apparently this needle comes from https://github.com/os-autoinst/os-autoinst-distri-opensuse/blame/79441e7d5b1211ddc3f4af39c4194001058cb6cd/lib/susedistribution.pm#L858 which was created by myself 4 years ago and was never touched since then. I decided that a new needle should suffice to fix the test without breaking existing tests (e.g. novalink based tests which use a similar needle matching flow). I restarted https://openqa.suse.de/tests/10745740 which hopefully validates that redcurrant is usable again over powerhmc1. If that succeeds, I consider this task done.

Actions #30

Updated by nicksinger over 1 year ago

openQA test failed because hmc1 was not the primary management controller for redcurrant and couldn't control the power state for LPARs. I had to execute

chcomgmt -o setmaster -t norm -m redcurrant

Testing again with https://openqa.suse.de/tests/10747085

Actions #31

Updated by livdywan over 1 year ago

  • Due date changed from 2023-03-21 to 2023-03-24

nicksinger wrote:

openQA test failed because hmc1 was not the primary management controller for redcurrant and couldn't control the power state for LPARs. I had to execute

chcomgmt -o setmaster -t norm -m redcurrant

Testing again with https://openqa.suse.de/tests/10747085

Disk sda was not used for partitioning.
Disk sdb was wrongly used during partitioning. at sle/tests/console/validate_first_disk_selection.pm line 56.

Seems like it got further this time?

Incidentally bumping the due date so I stop stumbling over this ticket when checking our backlog ;-)

Actions #32

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to Resolved

Can definitely confirm that this must be a test failure so we can resolve this :)

Actions #33

Updated by openqa_review over 1 year ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: online_sles15sp4_pscc_basesys-srv-desk-dev_def_full_zypp
https://openqa.suse.de/tests/10815518#step/bootloader#1/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #34

Updated by livdywan over 1 year ago

This bug is still referenced in a failing openQA test: online_sles15sp4_pscc_basesys-srv-desk-dev_def_full_zypp
https://openqa.suse.de/tests/10815518#step/bootloader#1/1

@zoecao This looks to be a different issue? It's novalink, not hmc?

# Test died: no candidate needle with tag(s) 'novalink-ssh' matched
Actions #35

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to Resolved

cdywan wrote:

This bug is still referenced in a failing openQA test: online_sles15sp4_pscc_basesys-srv-desk-dev_def_full_zypp
https://openqa.suse.de/tests/10815518#step/bootloader#1/1

@zoecao This looks to be a different issue? It's novalink, not hmc?

# Test died: no candidate needle with tag(s) 'novalink-ssh' matched

Indeed a different issue. I pinged @zoecao in slack and removed the wrong bug reference/comment in the test.

Actions #38

Updated by openqa_review over 1 year ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15sp3_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm-pcm_all_full_pre
https://openqa.suse.de/tests/10957516#step/bootloader_start/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #39

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to Resolved

openqa_review wrote:

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15sp3_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm-pcm_all_full_pre
https://openqa.suse.de/tests/10957516#step/bootloader_start/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Again wrong reference. Added https://progress.opensuse.org/issues/126821 as correct one and deleted the old.

Actions #40

Updated by okurz over 1 year ago

$ openqa-query-for-job-label 'poo#124685'
10945318|2023-04-18 22:04:32|done|failed|select_disk||grenache-1
10945291|2023-04-18 21:21:09|done|failed|select_disk||grenache-1
10941323|2023-04-18 20:18:00|done|failed|select_disk||grenache-1
10918360|2023-04-14 08:53:05|done|failed|offline_sles15sp2_ltss_pscc_basesys-srv-phub_def_full_pre||grenache-1
10917811|2023-04-14 08:30:36|done|failed|sap_migration_online_zypper_sles4sap15sp2_pre||grenache-1
10917807|2023-04-14 08:30:35|done|failed|sap_migration_offline_scc_sles4sap15sp4_fullcd_pre||grenache-1
10917805|2023-04-14 08:25:54|done|failed|sap_migration_offline_scc_sles4sap15sp3_fullcd_pre||grenache-1
10917798|2023-04-14 08:21:34|done|failed|sap_migration_offline_dvd_sles4sap15sp3_fullcd_pre||grenache-1
10917775|2023-04-14 08:07:22|done|failed|offline_sles15sp2_ltss_pscc_basesys-srv-phub_def_full_pre||grenache-1
10912036|2023-04-13 08:56:58|done|failed|sap_migration_online_zypper_sles4sap15sp2_pre||grenache-1
Actions #41

Updated by okurz over 1 year ago

  • Due date deleted (2023-03-24)
Actions

Also available in: Atom PDF