Project

General

Profile

Actions

action #106257

closed

[sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error on redcurrant-*.qa

Added by rfan1 about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Infrastructure
Target version:
Start date:
2022-02-08
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP4-Migration-from-SLE15-SPx-Milestone-ppc64le-online_sles15sp3_scc_basesys-srv-desk-dev_all_full_zypp_pre@ppc64le-hmc-single-disk fails in
await_install

Test suite description

Reproducible

Fails since (at least) Build 88.4

Expected result

Last good: 79.1 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #106598: Redcurrant has a broken HDDRejectednicksinger2022-02-10

Actions
Actions #1

Updated by rfan1 about 2 years ago

The issue is caused by hardware issue, and seems only part of the Lpars from redcurrant are impacted [redcurrant-5 and redcurrant-7], and no issue with [redcurrant-4 and redcurrant-6 etc].

I have filed a jira ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-75978 to track the issue, and got confirmation it was hardware issue:

Michal Suchanek commented:

The VIOS errlog command clearly shows disk errors. errlog.txtlink_attachment_7.gif
-> hardware problem (with hdisk2 specifically)

Actions #2

Updated by rfan1 about 2 years ago

  • Subject changed from [sle][PowerVM][hardware]test fails in await_install which is caused by disk error to [sle][migration][PowerVM][hardware]test fails in await_install which is caused by disk error
Actions #3

Updated by rfan1 about 2 years ago

  • Subject changed from [sle][migration][PowerVM][hardware]test fails in await_install which is caused by disk error to [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error
Actions #4

Updated by rfan1 about 2 years ago

More update!

ONLY PART OF LPARs are impacted based my findings [red....ant-3, 5, 7 ].

Others work fine [red....ant-4, 6 etc]

Actions #5

Updated by JERiveraMoya about 2 years ago

  • Priority changed from Normal to High

Plenty of failures in latest build, even restarting multiple times we don't hit the good ones easily,
could it be possible to deactivate the failing ones?
https://openqa.suse.de/tests/overview?arch=ppc64le&flavor=&machine=ppc64le-hmc-single-disk%2Cppc64le-hmc-4disk&test=&modules=&module_re=&distri=sle&version=15-SP4&build=95.1&groupid=129#

Yes, tests in security group failed as well:(
The worker classes are recorded at:
https://gitlab.suse.de/-/ide/project/openqa/salt-pillars-openqa/tree/master/-/openqa/workerconf.sls/

We can adjust this file or define a new worker class for individual test case as a workaround.

At the same time, I am still look for some help to fix the hardware issue via jira SD ticket as well, hopefully it can be fix asap.

Actions #6

Updated by rfan1 about 2 years ago

Just received some kindly feedback from Matthias Griessmeier, we may need order a new disk.

For now, we may try to use the good LPARs for our tests as a workaround.

Actions #7

Updated by okurz about 2 years ago

Actions #8

Updated by nicksinger about 2 years ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger

The affected LPARs are:

  • redcurrant-1
  • redcurrant-2
  • redcurrant-3
  • redcurrant-5
  • redcurrant-7
  • redcurrant-9 defined but not in use currently
  • redcurrant-11 defined but not in use currently

I will mask the according workers for now so tests only run on working LPARs. Next Wednesday we will be in the office and try to replace the disk (also see SD ticket for more details)

Actions #9

Updated by nicksinger about 2 years ago

Worker instances stopped and masked with

systemctl stop openqa-worker-auto-restart@{21,22,23,25,27} && systemctl mask openqa-worker-auto-restart@{21,22,23,25,27}

on worker grenache-1 (see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1003-1042).
I guess for the replacement and reconstruction of the "Storage pool" I will need shutdown the whole machine (redcurrant-1) but I will do this shortly before we replace the physical disk.

Afterwards the following steps are needed to enable the instances again:

systemctl unmask openqa-worker-auto-restart@{21,22,23,25,27} && systemctl start openqa-worker-auto-restart@{21,22,23,25,27}
Actions #10

Updated by nicksinger about 2 years ago

  • Status changed from In Progress to Blocked

blocked till we are in the office

Actions #11

Updated by rfan1 about 2 years ago

physical disk was replaced, storage pool will be checked by nsinger

Actions #12

Updated by nicksinger about 2 years ago

  • Due date set to 2022-02-18
  • Status changed from Blocked to Feedback

unfortunately just replacing the disk is not enough. The tests still produce errors. Checking the storage pool I can see that only one HDD is present (which is expected) but I also can't see the newly build in HDD. I have to check with Toni and IBM on how to proceed here.

Actions #13

Updated by nicksinger about 2 years ago

  • Due date changed from 2022-02-18 to 2022-02-21
  • Status changed from Feedback to Workable

I managed to make the new HDD visible in the hmc. I used smittyon the VIOS itself to reformat the new disk. Adding to the storage pool produced an error and I have to look into that what could be the issue. Maybe the storage pool has to be recreated.

Actions #14

Updated by okurz about 2 years ago

  • Due date changed from 2022-02-21 to 2022-03-11
  • Target version set to Ready

@nicksinger if you take a look then let's treat it as part of the SUSE QE Tools backlog officially

Actions #15

Updated by okurz about 2 years ago

  • Subject changed from [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error to [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error on redcurrant-*.qa
Actions #16

Updated by rfan1 about 2 years ago

@nicksinger,

Do you have any update for this ticket?

BR/Richard.

Actions #17

Updated by nicksinger about 2 years ago

I fiddled around with the storage pool again today and had to recreate every virtual HDD from scratch. But apparently this already worked for redcurrant-6: https://openqa.suse.de/tests/8284153#step/partitioning/2
I now reenable all other instances again to check if they work as expected too.

Actions #18

Updated by nicksinger about 2 years ago

  • Due date deleted (2022-03-11)
  • Status changed from Workable to Resolved

I checked the other instances too and found at least one working job. Please reopen if you see further problems.

Actions #20

Updated by openqa_review about 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15sp2_ltss_media_basesys-srv_all_full_pre
https://openqa.suse.de/tests/8382301

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Also available in: Atom PDF