action #106257: [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error on redcurrant-*.qa - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #106257

closed

[sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error on redcurrant-*.qa

Added by rfan1 over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Infrastructure

Target version:

openQA Project (public) - Ready

Start date:

2022-02-08

Due date:

% Done:

Estimated time:

Difficulty:

Description

Observation¶

openQA test in scenario sle-15-SP4-Migration-from-SLE15-SPx-Milestone-ppc64le-online_sles15sp3_scc_basesys-srv-desk-dev_all_full_zypp_pre@ppc64le-hmc-single-disk fails in
await_install

Test suite description¶

Reproducible¶

Fails since (at least) Build 88.4

Expected result¶

Last good: 79.1 (or more recent)

Further details¶

Always latest result in this scenario: latest

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by rfan1 over 3 years ago

The issue is caused by hardware issue, and seems only part of the Lpars from redcurrant are impacted [redcurrant-5 and redcurrant-7], and no issue with [redcurrant-4 and redcurrant-6 etc].

I have filed a jira ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-75978 to track the issue, and got confirmation it was hardware issue:

Michal Suchanek commented:

The VIOS errlog command clearly shows disk errors. errlog.txtlink_attachment_7.gif
-> hardware problem (with hdisk2 specifically)

Actions

Copy link

Updated by rfan1 over 3 years ago

Subject changed from [sle][PowerVM][hardware]test fails in await_install which is caused by disk error to [sle][migration][PowerVM][hardware]test fails in await_install which is caused by disk error

Actions

Copy link

Updated by rfan1 over 3 years ago

Subject changed from [sle][migration][PowerVM][hardware]test fails in await_install which is caused by disk error to [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error

Actions

Copy link

Updated by rfan1 over 3 years ago

More update!

ONLY PART OF LPARs are impacted based my findings [red....ant-3, 5, 7 ].

Others work fine [red....ant-4, 6 etc]

Actions

Copy link

Updated by JERiveraMoya over 3 years ago

Priority changed from Normal to High

Plenty of failures in latest build, even restarting multiple times we don't hit the good ones easily,
could it be possible to deactivate the failing ones?
https://openqa.suse.de/tests/overview?arch=ppc64le&flavor=&machine=ppc64le-hmc-single-disk%2Cppc64le-hmc-4disk&test=&modules=&module_re=&distri=sle&version=15-SP4&build=95.1&groupid=129#

Yes, tests in security group failed as well:(
The worker classes are recorded at:
https://gitlab.suse.de/-/ide/project/openqa/salt-pillars-openqa/tree/master/-/openqa/workerconf.sls/

We can adjust this file or define a new worker class for individual test case as a workaround.

At the same time, I am still look for some help to fix the hardware issue via jira SD ticket as well, hopefully it can be fix asap.

Actions

Copy link

Updated by rfan1 over 3 years ago

Just received some kindly feedback from Matthias Griessmeier, we may need order a new disk.

For now, we may try to use the good LPARs for our tests as a workaround.

Actions

Copy link

Updated by okurz over 3 years ago

Related to action #106598: Redcurrant has a broken HDD added

Actions

Copy link

Updated by nicksinger over 3 years ago

Status changed from New to In Progress
Assignee set to nicksinger

The affected LPARs are:

redcurrant-1
redcurrant-2
redcurrant-3
redcurrant-5
redcurrant-7
~~redcurrant-9~~ defined but not in use currently
~~redcurrant-11~~ defined but not in use currently

I will mask the according workers for now so tests only run on working LPARs. Next Wednesday we will be in the office and try to replace the disk (also see SD ticket for more details)

Actions

Copy link

Updated by nicksinger over 3 years ago

Worker instances stopped and masked with

systemctl stop openqa-worker-auto-restart@{21,22,23,25,27} && systemctl mask openqa-worker-auto-restart@{21,22,23,25,27}

on worker grenache-1 (see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1003-1042).
I guess for the replacement and reconstruction of the "Storage pool" I will need shutdown the whole machine (redcurrant-1) but I will do this shortly before we replace the physical disk.

Afterwards the following steps are needed to enable the instances again:

systemctl unmask openqa-worker-auto-restart@{21,22,23,25,27} && systemctl start openqa-worker-auto-restart@{21,22,23,25,27}

Actions

Copy link

#10

Updated by nicksinger over 3 years ago

Status changed from In Progress to Blocked

blocked till we are in the office

Actions

Copy link

#11

Updated by rfan1 over 3 years ago

physical disk was replaced, storage pool will be checked by nsinger

Actions

Copy link

#12

Updated by nicksinger over 3 years ago

Due date set to 2022-02-18
Status changed from Blocked to Feedback

unfortunately just replacing the disk is not enough. The tests still produce errors. Checking the storage pool I can see that only one HDD is present (which is expected) but I also can't see the newly build in HDD. I have to check with Toni and IBM on how to proceed here.

Actions

Copy link

#13

Updated by nicksinger over 3 years ago

Due date changed from 2022-02-18 to 2022-02-21
Status changed from Feedback to Workable

I managed to make the new HDD visible in the hmc. I used smittyon the VIOS itself to reformat the new disk. Adding to the storage pool produced an error and I have to look into that what could be the issue. Maybe the storage pool has to be recreated.

Actions

Copy link

#14

Updated by okurz over 3 years ago

Due date changed from 2022-02-21 to 2022-03-11
Target version set to Ready

@nicksinger if you take a look then let's treat it as part of the SUSE QE Tools backlog officially

Actions

Copy link

#15

Updated by okurz over 3 years ago

Subject changed from [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error to [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error on redcurrant-*.qa

Actions

Copy link

#16

Updated by rfan1 over 3 years ago

@nicksinger,

Do you have any update for this ticket?

BR/Richard.

Actions

Copy link

#17

Updated by nicksinger about 3 years ago

I fiddled around with the storage pool again today and had to recreate every virtual HDD from scratch. But apparently this already worked for redcurrant-6: https://openqa.suse.de/tests/8284153#step/partitioning/2
I now reenable all other instances again to check if they work as expected too.

Actions

Copy link

#18

Updated by nicksinger about 3 years ago

Due date deleted (~~2022-03-11~~)
Status changed from Workable to Resolved

I checked the other instances too and found at least one working job. Please reopen if you see further problems.

Actions

Copy link

#20

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15sp2_ltss_media_basesys-srv_all_full_pre
https://openqa.suse.de/tests/8382301

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #106257

[sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error on redcurrant-*.qa

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by rfan1 over 3 years ago

Updated by rfan1 over 3 years ago

Updated by rfan1 over 3 years ago

Updated by rfan1 over 3 years ago

Updated by JERiveraMoya over 3 years ago

Updated by rfan1 over 3 years ago

Updated by okurz over 3 years ago

Updated by nicksinger over 3 years ago

Updated by nicksinger over 3 years ago

Updated by nicksinger over 3 years ago

Updated by rfan1 over 3 years ago

Updated by nicksinger over 3 years ago

Updated by nicksinger over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by rfan1 over 3 years ago

Updated by nicksinger about 3 years ago

Updated by nicksinger about 3 years ago

Updated by openqa_review about 3 years ago