action #106257
closed[sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error on redcurrant-*.qa
Added by rfan1 almost 3 years ago. Updated over 2 years ago.
0%
Description
Observation¶
openQA test in scenario sle-15-SP4-Migration-from-SLE15-SPx-Milestone-ppc64le-online_sles15sp3_scc_basesys-srv-desk-dev_all_full_zypp_pre@ppc64le-hmc-single-disk fails in
await_install
Test suite description¶
Reproducible¶
Fails since (at least) Build 88.4
Expected result¶
Last good: 79.1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by rfan1 almost 3 years ago
The issue is caused by hardware issue, and seems only part of the Lpars from redcurrant are impacted [redcurrant-5 and redcurrant-7], and no issue with [redcurrant-4 and redcurrant-6 etc].
I have filed a jira ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-75978 to track the issue, and got confirmation it was hardware issue:
Michal Suchanek commented:
The VIOS errlog command clearly shows disk errors. errlog.txtlink_attachment_7.gif
-> hardware problem (with hdisk2 specifically)
Updated by rfan1 almost 3 years ago
- Subject changed from [sle][PowerVM][hardware]test fails in await_install which is caused by disk error to [sle][migration][PowerVM][hardware]test fails in await_install which is caused by disk error
Updated by rfan1 almost 3 years ago
- Subject changed from [sle][migration][PowerVM][hardware]test fails in await_install which is caused by disk error to [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error
Updated by rfan1 almost 3 years ago
More update!
ONLY PART OF LPARs are impacted based my findings [red....ant-3, 5, 7 ].
Others work fine [red....ant-4, 6 etc]
Updated by JERiveraMoya almost 3 years ago
- Priority changed from Normal to High
Plenty of failures in latest build, even restarting multiple times we don't hit the good ones easily,
could it be possible to deactivate the failing ones?
https://openqa.suse.de/tests/overview?arch=ppc64le&flavor=&machine=ppc64le-hmc-single-disk%2Cppc64le-hmc-4disk&test=&modules=&module_re=&distri=sle&version=15-SP4&build=95.1&groupid=129#
Yes, tests in security group failed as well:(
The worker classes are recorded at:
https://gitlab.suse.de/-/ide/project/openqa/salt-pillars-openqa/tree/master/-/openqa/workerconf.sls/
We can adjust this file or define a new worker class for individual test case as a workaround.
At the same time, I am still look for some help to fix the hardware issue via jira SD ticket as well, hopefully it can be fix asap.
Updated by rfan1 almost 3 years ago
Just received some kindly feedback from Matthias Griessmeier, we may need order a new disk.
For now, we may try to use the good LPARs for our tests as a workaround.
Updated by okurz almost 3 years ago
- Related to action #106598: Redcurrant has a broken HDD added
Updated by nicksinger almost 3 years ago
- Status changed from New to In Progress
- Assignee set to nicksinger
The affected LPARs are:
- redcurrant-1
- redcurrant-2
- redcurrant-3
- redcurrant-5
- redcurrant-7
redcurrant-9defined but not in use currentlyredcurrant-11defined but not in use currently
I will mask the according workers for now so tests only run on working LPARs. Next Wednesday we will be in the office and try to replace the disk (also see SD ticket for more details)
Updated by nicksinger almost 3 years ago
Worker instances stopped and masked with
systemctl stop openqa-worker-auto-restart@{21,22,23,25,27} && systemctl mask openqa-worker-auto-restart@{21,22,23,25,27}
on worker grenache-1 (see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1003-1042).
I guess for the replacement and reconstruction of the "Storage pool" I will need shutdown the whole machine (redcurrant-1) but I will do this shortly before we replace the physical disk.
Afterwards the following steps are needed to enable the instances again:
systemctl unmask openqa-worker-auto-restart@{21,22,23,25,27} && systemctl start openqa-worker-auto-restart@{21,22,23,25,27}
Updated by nicksinger almost 3 years ago
- Status changed from In Progress to Blocked
blocked till we are in the office
Updated by rfan1 almost 3 years ago
physical disk was replaced, storage pool will be checked by nsinger
Updated by nicksinger almost 3 years ago
- Due date set to 2022-02-18
- Status changed from Blocked to Feedback
unfortunately just replacing the disk is not enough. The tests still produce errors. Checking the storage pool I can see that only one HDD is present (which is expected) but I also can't see the newly build in HDD. I have to check with Toni and IBM on how to proceed here.
Updated by nicksinger almost 3 years ago
- Due date changed from 2022-02-18 to 2022-02-21
- Status changed from Feedback to Workable
I managed to make the new HDD visible in the hmc. I used smitty
on the VIOS itself to reformat the new disk. Adding to the storage pool produced an error and I have to look into that what could be the issue. Maybe the storage pool has to be recreated.
Updated by okurz almost 3 years ago
- Due date changed from 2022-02-21 to 2022-03-11
- Target version set to Ready
@nicksinger if you take a look then let's treat it as part of the SUSE QE Tools backlog officially
Updated by okurz almost 3 years ago
- Subject changed from [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error to [sle][security][migration][PowerVM][hardware]test fails in await_install which is caused by disk error on redcurrant-*.qa
Updated by rfan1 almost 3 years ago
Updated by nicksinger almost 3 years ago
I fiddled around with the storage pool again today and had to recreate every virtual HDD from scratch. But apparently this already worked for redcurrant-6: https://openqa.suse.de/tests/8284153#step/partitioning/2
I now reenable all other instances again to check if they work as expected too.
Updated by nicksinger almost 3 years ago
- Due date deleted (
2022-03-11) - Status changed from Workable to Resolved
I checked the other instances too and found at least one working job. Please reopen if you see further problems.
Updated by openqa_review over 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: offline_sles15sp2_ltss_media_basesys-srv_all_full_pre
https://openqa.suse.de/tests/8382301
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.