action #157432
closedparted /dev/sda disk got error at powerVM worker size:M
0%
Description
Observation¶
Failed job:
https://openqa.suse.de/tests/13768555#step/bootloader_start/39
Reproduce steps:
- wipefs -af /dev/sda erased disk successful.
- sync
- parted -s /dev/sda mklabel gpt Error: Partition(s)2,3 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making future changes.
Acceptance criteria¶
- AC1: Errors clearly state what the problem is
- AC2: Test is able to handle a non-empty disk
Suggestions¶
- Maybe specific to grenache, see https://openqa.suse.de/tests/14289526#next_previous
Error detected during first stage of the installation at sle/tests/autoyast/installation.pm line 288.
- Maybe related to #157447 - to be confirmed
- Sporadic issue https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast&version=15-SP6#next_previous
- Ensure the disk is cleaned if it's not empty before running the test
- Properly wipe the disk
- Unmount the partition before trying to delete it (it seems partitions are automounted)
Further details¶
Link to latest: https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast#next_previous
Updated by okurz 8 months ago
@tinawang123 I don't think we can anything about that from openQA side. This is quite usual behaviour I have seen in Linux regardless of the architecture and definitely not related to a specific machine. You need to handle that behaviour in test code accordingly.
Updated by JERiveraMoya 8 months ago
- Subject changed from [tools] parted /dev/sda disk got error at powerVM worker to parted /dev/sda disk got error at powerVM worker
okurz wrote in #note-2:
@tinawang123 I don't think we can anything about that from openQA side. This is quite usual behaviour I have seen in Linux regardless of the architecture and definitely not related to a specific machine. You need to handle that behaviour in test code accordingly.
The reason to introduce this wiping + partitioning was to make more stable the tests (because the disk was reused with whatever happened before making it unpredictable with sporadic failures), this has been working for ages in powervm and for example in s390x we do something similar, https://openqa.suse.de/tests/13783114#step/bootloader_start/49.
You can compare with what we expect for pvm with old passing job: https://openqa.suse.de/tests/11163215#step/bootloader_start/35.
Is there any other way to have a fresh lpar there? to reboot at that point and handling on the test would be an overkill, this new setup for some reason does't allow to perform that operation. Googling I hit some result regarding potential kernel issues (but not idea...honestly).
Updated by okurz 8 months ago
JERiveraMoya wrote in #note-3:
Is there any other way to have a fresh lpar there?
Potentially by wiping the LPAR assigned storage from novalink at the beginning of the test execution.
As alternative one could try to force a refresh of storage devices from the Linux system.
to reboot at that point and handling on the test would be an overkill, this new setup for some reason does't allow to perform that operation.
What do you mean with "new setup"? What we have now in PRG2 are the very same machines that were already used in before.
Updated by openqa_review 8 months ago
- Due date set to 2024-04-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by JERiveraMoya 7 months ago ยท Edited
Unfortunately, latest build doesn't pass the bootloader to give you any feedback here (sorry for the delay in any case).
Once that happens we might consider your advice there (although technically I don't know how can be done).
The issues most likely will persist, but if you prefer reject it for now we can reopen later, up to you how to handle it.
The other point is that now I know that they are the same machines that have existing issues from years ago, thanks for that info.
Updated by JERiveraMoya 6 months ago
Here is the sporadic issue: https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast&version=15-SP6#next_previous
What command did you suggest that could be run before the parted to refresh the storage?
For the other suggestion we don't have expertise to remove LPAR assigned storage from novalink.
Can be connected with https://progress.opensuse.org/issues/157447 ?
If you need a new ticket instead of reopening this one, please let us know.
Updated by leli 6 months ago
- Status changed from Rejected to New
@okurz Reopened this ticket, please help to solve it, we don't have enough knowledge to fix it, refer to Joaquin's comments. Thanks.
Some investigation for the history of the issue https://openqa.suse.de/tests/14289526#next_previous.
All failed case has the parted issue, on grenache-1:2 grenache-1:4, while passed job without the parted issue on other workers(Strange is that I found one passed job on 1:4 also)
Updated by okurz 6 months ago
- Related to action #157447: ppc64le-spvm stops at grub page, timeout due to slow PXE traffic between PRG2 and NUE2? added
Updated by openqa_review 5 months ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: autoyast_reinstall
https://openqa.suse.de/tests/14373678#step/bootloader_start/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by nicksinger 5 months ago
I also saw such behavior on several Linux systems. From the configuration point of view I can only tell you that all disks are equally created once and live in some virtualization layer ("PowerVM") managed by either NovaLink or the HMC/VIOS. It would surprise me if the management makes any difference but might be worth a try.
leli wrote in #note-11:
All failed case has the parted issue, on grenache-1:2 grenache-1:4, while passed job without the parted issue on other workers(Strange is that I found one passed job on 1:4 also)
This brings me back to my first sentence; it might be just a case of random access to that disk. I don't see much done between the wipefs call and parted except a single sync which does not really ensure the kernel is not still using any previous references to the partition table.
Fully recreating the disk in NovaLink or the HMC might be equally overkill as a reboot of the testcase itself because it usually takes some time to process and is prone to produce errors.
We could however introduce another disk to these systems to they could switch to the other disk in case this (sporadic) issue is encountered. But I'm not sure if this is really worth the effort to implement?
Updated by jbaier_cz 3 months ago
Looking at https://openqa.suse.de/tests/14373678#step/bootloader_start/39; the partitions are mounted before the wipefs
call. Of course the system is complaining about deleting mounted partitions. Wouldn't the solution here be just unmount them first before wipe and sync?
Updated by okurz about 2 months ago
- Description updated (diff)
- Status changed from Workable to Resolved
- Assignee set to okurz
By now the "link to latest" is https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast#next_previous showing at least 5 consecutive passed jobs. The ticket so far does not mention a fail ratio so given that I am assuming the issue is resolved.