action #157432: parted /dev/sda disk got error at powerVM worker size:M - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

action #157432

closed

parted /dev/sda disk got error at powerVM worker size:M

Added by tinawang123 8 months ago. Updated about 2 months ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Support

Target version:

openQA Project - Ready

Start date:

2024-03-18

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

Failed job:
https://openqa.suse.de/tests/13768555#step/bootloader_start/39
Reproduce steps:

wipefs -af /dev/sda erased disk successful.
sync
parted -s /dev/sda mklabel gpt Error: Partition(s)2,3 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making future changes.

Acceptance criteria¶

AC1: Errors clearly state what the problem is
AC2: Test is able to handle a non-empty disk

Suggestions¶

Maybe specific to grenache, see https://openqa.suse.de/tests/14289526#next_previous
- Error detected during first stage of the installation at sle/tests/autoyast/installation.pm line 288.
Maybe related to #157447 - to be confirmed
Sporadic issue https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast&version=15-SP6#next_previous
Ensure the disk is cleaned if it's not empty before running the test
Properly wipe the disk
Unmount the partition before trying to delete it (it seems partitions are automounted)

Further details¶

Link to latest: https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast#next_previous

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by okurz 8 months ago

Category set to Support
Status changed from New to In Progress
Assignee set to okurz
Target version set to Ready

Actions

Copy link

Updated by okurz 8 months ago

@tinawang123 I don't think we can anything about that from openQA side. This is quite usual behaviour I have seen in Linux regardless of the architecture and definitely not related to a specific machine. You need to handle that behaviour in test code accordingly.

Actions

Copy link

Updated by JERiveraMoya 8 months ago

Subject changed from [tools] parted /dev/sda disk got error at powerVM worker to parted /dev/sda disk got error at powerVM worker

okurz wrote in #note-2:

@tinawang123 I don't think we can anything about that from openQA side. This is quite usual behaviour I have seen in Linux regardless of the architecture and definitely not related to a specific machine. You need to handle that behaviour in test code accordingly.

The reason to introduce this wiping + partitioning was to make more stable the tests (because the disk was reused with whatever happened before making it unpredictable with sporadic failures), this has been working for ages in powervm and for example in s390x we do something similar, https://openqa.suse.de/tests/13783114#step/bootloader_start/49.
You can compare with what we expect for pvm with old passing job: https://openqa.suse.de/tests/11163215#step/bootloader_start/35.

Is there any other way to have a fresh lpar there? to reboot at that point and handling on the test would be an overkill, this new setup for some reason does't allow to perform that operation. Googling I hit some result regarding potential kernel issues (but not idea...honestly).

Actions

Copy link

Updated by okurz 8 months ago

JERiveraMoya wrote in #note-3:

Is there any other way to have a fresh lpar there?

Potentially by wiping the LPAR assigned storage from novalink at the beginning of the test execution.
As alternative one could try to force a refresh of storage devices from the Linux system.

to reboot at that point and handling on the test would be an overkill, this new setup for some reason does't allow to perform that operation.

What do you mean with "new setup"? What we have now in PRG2 are the very same machines that were already used in before.

Actions

Copy link

Updated by openqa_review 8 months ago

Due date set to 2024-04-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz 8 months ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by okurz 7 months ago

Due date changed from 2024-04-02 to 2024-04-30
Priority changed from Normal to Low
Target version changed from Ready to Tools - Next

no response. Following up with lower prio

Actions

Copy link

Updated by okurz 7 months ago

Due date deleted (~~2024-04-30~~)
Status changed from Feedback to Rejected
Target version changed from Tools - Next to Ready

rejecting due to no response

Actions

Copy link

Updated by JERiveraMoya 7 months ago · Edited

Unfortunately, latest build doesn't pass the bootloader to give you any feedback here (sorry for the delay in any case).
Once that happens we might consider your advice there (although technically I don't know how can be done).
The issues most likely will persist, but if you prefer reject it for now we can reopen later, up to you how to handle it.
The other point is that now I know that they are the same machines that have existing issues from years ago, thanks for that info.

Actions

Copy link

#10

Updated by JERiveraMoya 6 months ago

Here is the sporadic issue: https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast&version=15-SP6#next_previous

What command did you suggest that could be run before the parted to refresh the storage?
For the other suggestion we don't have expertise to remove LPAR assigned storage from novalink.

Can be connected with https://progress.opensuse.org/issues/157447 ?
If you need a new ticket instead of reopening this one, please let us know.

Actions

Copy link

#11

Updated by leli 6 months ago

Status changed from Rejected to New

@okurz Reopened this ticket, please help to solve it, we don't have enough knowledge to fix it, refer to Joaquin's comments. Thanks.

Some investigation for the history of the issue https://openqa.suse.de/tests/14289526#next_previous.

All failed case has the parted issue, on grenache-1:2 grenache-1:4, while passed job without the parted issue on other workers(Strange is that I found one passed job on 1:4 also)

Actions

Copy link

#12

Updated by okurz 6 months ago

Related to action #157447: ppc64le-spvm stops at grub page, timeout due to slow PXE traffic between PRG2 and NUE2? added

Actions

Copy link

#13

Updated by okurz 6 months ago

Assignee deleted (~~okurz~~)

Actions

Copy link

#14

Updated by livdywan 6 months ago

Subject changed from parted /dev/sda disk got error at powerVM worker to parted /dev/sda disk got error at powerVM worker size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#15

Updated by openqa_review 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoyast_reinstall
https://openqa.suse.de/tests/14373678#step/bootloader_start/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

#16

Updated by okurz 5 months ago

Tags set to infra
Project changed from openQA Project to openQA Infrastructure
Category changed from Support to Support

maybe something that can be investigated more from the infra side rather than openQA itself.

Actions

Copy link

#17

Updated by nicksinger 5 months ago

I also saw such behavior on several Linux systems. From the configuration point of view I can only tell you that all disks are equally created once and live in some virtualization layer ("PowerVM") managed by either NovaLink or the HMC/VIOS. It would surprise me if the management makes any difference but might be worth a try.

leli wrote in #note-11:

All failed case has the parted issue, on grenache-1:2 grenache-1:4, while passed job without the parted issue on other workers(Strange is that I found one passed job on 1:4 also)

This brings me back to my first sentence; it might be just a case of random access to that disk. I don't see much done between the wipefs call and parted except a single sync which does not really ensure the kernel is not still using any previous references to the partition table.
Fully recreating the disk in NovaLink or the HMC might be equally overkill as a reboot of the testcase itself because it usually takes some time to process and is prone to produce errors.

We could however introduce another disk to these systems to they could switch to the other disk in case this (sporadic) issue is encountered. But I'm not sure if this is really worth the effort to implement?

Actions

Copy link

#18

Updated by jbaier_cz 3 months ago

Looking at https://openqa.suse.de/tests/14373678#step/bootloader_start/39; the partitions are mounted before the wipefs call. Of course the system is complaining about deleting mounted partitions. Wouldn't the solution here be just unmount them first before wipe and sync?

Actions

Copy link

#19

Updated by okurz about 2 months ago

Description updated (diff)
Status changed from Workable to Resolved
Assignee set to okurz

By now the "link to latest" is https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast#next_previous showing at least 5 consecutive passed jobs. The ticket so far does not mention a fail ratio so given that I am assuming the issue is resolved.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #157432

parted /dev/sda disk got error at powerVM worker size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Further details¶

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by JERiveraMoya 8 months ago

Updated by okurz 8 months ago

Updated by openqa_review 8 months ago

Updated by okurz 8 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by JERiveraMoya 7 months ago · Edited

Updated by JERiveraMoya 6 months ago

Updated by leli 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by livdywan 6 months ago

Updated by openqa_review 5 months ago

Updated by okurz 5 months ago

Updated by nicksinger 5 months ago

Updated by jbaier_cz 3 months ago

Updated by okurz about 2 months ago