Project

General

Profile

Actions

action #157432

closed

parted /dev/sda disk got error at powerVM worker size:M

Added by tinawang123 8 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Support
Target version:
Start date:
2024-03-18
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Failed job:
https://openqa.suse.de/tests/13768555#step/bootloader_start/39
Reproduce steps:

  1. wipefs -af /dev/sda erased disk successful.
  2. sync
  3. parted -s /dev/sda mklabel gpt Error: Partition(s)2,3 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making future changes.

Acceptance criteria

  • AC1: Errors clearly state what the problem is
  • AC2: Test is able to handle a non-empty disk

Suggestions

Further details

Link to latest: https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast#next_previous


Related issues 1 (1 open0 closed)

Related to openQA Tests - action #157447: ppc64le-spvm stops at grub page, timeout due to slow PXE traffic between PRG2 and NUE2?New2024-03-18

Actions
Actions #1

Updated by okurz 8 months ago

  • Category set to Support
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Ready
Actions #2

Updated by okurz 8 months ago

@tinawang123 I don't think we can anything about that from openQA side. This is quite usual behaviour I have seen in Linux regardless of the architecture and definitely not related to a specific machine. You need to handle that behaviour in test code accordingly.

Actions #3

Updated by JERiveraMoya 8 months ago

  • Subject changed from [tools] parted /dev/sda disk got error at powerVM worker to parted /dev/sda disk got error at powerVM worker

okurz wrote in #note-2:

@tinawang123 I don't think we can anything about that from openQA side. This is quite usual behaviour I have seen in Linux regardless of the architecture and definitely not related to a specific machine. You need to handle that behaviour in test code accordingly.

The reason to introduce this wiping + partitioning was to make more stable the tests (because the disk was reused with whatever happened before making it unpredictable with sporadic failures), this has been working for ages in powervm and for example in s390x we do something similar, https://openqa.suse.de/tests/13783114#step/bootloader_start/49.
You can compare with what we expect for pvm with old passing job: https://openqa.suse.de/tests/11163215#step/bootloader_start/35.

Is there any other way to have a fresh lpar there? to reboot at that point and handling on the test would be an overkill, this new setup for some reason does't allow to perform that operation. Googling I hit some result regarding potential kernel issues (but not idea...honestly).

Actions #4

Updated by okurz 8 months ago

JERiveraMoya wrote in #note-3:

Is there any other way to have a fresh lpar there?

Potentially by wiping the LPAR assigned storage from novalink at the beginning of the test execution.
As alternative one could try to force a refresh of storage devices from the Linux system.

to reboot at that point and handling on the test would be an overkill, this new setup for some reason does't allow to perform that operation.

What do you mean with "new setup"? What we have now in PRG2 are the very same machines that were already used in before.

Actions #5

Updated by openqa_review 8 months ago

  • Due date set to 2024-04-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz 8 months ago

  • Status changed from In Progress to Feedback
Actions #7

Updated by okurz 7 months ago

  • Due date changed from 2024-04-02 to 2024-04-30
  • Priority changed from Normal to Low
  • Target version changed from Ready to Tools - Next

no response. Following up with lower prio

Actions #8

Updated by okurz 7 months ago

  • Due date deleted (2024-04-30)
  • Status changed from Feedback to Rejected
  • Target version changed from Tools - Next to Ready

rejecting due to no response

Actions #9

Updated by JERiveraMoya 7 months ago ยท Edited

Unfortunately, latest build doesn't pass the bootloader to give you any feedback here (sorry for the delay in any case).
Once that happens we might consider your advice there (although technically I don't know how can be done).
The issues most likely will persist, but if you prefer reject it for now we can reopen later, up to you how to handle it.
The other point is that now I know that they are the same machines that have existing issues from years ago, thanks for that info.

Actions #10

Updated by JERiveraMoya 6 months ago

Here is the sporadic issue: https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast&version=15-SP6#next_previous

What command did you suggest that could be run before the parted to refresh the storage?
For the other suggestion we don't have expertise to remove LPAR assigned storage from novalink.

Can be connected with https://progress.opensuse.org/issues/157447 ?
If you need a new ticket instead of reopening this one, please let us know.

Actions #11

Updated by leli 6 months ago

  • Status changed from Rejected to New

@okurz Reopened this ticket, please help to solve it, we don't have enough knowledge to fix it, refer to Joaquin's comments. Thanks.

Some investigation for the history of the issue https://openqa.suse.de/tests/14289526#next_previous.

All failed case has the parted issue, on grenache-1:2 grenache-1:4, while passed job without the parted issue on other workers(Strange is that I found one passed job on 1:4 also)

Actions #12

Updated by okurz 6 months ago

  • Related to action #157447: ppc64le-spvm stops at grub page, timeout due to slow PXE traffic between PRG2 and NUE2? added
Actions #13

Updated by okurz 6 months ago

  • Assignee deleted (okurz)
Actions #14

Updated by livdywan 6 months ago

  • Subject changed from parted /dev/sda disk got error at powerVM worker to parted /dev/sda disk got error at powerVM worker size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #15

Updated by openqa_review 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoyast_reinstall
https://openqa.suse.de/tests/14373678#step/bootloader_start/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #16

Updated by okurz 5 months ago

  • Tags set to infra
  • Project changed from openQA Project to openQA Infrastructure
  • Category changed from Support to Support

maybe something that can be investigated more from the infra side rather than openQA itself.

Actions #17

Updated by nicksinger 5 months ago

I also saw such behavior on several Linux systems. From the configuration point of view I can only tell you that all disks are equally created once and live in some virtualization layer ("PowerVM") managed by either NovaLink or the HMC/VIOS. It would surprise me if the management makes any difference but might be worth a try.

leli wrote in #note-11:

All failed case has the parted issue, on grenache-1:2 grenache-1:4, while passed job without the parted issue on other workers(Strange is that I found one passed job on 1:4 also)

This brings me back to my first sentence; it might be just a case of random access to that disk. I don't see much done between the wipefs call and parted except a single sync which does not really ensure the kernel is not still using any previous references to the partition table.
Fully recreating the disk in NovaLink or the HMC might be equally overkill as a reboot of the testcase itself because it usually takes some time to process and is prone to produce errors.

We could however introduce another disk to these systems to they could switch to the other disk in case this (sporadic) issue is encountered. But I'm not sure if this is really worth the effort to implement?

Actions #18

Updated by jbaier_cz 3 months ago

Looking at https://openqa.suse.de/tests/14373678#step/bootloader_start/39; the partitions are mounted before the wipefs call. Of course the system is complaining about deleting mounted partitions. Wouldn't the solution here be just unmount them first before wipe and sync?

Actions #19

Updated by okurz about 2 months ago

  • Description updated (diff)
  • Status changed from Workable to Resolved
  • Assignee set to okurz

By now the "link to latest" is https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=create_hdd_textmode_yast#next_previous showing at least 5 consecutive passed jobs. The ticket so far does not mention a fail ratio so given that I am assuming the issue is resolved.

Actions

Also available in: Atom PDF