Project

General

Profile

action #96833

[sap] test fails in hana_install auto_review:"(?s)tests/sles4sap/hana_install.*Test died: poo#96833 - locked block device"

Added by acarvajal 4 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2021-08-13
Due date:
2021-08-31
% Done:

100%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP1-Server-DVD-SAP-Incidents-x86_64-qam-sles4sap_online_dvd_gnome_hana_nvdimm@64bit-ipmi-nvdimm fails in
hana_install

Reproducible

Failure is sporadic, but it only impacts tests that schedule the sles4sap/hana_install test module on the MACHINE=64bit-ipmi-nvdimm identified by IPMI_HOSTNAME=sp.holmes.qa.suse.de.

This system has one 465G SATA HDD, which needs to be re-partitioned before installing HANA to ensure there is enough free space for the HANA installation (2.5 times the amount of RAM).

As can be seen on https://openqa.suse.de/tests/6797823#next_previous test is able to complete successfully some times, while others it is unable to successfully clear up all LVM structures from the previous run resulting in a failure:

# wait_serial expected: "pvremove -f /dev/sda3; echo 2NoOV-\$?-"
# Result:
pvremove -f /dev/sda3; echo 2NoOV-$?-

# wait_serial expected: qr/2NoOV-\d+-/
# Result:

  Can't open /dev/sda3 exclusively.  Mounted filesystem?
  Can't open /dev/sda3 exclusively.  Mounted filesystem?
2NoOV-5-

Currently test module is calling wipefs -a as well as pvremove, vgremove, lvremove and dmsetup remove commands to clean up the device, and also the boot/boot_from_pxe test module is calling wipefs -a before installation, but this seems to not be enough on some test runs.

One idea could be to zero-out the whole disk during boot/boot_from_pxe, but this could turn out to be very slow due to the disk size.

Expected result

Last good: :20736:libarchive (or more recent)

Further details

Always latest result in this scenario: latest

History

#1 Updated by acarvajal 4 months ago

  • Assignee set to acarvajal

As a first approach to attempt to gather more information on the state of the LVM configuration in the system when the tests fail, I will submit a PR adding more debugging command on failure.

I will also include the poo# in the die command in these cases so test is properly identified as failing with this poo.

#3 Updated by acarvajal 4 months ago

Attempted clearing out logical volumes first with dmsetup remove than with lvremove, and the last 5 verification runs have all completed successfully:

I just cleaned the code in the PR, and will be submitting at least 5 new verification runs to be certain that this is a fix before merging.

#4 Updated by acarvajal 3 months ago

  • Subject changed from test fails in hana_install to test fails in hana_install auto_review:"(?s)tests/sles4sap/hana_install.*Test died: locked block device"

#5 Updated by acarvajal 3 months ago

  • Subject changed from test fails in hana_install auto_review:"(?s)tests/sles4sap/hana_install.*Test died: locked block device" to test fails in hana_install auto_review:"(?s)tests/sles4sap/hana_install.*Test died: poo#96833 - locked block device"

#6 Updated by acarvajal 3 months ago

  • Due date set to 2021-08-31

Seems latest changes have consistently yielded positive results.

Besides the 5 tests from https://progress.opensuse.org/issues/96833?#note-3, we have 5 more tests passing with the code from the pull request:

I have removed the WIP label from the pull request and will proceed to merge it.

I have also updated this ticket with an auto_review regexp so the openqa-label-known-issues-and-investigate-hook script will tag failed tests that match the new error message on L162 of tests/sles4sap/hana_install with this poo#. Currently latest tests that failed with this issue are:

$ env host=openqa.suse.de ./openqa-query-for-job-label poo#96833
6893068|2021-08-19 08:57:42|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6884463|2021-08-18 20:58:23|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6881752|2021-08-18 11:57:59|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6880171|2021-08-18 04:19:42|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6867933|2021-08-17 09:15:57|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:retry||grenache-1
6857549|2021-08-17 05:22:38|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6857533|2021-08-17 01:53:58|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6857525|2021-08-16 12:22:38|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6855972|2021-08-16 07:55:43|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6840844|2021-08-14 01:27:03|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1

I will be monitoring for failures with this issue until end of the month before closing the ticket.

#7 Updated by maritawerner 3 months ago

  • Subject changed from test fails in hana_install auto_review:"(?s)tests/sles4sap/hana_install.*Test died: poo#96833 - locked block device" to [sap] test fails in hana_install auto_review:"(?s)tests/sles4sap/hana_install.*Test died: poo#96833 - locked block device"

#8 Updated by acarvajal 3 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

Since https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/13069 was merged, no more jobs have failed on this issue.

openqa-query-for-job-label poo#96833 shows the same results as before:

acarvajal@linux-mkji:~/git/openqa-scripts [master|✔] > env host=openqa.suse.de ./openqa-query-for-job-label poo#96833
6893068|2021-08-19 08:57:42|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6888371|2021-08-19 01:58:33|done|failed|sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6884463|2021-08-18 20:58:23|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6881752|2021-08-18 11:57:59|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6880171|2021-08-18 04:19:42|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6867933|2021-08-17 09:15:57|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:retry||grenache-1
6857549|2021-08-17 05:22:38|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6857533|2021-08-17 01:53:58|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6857525|2021-08-16 12:22:38|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
6855972|2021-08-16 07:55:43|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1

While looking at the job results on the worker https://openqa.suse.de/admin/workers/1264 yields the following results since August 20th (PR was merged on the 19th):

Total Jobs: 151
Passed: 65 (43%)
Obsoleted: 59 (39%)
Incomplete: 4 (3%)
Failed: 23 (15%)

  • 7 on installation/add_update_test_repo
  • 3 on installation/partitioning_firstdisk
  • 3 on sles4sap/hana_test (probably performance-related. Will create a new poo# to handle this)
  • 5 on boot/boot_from_pxe (failed connection to IPMI server)
  • 1 on installation/welcome (seems also like a connection issue with IPMI server)
  • 1 on sles4sap/wizard_hana_install (bsc#1184679)
  • 3 on boot/first_boot (new DM screen in 15-SP4)

And if we remove the Obsoleted tests we get:

Total Jobs: 92
Passed: 65 (71%)
Incomplete: 4 (4%)
Failed: 23 (25%)

No failures in hana_install in any of these tests.

Closing this.

#9 Updated by acarvajal 3 months ago

  • Status changed from Closed to Resolved

#10 Updated by openqa_review about 2 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam-sles4sap_online_dvd_gnome_hana_nvdimm@64bit-ipmi-nvdimm
https://openqa.suse.de/tests/7431617

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Also available in: Atom PDF