Project

General

Profile

Actions

action #162941

closed

Add job group definitions for SLEM 6.0 to QAC-yaml

Added by ph03nix 6 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
2024-06-27
Due date:
% Done:

100%

Estimated time:

Description

https://openqa.suse.de/group_overview/566 is a prototype of the upcoming maintenance setup for SLEM 6.0. We need to create a job group definition for this job group in our https://gitlab.suse.de/qac/qac-openqa-yaml/

I think a new file staging-slem6_0.yaml in https://gitlab.suse.de/qac/qac-openqa-yaml/-/tree/master/sle-micro would fit nicely

Acceptance criteria


Checklist

  • Default-qcow-Updates
  • Default-encrypted-Updates (x86_64 only)
  • Default-VMware-Updates (x86_64 only)
  • Base-qcow-Updates
  • Base-encrypted-Updates (x86_64 only)
  • Base-VMware-Updates (x86_64 only)
  • Base-RT-Updates (x86_64 only)

Related issues 2 (2 open0 closed)

Related to openQA Project (public) - action #165923: [qa-tools][vmware][spikesolution][timeboxed:20h] VNC reconnect after reboot size:SWorkable2024-08-28

Actions
Related to Containers and images - action #166748: [MinimalVM] VMware images not handling hdd subfoldesWorkable2024-09-12

Actions
Actions #1

Updated by ph03nix 6 months ago

  • Parent task set to #159828
Actions #2

Updated by mdati 6 months ago

  • Assignee set to mdati
Actions #3

Updated by mdati 6 months ago

  • Status changed from Workable to In Progress
Actions #5

Updated by mdati 6 months ago

  • Tags set to slem, yaml
  • Status changed from In Progress to Feedback

A.C.s ok, no issue to fix atm.

Actions #6

Updated by mdati 6 months ago

  • Status changed from Feedback to Resolved
Actions #7

Updated by ph03nix 5 months ago

  • Status changed from Resolved to In Progress
  • Assignee changed from mdati to ph03nix

Reopening, as the product increments (https://openqa.suse.de/group_overview/572) are still to be done.

Actions #9

Updated by ph03nix 5 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

https://openqa.suse.de/admin/job_templates/572 is now populated and under our control.

Actions #10

Updated by mdati 5 months ago · Edited

  • Status changed from Resolved to In Progress
  • Assignee changed from ph03nix to mdati

Poo reopened for discussion in Slack, addressing Product increments SL Micro 6.0, with vmware and encrypted flavors as well.

See https://progress.opensuse.org/issues/159828#note-27

Actions #11

Updated by mdati 5 months ago

  • Checklist item Default-qcow-Updates added
  • Checklist item Default-encrypted-Updates (x86_64 only) added
  • Checklist item Default-VMware-Updates (x86_64 only) added
  • Checklist item Base-qcow-Updates added
  • Checklist item Base-encrypted-Updates (x86_64 only) added
  • Checklist item Base-VMware-Updates (x86_64 only) added
  • Checklist item Base-RT-Updates (x86_64 only) added
Actions #12

Updated by mdati 5 months ago

Created MR https://gitlab.suse.de/qac/qac-openqa-yaml/-/merge_requests/1763,
for all products/flavors in checklist.

Actions #13

Updated by mdati 5 months ago

MR 1763 Merged.

Some logic-errors fixed in new MR https://gitlab.suse.de/qac/qac-openqa-yaml/-/merge_requests/1766,
also MERGED.

See tests in last build of https://openqa.suse.de/group_overview/572

Actions #14

Updated by mdati 5 months ago · Edited

Today a new error affected Base-VMware-Updates tests SL Micro 6.0 Product Increments - Containers, failed in boot phase, due to SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk image located in hdd/fixed.

See analysis here below:

Findings on VMware boot error, by matching the autoinst logs of A Vs B casaes:

(A) boot PASS:

https://openqa.suse.de/tests/15036824/logfile?filename=autoinst-log.txt

run_ssh_cmd(if test -e /vmfs/volumes/datastore1/openQA/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk; then while lsof | grep 'cp.*SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk'; do echo File SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk is being copied by other process, sleeping for 60 seconds; sleep 60;done;else cp /vmfs/volumes/openqa/hdd/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk /vmfs/volumes/datastore1/openQA/;fi;)] exit-code: 0

and

(B) boot FAIL:

https://openqa.suse.de/tests/15040545/logfile?filename=autoinst-log.txt

run_ssh_cmd(if test -e /vmfs/volumes/Datastore2/openQA/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk; then while lsof | grep 'cp.*SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk'; do echo File SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk is being copied by other process, sleeping for 60 seconds; sleep 60;done;else cp /vmfs/volumes/openqa/hdd/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk /vmfs/volumes/Datastore2/openQA/;fi;)] stderr:
  cp: can't stat '/vmfs/volumes/openqa/hdd/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk': No such file or directory
... exit-code: 1

The .vmdk file exists in ./hdd/fixed, for both A and B case,
but

  • for B, SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk not found in VMWARE_DATASTORE by the script, else triggered cp from ./hdd/, but no file here and fails;

  • for A instead, that script seems to find already .vmdk in VMWARE_DATASTORE, probably for previous works, so no (wrong) cp executed and errror skipped.

Now, the previous cleanup should have to remove image .vmdk for A, but expression doesn't match it, expecting some more string like "openQA-SUT" appended, therefore basename SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk file not removed from VMWARE_DATASTORE. This cleanup runs before above if.

Cleanup for B also ran, but some files resulted locked, so did not delete those. This could be the logic behind above A .vmdk already present.

Proposed solution, by priority:

1) in _copy_image_vmware, either cp should fallback to hdd/fixed when no file found in hdd OR $file_basename should contain the path including hdd/fixed.

2) Cleanup should remove also file_basename, if none using it:

   ...
   rm -f ${vmware_openqa_datastore}*${name}* \
      ${vmware_openqa_datastore}*${file_basename}

3) Clarify/normalize images management in test code, always for both hdd and hdd/fixed.

Actions #15

Updated by mdati 5 months ago

For item n.1 above, created PR2524, last Aug 4.

Actions #16

Updated by mdati 5 months ago

Activities temporary paused will be resumed soon.

Actions #17

Updated by mdati 5 months ago · Edited

Activity restarted:

noted in the os_autoinst repo code flow, the add_disk($self, $args) routine, calling the _copy_image_to_vm_host($args,...), provides in the @args also original image full-path-file (eventually placed in subfolder like fixed/), coming from bootloader_svirt.pm call (see my $hddpath); but then that full-path is never used and even lost.

In fact, when internally calling _copy_image_vmware(...,$file_basename,...), the only basename is extracted and passed as parameter, but there the original image path is hard-coded and partially re-calculated, without any subfolder management.
Therefore images in hdd/ (or iso/) subolders are not correctly managed in copy commands.

In last PR 2524 update, replaced in _copy_image_to_vm_host() and inner-called _copy_image_vmware() the file_basename input parameter with the full path source file, passed by add_disk $args, coming (only) from bootloader_svirt.pm settings at runtime (or similar bootloader_zkvm.pm): this way we allow management of images also in subfolders like fixed/, avoiding the unneeded original folder recalculation.

Actions #18

Updated by mdati 4 months ago · Edited

All tests in SL Micro 6.0 Product Increments - Containers, and other groups too, are actually all failing, affected by a IBS repo renaming issue, causing install_updates to fail.

Poo opened: https://progress.opensuse.org/issues/165536, but issue managed in the named Jira ticket.

Moreover in last builds, a not-yet-clear behavior in such tests let the bootloader_svirt.pm step pass ok, even being the original image placed in the not-managed hdd/fixed/ subdirectory, that caused the error discussed in https://progress.opensuse.org/issues/162941#note-14. I.e. it could it be the image is also present into the destination folder already.

Actions #19

Updated by mdati 4 months ago · Edited

Recently resolved issue about IBS repo renaming, the tests in group 572 pass almost all, but 2 vmware tests still fail in podman netawark/skopeo/remote: for those issues I created poo https://progress.opensuse.org/issues/165884.

About hdd/fixed/ issue in os-autoinst, PR 2524 has been updated, all code fix reverted and simply introduced on-demand debugging in nfs datastore script,adding VMWARE_NFS_DATASTORE_DEBUG=1, to verify the image file status.

Actions #20

Updated by mdati 4 months ago · Edited

  • Checklist item Default-qcow-Updates set to Done
  • Checklist item Default-encrypted-Updates (x86_64 only) set to Done
  • Checklist item Base-qcow-Updates set to Done
  • Checklist item Base-encrypted-Updates (x86_64 only) set to Done
  • Checklist item Base-RT-Updates (x86_64 only) set to Done
  • Tags changed from slem, yaml to slem, yaml, vmware

Status today about SL Micro 6.0 Product Increments - Containers: all tests pass, but only flavors VMware tests fail on rerun.

Main issue resulted a form of slowness or key-press lost, blocking the screen until needle timeout occurred: see poo 165923.
But those vmware tests always have assigned qesapworker# instances in Prg, despite available also other hosts sapworker# in Nue.
So I executed a run forcing the worker on Nue, WORKER_CLASS="sapworker1,svirt-vmware70": https://openqa.suse.de/tests/15299487.
The test proceeded until end, failing for needle format differences. But after needle updated, next [rerun] failed for worker problems and now all reruns on that worker fail this way:
https://openqa.suse.de/tests/15305429/logfile?filename=autoinst-log.txt#line-616

...
!!!! X64 Exception Type - 06(#UD - Invalid Opcode)  CPU Apic ID - 00000000 !!!!
RIP  - 0000000000000040, CS  - 0000000000000018, RFLAGS - 0000000000010247
RAX  - 000000005FC0E020, RCX - 000000005FC0E020, RDX - 000000005FC10EC8
RBX  - 000000005FC10EC8, RSP - 000000005FFBD7D8, RBP - 000000005FFBD830
RSI  - 000000005EB71120, RDI - 0000000000000031
R8   - 0000000000000004, R9  - 0000000000000001, R10 - 0000000000000000
R11  - 000000005EBF4140, R12 - 000000005FC10EC8, R13 - 000000005FD1DD98
R14  - 000000005FD8B818, R15 - 000000005EB71130
DS   - 0000000000000008, ES  - 0000000000000008, FS  - 0000000000000008
GS   - 0000000000000008, SS  - 0000000000000008
CR0  - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000005FF98000
CR4  - 0000000000000668, CR8 - 0000000000000000
DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 00000000FFFFFCC0 000000000000002F, LDTR - 0000000000000000
IDTR - 000000005FEE6440 0000000000000FFF,   TR - 0000000000000000
FXSAVE_STATE - 000000005FFBD430
!!!! Can't find image information. !!!!
...

See discussion in https://suse.slack.com/archives/C02CANHLANP/p1725012429713349 and problem seems inside the common host unreal7.qe.nue2.suse.org.

As summary:

VMmware tests, when running on workers:
qesapworker-prg# seem affected by random slow motions or missed key action: poo#165923;
sapworker# since today are affected by a cpu issue in unreal7.qe.nue2.suse.org.

Actions #21

Updated by mdati 4 months ago

  • Related to action #165923: [qa-tools][vmware][spikesolution][timeboxed:20h] VNC reconnect after reboot size:S added
Actions #22

Updated by mdati 4 months ago

  • % Done changed from 100 to 80
Actions #23

Updated by mdati 4 months ago · Edited

Today all VMware tests in grp/572 PASS.

In particular, using WORKER_CLASS unreal7 all pass ok; see slack.

But still the tests having WORKER qesapworker assigned, fail because of issues on the used server esxi7: a poo ticket for this issue has been opened by eng.team: https://progress.opensuse.org/issues/166529.

Suggested, as W.A. until fixed, to run VMware tests on workers using unreal7:
WORKER_CLASS=sapworker1,svirt-vmware70 or WORKER_CLASS=unreal7,svirt-vmware70 or WORKER_CLASS=unreal7.
I.e. https://openqa.suse.de/tests/15390100 pass.

Actions #24

Updated by mdati 4 months ago · Edited

At the moment all VMware tests in https://openqa.suse.de/group_overview/572 pass ok.

Please note that in actual VMware tests run the issue in note-14/B is no more present, because the image file results already present in the expected folder (here transferred by some unknown or manual operation), as also revealed cloning the test with VMWARE_NFS_DATASTORE_DEBUG=1, from PR 2524. See bash snippet in i.e. job 15400838

But this could mean that the existing local image is always used, because never cleaned , so that eventual new image update from builds are never tested.

A possible correction could be, in sequential changes:

  1. ensure that the right full-path image is provided as origin in _copy_image_vmware, as proposed in PR https://github.com/os-autoinst/os-autoinst/pull/2542
  2. Define a lock-file policy for these VMware images (in place of the lsof check), to prevent clean up when running test are using it.
  3. cleanup in item n.2 of note-14 implemented.
Actions #25

Updated by mdati 4 months ago · Edited

  • Checklist item Default-VMware-Updates (x86_64 only) set to Done
  • Checklist item Base-VMware-Updates (x86_64 only) set to Done
  • Status changed from In Progress to Feedback

Confirming the status that at the moment all VMware tests in https://openqa.suse.de/group_overview/572 pass ok, also resolved the problems in poo https://progress.opensuse.org/issues/165884, the requests in note-10 result addressed and completed .

Only remains open the topic in note-24, addressed in the proposed 3 points, but being it a pre-existing situation, it can be subject of a dedicated new ticket, that will be created soon.

Actions #26

Updated by ph03nix 4 months ago

  • Related to action #166748: [MinimalVM] VMware images not handling hdd subfoldes added
Actions #27

Updated by mdati 3 months ago

  • Status changed from Feedback to Resolved
Actions #28

Updated by mdati 3 months ago

  • % Done changed from 80 to 100
Actions #29

Updated by mdati 3 months ago

Unschedule VMware for SLEM product increments, MR https://gitlab.suse.de/qac/qac-openqa-yaml/-/merge_requests/1842 MERGED

Actions #30

Updated by ph03nix 2 months ago

  • Tags changed from slem, yaml, vmware to containers
Actions

Also available in: Atom PDF