Project

General

Profile

Actions

action #162941

open

Add job group definitions for SLEM 6.0 to QAC-yaml

Added by ph03nix 2 months ago. Updated 9 days ago.

Status:
In Progress
Priority:
High
Assignee:
Target version:
-
Start date:
2024-06-27
Due date:
% Done:

80%

Estimated time:

Description

https://openqa.suse.de/group_overview/566 is a prototype of the upcoming maintenance setup for SLEM 6.0. We need to create a job group definition for this job group in our https://gitlab.suse.de/qac/qac-openqa-yaml/

I think a new file staging-slem6_0.yaml in https://gitlab.suse.de/qac/qac-openqa-yaml/-/tree/master/sle-micro would fit nicely

Acceptance criteria


Checklist

  • Default-qcow-Updates
  • Default-encrypted-Updates (x86_64 only)
  • Default-VMware-Updates (x86_64 only)
  • Base-qcow-Updates
  • Base-encrypted-Updates (x86_64 only)
  • Base-VMware-Updates (x86_64 only)
  • Base-RT-Updates (x86_64 only)

Related issues 1 (1 open0 closed)

Related to openQA Project - action #165923: [qa-tools][vmware][spikesolution][timeboxed:20h] VNC reconnect after reboot size:SWorkable2024-08-28

Actions
Actions #1

Updated by ph03nix 2 months ago

  • Parent task set to #159828
Actions #2

Updated by mdati 2 months ago

  • Assignee set to mdati
Actions #3

Updated by mdati 2 months ago

  • Status changed from Workable to In Progress
Actions #5

Updated by mdati 2 months ago

  • Tags set to slem, yaml
  • Status changed from In Progress to Feedback

A.C.s ok, no issue to fix atm.

Actions #6

Updated by mdati 2 months ago

  • Status changed from Feedback to Resolved
Actions #7

Updated by ph03nix about 2 months ago

  • Status changed from Resolved to In Progress
  • Assignee changed from mdati to ph03nix

Reopening, as the product increments (https://openqa.suse.de/group_overview/572) are still to be done.

Actions #9

Updated by ph03nix about 2 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

https://openqa.suse.de/admin/job_templates/572 is now populated and under our control.

Actions #10

Updated by mdati about 1 month ago · Edited

  • Status changed from Resolved to In Progress
  • Assignee changed from ph03nix to mdati

Poo reopened for discussion in Slack, addressing Product increments SL Micro 6.0, with vmware and encrypted flavors as well.

See https://progress.opensuse.org/issues/159828#note-27

Actions #11

Updated by mdati about 1 month ago

  • Checklist item Default-qcow-Updates added
  • Checklist item Default-encrypted-Updates (x86_64 only) added
  • Checklist item Default-VMware-Updates (x86_64 only) added
  • Checklist item Base-qcow-Updates added
  • Checklist item Base-encrypted-Updates (x86_64 only) added
  • Checklist item Base-VMware-Updates (x86_64 only) added
  • Checklist item Base-RT-Updates (x86_64 only) added
Actions #12

Updated by mdati about 1 month ago

Created MR https://gitlab.suse.de/qac/qac-openqa-yaml/-/merge_requests/1763,
for all products/flavors in checklist.

Actions #13

Updated by mdati about 1 month ago

MR 1763 Merged.

Some logic-errors fixed in new MR https://gitlab.suse.de/qac/qac-openqa-yaml/-/merge_requests/1766,
also MERGED.

See tests in last build of https://openqa.suse.de/group_overview/572

Actions #14

Updated by mdati about 1 month ago · Edited

Today a new error affected Base-VMware-Updates tests SL Micro 6.0 Product Increments - Containers, failed in boot phase, due to SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk image located in hdd/fixed.

See analysis here below:

Findings on VMware boot error, by matching the autoinst logs of A Vs B casaes:

(A) boot PASS:

https://openqa.suse.de/tests/15036824/logfile?filename=autoinst-log.txt

run_ssh_cmd(if test -e /vmfs/volumes/datastore1/openQA/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk; then while lsof | grep 'cp.*SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk'; do echo File SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk is being copied by other process, sleeping for 60 seconds; sleep 60;done;else cp /vmfs/volumes/openqa/hdd/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk /vmfs/volumes/datastore1/openQA/;fi;)] exit-code: 0

and

(B) boot FAIL:

https://openqa.suse.de/tests/15040545/logfile?filename=autoinst-log.txt

run_ssh_cmd(if test -e /vmfs/volumes/Datastore2/openQA/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk; then while lsof | grep 'cp.*SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk'; do echo File SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk is being copied by other process, sleeping for 60 seconds; sleep 60;done;else cp /vmfs/volumes/openqa/hdd/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk /vmfs/volumes/Datastore2/openQA/;fi;)] stderr:
  cp: can't stat '/vmfs/volumes/openqa/hdd/SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk': No such file or directory
... exit-code: 1

The .vmdk file exists in ./hdd/fixed, for both A and B case,
but

  • for B, SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk not found in VMWARE_DATASTORE by the script, else triggered cp from ./hdd/, but no file here and fails;

  • for A instead, that script seems to find already .vmdk in VMWARE_DATASTORE, probably for previous works, so no (wrong) cp executed and errror skipped.

Now, the previous cleanup should have to remove image .vmdk for A, but expression doesn't match it, expecting some more string like "openQA-SUT" appended, therefore basename SL-Micro.x86_64-6.0-Base-VMware-GM.vmdk file not removed from VMWARE_DATASTORE. This cleanup runs before above if.

Cleanup for B also ran, but some files resulted locked, so did not delete those. This could be the logic behind above A .vmdk already present.

Proposed solution, by priority:

1) in _copy_image_vmware, cp should fallback to hdd/fixed when no file found in hdd.

2) Cleanup should remove also ${file_basename}:

   ...
   rm -f ${vmware_openqa_datastore}*${name}* \\
      ${vmware_openqa_datastore}*${file_basename}

3) Clarify/normalize images management in test code, always for both hdd and hdd/fixed.

Actions #15

Updated by mdati 29 days ago

For item n.1 above, created PR2524, last Aug 4.

Actions #16

Updated by mdati 29 days ago

Activities temporary paused will be resumed soon.

Actions #17

Updated by mdati 19 days ago · Edited

Activity restarted:

noted in the os_autoinst repo code flow, the add_disk($self, $args) routine, calling the _copy_image_to_vm_host($args,...), provides in the @args also original image full-path-file (eventually placed in subfolder like fixed/), coming from bootloader_svirt.pm call (see my $hddpath); but then that full-path is never used and even lost.

In fact, when internally calling _copy_image_vmware(...,$file_basename,...), the only basename is extracted and passed as parameter, but there the original image path is hard-coded and partially re-calculated, without any subfolder management.
Therefore images in hdd/ (or iso/) subolders are not correctly managed in copy commands.

In last PR 2524 update, replaced in _copy_image_to_vm_host() and inner-called _copy_image_vmware() the file_basename input parameter with the full path source file, passed by add_disk $args, coming (only) from bootloader_svirt.pm settings at runtime (or similar bootloader_zkvm.pm): this way we allow management of images also in subfolders like fixed/, avoiding the unneeded original folder recalculation.

Actions #18

Updated by mdati 17 days ago · Edited

All tests in SL Micro 6.0 Product Increments - Containers, and other groups too, are actually all failing, affected by a IBS repo renaming issue, causing install_updates to fail.

Poo opened: https://progress.opensuse.org/issues/165536, but issue managed in the named Jira ticket.

Moreover in last builds, a not-yet-clear behavior in such tests let the bootloader_svirt.pm step pass ok, even being the original image placed in the not-managed hdd/fixed/ subdirectory, that caused the error discussed in https://progress.opensuse.org/issues/162941#note-14. I.e. it could it be the image is also present into the destination folder already.

Actions #19

Updated by mdati 12 days ago · Edited

Recently resolved issue about IBS repo renaming, the tests in group 572 pass almost all, but 2 vmware tests still fail in podman netawark/skopeo/remote: for those issues I created poo https://progress.opensuse.org/issues/165884.

About hdd/fixed/ issue in os-autoinst, PR 2524 has been updated, all code fix reverted and simply introduced on-demand debugging in nfs datastore script, to verify the image file status.

Actions #20

Updated by mdati 9 days ago · Edited

  • Checklist item Default-qcow-Updates set to Done
  • Checklist item Default-encrypted-Updates (x86_64 only) set to Done
  • Checklist item Base-qcow-Updates set to Done
  • Checklist item Base-encrypted-Updates (x86_64 only) set to Done
  • Checklist item Base-RT-Updates (x86_64 only) set to Done
  • Tags changed from slem, yaml to slem, yaml, vmware

Status today about SL Micro 6.0 Product Increments - Containers: all tests pass, but only flavors VMware tests fail on rerun.

Main issue resulted a form of slowness or key-press lost, blocking the screen until needle timeout occurred: see poo 165923.
But those vmware tests always have assigned qesapworker# instances in Prg, despite available also other hosts sapworker# in Nue.
So I executed a run forcing the worker on Nue, WORKER_CLASS="sapworker1,svirt-vmware70": https://openqa.suse.de/tests/15299487.
The test proceeded until end, failing for needle format differences. But after needle updated, next [rerun] failed for worker problems and now all reruns on that worker fail this way:
https://openqa.suse.de/tests/15305429/logfile?filename=autoinst-log.txt#line-616

...
!!!! X64 Exception Type - 06(#UD - Invalid Opcode)  CPU Apic ID - 00000000 !!!!
RIP  - 0000000000000040, CS  - 0000000000000018, RFLAGS - 0000000000010247
RAX  - 000000005FC0E020, RCX - 000000005FC0E020, RDX - 000000005FC10EC8
RBX  - 000000005FC10EC8, RSP - 000000005FFBD7D8, RBP - 000000005FFBD830
RSI  - 000000005EB71120, RDI - 0000000000000031
R8   - 0000000000000004, R9  - 0000000000000001, R10 - 0000000000000000
R11  - 000000005EBF4140, R12 - 000000005FC10EC8, R13 - 000000005FD1DD98
R14  - 000000005FD8B818, R15 - 000000005EB71130
DS   - 0000000000000008, ES  - 0000000000000008, FS  - 0000000000000008
GS   - 0000000000000008, SS  - 0000000000000008
CR0  - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000005FF98000
CR4  - 0000000000000668, CR8 - 0000000000000000
DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 00000000FFFFFCC0 000000000000002F, LDTR - 0000000000000000
IDTR - 000000005FEE6440 0000000000000FFF,   TR - 0000000000000000
FXSAVE_STATE - 000000005FFBD430
!!!! Can't find image information. !!!!
...

See discussion in https://suse.slack.com/archives/C02CANHLANP/p1725012429713349 and problem seems inside the common host unreal7.qe.nue2.suse.org.

As summary:

VMmware tests, when running on workers:
qesapworker-prg# seem affected by random slow motions or missed key action: poo#165923;
sapworker# since today are affected by a cpu issue in unreal7.qe.nue2.suse.org.

Actions #21

Updated by mdati 9 days ago

  • Related to action #165923: [qa-tools][vmware][spikesolution][timeboxed:20h] VNC reconnect after reboot size:S added
Actions #22

Updated by mdati 9 days ago

  • % Done changed from 100 to 80
Actions

Also available in: Atom PDF