Project

General

Profile

Actions

action #137408

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

Support move of s390x mainframe(s) to PRG2 - o3 size:M

Added by okurz 7 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-06-29
Due date:
% Done:

0%

Estimated time:

Description

Motivation

s390x mainframe(s) are being moved to PRG2 with the help of IBM. We need to support the process and help to bring back any QE related LPARs or VMs on those mainframes to be able to use them as part of openQA.

Acceptance criteria

  • AC1: s390x openQA tests on openqa.opensuse.org are able to successfully execute tests

Suggestions

  • Follow s390x mainframe related moving coordination as referenced in #100455#Important-documents
  • Read #132152 about what was done in the area of o3
  • Update access details in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls where necessary
  • See last worker definitions in #134912-34 and crosscheck with , e.g. see openqa.opensuse.org/admin/workers/
  • Ask in particular mgriessmeier+gschlotter for collaboration – if not the assignee themselves ;)
  • Ensure we have access to machines manually as well as with verification openQA jobs, both for o3+osd
  • Ensure we work with proper FQDNs where possible, not IPv4 mess :)
  • React to any user reports about not working or missing s390x tests/VMs/LPARs, etc.
  • the openqa.opensuse.org openQA machine setting for "s390x-zVM-vswitch-l2" has REPO_HOST=192.168.112.100 and other references to 192.168.112. This needs to be changed as soon as zVM instances are able to reach new-ariel internally, e.g. over FTP
  • Inform users about the result
  • Ensure that https://openqa.opensuse.org/group_overview/34 shows no obvious incompletes/fails/stuck jobs, currently no new build for some days

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #133364: Migration of o3 VM to PRG2 - Decommission old-ariel in NUE1 as soon as we do not need it anymoreResolvedokurz

Actions
Related to openQA Tests - action #153057: [tools] test fails in bootloader_start because openQA can not boot for s390x size:MResolvedmkittler2024-01-03

Actions
Has duplicate openQA Infrastructure - action #139307: openQA for s390x does not work at the momentRejectedokurz2023-11-12

Actions
Actions #2

Updated by okurz 7 months ago

  • Subject changed from Support move of s390x mainframe(s) to PRG2 - o3 to Support move of s390x mainframe(s) to PRG2 - o3 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz 7 months ago

  • Related to action #133364: Migration of o3 VM to PRG2 - Decommission old-ariel in NUE1 as soon as we do not need it anymore added
Actions #4

Updated by okurz 6 months ago

  • Target version changed from Ready to Tools - Next
Actions #5

Updated by okurz 6 months ago

  • Has duplicate action #139307: openQA for s390x does not work at the moment added
Actions #6

Updated by okurz 6 months ago

  • Project changed from 46 to openQA Infrastructure
  • Category deleted (Infrastructure)
  • Priority changed from Normal to High
Actions #7

Updated by mgriessmeier 6 months ago

just to not forget: in theory o3 setup is as far progressed as o.s.d, just the last basic test to ping the LPAR s390zl11 was not pingable from o.o.o which would need to be investigated by SUSE IT

Actions #8

Updated by okurz 6 months ago

  • Target version changed from Tools - Next to Ready

getting more and more questions, e.g. by DimStar today in irc://irc.libera.chat/opensuse-factory

[13/11/2023 17:12:45] okurz_: now as the latest s390x snapshot is in openQA we do seem to have an issue there: no worker available to pick up the jobs; is that expected?

hence adding to our backlog

Actions #9

Updated by okurz 6 months ago

According to mgriessmeier in the o3 network the s390x mainframe zVM interface should be reachable over 10.150.1.41.

mgriessmeier will ask gschlotter to double check the vswitch config related to that IP and if that should be reachable from o3 (10.150.1.11/24). At best find out the L2 address accordingly as well.

Actions #10

Updated by mgriessmeier 6 months ago

update:

  • connection from ariel (and the whole 10.150.1.x range) to s390zl11 (10.150.1.41) is working now
  • two o3 guests have been created
  • modified existing worker on o3 (openqaworker23:1)
  • started first job https://openqa.opensuse.org/tests/3731839#live with openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3588513 WORKER_CLASS=s390x-zVM_poo137408 _GROUP=0 BUILD=poo137408 TEST+=-poo137408
Actions #11

Updated by mgriessmeier 6 months ago

it used the old MACHINE settings since it was a clone jobs, so I triggered another one while explicitly overwriting the MACHINE settings:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3588513 WORKER_CLASS=s390x-zVM_poo137408 _GROUP=0 BUILD=poo137408 TEST+=-poo137408 REPO_HOST=10.150.1.11 S390_NETWORK_PARAMS="OSAMedium=eth OSAInterface=qdio InstNetDev=osa HostIP=10.150.1.150 Hostname=o3zvm001 Gateway=10.150.1.254 Nameserver=10.151.53.53 Domain=openqanet.opensuse.org PortNo= Layer2=0 ReadChannel=0.0.0A00 WriteChannel=0.0.0A01 DataChannel=0.0.0A02 OSAHWAddr="

-> https://openqa.opensuse.org/tests/3731858

Actions #12

Updated by nicksinger 6 months ago

we had some issues with apparmor which caused icewm (containing the xterm-console) to fail. Apparently the apparmor-profile package was missing which we fixed by just installing the apparmor pattern on worker23

Actions #13

Updated by mgriessmeier 6 months ago

the core infrastructure part is done, mainframe is able to start the installer -> https://openqa.opensuse.org/tests/3733312

This still shows an error which is most likely caused by slow network (or less likely, DNS/firewall issue, though the service is reachable from the worker) and would need more investigation

still to do:

  • currently there is only one instance hardcoded as the old approach with substitution of @S390_HOST@ does not work anymore, so refactoring in the same way as on o.s.d is needed - see https://progress.opensuse.org/issues/132152#note-62 for details
  • trigger enough verification runs to get statistical stability
  • adapt MACHINE definition accordingly after the refactoring mentioned above

created 50 jobs with

for i in {1..50}; do openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3718373 WORKER_CLASS=s390x-zVM_poo137408 _GROUP=0 BUILD=poo137408 TEST+=-poo137408-$i REPO_HOST=10.150.1.11 S390_NETWORK_PARAMS="OSAMedium=eth OSAInterface=qdio InstNetDev=osa HostIP=10.150.1.150/24 Hostname=o3zvm001.openqanet.opensuse.org Gateway=10.150.1.254 Nameserver=10.151.1.11 Domain=openqanet.opensuse.org PortNo= Layer2=0 ReadChannel=0.0.0A00 WriteChannel=0.0.0A01 DataChannel=0.0.0A02 OSAHWAddr="; done

-> https://openqa.opensuse.org/tests/overview?build=poo137408&version=Tumbleweed&distri=opensuse

Actions #14

Updated by mgriessmeier 5 months ago

nicksinger wrote in #note-12:

we had some issues with apparmor which caused icewm (containing the xterm-console) to fail. Apparently the apparmor-profile package was missing which we fixed by just installing the apparmor pattern on worker23

hmm seems like our temporary changes to apparmor were reverted somehow/by someone -> https://openqa.opensuse.org/tests/3735119#step/bootloader_s390/33
as I understood this is not acceptable anyway and the worker should go either in a container or on a seperate host, please move it accordingly :)

However, the infrastructure to run the tests is in place, there seems to be an issue with ntp which needs to be investigated further, I assume it's a timeout but haven't verified it -> https://openqa.opensuse.org/tests/3733329#step/partitioning_finish/2

besides that, next steps would be:

  • schedule a bunch of verification jobs after the worker changes have been made (ideally over the weekend)
  • apply the same refactoring as to be done for the o.s.d. workers in #132152
Actions #15

Updated by okurz 5 months ago

Regarding s390 in containers please see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers . As we discussed we see the main infrastructure network configuration in good shape. There might be specific individual network services still blocked. This would need to be solved on a case by case based on openQA test results.

mgriessmeier wrote in #note-14:

However, the infrastructure to run the tests is in place, there seems to be an issue with ntp which needs to be investigated further, I assume it's a timeout but haven't verified it -> https://openqa.opensuse.org/tests/3733329#step/partitioning_finish/2

For that I suggest to use the interactive mode from openQA and pause at the partitioning module and look if the connection can be established at all. I suspect firewall blocking the communication.

Actions #16

Updated by mgriessmeier 5 months ago

okurz wrote in #note-15:

Regarding s390 in containers please see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers . As we discussed we see the main infrastructure network configuration in good shape. There might be specific individual network services still blocked. This would need to be solved on a case by case based on openQA test results.

Installed openqaworker23_container_101 in openqaworker23 with the following steps:

  • created directory /opt/s390x_opensuse
  • copied /etc/openqa/client.conf and /etc/openqa/workers.ini from openqaworker23 itself
  • chmod 644 /opt/s390x_opensuse/client.conf in order to be readable by the container

  • modify /opt/s390x_opensuse/workers.ini:

[global]
HOST = https://openqa.opensuse.org
WORKER_HOSTNAME = openqaworker23
CACHEDIRECTORY = /var/lib/openqa/cache
CACHESERVICEURL=http://10.150.1.26:9530/
CACHELIMIT = 400


[101]
WORKER_CLASS = s390x-zVM_poo137408
s390_HOST = '001'
ZVM_GUEST = o3zvm001
ZVM_HOST = 10.150.1.41
ZVM_PASSWORD =
  • override cache service: systemctl edit openqa-worker-cacheservice.service)
# /etc/systemd/system/openqa-worker-cacheservice.service.d/override.conf
[Service]
Environment="MOJO_LISTEN=http://0.0.0.0:9530"
  • podman run -d -h openqaworker23_container --name openqaworker23_container_101 -p $(python3 -c"p=101*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=101 -v /opt/s390x_opensuse:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share -v /var/lib/openqa/cache:/var/lib/openqa/cache registry.opensuse.org/devel/openqa/containers15.5/ openqa_worker_os_autoinst_distri_opensuse:latest

  • (cd /etc/systemd/system/; podman generate systemd -f -n --new openqaworker23_container_101 --restart-policy always)

  • systemctl enable container-openqaworker23_container_101.service
    Created symlink /etc/systemd/system/default.target.wants/container-openqaworker23_container_101.service → /etc/systemd/system/container-openqaworker23_container_101.service.

  • one presumably bug I found, openssh was missing from the container (see https://openqa.opensuse.org/tests/3736877#step/bootloader_s390/33) -> podman exec -it openqaworker23_container_101 zypper in openssh manually

https://openqa.opensuse.org/tests/3733329#step/partitioning_finish/2

For that I suggest to use the interactive mode from openQA and pause at the partitioning module and look if the connection can be established at all. I suspect firewall blocking the communication.

I tried once, but it didn't stop, however I logged in before that step to already notice that DNS resolution didn't work at all. Changed nameserver to 10.150.2.10 -> https://openqa.opensuse.org/tests/3737257

Actions #18

Updated by okurz 5 months ago

On w32 in /opt/s390x_opensuse I added

ZVM_HOST = s390zl11.openqanet.opensuse.org

to fix https://openqa.opensuse.org/tests/3749284/. Retriggered as https://openqa.opensuse.org/tests/3749340

Actions #19

Updated by okurz 5 months ago

  • Status changed from Workable to In Progress

As https://openqa.opensuse.org/tests/3749350 progressed far enough mgriessmeier and me enabled four instances using s3270, added corresponding entries in o3:/etc/hosts and then generated four openQA worker systemd services on w23.

@mgriessmeier after today please review again if that still looks fine and then resolve.

Actions #20

Updated by openqa_review 5 months ago

  • Due date set to 2023-12-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by okurz 5 months ago · Edited

There are still some old o3 jobs scheduled with old worker class "s390x-zVM-vswitch-l2", not the new one which is just "s390x-zVM". We could just ignore those. But I opted to add that back to the worker class config of all workers so we have now in w23:/opt/s390x_opensuse/workers.ini

[global]
…
WORKER_CLASS = s390x-zVM,s390x-zVM-vswitch-l2,prg,prg2,prg2-j12
ZVM_PASSWORD = …
ZVM_HOST = s390zl11.openqanet.opensuse.org

[101]
ZVM_GUEST = o3zvm001.openqanet.opensuse.org

[102]
ZVM_GUEST = o3zvm002.openqanet.opensuse.org

[103]
ZVM_GUEST = o3zvm003.openqanet.opensuse.org

[104]
ZVM_GUEST = o3zvm004.openqanet.opensuse.org

and then systemctl restart container-openqaworker23_container_10{1..4}.service

that lead to problems that reverse resolution did not work for o3zvm003.openqanet.opensuse.org and o3zvm004.openqanet.opensuse.org but working for o3zvm001 and o3zvm002 that we already had in o3:/etc/hosts. I restarted dnsmasq on o3 and restarted https://openqa.opensuse.org/tests/3751739 leading to https://openqa.opensuse.org/tests/3751741

Actions #22

Updated by okurz 5 months ago

Actions #23

Updated by mgriessmeier 5 months ago

okurz wrote in #note-22:

Not clear about https://openqa.opensuse.org/tests/3751745#step/bootloader_s390/29 . Please take a look

hmm must have missed this...

well, looks like it's not booting within 300s and there is no output after network config created... but it don't see any obvious failure

Actions #24

Updated by mgriessmeier 5 months ago

  • Status changed from In Progress to Feedback

I was able to reproduce the issue on o3zvm003 and o3zvm004, but not on 001 or 002... which lead me to have a look into the z/VM internal network config. I figured that o3zvm003 and o3zvm004 were simply not configured at all in the vswitch, hence failing to establish a network connection. Added, restarted z/VM and triggered 4 parallel jobs to verify all 4 workers: https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=poo137408_network-test

Actions #25

Updated by mgriessmeier 5 months ago

  • Status changed from Feedback to Resolved

the 4 tests are passing, and the live test of the production build also looks like: https://openqa.opensuse.org/tests/3767882#live
so I will close this

Actions #26

Updated by okurz 5 months ago

  • Due date deleted (2023-12-07)
Actions #27

Updated by okurz 4 months ago

  • Related to action #153057: [tools] test fails in bootloader_start because openQA can not boot for s390x size:M added
Actions

Also available in: Atom PDF