action #137408
closedQA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Support move of s390x mainframe(s) to PRG2 - o3 size:M
0%
Description
Motivation¶
s390x mainframe(s) are being moved to PRG2 with the help of IBM. We need to support the process and help to bring back any QE related LPARs or VMs on those mainframes to be able to use them as part of openQA.
Acceptance criteria¶
- AC1: s390x openQA tests on openqa.opensuse.org are able to successfully execute tests
Suggestions¶
- Follow s390x mainframe related moving coordination as referenced in #100455#Important-documents
- Read #132152 about what was done in the area of o3
- Update access details in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls where necessary
- See last worker definitions in #134912-34 and crosscheck with , e.g. see openqa.opensuse.org/admin/workers/
- Ask in particular mgriessmeier+gschlotter for collaboration – if not the assignee themselves ;)
- Ensure we have access to machines manually as well as with verification openQA jobs, both for o3+osd
- Ensure we work with proper FQDNs where possible, not IPv4 mess :)
- React to any user reports about not working or missing s390x tests/VMs/LPARs, etc.
- the openqa.opensuse.org openQA machine setting for "s390x-zVM-vswitch-l2" has REPO_HOST=192.168.112.100 and other references to 192.168.112. This needs to be changed as soon as zVM instances are able to reach new-ariel internally, e.g. over FTP
- Inform users about the result
- Ensure that https://openqa.opensuse.org/group_overview/34 shows no obvious incompletes/fails/stuck jobs, currently no new build for some days
Updated by okurz about 1 year ago
- Subject changed from Support move of s390x mainframe(s) to PRG2 - o3 to Support move of s390x mainframe(s) to PRG2 - o3 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz almost 1 year ago
- Related to action #133364: Migration of o3 VM to PRG2 - Decommission old-ariel in NUE1 as soon as we do not need it anymore added
Updated by okurz 11 months ago
- Has duplicate action #139307: openQA for s390x does not work at the moment added
Updated by mgriessmeier 11 months ago
just to not forget: in theory o3 setup is as far progressed as o.s.d, just the last basic test to ping the LPAR s390zl11
was not pingable from o.o.o which would need to be investigated by SUSE IT
Updated by okurz 11 months ago
- Target version changed from Tools - Next to Ready
getting more and more questions, e.g. by DimStar today in irc://irc.libera.chat/opensuse-factory
[13/11/2023 17:12:45] okurz_: now as the latest s390x snapshot is in openQA we do seem to have an issue there: no worker available to pick up the jobs; is that expected?
hence adding to our backlog
Updated by okurz 11 months ago
According to mgriessmeier in the o3 network the s390x mainframe zVM interface should be reachable over 10.150.1.41.
mgriessmeier will ask gschlotter to double check the vswitch config related to that IP and if that should be reachable from o3 (10.150.1.11/24). At best find out the L2 address accordingly as well.
Updated by mgriessmeier 11 months ago
update:
- connection from ariel (and the whole 10.150.1.x range) to s390zl11 (10.150.1.41) is working now
- two o3 guests have been created
- modified existing worker on o3 (openqaworker23:1)
- started first job https://openqa.opensuse.org/tests/3731839#live with
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3588513 WORKER_CLASS=s390x-zVM_poo137408 _GROUP=0 BUILD=poo137408 TEST+=-poo137408
Updated by mgriessmeier 11 months ago
it used the old MACHINE settings since it was a clone jobs, so I triggered another one while explicitly overwriting the MACHINE settings:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3588513 WORKER_CLASS=s390x-zVM_poo137408 _GROUP=0 BUILD=poo137408 TEST+=-poo137408 REPO_HOST=10.150.1.11 S390_NETWORK_PARAMS="OSAMedium=eth OSAInterface=qdio InstNetDev=osa HostIP=10.150.1.150 Hostname=o3zvm001 Gateway=10.150.1.254 Nameserver=10.151.53.53 Domain=openqanet.opensuse.org PortNo= Layer2=0 ReadChannel=0.0.0A00 WriteChannel=0.0.0A01 DataChannel=0.0.0A02 OSAHWAddr="
Updated by nicksinger 11 months ago
we had some issues with apparmor which caused icewm (containing the xterm-console) to fail. Apparently the apparmor-profile
package was missing which we fixed by just installing the apparmor
pattern on worker23
Updated by mgriessmeier 11 months ago
the core infrastructure part is done, mainframe is able to start the installer -> https://openqa.opensuse.org/tests/3733312
This still shows an error which is most likely caused by slow network (or less likely, DNS/firewall issue, though the service is reachable from the worker) and would need more investigation
still to do:
- currently there is only one instance hardcoded as the old approach with substitution of @S390_HOST@ does not work anymore, so refactoring in the same way as on o.s.d is needed - see https://progress.opensuse.org/issues/132152#note-62 for details
- trigger enough verification runs to get statistical stability
- adapt MACHINE definition accordingly after the refactoring mentioned above
created 50 jobs with
for i in {1..50}; do openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3718373 WORKER_CLASS=s390x-zVM_poo137408 _GROUP=0 BUILD=poo137408 TEST+=-poo137408-$i REPO_HOST=10.150.1.11 S390_NETWORK_PARAMS="OSAMedium=eth OSAInterface=qdio InstNetDev=osa HostIP=10.150.1.150/24 Hostname=o3zvm001.openqanet.opensuse.org Gateway=10.150.1.254 Nameserver=10.151.1.11 Domain=openqanet.opensuse.org PortNo= Layer2=0 ReadChannel=0.0.0A00 WriteChannel=0.0.0A01 DataChannel=0.0.0A02 OSAHWAddr="; done
-> https://openqa.opensuse.org/tests/overview?build=poo137408&version=Tumbleweed&distri=opensuse
Updated by mgriessmeier 11 months ago
nicksinger wrote in #note-12:
we had some issues with apparmor which caused icewm (containing the xterm-console) to fail. Apparently the
apparmor-profile
package was missing which we fixed by just installing theapparmor
pattern on worker23
hmm seems like our temporary changes to apparmor were reverted somehow/by someone -> https://openqa.opensuse.org/tests/3735119#step/bootloader_s390/33
as I understood this is not acceptable anyway and the worker should go either in a container or on a seperate host, please move it accordingly :)
However, the infrastructure to run the tests is in place, there seems to be an issue with ntp which needs to be investigated further, I assume it's a timeout but haven't verified it -> https://openqa.opensuse.org/tests/3733329#step/partitioning_finish/2
besides that, next steps would be:
- schedule a bunch of verification jobs after the worker changes have been made (ideally over the weekend)
- apply the same refactoring as to be done for the o.s.d. workers in #132152
Updated by okurz 11 months ago
Regarding s390 in containers please see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers . As we discussed we see the main infrastructure network configuration in good shape. There might be specific individual network services still blocked. This would need to be solved on a case by case based on openQA test results.
mgriessmeier wrote in #note-14:
However, the infrastructure to run the tests is in place, there seems to be an issue with ntp which needs to be investigated further, I assume it's a timeout but haven't verified it -> https://openqa.opensuse.org/tests/3733329#step/partitioning_finish/2
For that I suggest to use the interactive mode from openQA and pause at the partitioning module and look if the connection can be established at all. I suspect firewall blocking the communication.
Updated by mgriessmeier 11 months ago
okurz wrote in #note-15:
Regarding s390 in containers please see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers . As we discussed we see the main infrastructure network configuration in good shape. There might be specific individual network services still blocked. This would need to be solved on a case by case based on openQA test results.
Installed openqaworker23_container_101
in openqaworker23 with the following steps:
- created directory /opt/s390x_opensuse
- copied /etc/openqa/client.conf and /etc/openqa/workers.ini from openqaworker23 itself
chmod 644 /opt/s390x_opensuse/client.conf
in order to be readable by the containermodify /opt/s390x_opensuse/workers.ini:
[global]
HOST = https://openqa.opensuse.org
WORKER_HOSTNAME = openqaworker23
CACHEDIRECTORY = /var/lib/openqa/cache
CACHESERVICEURL=http://10.150.1.26:9530/
CACHELIMIT = 400
[101]
WORKER_CLASS = s390x-zVM_poo137408
s390_HOST = '001'
ZVM_GUEST = o3zvm001
ZVM_HOST = 10.150.1.41
ZVM_PASSWORD =
- override cache service:
systemctl edit openqa-worker-cacheservice.service)
# /etc/systemd/system/openqa-worker-cacheservice.service.d/override.conf
[Service]
Environment="MOJO_LISTEN=http://0.0.0.0:9530"
podman run -d -h openqaworker23_container --name openqaworker23_container_101 -p $(python3 -c"p=101*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=101 -v /opt/s390x_opensuse:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share -v /var/lib/openqa/cache:/var/lib/openqa/cache registry.opensuse.org/devel/openqa/containers15.5/ openqa_worker_os_autoinst_distri_opensuse:latest
(cd /etc/systemd/system/; podman generate systemd -f -n --new openqaworker23_container_101 --restart-policy always)
systemctl enable container-openqaworker23_container_101.service
Created symlink /etc/systemd/system/default.target.wants/container-openqaworker23_container_101.service → /etc/systemd/system/container-openqaworker23_container_101.service.one presumably bug I found, openssh was missing from the container (see https://openqa.opensuse.org/tests/3736877#step/bootloader_s390/33) ->
podman exec -it openqaworker23_container_101 zypper in openssh
manually
https://openqa.opensuse.org/tests/3733329#step/partitioning_finish/2
For that I suggest to use the interactive mode from openQA and pause at the partitioning module and look if the connection can be established at all. I suspect firewall blocking the communication.
I tried once, but it didn't stop, however I logged in before that step to already notice that DNS resolution didn't work at all. Changed nameserver to 10.150.2.10 -> https://openqa.opensuse.org/tests/3737257
Updated by mgriessmeier 11 months ago
there is a job which passed the installation step: https://openqa.opensuse.org/tests/3737299#live
let's see -> https://openqa.opensuse.org/tests/overview?build=poo137408_V2&version=Tumbleweed&distri=opensuse
Updated by okurz 11 months ago
On w32 in /opt/s390x_opensuse I added
ZVM_HOST = s390zl11.openqanet.opensuse.org
to fix https://openqa.opensuse.org/tests/3749284/. Retriggered as https://openqa.opensuse.org/tests/3749340
Updated by okurz 11 months ago
- Status changed from Workable to In Progress
As https://openqa.opensuse.org/tests/3749350 progressed far enough mgriessmeier and me enabled four instances using s3270, added corresponding entries in o3:/etc/hosts and then generated four openQA worker systemd services on w23.
@mgriessmeier after today please review again if that still looks fine and then resolve.
Updated by openqa_review 11 months ago
- Due date set to 2023-12-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 11 months ago · Edited
There are still some old o3 jobs scheduled with old worker class "s390x-zVM-vswitch-l2", not the new one which is just "s390x-zVM". We could just ignore those. But I opted to add that back to the worker class config of all workers so we have now in w23:/opt/s390x_opensuse/workers.ini
[global]
…
WORKER_CLASS = s390x-zVM,s390x-zVM-vswitch-l2,prg,prg2,prg2-j12
ZVM_PASSWORD = …
ZVM_HOST = s390zl11.openqanet.opensuse.org
[101]
ZVM_GUEST = o3zvm001.openqanet.opensuse.org
[102]
ZVM_GUEST = o3zvm002.openqanet.opensuse.org
[103]
ZVM_GUEST = o3zvm003.openqanet.opensuse.org
[104]
ZVM_GUEST = o3zvm004.openqanet.opensuse.org
and then systemctl restart container-openqaworker23_container_10{1..4}.service
that lead to problems that reverse resolution did not work for o3zvm003.openqanet.opensuse.org and o3zvm004.openqanet.opensuse.org but working for o3zvm001 and o3zvm002 that we already had in o3:/etc/hosts. I restarted dnsmasq on o3 and restarted https://openqa.opensuse.org/tests/3751739 leading to https://openqa.opensuse.org/tests/3751741
Updated by okurz 11 months ago
Not clear about https://openqa.opensuse.org/tests/3751745#step/bootloader_s390/29 . Please take a look
Updated by mgriessmeier 11 months ago
okurz wrote in #note-22:
Not clear about https://openqa.opensuse.org/tests/3751745#step/bootloader_s390/29 . Please take a look
hmm must have missed this...
well, looks like it's not booting within 300s and there is no output after network config created
... but it don't see any obvious failure
Updated by mgriessmeier 11 months ago
- Status changed from In Progress to Feedback
I was able to reproduce the issue on o3zvm003 and o3zvm004, but not on 001 or 002... which lead me to have a look into the z/VM internal network config. I figured that o3zvm003 and o3zvm004 were simply not configured at all in the vswitch, hence failing to establish a network connection. Added, restarted z/VM and triggered 4 parallel jobs to verify all 4 workers: https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=poo137408_network-test
Updated by mgriessmeier 11 months ago
- Status changed from Feedback to Resolved
the 4 tests are passing, and the live test of the production build also looks like: https://openqa.opensuse.org/tests/3767882#live
so I will close this
Updated by okurz 9 months ago
- Related to action #153057: [tools] test fails in bootloader_start because openQA can not boot for s390x size:M added