action #97751
closedreplacement setup for o3 s390x openQA workers size:M
Added by okurz over 3 years ago. Updated over 3 years ago.
0%
Description
Motivation¶
#97658 is about recovering the original machine.
Regarding rebel, if we can't recover it in reasonable time we could try to run the s390x openQA worker instances on one of the other hosts within containers as we don't run qemu on the machines anyway, it's mostly forwarding VNC and recording video. So we should be able to come up with a replacement setup, maybe containers that just know the /etc/openqa/client.conf and /etc/openqa/workers.ini and run individual worker instances on openqaworker7 or any of the other existing o3 machines
Suggestion¶
- Configure a container image with existing client.conf/workers.ini from #97658
- Use podman on openqaworker7 (prefer non-root)
Updated by okurz over 3 years ago
- Copied from action #97658: many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M added
Updated by livdywan over 3 years ago
- Subject changed from replacement setup for o3 s390x openQA workers to replacement setup for o3 s390x openQA workers size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 3 years ago
podman is available on openqaworker7. I already put the two config files on openqaworker7 into /opt/s390x_rebel_replacement and tried
podman run --rm -it -v /opt/s390x_rebel_replacement:/etc/openqa registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
this starts but then fails trying to access the worker cache, see
[info] [pid:21] worker 1:
- config file: /etc/openqa/workers.ini
- worker hostname: 02d7428a4ac4
- isotovideo version: 23
- websocket API version: 1
- web UI hosts: http://openqa1-opensuse
- class: s390x-zVM-vswitch-l2,s390x-rebel-1-linux144
- no cleanup: no
- pool directory: /var/lib/openqa/pool/1
[error] [pid:21] Worker cache not available: Cache service info error: Connection refused
[info] [pid:21] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
[info] [pid:21] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
[info] [pid:21] Registering with openQA http://openqa1-opensuse
[info] [pid:21] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/397
[info] [pid:21] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 397
[warn] [pid:21] Worker cache not available: Cache service info error: Connection refused - checking again for web UI 'http://openqa1-opensuse' in 100.00 s
Updated by dheidler over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to dheidler
Updated by dheidler over 3 years ago
openqaworker7
seems to be unrechable.
Also running podman as non-root doesn't seem to work (on openSUSE?):
$ podman run --rm -it registry.opensuse.org/opensuse/leap:15.3 bash
Trying to pull registry.opensuse.org/opensuse/leap:15.3...
Getting image source signatures
Copying blob 795e626d95ff done
Copying config 4826cf609b done
Writing manifest to image destination
Storing signatures
Error: Error committing the finished image: error adding layer with blob "sha256:795e626d95ff6936a1f4c64c8fde63e59d8f9f373557db78f84fe9ac4a91f1da": Error processing tar file(exit status 1): potentially insufficient UIDs or GIDs available in user namespace (requested 0:15 for /etc/shadow): Check /etc/subuid and /etc/subgid: lchown /etc/shadow: invalid argument
Updated by okurz over 3 years ago
openqaworker7 is part of the o3 network so you need to go over o3 aka. ariel. Also I suggest to use registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest which already has openQA-worker, see above
Updated by openqa_review over 3 years ago
- Due date set to 2021-09-15
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler over 3 years ago
I currently use the command
podman run --rm -it -h openqaworker7_container -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share --entrypoint /bin/bash registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
on openqaworker7 to start the container.
I disabled the cache service via worker.ini and use the NFS mount from the host instead.
For some reason the permissions of the pool directory are incorrect even though in the Dockerfile there is a line that should put it right:
chown -R _openqa-worker /usr/share/openqa/script/worker /var/lib/openqa/cache /var/lib/openqa/pool
I have to execute the same command manually:
chown -R _openqa-worker /var/lib/openqa/pool
Then I can run /run_openqa_worker.sh
which I currently do in an interactive (bash) session on openqaworker7 within a tmux session.
The results look promising (https://openqa.opensuse.org/admin/workers/407) but there seem to be some network issues that I'm not sure if they are actually related: https://openqa.opensuse.org/tests/1898458#step/prepare_test_data/8
Updated by dheidler over 3 years ago
Also something odd happens:
Sometimes the entrypoint script is missing from the container.
Then I need to run podman rmi openqa_worker
and let podman fetch a new image.
Updated by dheidler over 3 years ago
The network issues seem to be due to the wrong WORKER_HOSTNAME
entry in workers.ini
.
I updated that file to use the ip of the container host and created a new start command:
i=102; podman run --rm -it -h openqaworker7_container -p $(python3 -c"p=${i}*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=$i -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share --entrypoint /bin/bash registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
The command server port is calculated ${instance}*10+20003
- so I'm using instance starting with 101 so that I don't conflict with the host worker ports.
Then I need to run these commands within the container:
mkdir /var/lib/openqa/pool/$OPENQA_WORKER_INSTANCE
chown -R _openqa-worker /var/lib/openqa/pool/
/run_openqa_worker.sh
Updated by dheidler over 3 years ago
Updated by dheidler over 3 years ago
With the PR merged and the new container image published on the registry, this command should be sufficient to start the container worker:
i=101; podman run --rm -it -h openqaworker7_container -p $(python3 -c"p=${i}*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=$i -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
Updated by dheidler over 3 years ago
i=102
podman run -d -h openqaworker7_container --name openqaworker7_container_$i -p $(python3 -c"p=${i}*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=$i -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
(cd /etc/systemd/system/; podman generate systemd -f -n openqaworker7_container_$i --restart-policy always)
systemctl enable container-openqaworker7_container_$i
Updated by okurz over 3 years ago
The reference for tests should be https://openqa.opensuse.org/tests/overview?build=20210827&groupid=34&version=Tumbleweed&distri=opensuse with 4/4 passed.
As discussed if you suspect problems because of the hybrid transactional-update setup on w7 try on openqaworker1 or openqaworker4. Or, put everything including "container removing before start" into a dirty shell script, e.g. in /opt/bin/our_hacky_containers_on_broken_transactional_systems_because_we_do_not_know_how_to_do_it_properly
Or ask in opensuse-factory@opensuse.org mailing list or libera.chat how to properly maintain container services on transactional-update without installing a full-blown orchestration setup, e.g. kubernetes.
Updated by dheidler over 3 years ago
Seems that there are some dependencies missing in the container: https://openqa.opensuse.org/tests/1911614
[2021-09-11T09:36:31.323 UTC] [warn] !!! main_common.pm: Failed to load main_ltp.pm:
Can't locate XML/Simple.pm in @INC (you may need to install the XML::Simple module) (@INC contains: . opensuse/lib /var/lib/openqa/pool/102/blib/arch /var/lib/openqa/pool/102/blib/lib /usr/lib/os-autoinst /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.26.1 /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at opensuse/lib/bugzilla.pm line 25.
BEGIN failed--compilation aborted at opensuse/lib/bugzilla.pm line 25.
Compilation failed in require at opensuse/lib/LTP/WhiteList.pm line 26.
BEGIN failed--compilation aborted at opensuse/lib/LTP/WhiteList.pm line 26.
Compilation failed in require at opensuse/lib/LTP/utils.pm line 27.
BEGIN failed--compilation aborted at opensuse/lib/LTP/utils.pm line 27.
Compilation failed in require at opensuse/lib/main_ltp.pm line 27.
BEGIN failed--compilation aborted at opensuse/lib/main_ltp.pm line 27.
Compilation failed in require at (eval 142) line 1.
BEGIN failed--compilation aborted at (eval 142) line 1.
[2021-09-11T09:36:31.791 UTC] [warn] !!! autotest::loadtest: error on tests/installation/installation_overview.pm: Can't locate Test/Assert.pm in @INC (you may need to install the Test::Assert module) (@INC contains: opensuse/tests/installation opensuse/lib . opensuse/products/opensuse/../../lib /var/lib/openqa/pool/102/blib/arch /var/lib/openqa/pool/102/blib/lib /usr/lib/os-autoinst /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.26.1 /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at opensuse/tests/installation/installation_overview.pm line 25.
So perl(XML::Simple)
and perl(Test::Assert)
are missing.
Turns out that os-autoinst-distri-opensuse-deps
is missing in the container.
Updated by dheidler over 3 years ago
- Status changed from In Progress to Feedback
Updated by dheidler over 3 years ago
I manually installed the dependencies into the container for now:
podman exec -it openqaworker1_container_101 zypper -n --gpg-auto-import-keys in os-autoinst-distri-opensuse-deps
podman exec -it openqaworker1_container_102 zypper -n --gpg-auto-import-keys in os-autoinst-distri-opensuse-deps
podman exec -it openqaworker1_container_103 zypper -n --gpg-auto-import-keys in os-autoinst-distri-opensuse-deps
podman exec -it openqaworker1_container_104 zypper -n --gpg-auto-import-keys in os-autoinst-distri-opensuse-deps
Updated by dheidler over 3 years ago
Updated by dheidler over 3 years ago
Updated by dheidler over 3 years ago
Updated by dheidler over 3 years ago
Updated by livdywan over 3 years ago
- Due date changed from 2021-09-15 to 2021-09-17
Bumping the due date due. It's not clear currently what's missing. @okurz suggests to if needed simply run the container and install and fix up in the instance - and file tickets as needed for concrete follow-ups.
Updated by okurz over 3 years ago
Unfortunately the ticket is still "Urgent" since 15 days. Please focus on removing the urgency ASAP. It does not have to be pretty but some s390x tests should run on o3, regardless how hacky it looks.
Please prioritize coming up with a quick-and-dirty solution, e.g. podman run … bash
to have an interactive prompt in the container, do all what is necessary in the interactive bash session to have at least one s390x job completing and store that in a dirty shell script.
Updated by dheidler over 3 years ago
- Priority changed from Urgent to High
After my fixes from Tuesday we should have at least one container worker that performs jobs:
https://openqa.opensuse.org/tests/overview?build=20210915&groupid=34&version=Tumbleweed&distri=opensuse
So I think we could remove the urgency for now.
The tests run through (or look like product issues) and in the meantime I can fix the (hopefully) last dependency issues.
Updated by okurz over 3 years ago
dheidler wrote:
So I think we could remove the urgency for now.
The tests run through (or look like product issues) and in the meantime I can fix the (hopefully) last dependency issues.
Thank you. I appreciate that. Result looks good so far.
Updated by dheidler over 3 years ago
Updated by dheidler over 3 years ago
Updated by dheidler over 3 years ago
- Status changed from Feedback to Resolved
With the last obstacles removed (FileProvides: /usr/bin/Xvnc xorg-x11-Xvnc
), we now have four running containers on openqaworker1 that were started using the described command and that can survive reboots using their systemd service files.
https://openqa.opensuse.org/tests/1922456 was ran in such a container which worked out of the box.
Updated by dheidler over 3 years ago
Turns out that the unit files are still specific to the container internal id due to:
[Service]
PIDFile=/var/run/containers/storage/btrfs-containers/b19ecda81ee2710f7cd3aadeb463c89f7d36b767ec4b23f71f49401da3c5860d/userdata/conmon.pid
Therefore we need to always regenerate the systemd service files when running a new container instance using podman run
.
i=101
podman run -d -h openqaworker1_container --name openqaworker1_container_$i -p $(python3 -c"p=${i}*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=$i -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
(cd /etc/systemd/system/; podman generate systemd -f -n openqaworker1_container_$i --restart-policy always)
systemctl daemon-reload
systemctl enable container-openqaworker1_container_$i
Updated by okurz over 2 years ago
- Related to action #116782: o3 s390 workers are offline added
Updated by okurz about 2 years ago
- Related to action #119713: Leap tests are failing because of failed log file uploading in multiple tests on s390x size:M added