Project

General

Profile

Actions

action #97751

closed

replacement setup for o3 s390x openQA workers size:M

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
2021-09-17
% Done:

0%

Estimated time:

Description

Motivation

#97658 is about recovering the original machine.
Regarding rebel, if we can't recover it in reasonable time we could try to run the s390x openQA worker instances on one of the other hosts within containers as we don't run qemu on the machines anyway, it's mostly forwarding VNC and recording video. So we should be able to come up with a replacement setup, maybe containers that just know the /etc/openqa/client.conf and /etc/openqa/workers.ini and run individual worker instances on openqaworker7 or any of the other existing o3 machines

Suggestion

  • Configure a container image with existing client.conf/workers.ini from #97658
  • Use podman on openqaworker7 (prefer non-root)

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #116782: o3 s390 workers are offlineResolvedmkittler2022-09-192022-10-04

Actions
Related to openQA Project - action #119713: Leap tests are failing because of failed log file uploading in multiple tests on s390x size:MResolvedokurz2022-11-01

Actions
Copied from openQA Infrastructure - action #97658: many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:MResolvednicksinger2021-08-30

Actions
Actions #1

Updated by okurz about 3 years ago

  • Copied from action #97658: many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M added
Actions #2

Updated by livdywan about 3 years ago

  • Subject changed from replacement setup for o3 s390x openQA workers to replacement setup for o3 s390x openQA workers size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz about 3 years ago

podman is available on openqaworker7. I already put the two config files on openqaworker7 into /opt/s390x_rebel_replacement and tried

podman run --rm -it -v /opt/s390x_rebel_replacement:/etc/openqa registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest

this starts but then fails trying to access the worker cache, see

[info] [pid:21] worker 1:
 - config file:           /etc/openqa/workers.ini
 - worker hostname:       02d7428a4ac4
 - isotovideo version:    23
 - websocket API version: 1
 - web UI hosts:          http://openqa1-opensuse
 - class:                 s390x-zVM-vswitch-l2,s390x-rebel-1-linux144
 - no cleanup:            no
 - pool directory:        /var/lib/openqa/pool/1
[error] [pid:21] Worker cache not available: Cache service info error: Connection refused
[info] [pid:21] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
[info] [pid:21] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
[info] [pid:21] Registering with openQA http://openqa1-opensuse
[info] [pid:21] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/397
[info] [pid:21] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 397
[warn] [pid:21] Worker cache not available: Cache service info error: Connection refused - checking again for web UI 'http://openqa1-opensuse' in 100.00 s
Actions #4

Updated by dheidler about 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #5

Updated by dheidler about 3 years ago

openqaworker7 seems to be unrechable.

Also running podman as non-root doesn't seem to work (on openSUSE?):

$ podman run --rm -it registry.opensuse.org/opensuse/leap:15.3 bash
Trying to pull registry.opensuse.org/opensuse/leap:15.3...
Getting image source signatures
Copying blob 795e626d95ff done  
Copying config 4826cf609b done  
Writing manifest to image destination
Storing signatures
Error: Error committing the finished image: error adding layer with blob "sha256:795e626d95ff6936a1f4c64c8fde63e59d8f9f373557db78f84fe9ac4a91f1da": Error processing tar file(exit status 1): potentially insufficient UIDs or GIDs available in user namespace (requested 0:15 for /etc/shadow): Check /etc/subuid and /etc/subgid: lchown /etc/shadow: invalid argument
Actions #6

Updated by okurz about 3 years ago

openqaworker7 is part of the o3 network so you need to go over o3 aka. ariel. Also I suggest to use registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest which already has openQA-worker, see above

Actions #7

Updated by openqa_review about 3 years ago

  • Due date set to 2021-09-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by dheidler about 3 years ago

I currently use the command

podman run --rm -it -h openqaworker7_container -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share --entrypoint /bin/bash registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest

on openqaworker7 to start the container.

I disabled the cache service via worker.ini and use the NFS mount from the host instead.

For some reason the permissions of the pool directory are incorrect even though in the Dockerfile there is a line that should put it right:

chown -R _openqa-worker /usr/share/openqa/script/worker /var/lib/openqa/cache /var/lib/openqa/pool

I have to execute the same command manually:

chown -R _openqa-worker /var/lib/openqa/pool

Then I can run /run_openqa_worker.sh which I currently do in an interactive (bash) session on openqaworker7 within a tmux session.

The results look promising (https://openqa.opensuse.org/admin/workers/407) but there seem to be some network issues that I'm not sure if they are actually related: https://openqa.opensuse.org/tests/1898458#step/prepare_test_data/8

Actions #9

Updated by dheidler about 3 years ago

Also something odd happens:
Sometimes the entrypoint script is missing from the container.
Then I need to run podman rmi openqa_worker and let podman fetch a new image.

Actions #10

Updated by dheidler about 3 years ago

The network issues seem to be due to the wrong WORKER_HOSTNAME entry in workers.ini.

I updated that file to use the ip of the container host and created a new start command:

i=102; podman run --rm -it -h openqaworker7_container -p $(python3 -c"p=${i}*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=$i -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share --entrypoint /bin/bash registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest

The command server port is calculated ${instance}*10+20003 - so I'm using instance starting with 101 so that I don't conflict with the host worker ports.

Then I need to run these commands within the container:

mkdir /var/lib/openqa/pool/$OPENQA_WORKER_INSTANCE
chown -R _openqa-worker /var/lib/openqa/pool/
/run_openqa_worker.sh
Actions #12

Updated by dheidler about 3 years ago

With the PR merged and the new container image published on the registry, this command should be sufficient to start the container worker:

i=101; podman run --rm -it -h openqaworker7_container -p $(python3 -c"p=${i}*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=$i -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
Actions #13

Updated by dheidler about 3 years ago

i=102
podman run -d -h openqaworker7_container --name openqaworker7_container_$i -p $(python3 -c"p=${i}*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=$i -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
(cd /etc/systemd/system/; podman generate systemd -f -n openqaworker7_container_$i --restart-policy always)
systemctl enable container-openqaworker7_container_$i
Actions #14

Updated by okurz about 3 years ago

The reference for tests should be https://openqa.opensuse.org/tests/overview?build=20210827&groupid=34&version=Tumbleweed&distri=opensuse with 4/4 passed.

As discussed if you suspect problems because of the hybrid transactional-update setup on w7 try on openqaworker1 or openqaworker4. Or, put everything including "container removing before start" into a dirty shell script, e.g. in /opt/bin/our_hacky_containers_on_broken_transactional_systems_because_we_do_not_know_how_to_do_it_properly

Or ask in opensuse-factory@opensuse.org mailing list or libera.chat how to properly maintain container services on transactional-update without installing a full-blown orchestration setup, e.g. kubernetes.

Actions #15

Updated by dheidler about 3 years ago

Seems that there are some dependencies missing in the container: https://openqa.opensuse.org/tests/1911614

[2021-09-11T09:36:31.323 UTC] [warn] !!! main_common.pm: Failed to load main_ltp.pm:
  Can't locate XML/Simple.pm in @INC (you may need to install the XML::Simple module) (@INC contains: . opensuse/lib /var/lib/openqa/pool/102/blib/arch /var/lib/openqa/pool/102/blib/lib /usr/lib/os-autoinst /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.26.1 /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at opensuse/lib/bugzilla.pm line 25.
  BEGIN failed--compilation aborted at opensuse/lib/bugzilla.pm line 25.
  Compilation failed in require at opensuse/lib/LTP/WhiteList.pm line 26.
  BEGIN failed--compilation aborted at opensuse/lib/LTP/WhiteList.pm line 26.
  Compilation failed in require at opensuse/lib/LTP/utils.pm line 27.
  BEGIN failed--compilation aborted at opensuse/lib/LTP/utils.pm line 27.
  Compilation failed in require at opensuse/lib/main_ltp.pm line 27.
  BEGIN failed--compilation aborted at opensuse/lib/main_ltp.pm line 27.
  Compilation failed in require at (eval 142) line 1.
  BEGIN failed--compilation aborted at (eval 142) line 1.
[2021-09-11T09:36:31.791 UTC] [warn] !!! autotest::loadtest: error on tests/installation/installation_overview.pm: Can't locate Test/Assert.pm in @INC (you may need to install the Test::Assert module) (@INC contains: opensuse/tests/installation opensuse/lib . opensuse/products/opensuse/../../lib /var/lib/openqa/pool/102/blib/arch /var/lib/openqa/pool/102/blib/lib /usr/lib/os-autoinst /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.26.1 /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at opensuse/tests/installation/installation_overview.pm line 25.

So perl(XML::Simple) and perl(Test::Assert) are missing.

Turns out that os-autoinst-distri-opensuse-deps is missing in the container.

Actions #16

Updated by dheidler about 3 years ago

  • Status changed from In Progress to Feedback
Actions #17

Updated by dheidler about 3 years ago

I manually installed the dependencies into the container for now:

podman exec -it openqaworker1_container_101 zypper -n --gpg-auto-import-keys in os-autoinst-distri-opensuse-deps
podman exec -it openqaworker1_container_102 zypper -n --gpg-auto-import-keys in os-autoinst-distri-opensuse-deps
podman exec -it openqaworker1_container_103 zypper -n --gpg-auto-import-keys in os-autoinst-distri-opensuse-deps
podman exec -it openqaworker1_container_104 zypper -n --gpg-auto-import-keys in os-autoinst-distri-opensuse-deps
Actions #22

Updated by livdywan about 3 years ago

  • Due date changed from 2021-09-15 to 2021-09-17

Bumping the due date due. It's not clear currently what's missing. @okurz suggests to if needed simply run the container and install and fix up in the instance - and file tickets as needed for concrete follow-ups.

Actions #23

Updated by okurz about 3 years ago

Unfortunately the ticket is still "Urgent" since 15 days. Please focus on removing the urgency ASAP. It does not have to be pretty but some s390x tests should run on o3, regardless how hacky it looks.
Please prioritize coming up with a quick-and-dirty solution, e.g. podman run … bash to have an interactive prompt in the container, do all what is necessary in the interactive bash session to have at least one s390x job completing and store that in a dirty shell script.

Actions #24

Updated by dheidler about 3 years ago

  • Priority changed from Urgent to High

After my fixes from Tuesday we should have at least one container worker that performs jobs:
https://openqa.opensuse.org/tests/overview?build=20210915&groupid=34&version=Tumbleweed&distri=opensuse

So I think we could remove the urgency for now.
The tests run through (or look like product issues) and in the meantime I can fix the (hopefully) last dependency issues.

Actions #25

Updated by okurz about 3 years ago

dheidler wrote:

So I think we could remove the urgency for now.
The tests run through (or look like product issues) and in the meantime I can fix the (hopefully) last dependency issues.

Thank you. I appreciate that. Result looks good so far.

Actions #28

Updated by dheidler about 3 years ago

  • Status changed from Feedback to Resolved

With the last obstacles removed (FileProvides: /usr/bin/Xvnc xorg-x11-Xvnc), we now have four running containers on openqaworker1 that were started using the described command and that can survive reboots using their systemd service files.

https://openqa.opensuse.org/tests/1922456 was ran in such a container which worked out of the box.

Actions #29

Updated by dheidler about 3 years ago

Turns out that the unit files are still specific to the container internal id due to:

[Service]
PIDFile=/var/run/containers/storage/btrfs-containers/b19ecda81ee2710f7cd3aadeb463c89f7d36b767ec4b23f71f49401da3c5860d/userdata/conmon.pid

Therefore we need to always regenerate the systemd service files when running a new container instance using podman run.

i=101
podman run -d -h openqaworker1_container --name openqaworker1_container_$i -p $(python3 -c"p=${i}*10+20003;print(f'{p}:{p}')") -e OPENQA_WORKER_INSTANCE=$i -v /opt/s390x_rebel_replacement:/etc/openqa -v /var/lib/openqa/share:/var/lib/openqa/share registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
(cd /etc/systemd/system/; podman generate systemd -f -n openqaworker1_container_$i --restart-policy always)
systemctl daemon-reload
systemctl enable container-openqaworker1_container_$i
Actions #31

Updated by okurz about 2 years ago

Actions #32

Updated by okurz almost 2 years ago

  • Related to action #119713: Leap tests are failing because of failed log file uploading in multiple tests on s390x size:M added
Actions

Also available in: Atom PDF