Project

General

Profile

Actions

action #129484

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #108209: [epic] Reduce load on OSD

high response times on osd - Move OSD workers to o3 to prevent OSD overload size:M

Added by okurz 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-05-17
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

OSD can be overloaded by too many jobs uploading too much data. We have more workers pending installation so the situation can become even worse. On the other side o3 sometimes runs into capacity limits when a new Tumbleweed snapshot, Leap and staging tests come together. So we should logically move some machines in NUE1-SRV1 from osd to o3.

Acceptance criteria

  • AC1: o3 has more workers than before e.g. 1-2 extra workers
  • AC2: wiki listing o3 workers and racktables is up-to-date

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Project - action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs)Resolvedmkittler2021-01-18

Actions
Copied to openQA Project - action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:MRejectedokurz

Actions
Actions #1

Updated by okurz 11 months ago

  • Copied to action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M added
Actions #2

Updated by okurz 11 months ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (Feature requests)
Actions #3

Updated by okurz 11 months ago

  • Target version changed from Ready to future
Actions #4

Updated by okurz 11 months ago

  • Subject changed from high response times on osd - Consider reducing number of openQA worker instances to high response times on osd - Move OSD workers to o3 to prevent OSD overload
  • Description updated (diff)
  • Priority changed from Normal to High
  • Target version changed from future to Ready
Actions #5

Updated by livdywan 11 months ago

  • Subject changed from high response times on osd - Move OSD workers to o3 to prevent OSD overload to high response times on osd - Move OSD workers to o3 to prevent OSD overload size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by livdywan 10 months ago

  • Status changed from Workable to Feedback
  • Assignee set to livdywan

ipmi on worker6 is fine. Following our ticket handling process to get the zone updated as a first step, and updating the moving instructions which refer to an invalid ticket link.

Unsure what to make of this:

salt-key -y -d worker6.oqa.suse.de
The key glob 'worker6.oqa.suse.de' does not match any unaccepted keys.
Actions #7

Updated by okurz 10 months ago

  • Due date set to 2023-06-26

I tried on OSD:

sudo salt-key -d worker6.oqa.suse.de
The following keys are going to be deleted:
Accepted Keys:
worker6.oqa.suse.de
Proceed? [N/y] n

I could reproduce your problem as non-root. Ensure to be root.

I added some details to the SD ticket and updated the wiki entry.

Actions #8

Updated by livdywan 10 months ago

  • Status changed from Feedback to Blocked

okurz wrote:

I could reproduce your problem as non-root. Ensure to be root.

Ack🤦🏾‍♀️

Works with that. Marking the ticket as blocked on the SD ticket now.

Actions #9

Updated by livdywan 10 months ago

  • Status changed from Blocked to In Progress

cdywan wrote:

Works with that. Marking the ticket as blocked on the SD ticket now.

Change seems to have worked, machine disappeared from oqa.suse.de as expected - and I incidentally swapped the steps to adjust mount points and restart the network on the worker after restarting dnsmasq since there's no way to get to it before 🤓

Restart network on machine (over IMPI) using systemctl restart network and monitor in o3:journalctl -f -u dnsmasq until address is assigned, e.g.:

Kind of wondering what the point of these instructions is... the machine is already reachable at this point and the network service can be restarted without getting disconnected from SSH.

Ensure all mountpoints up
mount -a

This seems to be hanging forever... will investigate this later if I see any problems, I guess.

Actions #10

Updated by livdywan 10 months ago

Change root password to o3 one

I guess the warning is expected and I assume it means I did set the correct password. Can't login via SSH, though🤔

Update /etc/openqa/workers.ini with similar config as used on other workers, e.g. based on openqaworker4, example:

Also copyng EXTERNAL_VIDEO_ENCODER.

And apparently there was an explicit mention of the machine in salt which was causing pipeline failures despite the salt key having been removing early on.

systemctl disable --now auto-update.timer salt-minion telegraf

My shell's getting stuck running this 🤔

Actions #11

Updated by livdywan 10 months ago

cdywan wrote:

Change root password to o3 one

I guess the warning is expected and I assume it means I did set the correct password. Can't login via SSH, though🤔

This got resolved by installing openssh-server-config-rootlogin.

systemctl disable --now auto-update.timer salt-minion telegraf

My shell's getting stuck running this 🤔

Failed to get load state of salt-minion.service: Connection timed out
Actions #12

Updated by okurz 10 months ago

cdywan wrote:

cdywan wrote:

Change root password to o3 one

I guess the warning is expected and I assume it means I did set the correct password. Can't login via SSH, though🤔

This got resolved by installing openssh-server-config-rootlogin.

Not true. That package does not exist in Leap, see https://software.opensuse.org/package/openssh-server-config-rootlogin?search_term=openssh-server-config-rootlogin . I did the change with sed in the sshd config.

systemctl disable --now auto-update.timer salt-minion telegraf

My shell's getting stuck running this 🤔

Failed to get load state of salt-minion.service: Connection timed out

Then terminate it manually and try again. Or reboot first and try again.

Actions #13

Updated by livdywan 10 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (livdywan)

okurz wrote:

Failed to get load state of salt-minion.service: Connection timed out

Then terminate it manually and try again. Or reboot first and try again.

I did. That didn't help ;-)

Edit: My other comment update got lost. zypper -n dup --force-resolution --allow-vendor-change repeatedly got stuck trying to remove openQA-common. There was no output. I didn't manage to investigate this further whilst I still had a working connection.

I'm putting the ticket back for now whilst I can't reliably connect to any o3 workers. No tracker yet since different people report it working for them, and "it's Thursday".

Actions #14

Updated by okurz 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

the system was never rebooted and it was stuck due to NFS file handles to openqa.suse.de still open. We did an unmount -l /var/lib/openqa/share but to come to a proper clean system we just triggered a reboot anyway. To shorten further reinitialization time I tried with kexec but failed. Likely the correctly command line would be kexec -l /boot/vmlinuz --initrd=/boot/initrd --reuse-cmdline && kexec -e. I extended the NFS mount line in https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Moving-worker-from-osd-to-o3 as it was missing the special arguments we found to be quite helping, e.g. noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m

So rebooted and after reboot we found that openQA worker services failed to register against openqa1-opensuse. That looked suspiciously like problems observed in #78390 regarding nscd. Restarting services with nscd openqa-worker-auto-restart@1 helped. Triggering another reboot trying to reproduce the problem.

EDIT: A reboot reproduces the problem.

Maybe related https://bugzilla.opensuse.org/show_bug.cgi?id=1149603

nicksinger disabled the nscd cache lookup with enable-cache … no in /etc/nscd.conf and then after kexec'ing openQA services immediately starting up working on jobs.

/etc/openqa/workers.ini had a wrong entry in WORKER_HOSTNAME. As WORKER_HOSTNAME should resolve automatically to a working variable value I disabled the entry in the config file and restarted openQA worker services again with systemctl restart openqa-worker-auto-restart@{1..30}. Monitoring openQA jobs, e.g.
https://openqa.opensuse.org/tests/3361924
on openqaworker6:1

Regarding the nscd problem we investigated the systemd dependencies because we suspect some problem there. with for i in aarch64 openqaworker4 openqaworker6 openqaworker7 openqaworker19 openqaworker20 qa-power8-3 rebel; do echo $i && ssh root@$i "systemd-analyze critical-chain" ; done we found that the chain for worker6 looks quite different then for other workers.

openqaworker6
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

multi-user.target @5min 25.302s
└─getty.target @5min 25.286s
  └─serial-getty@ttyS1.service @5min 25.271s
    └─rc-local.service @24.987s +5min 250ms
      └─basic.target @24.742s
        └─sockets.target @24.726s
          └─xvnc.socket @24.710s
            └─sysinit.target @24.002s
              └─systemd-update-utmp.service @23.943s +41ms
                └─auditd.service @23.735s +186ms
                  └─systemd-tmpfiles-setup.service @23.543s +174ms
                    └─local-fs.target @23.486s
                      └─run-user-0.mount @3min 32.216s
                        └─local-fs-pre.target @3.814s
                          └─systemd-tmpfiles-setup-dev.service @3.634s +163ms
                            └─kmod-static-nodes.service @3.043s +391ms
                              └─systemd-journald.socket
                                └─system.slice
                                  └─-.slice
openqaworker7
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

multi-user.target @3min 29.136s
└─cron.service @3min 29.122s
  └─postfix.service @3min 26.806s +2.293s
    └─time-sync.target @3min 26.237s
      └─chronyd.service @3min 25.442s +778ms
        └─network.target @3min 25.410s
          └─wicked.service @2min 55.170s +30.223s
            └─wickedd-nanny.service @2min 55.130s +21ms
              └─wickedd.service @2min 55.042s +70ms
                └─openvswitch.service @2min 55.002s +18ms
                  └─ovs-vswitchd.service @2min 54.742s +241ms
                    └─ovs-delete-transient-ports.service @2min 54.694s +29ms
                      └─ovsdb-server.service @2min 53.368s +1.307s
                        └─network-pre.target @2min 53.297s
                          └─firewalld.service @2min 49.377s +3.895s
                            └─polkit.service @2min 56.485s +319ms
                              └─basic.target @2min 49.308s
                                └─sockets.target @2min 49.292s
                                  └─iscsid.socket @2min 49.276s
                                    └─sysinit.target @2min 48.852s
                                      └─systemd-update-utmp.service @2min 48.809s +26ms
                                        └─auditd.service @2min 48.618s +170ms
                                          └─systemd-tmpfiles-setup.service @2min 48.485s +114ms
                                            └─local-fs.target @2min 48.436s
                                              └─var-lib-openqa-share.automount @2min 48.419s
                                                └─var-lib-openqa.mount @2min 48.211s +171ms
                                                  └─dev-md127.device @5.980s

also found that apparmor is masked. Unmasked now and started.

Actions #15

Updated by okurz 10 months ago

  • Related to action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs) added
Actions #16

Updated by mkittler 10 months ago

Because I've just got a complaint about broken jobs via #factory:opensuse.org on Matrix, I stopped all worker slots for now: systemctl disable --now openqa-worker-auto-restart@{1..30} && systemctl disable --now openqa-reload-worker-auto-restart@{1..30}.path && systemctl disable --now openqa-worker-cacheservice.service

Actions #17

Updated by okurz 10 months ago

uefi jobs failed like https://openqa.opensuse.org/tests/3363797

I called from o3

rsync -aHP root@openqaworker7:/usr/share/qemu/*staging* . && rsync -aHP *staging* root@openqaworker6:/usr/share/qemu/

and then to verify

openqa-clone-job --within-instance https://openqa.opensuse.org/tests/3363797 WORKER_CLASS=openqaworker6 TEST+=poo129484 BUILD= _GROUP_ID=0

The error in https://openqa.opensuse.org/tests/3361980 is a bit hidden in log but rather explicit "ffmpeg: command not found". But the job is incomplete and properly restarted. at least that works fine. I will look into installing the missing package and restarting

Actions #18

Updated by okurz 10 months ago

Following https://progress.opensuse.org/projects/openqav3/wiki/#Setup-guide-for-new-machines

okurz@ariel:~> rsync -aHP *staging* root@openqaworker6:/usr/share/qemu/

and on w6:

zypper -n in ffmpeg && systemctl enable --now openqa-worker-cacheservice openqa-worker-cacheservice-minion openqa-worker-auto-restart@1 openqa-reload-worker-auto-restart@1

I started just a single worker instance now. It picked up https://openqa.opensuse.org/tests/3362077

Actions #19

Updated by okurz 10 months ago

Multiple jobs were fine so I enabled more with systemctl enable --now openqa-worker-auto-restart@{2..30} openqa-reload-worker-auto-restart@{2..30}

Actions #20

Updated by okurz 10 months ago

ok, that caused problems because worker instance 30 did not have vtap devices configured. Previously only 29 worker instances were used. So I disabled instance 30 again. Then we realized that we need to update gre tunnel connections. So in /etc/wicked/scripts/gre_tunnel_preup.sh we added according IPv4 entries for each "other" multi-machine openQA worker.

Oh, and I did

WORKER=openqaworker6 failed_since="2023-06-15" openqa-advanced-retrigger-jobs

to retrigger recent incompletes on openqaworker6 that are not handled yet.

Checking for i in aarch64 openqaworker4 openqaworker6 openqaworker7 openqaworker19 openqaworker20 qa-power8-3 rebel; do echo $i && ssh root@$i "systemctl list-timers | grep -i openqa" ; done I find more amiss

aarch64
Mon 2023-06-19 14:00:34 CEST 3min 25s left      Mon 2023-06-19 13:55:34 CEST 1min 34s ago       openqa-continuous-update.timer openqa-continuous-update.service
Tue 2023-06-20 03:00:00 CEST 13h left           Mon 2023-06-19 03:00:34 CEST 10h ago            openqa-auto-update.timer       openqa-auto-update.service
openqaworker4
Mon 2023-06-19 13:59:41 CEST 2min 32s left      n/a                          n/a                openqa-continuous-update.timer openqa-continuous-update.service
Tue 2023-06-20 03:00:00 CEST 13h left           Mon 2023-06-19 13:49:17 CEST 7min ago           openqa-auto-update.timer       openqa-auto-update.service
openqaworker6
openqaworker7
Tue 2023-06-20 03:00:00 CEST 13h left           Mon 2023-06-19 03:00:40 CEST 10h ago            openqa-auto-update.timer       openqa-auto-update.service
n/a                          n/a                Mon 2023-06-19 13:55:19 CEST 1min 54s ago       openqa-continuous-update.timer openqa-continuous-update.service
openqaworker19
Mon 2023-06-19 13:57:28 CEST 13s left           Mon 2023-06-19 13:52:28 CEST 4min 46s ago       openqa-continuous-update.timer openqa-continuous-update.service
Tue 2023-06-20 03:00:00 CEST 13h left           Mon 2023-06-19 03:00:49 CEST 10h ago            openqa-auto-update.timer       openqa-auto-update.service
openqaworker20
Tue 2023-06-20 03:00:00 CEST 13h left           Mon 2023-06-19 03:00:03 CEST 10h ago            openqa-auto-update.timer     openqa-auto-update.service
qa-power8-3
Tue 2023-06-20 03:00:00 CEST 13h left           Mon 2023-06-19 03:00:20 CEST 10h ago            openqa-auto-update.timer       openqa-auto-update.service
n/a                          n/a                Mon 2023-06-19 13:57:15 CEST 234ms ago          openqa-continuous-update.timer openqa-continuous-update.service
rebel
Mon 2023-06-19 13:57:30 CEST 15s left           Mon 2023-06-19 13:52:30 CEST 4min 44s ago       openqa-continuous-update.timer openqa-continuous-update.service

so I enabled with zypper -n in openQA-auto-update openQA-continuous-update && systemctl enable --now openqa-{auto,continuous}-update.timer

Also the output of for i in aarch64 openqaworker4 openqaworker6 openqaworker7 openqaworker19 openqaworker20 qa-power8-3 rebel; do echo $i && ssh root@$i "ovs-vsctl show | grep 192" ; done looks clean now:

aarch64
openqaworker4
                options: {remote_ip="192.168.112.17"}
                options: {remote_ip="192.168.112.18"}
                options: {remote_ip="192.168.112.23"}
                options: {remote_ip="192.168.112.12"}
openqaworker6
                options: {remote_ip="192.168.112.17"}
                options: {remote_ip="192.168.112.18"}
                options: {remote_ip="192.168.112.7"}
                options: {remote_ip="192.168.112.12"}
openqaworker7
                options: {remote_ip="192.168.112.6"}
                options: {remote_ip="192.168.112.7"}
                options: {remote_ip="192.168.112.17"}
                options: {remote_ip="192.168.112.18"}
openqaworker19
                options: {remote_ip="192.168.112.23"}
                options: {remote_ip="192.168.112.12"}
                options: {remote_ip="192.168.112.7"}
                options: {remote_ip="192.168.112.18"}
openqaworker20
                options: {remote_ip="192.168.112.12"}
                options: {remote_ip="192.168.112.7"}
                options: {remote_ip="192.168.112.23"}
                options: {remote_ip="192.168.112.17"}
Actions #21

Updated by okurz 10 months ago

  • Due date deleted (2023-06-26)
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF