action #139103
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #139010: [epic] Long OSD ppc64le job queue
Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M
Added by okurz about 1 year ago. Updated 2 months ago.
0%
Description
Motivation¶
Currently on OSD there is a longer job queue in particular for ppc64le for multiple reasons, see #139010. One idea to decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs due to the OSD openQA instance job limit.
Acceptance criteria¶
- AC1: The impact of worker instance ratio by arch/class has been verified
- AC2: Given the openQA instance job limit is impacting the ppc64le job queue When the ratio of ppc64le/all workers has been increased Then the ppc64le job age is lower
Suggestions¶
- DONE Look up current number of x86_64 and qemu ppc64le jobs assuming that we have a very low ppc64le/all ratio, e.g. many workers for qemu_x86_64 and very few for qemu_ppc64le (16 as of 2023-11-04).
- DONE Reduce number of x86_64 qemu slots if we have "too many"
- Monitor for the impact on qemu_ppc64le job age
- Increase the amount of ppc64le machines and then again re-enable x86_64 machines
- Take care to apply the workarounds from #157975-12 to prevent accidental distribution upgrades
Rollback steps¶
- Re-enable openQA OSD workers w35-w36, remove according alert https://monitor.qa.suse.de/alerting/silence/e2c36842-e6a9-4d48-aeef-330c3d8604c7/edit?alertmanager=grafana
- Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/687 to enable multi-machine tests after ensuring stability
Out of scope¶
- Any code changes for the scheduler
Updated by okurz about 1 year ago
- Copied from action #139100: Long OSD ppc64le job queue - Move nue3 power8 machines to nue2 added
Updated by okurz about 1 year ago
- Description updated (diff)
- Status changed from New to Feedback
I called an SQL query select host,count(distinct(w.id)) from workers w join worker_properties wp on w.id = wp.worker_id where w.t_seen >= '2023-11-01' group by host;
host | count
------------------+-------
diesel | 8
imagetester | 18
openqa-piworker | 3
openqaworker1 | 11
openqaworker14 | 16
openqaworker16 | 20
openqaworker17 | 20
openqaworker18 | 20
petrol | 8
qesapworker-prg4 | 24
qesapworker-prg5 | 23
qesapworker-prg6 | 24
qesapworker-prg7 | 22
sapworker1 | 32
sapworker2 | 33
sapworker3 | 29
worker-arm1 | 40
worker-arm2 | 40
worker29 | 49
worker30 | 57
worker31 | 50
worker32 | 50
worker33 | 50
worker34 | 50
worker35 | 40
worker36 | 40
worker37 | 40
worker38 | 40
worker39 | 40
worker40 | 46
(30 rows)
From this we can not easily see which exact worker classes machines are using but by look into https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls we can cross-reference a little bit.
Disabling two worker machines w35+w36.
sudo salt 'worker3[5-6].oqa.*' cmd.run "sudo systemctl disable --now telegraf \$(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs); sudo poweroff" && sudo salt-key -y -d worker3[5-6].oqa.*
Updated by okurz about 1 year ago
- Subject changed from Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs to Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M
- Description updated (diff)
Updated by okurz about 1 year ago · Edited
- Status changed from Feedback to In Progress
Given that with #139271 we have many more qemu-ppc64le worker slots I am bringing back worker3[56] back to production.
Powered on and then
salt --no-color 'worker3[5-6].oqa.*' --state-output=changes state.apply | grep -va 'Result: Clean'
Updated by okurz about 1 year ago
- Due date changed from 2023-11-25 to 2023-11-30
- Status changed from In Progress to Feedback
Machines showed up fine in https://openqa.suse.de/admin/workers again. Waiting for jobs to be executed on that hosts the next days.
Updated by acarvajal about 1 year ago
During QE-SAP osd review today, we started noticing multiple Multi-Machine errors in the HA/SAP Aggregate jobs from 2023-11-23, whereas jobs from the previous days were passing without issues.
Most common failure seems to be on SUT attempting to resolve names outside of openQA (updates.suse.com, scc.suse.com, download.suse.de), and then also failing to upload logs to 10.0.2.2.
The name solving issue could point to a communication problem between SUT and the DNS server in the support server job, but failure to reach 10.0.2.2 could point to a bigger issue.
Examples:
- job in w35 failing to resolve updates.suse.com and failing to connect to 10.0.2.2. Other jobs in w38 (support server), w29, w30 (https://openqa.suse.de/tests/12864114)
- another job in w35 with the same symptoms. Other jobs in w29 (support server), w36 & w39 (https://openqa.suse.de/tests/12872442#step/qnetd/29)
- job in w30 with the same symptoms. Other jobs in w35 (https://openqa.suse.de/tests/12872444)
- job in w37 with the same symptoms. Other jobs in w36 (https://openqa.suse.de/tests/12872454#step/iscsi_client/32)
- job in w37 with the same symptoms. Other jobs in w36 (support server), w35 & w38 (https://openqa.suse.de/tests/12872461#step/qnetd/28)
- job in w30 with the same symptoms. Other jobs in w35 (support server) & w29 (https://openqa.suse.de/tests/12872465#step/iscsi_client/35)
- job in w39 with the same symptoms. Other jobs in w35 (support server) & w40 (https://openqa.suse.de/tests/12872482#step/iscsi_client/35)
- job in w40 with the same symptoms. Other jobs in w36 (support server) & w35 (https://openqa.suse.de/tests/12872499#step/iscsi_client/32)
- job in w36 with the same symptoms. Other jobs in w38 (support server) & w35 & w39 (https://openqa.suse.de/tests/12872436#step/iscsi_client/34)
The following jobs also had a failure in name resolution, but did not attempt a connection to 10.0.2.2:
- job in w35 failed to resolve scc.suse.com. Other jobs in w29 (support server) & w35 (https://openqa.suse.de/tests/12872400#step/suseconnect_scc/20)
- another one like the one before, this one in w39. Other jobs in w36 (support server) & w35 (https://openqa.suse.de/tests/12872392#step/suseconnect_scc/20)
And finally this one which was different but could be related:
- job in w35 failed with network not configured. Other jobs in w30 (support server) & w38 (https://openqa.suse.de/tests/12872417#step/register_system/12)
Initially I suspected something wrong with worker35, but only looking at the results from above was not conclusive.
I manually restarted all these jobs, and during the course of the afternoon saw restarted jobs where one of the nodes or the support server were picked up by workers worker35 and worker36 failing.
Example:
- https://openqa.suse.de/tests/12876165#step/iscsi_client/34 (w35)
- https://openqa.suse.de/tests/12876161#step/iscsi_client/34 (w36)
So my current suspicion is that there is something wrong in MM configuration in these 2 workers.
I checked with ovs-vsctl show
and checked IPv4 forwarding settings below /proc/sys/net/ipv4
and with sysctl -a
in both worker35 and worker38 (this one as a control node) and found no obvious differences, so no idea so far why one seems to be working and other not.
After several restarts, failures in https://openqa.suse.de/group_overview/405 decreased from 12 to 2, and due to https://suse.slack.com/archives/C02CANHLANP/p1700750992725609?thread_ts=1700727149.287059&cid=C02CANHLANP, I expect the 2 ongoing jobs to finish successfully.
Will add more details as I find them.
Updated by okurz about 1 year ago
- Status changed from Feedback to In Progress
- Priority changed from Low to High
Due to the above mentioned machines I powered down w35+w36.
TODO remove from salt again and block on "multi machine debugging issues"
Updated by okurz about 1 year ago
- Related to action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems added
Updated by okurz about 1 year ago
- Due date deleted (
2023-11-30) - Status changed from In Progress to Blocked
- Priority changed from High to Low
I removed w35+w36 from salt again. Blocking on #151382
Updated by okurz 6 months ago
- Related to action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S added
Updated by okurz 4 months ago
- Category set to Feature requests
- Status changed from Blocked to Workable
- Assignee deleted (
okurz) - Priority changed from Low to High
- Target version changed from future to Ready
#151382 wasn't picked up for 10 months. But by now we have a pending change in the infrastructure which could adversely affect us, see #165282. For this it will be good if we can bring back more x86_64 openQA workers before NUE2 might loose the connection as we had it in before.
Updated by nicksinger 4 months ago
- Status changed from Workable to In Progress
worker35 is already online and properly working for a long time. worker36 was online but not reachable via ssh. Connecting via sol worked but I have to root password handy currently so I give rebooting a try now because the machine shows no network connectivity at all. Boot shows we currently run Leap 15.6 on there, maybe that's the reason for no network?
Updated by nicksinger 4 months ago
oh well, the current 15.6 kernel is happily crashing along every few minutes. No network even after a "proper" reboot. I will try to get into some kind of recovery system now and will eventually reinstall it to our current stable configuration.
Updated by nicksinger 4 months ago
okurz wrote in #note-21:
Did you follow #139103-13 ?
thanks, I'm going to follow that approach. Just confirmed from an efi-shell that network is fine:
Shell> ifconfig -l
-----------------------------------------------------------------
name : eth0
Media State : Media presentaddress for the eth0 interface:
policy : statics eth0 dns 192.168.0.8 192.168.0.9
mac addr : 7C:C2:55:24:DE:DE
ipv4 address : 0.0.0.0
subnet mask : 0.0.0.0
default gateway: 0.0.0.0
Routes (0 entries):
DNS server :
-----------------------------------------------------------------
name : eth1
Media State : Media present
policy : static
mac addr : 7C:C2:55:24:DE:DF
ipv4 address : 0.0.0.0
subnet mask : 0.0.0.0
Shell> ifconfig -s eth0 dhcp
Shell> ping 1.1.1.1
16 bytes from 1.1.1.1 : icmp_seq=2 ttl=0 time0~53ms
2 packets transmitted, 2 received, 0% packet loss, time 0ms
-----------------------------------------------------------------
Rtt(round trip time) min=0~53ms max=0~53ms avg=0~53ms
Updated by openqa_review 3 months ago
- Due date set to 2024-09-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 3 months ago
- Copied to action #166802: Recover worker37, worker38, worker39 size:S added
Updated by okurz 3 months ago
- Related to action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org) added
Updated by nicksinger 3 months ago
@okurz helped me to log into the machine. Only a few snapshots are available and likely none contains an old kernel. Going to reinstall the machine now.
Updated by okurz 3 months ago
As alternative you can also forcefully install a Leap 15.5 kernel and firewalld as noted in https://bugzilla.suse.com/show_bug.cgi?id=1227616#c20 and we try if that helps to keep a system stable on Leap 15.6 otherwise.
Updated by nicksinger 3 months ago
I found no easy way to boot anything on worker36 yet as the running machine has no network in linux (have not tried setting it manually) and I cannot boot anything via UEFI PXE (haven't tried legacy yet) because there seems to be no PXE server present. The HTML5 console in the webui only has a grayed out "Virtual Media" button. Virtual media itself in the BMC webui requires a Samba host (inside the firewalled .qe-ipmi-ur network) which I don't have handy. I also didn't manage yet to use SMCIPMITool ( https://www.thomas-krenn.com/de/wiki/IPMI_Virtual_Media_einbinden# ) to mount a virtual media because I would need to execute it on qe-jumpy.prg2.suse.org (if even possible because of all the java stuff around it).
Updated by okurz 3 months ago · Edited
- Assignee changed from nicksinger to okurz
I will give it a try. I booted the normal system and reproduced the crashing system after some time. I booted again and disabled the firewalld
service.
Then did wicked --log-level info ifup all
with result
wicked: lo: configuration applied to nanny
wicked: eth0: configuration applied to nanny
[ 327.488637][ T2632] pps pps0: new PPS source ptp0
[ 327.493993][ T2632] ixgbe 0000:41:00.0: registered PHC device on eth0
[ 331.933937][ T856] ixgbe 0000:41:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
wicked: Interface wait time reached
lo up
eth0 up
and can use the network.
Regarding reinstallation attempts we can consider instructions from #132137-6 and https://wiki.suse.net/index.php/OpenQA#Installation_of_OSD_PRG2_workers
On http://download.opensuse.org/update/leap/15.5/sle/noarch/?P=*firewalld* I found http://download.opensuse.org/update/leap/15.5/sle/noarch/firewalld-0.9.3-150400.8.12.1.noarch.rpm . Maybe can downgrade to that? I did zypper in --oldpackage http://download.opensuse.org/update/leap/15.5/sle/noarch/{firewalld,firewalld-lang,python3-firewall}-0.9.3-150400.8.12.1.noarch.rpm
, removed the firewalld-bash-completion which is in conflict and added a package lock with zypper al -m "boo#1227616" *firewall*
. I think dheidler used the machine for debugging that bug. I disabled the kernel:head repo and did zypper dup
. After reboot the system came up fine including network. I now enabled the downgraded firewalld again with systemctl enable --now firewalld
. Let's monitor.
Updated by okurz 3 months ago
- Status changed from In Progress to Feedback
Looks all good and have seen multiple production OSD jobs on it. livdywan fixed a problem with auto-update in the related ticket. The worker is good. Now let's see how the system behaves with additional workers online. https://openqa.suse.de/admin/workers right now shows 947 worker instances online so anyway still below what we had problems with in the past.
Updated by okurz 3 months ago · Edited
While I monitored the operation it looks like not as many jobs are running as would be possible. From osd in /var/log/openqa_scheduler I find
[2024-09-18T18:40:46.762323Z] [debug] [pid:15820] [Job#15468085] Prepare for being processed by worker 2899
[2024-09-18T18:50:47.026654Z] [warn] [pid:15820] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.
[2024-09-18T18:50:47.026823Z] [warn] [pid:15820] Failed sending job(s) '15468085' to worker '2899': unknown error
[2024-09-18T18:50:47.035165Z] [debug] [pid:15820] Job 15468085 reset to state scheduled
[2024-09-18T18:50:47.060326Z] [debug] [pid:15820] Assigned job '15468104' to worker ID '2880'
[2024-09-18T18:50:47.063860Z] [debug] [pid:15820] [Job#15468104] Prepare for being processed by worker 2880
[2024-09-18T19:00:47.191988Z] [warn] [pid:15820] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.
this looks similar to #135122 and related tickets. https://openqa.suse.de/admin/workers shows that we have currently 1070 worker instances connected. Following https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production I took w38+w39 out of production again. Apparently immediately afterwards OSD could schedule jobs again and showed in /var/log/openqa_scheduler
[2024-09-18T19:09:32.060402Z] [debug] [pid:15820] Scheduler took 1729.79666s to perform operations and allocated 28 jobs
which is excessive.
From grep 'Scheduler took' /var/log/openqa_scheduler | less
I see that usually a scheduling cycle takes 1-2s. Since 2024-09-18T16:00Z so around the time I brought back w36+w37 the cycle time increased to 5-20s and after I brought back w38+w39 at 18:35Z 97s and 19:09Z as mentioned above 1730s