action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by okurz over 1 year ago

Copied from action #139100: Long OSD ppc64le job queue - Move nue3 power8 machines to nue2 added

Actions

Copy link

#2

Updated by okurz over 1 year ago

Description updated (diff)
Status changed from New to Feedback

I called an SQL query select host,count(distinct(w.id)) from workers w join worker_properties wp on w.id = wp.worker_id where w.t_seen >= '2023-11-01' group by host;

       host       | count 
------------------+-------
 diesel           |     8
 imagetester      |    18
 openqa-piworker  |     3
 openqaworker1    |    11
 openqaworker14   |    16
 openqaworker16   |    20
 openqaworker17   |    20
 openqaworker18   |    20
 petrol           |     8
 qesapworker-prg4 |    24
 qesapworker-prg5 |    23
 qesapworker-prg6 |    24
 qesapworker-prg7 |    22
 sapworker1       |    32
 sapworker2       |    33
 sapworker3       |    29
 worker-arm1      |    40
 worker-arm2      |    40
 worker29         |    49
 worker30         |    57
 worker31         |    50
 worker32         |    50
 worker33         |    50
 worker34         |    50
 worker35         |    40
 worker36         |    40
 worker37         |    40
 worker38         |    40
 worker39         |    40
 worker40         |    46
(30 rows)

From this we can not easily see which exact worker classes machines are using but by look into https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls we can cross-reference a little bit.

Disabling two worker machines w35+w36.

sudo salt 'worker3[5-6].oqa.*' cmd.run "sudo systemctl disable --now telegraf \$(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs); sudo poweroff" && sudo salt-key -y -d worker3[5-6].oqa.*

Actions

Copy link

#3

Updated by okurz over 1 year ago

Subject changed from Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs to Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M
Description updated (diff)

Actions

Copy link

#4

Updated by okurz over 1 year ago · Edited

Status changed from Feedback to In Progress

Given that with #139271 we have many more qemu-ppc64le worker slots I am bringing back worker3[56] back to production.
Powered on and then

salt --no-color 'worker3[5-6].oqa.*' --state-output=changes state.apply | grep -va 'Result: Clean'

Actions

Copy link

#5

Updated by okurz over 1 year ago

Due date changed from 2023-11-25 to 2023-11-30
Status changed from In Progress to Feedback

Machines showed up fine in https://openqa.suse.de/admin/workers again. Waiting for jobs to be executed on that hosts the next days.

Actions

Copy link

#6

Updated by acarvajal over 1 year ago

During QE-SAP osd review today, we started noticing multiple Multi-Machine errors in the HA/SAP Aggregate jobs from 2023-11-23, whereas jobs from the previous days were passing without issues.

Most common failure seems to be on SUT attempting to resolve names outside of openQA (updates.suse.com, scc.suse.com, download.suse.de), and then also failing to upload logs to 10.0.2.2.

The name solving issue could point to a communication problem between SUT and the DNS server in the support server job, but failure to reach 10.0.2.2 could point to a bigger issue.

Examples:

job in w35 failing to resolve updates.suse.com and failing to connect to 10.0.2.2. Other jobs in w38 (support server), w29, w30 (https://openqa.suse.de/tests/12864114)
another job in w35 with the same symptoms. Other jobs in w29 (support server), w36 & w39 (https://openqa.suse.de/tests/12872442#step/qnetd/29)
job in w30 with the same symptoms. Other jobs in w35 (https://openqa.suse.de/tests/12872444)
job in w37 with the same symptoms. Other jobs in w36 (https://openqa.suse.de/tests/12872454#step/iscsi_client/32)
job in w37 with the same symptoms. Other jobs in w36 (support server), w35 & w38 (https://openqa.suse.de/tests/12872461#step/qnetd/28)
job in w30 with the same symptoms. Other jobs in w35 (support server) & w29 (https://openqa.suse.de/tests/12872465#step/iscsi_client/35)
job in w39 with the same symptoms. Other jobs in w35 (support server) & w40 (https://openqa.suse.de/tests/12872482#step/iscsi_client/35)
job in w40 with the same symptoms. Other jobs in w36 (support server) & w35 (https://openqa.suse.de/tests/12872499#step/iscsi_client/32)
job in w36 with the same symptoms. Other jobs in w38 (support server) & w35 & w39 (https://openqa.suse.de/tests/12872436#step/iscsi_client/34)

The following jobs also had a failure in name resolution, but did not attempt a connection to 10.0.2.2:

job in w35 failed to resolve scc.suse.com. Other jobs in w29 (support server) & w35 (https://openqa.suse.de/tests/12872400#step/suseconnect_scc/20)
another one like the one before, this one in w39. Other jobs in w36 (support server) & w35 (https://openqa.suse.de/tests/12872392#step/suseconnect_scc/20)

And finally this one which was different but could be related:

job in w35 failed with network not configured. Other jobs in w30 (support server) & w38 (https://openqa.suse.de/tests/12872417#step/register_system/12)

Initially I suspected something wrong with worker35, but only looking at the results from above was not conclusive.

I manually restarted all these jobs, and during the course of the afternoon saw restarted jobs where one of the nodes or the support server were picked up by workers worker35 and worker36 failing.

Example:

So my current suspicion is that there is something wrong in MM configuration in these 2 workers.

I checked with ovs-vsctl show and checked IPv4 forwarding settings below /proc/sys/net/ipv4 and with sysctl -a in both worker35 and worker38 (this one as a control node) and found no obvious differences, so no idea so far why one seems to be working and other not.

After several restarts, failures in https://openqa.suse.de/group_overview/405 decreased from 12 to 2, and due to https://suse.slack.com/archives/C02CANHLANP/p1700750992725609?thread_ts=1700727149.287059&cid=C02CANHLANP, I expect the 2 ongoing jobs to finish successfully.

Will add more details as I find them.

Actions

Copy link

#7

Updated by okurz over 1 year ago

Status changed from Feedback to In Progress
Priority changed from Low to High

Due to the above mentioned machines I powered down w35+w36.

TODO remove from salt again and block on "multi machine debugging issues"

Actions

Copy link

#8

Updated by okurz over 1 year ago

Related to action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems added

Actions

Copy link

#9

Updated by okurz over 1 year ago

Due date deleted (~~2023-11-30~~)
Status changed from In Progress to Blocked
Priority changed from High to Low

I removed w35+w36 from salt again. Blocking on #151382

Actions

Copy link

#10

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#11

Updated by okurz over 1 year ago

Target version changed from Ready to Tools - Next

Actions

Copy link

#12

Updated by okurz about 1 year ago

Target version changed from Tools - Next to future

Actions

Copy link

#13

Updated by okurz 12 months ago · Edited

Description updated (diff)

Take care to apply the workarounds from #157975-12 to prevent accidental distribution upgrades

Actions

Copy link

#14

Updated by okurz 11 months ago

Related to action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S added

Actions

Copy link

#15

Updated by okurz 9 months ago

Category set to Feature requests
Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)
Priority changed from Low to High
Target version changed from future to Ready

#151382 wasn't picked up for 10 months. But by now we have a pending change in the infrastructure which could adversely affect us, see #165282. For this it will be good if we can bring back more x86_64 openQA workers before NUE2 might loose the connection as we had it in before.

Actions

Copy link

#17

Updated by okurz 9 months ago

Description updated (diff)

Actions

Copy link

#18

Updated by nicksinger 9 months ago

Assignee set to nicksinger

Actions

Copy link

#19

Updated by nicksinger 9 months ago

Status changed from Workable to In Progress

worker35 is already online and properly working for a long time. worker36 was online but not reachable via ssh. Connecting via sol worked but I have to root password handy currently so I give rebooting a try now because the machine shows no network connectivity at all. Boot shows we currently run Leap 15.6 on there, maybe that's the reason for no network?

Actions

Copy link

#20

Updated by nicksinger 8 months ago

oh well, the current 15.6 kernel is happily crashing along every few minutes. No network even after a "proper" reboot. I will try to get into some kind of recovery system now and will eventually reinstall it to our current stable configuration.

Actions

Copy link

#21

Updated by okurz 8 months ago

Did you follow #139103-13 ?

Actions

Copy link

#22

Updated by nicksinger 8 months ago

okurz wrote in #note-21:

Did you follow #139103-13 ?

thanks, I'm going to follow that approach. Just confirmed from an efi-shell that network is fine:

Shell> ifconfig -l
-----------------------------------------------------------------
name         : eth0
Media State  : Media presentaddress for the eth0 interface:
policy       : statics eth0 dns 192.168.0.8 192.168.0.9
mac addr     : 7C:C2:55:24:DE:DE

ipv4 address : 0.0.0.0

subnet mask  : 0.0.0.0

default gateway: 0.0.0.0

  Routes (0 entries):

DNS server   :

-----------------------------------------------------------------

name         : eth1
Media State  : Media present
policy       : static
mac addr     : 7C:C2:55:24:DE:DF

ipv4 address : 0.0.0.0

subnet mask  : 0.0.0.0
Shell> ifconfig -s eth0 dhcp
Shell> ping 1.1.1.1
16 bytes from 1.1.1.1 : icmp_seq=2 ttl=0 time0~53ms
2 packets transmitted, 2 received, 0% packet loss, time 0ms
-----------------------------------------------------------------
Rtt(round trip time) min=0~53ms max=0~53ms avg=0~53ms

Actions

Copy link

#23

Updated by openqa_review 8 months ago

Due date set to 2024-09-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#24

Updated by okurz 8 months ago

Copied to action #166802: Recover worker37, worker38, worker39 size:S added

Actions

Copy link

#25

Updated by okurz 8 months ago

Related to action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org) added

Actions

Copy link

#26

Updated by nicksinger 8 months ago

@okurz helped me to log into the machine. Only a few snapshots are available and likely none contains an old kernel. Going to reinstall the machine now.

Actions

Copy link

#27

Updated by okurz 8 months ago

As alternative you can also forcefully install a Leap 15.5 kernel and firewalld as noted in https://bugzilla.suse.com/show_bug.cgi?id=1227616#c20 and we try if that helps to keep a system stable on Leap 15.6 otherwise.

Actions

Copy link

#28

Updated by nicksinger 8 months ago

I found no easy way to boot anything on worker36 yet as the running machine has no network in linux (have not tried setting it manually) and I cannot boot anything via UEFI PXE (haven't tried legacy yet) because there seems to be no PXE server present. The HTML5 console in the webui only has a grayed out "Virtual Media" button. Virtual media itself in the BMC webui requires a Samba host (inside the firewalled .qe-ipmi-ur network) which I don't have handy. I also didn't manage yet to use SMCIPMITool ( https://www.thomas-krenn.com/de/wiki/IPMI_Virtual_Media_einbinden# ) to mount a virtual media because I would need to execute it on qe-jumpy.prg2.suse.org (if even possible because of all the java stuff around it).

Actions

Copy link

#29

Updated by okurz 8 months ago · Edited

Assignee changed from nicksinger to okurz

I will give it a try. I booted the normal system and reproduced the crashing system after some time. I booted again and disabled the firewalld service.

Then did wicked --log-level info ifup all with result

wicked: lo: configuration applied to nanny
wicked: eth0: configuration applied to nanny
[  327.488637][ T2632] pps pps0: new PPS source ptp0
[  327.493993][ T2632] ixgbe 0000:41:00.0: registered PHC device on eth0
[  331.933937][  T856] ixgbe 0000:41:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
wicked: Interface wait time reached
lo              up
eth0            up

and can use the network.

Regarding reinstallation attempts we can consider instructions from #132137-6 and https://wiki.suse.net/index.php/OpenQA#Installation_of_OSD_PRG2_workers

On http://download.opensuse.org/update/leap/15.5/sle/noarch/?P=*firewalld* I found http://download.opensuse.org/update/leap/15.5/sle/noarch/firewalld-0.9.3-150400.8.12.1.noarch.rpm . Maybe can downgrade to that? I did zypper in --oldpackage http://download.opensuse.org/update/leap/15.5/sle/noarch/{firewalld,firewalld-lang,python3-firewall}-0.9.3-150400.8.12.1.noarch.rpm, removed the firewalld-bash-completion which is in conflict and added a package lock with zypper al -m "boo#1227616" *firewall*. I think dheidler used the machine for debugging that bug. I disabled the kernel:head repo and did zypper dup. After reboot the system came up fine including network. I now enabled the downgraded firewalld again with systemctl enable --now firewalld. Let's monitor.

Actions

Copy link

#30

Updated by okurz 8 months ago

Status changed from In Progress to Feedback

Looks all good and have seen multiple production OSD jobs on it. livdywan fixed a problem with auto-update in the related ticket. The worker is good. Now let's see how the system behaves with additional workers online. https://openqa.suse.de/admin/workers right now shows 947 worker instances online so anyway still below what we had problems with in the past.

Actions

Copy link

#31

Updated by okurz 8 months ago · Edited

While I monitored the operation it looks like not as many jobs are running as would be possible. From osd in /var/log/openqa_scheduler I find

[2024-09-18T18:40:46.762323Z] [debug] [pid:15820] [Job#15468085] Prepare for being processed by worker 2899
[2024-09-18T18:50:47.026654Z] [warn] [pid:15820] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.

[2024-09-18T18:50:47.026823Z] [warn] [pid:15820] Failed sending job(s) '15468085' to worker '2899': unknown error
[2024-09-18T18:50:47.035165Z] [debug] [pid:15820] Job 15468085 reset to state scheduled
[2024-09-18T18:50:47.060326Z] [debug] [pid:15820] Assigned job '15468104' to worker ID '2880'
[2024-09-18T18:50:47.063860Z] [debug] [pid:15820] [Job#15468104] Prepare for being processed by worker 2880
[2024-09-18T19:00:47.191988Z] [warn] [pid:15820] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.

this looks similar to #135122 and related tickets. https://openqa.suse.de/admin/workers shows that we have currently 1070 worker instances connected. Following https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production I took w38+w39 out of production again. Apparently immediately afterwards OSD could schedule jobs again and showed in /var/log/openqa_scheduler

[2024-09-18T19:09:32.060402Z] [debug] [pid:15820] Scheduler took 1729.79666s to perform operations and allocated 28 jobs

which is excessive.

From grep 'Scheduler took' /var/log/openqa_scheduler | less I see that usually a scheduling cycle takes 1-2s. Since 2024-09-18T16:00Z so around the time I brought back w36+w37 the cycle time increased to 5-20s and after I brought back w38+w39 at 18:35Z 97s and 19:09Z as mentioned above 1730s

Actions

Copy link

#32

Updated by okurz 8 months ago

Due date deleted (~~2024-09-27~~)
Status changed from Feedback to Blocked
Target version changed from Ready to Tools - Next

blocked on #166802

Actions

Copy link

#33

Updated by okurz 7 months ago

Status changed from Blocked to Resolved
Target version changed from Tools - Next to Ready

We have w36 and also w39 back in production and no further problem with the ppc queue. w37+w38 are off again due to performance constraints and to be handled in #166802

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #139103

Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Out of scope¶

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago · Edited

Updated by okurz over 1 year ago

Updated by acarvajal over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz about 1 year ago

Updated by okurz 12 months ago · Edited

Updated by okurz 11 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by nicksinger 9 months ago

Updated by nicksinger 9 months ago

Updated by nicksinger 8 months ago

Updated by okurz 8 months ago

Updated by nicksinger 8 months ago

Updated by openqa_review 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by nicksinger 8 months ago

Updated by okurz 8 months ago

Updated by nicksinger 8 months ago

Updated by okurz 8 months ago · Edited

Updated by okurz 8 months ago

Updated by okurz 8 months ago · Edited

Updated by okurz 8 months ago

Updated by okurz 7 months ago