Project

General

Profile

Actions

action #139103

open

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #139010: [epic] Long OSD ppc64le job queue

Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M

Added by okurz 6 months ago. Updated 3 months ago.

Status:
Blocked
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2023-11-04
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Currently on OSD there is a longer job queue in particular for ppc64le for multiple reasons, see #139010. One idea to decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs due to the OSD openQA instance job limit.

Acceptance criteria

  • AC1: The impact of worker instance ratio by arch/class has been verified
  • AC2: Given the openQA instance job limit is impacting the ppc64le job queue When the ratio of ppc64le/all workers has been increased Then the ppc64le job age is lower

Suggestions

  • DONE Look up current number of x86_64 and qemu ppc64le jobs assuming that we have a very low ppc64le/all ratio, e.g. many workers for qemu_x86_64 and very few for qemu_ppc64le (16 as of 2023-11-04).
  • DONE Reduce number of x86_64 qemu slots if we have "too many"
  • Monitor for the impact on qemu_ppc64le job age
  • Increase the amount of ppc64le machines and then again re-enable x86_64 machines

Rollback steps

Out of scope

  • Any code changes for the scheduler

Related issues 2 (1 open1 closed)

Related to openQA Tests - action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problemsNew2023-11-24

Actions
Copied from openQA Infrastructure - action #139100: Long OSD ppc64le job queue - Move nue3 power8 machines to nue2Resolvedokurz2023-11-04

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #139100: Long OSD ppc64le job queue - Move nue3 power8 machines to nue2 added
Actions #2

Updated by okurz 6 months ago

  • Description updated (diff)
  • Status changed from New to Feedback

I called an SQL query select host,count(distinct(w.id)) from workers w join worker_properties wp on w.id = wp.worker_id where w.t_seen >= '2023-11-01' group by host;

       host       | count 
------------------+-------
 diesel           |     8
 imagetester      |    18
 openqa-piworker  |     3
 openqaworker1    |    11
 openqaworker14   |    16
 openqaworker16   |    20
 openqaworker17   |    20
 openqaworker18   |    20
 petrol           |     8
 qesapworker-prg4 |    24
 qesapworker-prg5 |    23
 qesapworker-prg6 |    24
 qesapworker-prg7 |    22
 sapworker1       |    32
 sapworker2       |    33
 sapworker3       |    29
 worker-arm1      |    40
 worker-arm2      |    40
 worker29         |    49
 worker30         |    57
 worker31         |    50
 worker32         |    50
 worker33         |    50
 worker34         |    50
 worker35         |    40
 worker36         |    40
 worker37         |    40
 worker38         |    40
 worker39         |    40
 worker40         |    46
(30 rows)

From this we can not easily see which exact worker classes machines are using but by look into https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls we can cross-reference a little bit.

Disabling two worker machines w35+w36.

sudo salt 'worker3[5-6].oqa.*' cmd.run "sudo systemctl disable --now telegraf \$(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs); sudo poweroff" && sudo salt-key -y -d worker3[5-6].oqa.*
Actions #3

Updated by okurz 6 months ago

  • Subject changed from Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs to Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M
  • Description updated (diff)
Actions #4

Updated by okurz 5 months ago ยท Edited

  • Status changed from Feedback to In Progress

Given that with #139271 we have many more qemu-ppc64le worker slots I am bringing back worker3[56] back to production.
Powered on and then

salt --no-color 'worker3[5-6].oqa.*' --state-output=changes state.apply | grep -va 'Result: Clean'
Actions #5

Updated by okurz 5 months ago

  • Due date changed from 2023-11-25 to 2023-11-30
  • Status changed from In Progress to Feedback

Machines showed up fine in https://openqa.suse.de/admin/workers again. Waiting for jobs to be executed on that hosts the next days.

Actions #6

Updated by acarvajal 5 months ago

During QE-SAP osd review today, we started noticing multiple Multi-Machine errors in the HA/SAP Aggregate jobs from 2023-11-23, whereas jobs from the previous days were passing without issues.

Most common failure seems to be on SUT attempting to resolve names outside of openQA (updates.suse.com, scc.suse.com, download.suse.de), and then also failing to upload logs to 10.0.2.2.

The name solving issue could point to a communication problem between SUT and the DNS server in the support server job, but failure to reach 10.0.2.2 could point to a bigger issue.

Examples:

The following jobs also had a failure in name resolution, but did not attempt a connection to 10.0.2.2:

And finally this one which was different but could be related:

Initially I suspected something wrong with worker35, but only looking at the results from above was not conclusive.

I manually restarted all these jobs, and during the course of the afternoon saw restarted jobs where one of the nodes or the support server were picked up by workers worker35 and worker36 failing.

Example:

So my current suspicion is that there is something wrong in MM configuration in these 2 workers.

I checked with ovs-vsctl show and checked IPv4 forwarding settings below /proc/sys/net/ipv4 and with sysctl -a in both worker35 and worker38 (this one as a control node) and found no obvious differences, so no idea so far why one seems to be working and other not.

After several restarts, failures in https://openqa.suse.de/group_overview/405 decreased from 12 to 2, and due to https://suse.slack.com/archives/C02CANHLANP/p1700750992725609?thread_ts=1700727149.287059&cid=C02CANHLANP, I expect the 2 ongoing jobs to finish successfully.

Will add more details as I find them.

Actions #7

Updated by okurz 5 months ago

  • Status changed from Feedback to In Progress
  • Priority changed from Low to High

Due to the above mentioned machines I powered down w35+w36.

TODO remove from salt again and block on "multi machine debugging issues"

Actions #8

Updated by okurz 5 months ago

  • Related to action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems added
Actions #9

Updated by okurz 5 months ago

  • Due date deleted (2023-11-30)
  • Status changed from In Progress to Blocked
  • Priority changed from High to Low

I removed w35+w36 from salt again. Blocking on #151382

Actions #10

Updated by okurz 5 months ago

  • Description updated (diff)
Actions #11

Updated by okurz 3 months ago

  • Target version changed from Ready to Tools - Next
Actions

Also available in: Atom PDF