Project

General

Profile

Actions

action #158170

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #158167: [epic] Increase worker capacity

Increase resources for s390x kvm size:M

Added by okurz 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-03-27
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

https://suse.slack.com/archives/C02CANHLANP/p1711533706482229

(Oliver Kurz) @Matthias Griessmeier would you be interested in trying to acquire more s390x kvm testing ressources? Looking into https://suse.slack.com/archives/C02CLB8TZP1/p1711532709502039 I found that s390x kvm openQA jobs have a significant schedule due to the limit of available instances. We would be able to run more instances with more memory assigned to the hpervisor LPAR

Acceptance criteria

  • AC1: s390zl12+13 run more than 5 VMs each
  • AC2: openQA jobs on s390zl12+13 still consistently pass and no related monitoring alerts

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #153958: [alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 genericResolvedokurz2024-01-19

Actions
Related to openQA Infrastructure (public) - action #160598: [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:SResolvedjbaier_cz

Actions
Actions #1

Updated by okurz 9 months ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Description updated (diff)
  • Category changed from Feature requests to Feature requests
  • Status changed from New to In Progress
Actions #2

Updated by okurz 9 months ago

  • Related to action #153958: [alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 generic added
Actions #3

Updated by okurz 9 months ago

Both s390zl12+13 will have double the original memory amount. Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/761 for re-enabling the previously disabled instances as part of #153958.

After that waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 to have more s390x kvm instances DHCP/DNS entries prepared, then enable more instances in workerconf.sls

Actions #4

Updated by mgriessmeier 9 months ago

s390zl12 and s390zl13 have been upgraded and now have 160GB RAM each (double than before) and 6.0 Processors (previously 4.0)

I have prepared and reserved 20 more instances for future uses with https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951

Actions #5

Updated by openqa_review 9 months ago

  • Due date set to 2024-04-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz 9 months ago

  • Tracker changed from coordination to action
  • Status changed from In Progress to Feedback

Right now situation looks stable. s390zl12+13 are using more ressources and both back again with +2 instances. More is still pending on https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951

Actions #7

Updated by okurz 9 months ago

https://suse.slack.com/archives/C02CANHLANP/p1712228880986239

(Oliver Kurz) @Matthias Griessmeier will you follow-up with https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 regarding DHCP pool adjustement or do we need to take over?

Actions #8

Updated by okurz 9 months ago

  • Description updated (diff)
  • Due date deleted (2024-04-11)
  • Status changed from Feedback to New
  • Assignee deleted (okurz)
Actions #9

Updated by okurz 9 months ago

  • Subject changed from Increase ressources for s390x kvm to Increase resources for s390x kvm size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #10

Updated by nicksinger 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #11

Updated by nicksinger 8 months ago

  • Status changed from In Progress to Feedback

IPs for machines adjusted in https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4977. I will wait for a merge before bringing up the according worker instances.

Actions #12

Updated by nicksinger 8 months ago

  • Status changed from Feedback to Workable

Merged. Ready to be worked on again e.g. by validating the entries work and mentioned machines are ready to be used.

Actions #13

Updated by nicksinger 8 months ago

  • Status changed from Workable to Feedback
Actions #14

Updated by nicksinger 8 months ago

  • Status changed from Feedback to Workable

Merged. It would have been wise to add the new instances with a ticket suffix but now we're testing live. Lets review on Monday if the new virsh-instances perform as expected.

Actions #15

Updated by okurz 8 months ago · Edited

Doesn't go well. https://openqa.suse.de/admin/workers/3087 shows no successful jobs on the new instances. In particular https://openqa.suse.de/tests/14103625#step/bootloader_zkvm/44 states

# Test died: Error connecting to VNC server <s390kvm115.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host at /usr/lib/os-autoinst/testapi.pm line 1690.

I checked all the s390-kvm workers and saw consistent failures on s390kvm100…s390kvm119 but I have also seen s390kvm093 consistently failing https://openqa.suse.de/admin/workers/2650, not sure about that one.

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/784 for mitigation, will merge and restart all related failures with

for i in WORKER="worker33 worker35 worker40"; do host=openqa.suse.de failed_since=2024-04-19 result="result='failed'" comment="label:poo158170" ./openqa-advanced-retrigger-jobs; done
Actions #16

Updated by nicksinger 8 months ago

  • Status changed from Workable to In Progress

DHCP configs did not properly apply after the merge because suttner1 apparently was "out of sync" with suttner2 - not sure what or who fixed that but we're good now: https://openqa.suse.de/tests/overview?build=nsinger_s390validation

The failing instances most likely hit the sexagesimal-quirk which I try to fix/workaround now. After this is done we can merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/791 to finally bring them into production.

Actions #17

Updated by okurz 8 months ago

Feel welcome to block on https://jira.suse.com/browse/ENGINFRA-4030 "suttner1.oqa.prg2.suse.org+suttner2.oqa.prg2.suse.org times are both out of sync with NTP causing DHCP-failover to fail" any time and escalate to a line manager of your choice :)

Actions #18

Updated by nicksinger 8 months ago

  • Status changed from In Progress to Feedback
Actions #19

Updated by nicksinger 8 months ago

  • Status changed from Feedback to Resolved

I checked the instances. A lot of red container tests but it looks like test issues. Some few green jobs in between show that the workers do their job as expected.

Actions #20

Updated by nicksinger 8 months ago

  • Status changed from Resolved to In Progress

worker36+37 are offline because of https://progress.opensuse.org/issues/157726 (and linked) meaning we miss 10 instances. Not sure how I missed them previously but we have to move them now. Doing this now

Actions #21

Updated by nicksinger 8 months ago

  • Status changed from In Progress to Resolved

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/804 moved the slots around and OSD now has 20 production jobs (zl13 disabled due to https://progress.opensuse.org/issues/159066) which are capable to successfully complete jobs.

Actions #22

Updated by jbaier_cz 7 months ago

  • Related to action #160598: [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S added
Actions

Also available in: Atom PDF