action #158170: Increase resources for s390x kvm size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #158170

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #158167: [epic] Increase worker capacity

Increase resources for s390x kvm size:M

Added by okurz about 1 year ago. Updated 11 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Feature requests

Target version:

openQA Project (public) - Ready

Start date:

2024-03-27

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Motivation¶

https://suse.slack.com/archives/C02CANHLANP/p1711533706482229

(Oliver Kurz) @Matthias Griessmeier would you be interested in trying to acquire more s390x kvm testing ressources? Looking into https://suse.slack.com/archives/C02CLB8TZP1/p1711532709502039 I found that s390x kvm openQA jobs have a significant schedule due to the limit of available instances. We would be able to run more instances with more memory assigned to the hpervisor LPAR

Acceptance criteria¶

AC1: s390zl12+13 run more than 5 VMs each
AC2: openQA jobs on s390zl12+13 still consistently pass and no related monitoring alerts

Suggestions¶

s390zl12+13 have more resources
There are already more VMs configured by mgriessmeier
Bring https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 forward, i.e. adjust DHCP pool addresses
Then increase instances in workerconf.sls
Verify while monitoring

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Project changed from openQA Project (public) to openQA Infrastructure (public)
Description updated (diff)
Category changed from Feature requests to Feature requests
Status changed from New to In Progress

Actions

Copy link

Updated by okurz about 1 year ago

Related to action #153958: [alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 generic added

Actions

Copy link

Updated by okurz about 1 year ago

Both s390zl12+13 will have double the original memory amount. Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/761 for re-enabling the previously disabled instances as part of #153958.

After that waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 to have more s390x kvm instances DHCP/DNS entries prepared, then enable more instances in workerconf.sls

Actions

Copy link

Updated by mgriessmeier about 1 year ago

s390zl12 and s390zl13 have been upgraded and now have 160GB RAM each (double than before) and 6.0 Processors (previously 4.0)

I have prepared and reserved 20 more instances for future uses with https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951

Actions

Copy link

Updated by openqa_review about 1 year ago

Due date set to 2024-04-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz about 1 year ago

Tracker changed from coordination to action
Status changed from In Progress to Feedback

Right now situation looks stable. s390zl12+13 are using more ressources and both back again with +2 instances. More is still pending on https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951

Actions

Copy link

Updated by okurz 12 months ago

https://suse.slack.com/archives/C02CANHLANP/p1712228880986239

(Oliver Kurz) @Matthias Griessmeier will you follow-up with https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 regarding DHCP pool adjustement or do we need to take over?

Actions

Copy link

Updated by okurz 12 months ago

Description updated (diff)
Due date deleted (~~2024-04-11~~)
Status changed from Feedback to New
Assignee deleted (~~okurz~~)

Actions

Copy link

Updated by okurz 12 months ago

Subject changed from Increase ressources for s390x kvm to Increase resources for s390x kvm size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#10

Updated by nicksinger 12 months ago

Status changed from Workable to In Progress
Assignee set to nicksinger

Actions

Copy link

#11

Updated by nicksinger 12 months ago

Status changed from In Progress to Feedback

IPs for machines adjusted in https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4977. I will wait for a merge before bringing up the according worker instances.

Actions

Copy link

#12

Updated by nicksinger 12 months ago

Status changed from Feedback to Workable

Merged. Ready to be worked on again e.g. by validating the entries work and mentioned machines are ready to be used.

Actions

Copy link

#13

Updated by nicksinger 12 months ago

Status changed from Workable to Feedback

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/781

Actions

Copy link

#14

Updated by nicksinger 12 months ago

Status changed from Feedback to Workable

Merged. It would have been wise to add the new instances with a ticket suffix but now we're testing live. Lets review on Monday if the new virsh-instances perform as expected.

Actions

Copy link

#15

Updated by okurz 12 months ago · Edited

Doesn't go well. https://openqa.suse.de/admin/workers/3087 shows no successful jobs on the new instances. In particular https://openqa.suse.de/tests/14103625#step/bootloader_zkvm/44 states

# Test died: Error connecting to VNC server <s390kvm115.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host at /usr/lib/os-autoinst/testapi.pm line 1690.

I checked all the s390-kvm workers and saw consistent failures on s390kvm100…s390kvm119 but I have also seen s390kvm093 consistently failing https://openqa.suse.de/admin/workers/2650, not sure about that one.

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/784 for mitigation, will merge and restart all related failures with

for i in WORKER="worker33 worker35 worker40"; do host=openqa.suse.de failed_since=2024-04-19 result="result='failed'" comment="label:poo158170" ./openqa-advanced-retrigger-jobs; done

Actions

Copy link

#16

Updated by nicksinger 12 months ago

Status changed from Workable to In Progress

DHCP configs did not properly apply after the merge because suttner1 apparently was "out of sync" with suttner2 - not sure what or who fixed that but we're good now: https://openqa.suse.de/tests/overview?build=nsinger_s390validation

The failing instances most likely hit the sexagesimal-quirk which I try to fix/workaround now. After this is done we can merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/791 to finally bring them into production.

Actions

Copy link

#17

Related to action #160598: [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #158170

Increase resources for s390x kvm size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by mgriessmeier about 1 year ago

Updated by openqa_review about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by nicksinger 12 months ago

Updated by nicksinger 12 months ago

Updated by nicksinger 12 months ago

Updated by nicksinger 12 months ago

Updated by nicksinger 12 months ago

Updated by okurz 12 months ago · Edited

Updated by nicksinger 12 months ago

Updated by okurz 12 months ago

Updated by nicksinger 12 months ago

Updated by nicksinger 11 months ago

Updated by nicksinger 11 months ago

Updated by nicksinger 11 months ago

Updated by jbaier_cz 11 months ago