action #158170
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #158167: [epic] Increase worker capacity
Increase resources for s390x kvm size:M
0%
Description
Motivation¶
https://suse.slack.com/archives/C02CANHLANP/p1711533706482229
(Oliver Kurz) @Matthias Griessmeier would you be interested in trying to acquire more s390x kvm testing ressources? Looking into https://suse.slack.com/archives/C02CLB8TZP1/p1711532709502039 I found that s390x kvm openQA jobs have a significant schedule due to the limit of available instances. We would be able to run more instances with more memory assigned to the hpervisor LPAR
Acceptance criteria¶
- AC1: s390zl12+13 run more than 5 VMs each
- AC2: openQA jobs on s390zl12+13 still consistently pass and no related monitoring alerts
Suggestions¶
- s390zl12+13 have more resources
- There are already more VMs configured by mgriessmeier
- Bring https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 forward, i.e. adjust DHCP pool addresses
- Then increase instances in workerconf.sls
- Verify while monitoring
Updated by okurz about 1 year ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Description updated (diff)
- Category changed from Feature requests to Feature requests
- Status changed from New to In Progress
Updated by okurz about 1 year ago
- Related to action #153958: [alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 generic added
Updated by okurz about 1 year ago
Both s390zl12+13 will have double the original memory amount. Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/761 for re-enabling the previously disabled instances as part of #153958.
After that waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 to have more s390x kvm instances DHCP/DNS entries prepared, then enable more instances in workerconf.sls
Updated by mgriessmeier about 1 year ago
s390zl12 and s390zl13 have been upgraded and now have 160GB RAM each (double than before) and 6.0 Processors (previously 4.0)
I have prepared and reserved 20 more instances for future uses with https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951
Updated by openqa_review about 1 year ago
- Due date set to 2024-04-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 1 year ago
- Tracker changed from coordination to action
- Status changed from In Progress to Feedback
Right now situation looks stable. s390zl12+13 are using more ressources and both back again with +2 instances. More is still pending on https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951
Updated by okurz 12 months ago
https://suse.slack.com/archives/C02CANHLANP/p1712228880986239
(Oliver Kurz) @Matthias Griessmeier will you follow-up with https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 regarding DHCP pool adjustement or do we need to take over?
Updated by nicksinger 12 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by nicksinger 12 months ago
- Status changed from In Progress to Feedback
IPs for machines adjusted in https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4977. I will wait for a merge before bringing up the according worker instances.
Updated by nicksinger 12 months ago
- Status changed from Feedback to Workable
Merged. Ready to be worked on again e.g. by validating the entries work and mentioned machines are ready to be used.
Updated by nicksinger 12 months ago
- Status changed from Workable to Feedback
Updated by nicksinger 12 months ago
- Status changed from Feedback to Workable
Merged. It would have been wise to add the new instances with a ticket suffix but now we're testing live. Lets review on Monday if the new virsh-instances perform as expected.
Updated by okurz 12 months ago · Edited
Doesn't go well. https://openqa.suse.de/admin/workers/3087 shows no successful jobs on the new instances. In particular https://openqa.suse.de/tests/14103625#step/bootloader_zkvm/44 states
# Test died: Error connecting to VNC server <s390kvm115.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host at /usr/lib/os-autoinst/testapi.pm line 1690.
I checked all the s390-kvm workers and saw consistent failures on s390kvm100…s390kvm119 but I have also seen s390kvm093 consistently failing https://openqa.suse.de/admin/workers/2650, not sure about that one.
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/784 for mitigation, will merge and restart all related failures with
for i in WORKER="worker33 worker35 worker40"; do host=openqa.suse.de failed_since=2024-04-19 result="result='failed'" comment="label:poo158170" ./openqa-advanced-retrigger-jobs; done
Updated by nicksinger 12 months ago
- Status changed from Workable to In Progress
DHCP configs did not properly apply after the merge because suttner1 apparently was "out of sync" with suttner2 - not sure what or who fixed that but we're good now: https://openqa.suse.de/tests/overview?build=nsinger_s390validation
The failing instances most likely hit the sexagesimal-quirk which I try to fix/workaround now. After this is done we can merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/791 to finally bring them into production.
Updated by okurz 12 months ago
Feel welcome to block on https://jira.suse.com/browse/ENGINFRA-4030 "suttner1.oqa.prg2.suse.org+suttner2.oqa.prg2.suse.org times are both out of sync with NTP causing DHCP-failover to fail" any time and escalate to a line manager of your choice :)
Updated by nicksinger 12 months ago
- Status changed from In Progress to Feedback
well, we're good for now :) https://openqa.suse.de/tests/overview?build=nsinger_s390validation
Updated by nicksinger 11 months ago
- Status changed from Feedback to Resolved
I checked the instances. A lot of red container tests but it looks like test issues. Some few green jobs in between show that the workers do their job as expected.
Updated by nicksinger 11 months ago
- Status changed from Resolved to In Progress
worker36+37 are offline because of https://progress.opensuse.org/issues/157726 (and linked) meaning we miss 10 instances. Not sure how I missed them previously but we have to move them now. Doing this now
Updated by nicksinger 11 months ago
- Status changed from In Progress to Resolved
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/804 moved the slots around and OSD now has 20 production jobs (zl13 disabled due to https://progress.opensuse.org/issues/159066) which are capable to successfully complete jobs.
Updated by jbaier_cz 11 months ago
- Related to action #160598: [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S added