action #90857
closedAdd redundancy for SAP multi machines tests - Extend openQA worker config to accomodate for upgraded RAM
Added by jadamek over 3 years ago. Updated over 3 years ago.
0%
Description
OpenQA QEM review reported an issue with our SAP HANA tests executed on Maintenance TestRepo.
The need is to get more resources or executing the tests on more existing machines.
First, let me summarize the situation:
Nowadays, the timing is really tight because as you know, the maintenance test repo is triggered twice a day.
That means 2 X 6 OS versions to test (12-SP3 to 15-SP2) with one HANA test per OS version.
And it must be completed before the next build otherwise jobs are tagged obsolete.
One HANA test requires 49 GB RAM: 2 x 24 GB (HANA machines) + 1 GB for the support server machine.
For these tests, we are only using openqaworker8 (sap_sle12) and openqaworker9 (sap_sle15), we made it like that to preserve the memory usage of the openQA instance (https://progress.opensuse.org/issues/73246):
Like that the HANA tests are done in serialize for sle12 as well as sle15.
For instance:
HANA test starts for 15 GA on openqaworker9, the test lasts half an hour and a half. Once the test is done, the HANA test on 15 SP1 starts, and so on...
Like we have 3 differents 15 versions (GA, SP1, SP2), the tests last 4 hours and a half only for SLE15.
For SLE12, the HANA test lasts one hour so as we have 3 different 12 versions (SP3, SP4, SP5), the tests last 3 hours for SLE12. 12-S2P2 was removed recently.
Besides that, both workers are also used on Maintenance incident and we can not know how much we need there in advance.
I agree the solution isn't redundant at all. If one of the workers is down, the tests can not be executed elsewhere.
For speeding up the tests, we can think about adding memory in both workers (at least 64GB per worker, not less because the jobs are linked together as they are multi machines jobs).
Updated by jadamek over 3 years ago
This merge request will add consistency between the SAP tests and free a little bit of RAM:
https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/274
Updated by okurz over 3 years ago
- Target version set to future
Thank you for the improvement in the MR. That should already help. Reading dmidecode on openqaworker8 I consider a memory upgrade feasible.
https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=7626
is the racktables entry for openqaworker8_9. Can you guys find sponsoring for a RAM upgrade related to SAP development work?
Updated by jadamek over 3 years ago
Thanks Oliver, I will raise the point to our team leads this afternoon.
Updated by runger over 3 years ago
Funding is secured. Will be taken out of my QE LSG labs budget. Matthias will organize procurement.
Updated by okurz over 3 years ago
- Status changed from New to In Progress
- Assignee set to nicksinger
- Target version changed from future to Ready
wow, that was quick :) Thanks a lot!
We checked dmidecode. there are 4 banks with support for max 256GB each. Currently we have 4x4x16GB. nicksinger researched what replacement modules would cost: "cheapest module found (https://geizhals.de/samsung-rdimm-32gb-m393a4k40bb1-crc-a1378231.html) this would make ~520€ for a 64gb upgrade (4*32 as we would need to replace 4 16GB dimms). maybe something more in the lines of https://www.mindfactory.de/product_info.php/32GB-Kingston-Server-Premier-KSM24RD4-32MEI-DDR4-2400-regECC-DIMM-CL17-_1250630.html - which would bump the price up to ~700€ "
Updated by openqa_review over 3 years ago
- Due date set to 2021-04-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger over 3 years ago
runger wrote:
Funding is secured. Will be taken out of my QE LSG labs budget. Matthias will organize procurement.
I've requested an offer from delta. Matthias is included in the CC. Will update once we have that offer.
Updated by nicksinger over 3 years ago
- Status changed from In Progress to Feedback
I'm currently waiting for a decision made by mgmt. Matthi and Ralf are in the loop here
Updated by okurz over 3 years ago
- Due date changed from 2021-04-27 to 2021-05-27
As this is a ticket waiting for feedback from others I am suggesting a new due date one month in the future to check back if there is any response at that time or check back with them.
Updated by mgrifalconi over 3 years ago
Hello, some SAP updates are stuck now since they require 3 parallel jobs for worker sap_sle15 but there are only 2 workers for that. An other 2 are available as sap_sle12.
Chat discussion
https://chat.suse.de/channel/qem-openqa-review?msg=8iih2j3m7koNSfC6Q
Updated by livdywan over 3 years ago
@nicksinger Did you see the above question?
Updated by jadamek over 3 years ago
mgrifalconi wrote:
Hello, some SAP updates are stuck now since they require 3 parallel jobs for worker sap_sle15 but there are only 2 workers for that. An other 2 are available as sap_sle12.
Chat discussion
https://chat.suse.de/channel/qem-openqa-review?msg=8iih2j3m7koNSfC6Q
Hello Michael,
I remember this issue and as far as I know, it's already fixed.
The worker class for the supportserver was incorrect.
Updated by okurz over 3 years ago
- Subject changed from Add redundancy for SAP multi machines tests to Add redundancy for SAP multi machines tests - extend RAM on machines
So let's keep this ticket centered around the approach to buy a RAM upgrade for our machines.
@nicksinger deadline will be reached tomorrow, what's the status?
Updated by mgriessmeier over 3 years ago
quote is requested and will be purchased today or tomorrow
Updated by mgriessmeier over 3 years ago
RAM upgrade was ordered and is already sent out.
Updated by nicksinger over 3 years ago
- Status changed from Feedback to Blocked
As the sticks arrived in NBG by now I created
https://infra.nue.suse.com/Ticket/Display.html?id=189709 and asked infra to build them into the machines. I will provide them with further assistance if needed. Please be aware that this upgrade will cause a downtime so if you have a timeslot where the machines need to run please let me know ASAP.
Updated by okurz over 3 years ago
- Due date changed from 2021-05-27 to 2021-06-22
ok, looks good. Bumping due date so that you can check the latest when the due date has passed if the blocking ticket progressed. Please remember for EngInfra tickets to use [openqa]
in the subject and CC osd-admins@suse.de
Updated by mgriessmeier over 3 years ago
RAM has arrived at SUSE Office in NUE, I have put the package on the desk in Nicks office (3.2.12 iirc).
Please open infra ticket for installation.
I'm sorry - forget about my comment... please let infra know that I have moved the package to Nicks office... (I don't have vpn available atm)
Updated by nicksinger over 3 years ago
- Status changed from Blocked to Resolved
The new module got build in today. I checked both machines and I can see them with dmidecode -t memory
. According to free -h
both machines now have 314Gi
of RAM.
Updated by okurz over 3 years ago
- Related to action #73246: [osd-admins][alert] openqaworker8: Memory usage alert added
Updated by okurz over 3 years ago
- Copied to action #93961: Add redundancy for SAP multi machines tests - extend RAM on machines added
Updated by okurz over 3 years ago
- Subject changed from Add redundancy for SAP multi machines tests - extend RAM on machines to Add redundancy for SAP multi machines tests - Extend openQA worker config to accomodate for upgraded RAM
- Due date deleted (
2021-06-22) - Status changed from Resolved to New
- Assignee deleted (
nicksinger) - Start date deleted (
2021-04-08)
Very nice. I copied the ticket into #93961 assigned to you, nicksinger, and resolved so that we can reopen here and make better use of the machines with upgraded memory as I still see the need to tweak the worker config to actually use the extended ressources for more worker instances
Updated by okurz over 3 years ago
- Status changed from New to Feedback
- Assignee set to okurz
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/324
To add redundancy for SAP multi-machine tests we have now more RAM available in the according openQA workers. I am proposing to have SAP corresponding worker classes spread out over openqaworker8+9 so that we have 1. more worker instances to run such tests, 2. higher redundancy as any of the two machines is able to execute multi-machine clusters for both the clases "sap_sle12" as well as "sap_sle15".
Updated by okurz over 3 years ago
merged and looking ok since 8 days.
Now, what was the reason to have separate "sap_sle12" and "sap_sle15" worker classes? I would like to simplify that and keep only "sap" classes or even better have no special class at all. I assume the main challenge is to have a much bigger RAM amount than normal, right?
Updated by acarvajal over 3 years ago
okurz wrote:
merged and looking ok since 8 days.
Now, what was the reason to have separate "sap_sle12" and "sap_sle15" worker classes? I would like to simplify that and keep only "sap" classes or even better have no special class at all. I assume the main challenge is to have a much bigger RAM amount than normal, right?
Apparently this was done to force MM jobs to run in the same worker, i.e., all QAM SLES for SAP Applications 12-SP* jobs in a given worker, and all QAM SLES for SAP Applications 15-SP* jobs in a given worker.
I think this can be simplified, and if there is still a need to have MM jobs running in the same worker, we can use something like WORKER_CLASS=openqaworker8 or WORKER_CLASS=openqaworker9 instead.
I will add a task on our backlog to replace sap_sle12 and sap_sle15 on the WORKER_CLASS for whatever name is chosen. Settings are currently in use in Maintenance Incidents and Maintenance TestRepo job groups.
Updated by okurz over 3 years ago
Updated by acarvajal over 3 years ago
Merge requests to remove the setting from QAM job groups:
Maintenance Single Incidents: https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/313
Maintenance TestRepo: https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153
Oliver - just to confirm - will we follow the following sequence?
- Merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/327
- Update WORKER_CLASS setting on machine 64bit-sap-qam in osd (changing WORKER_CLASS=qemu_x86_64 to WORKER_CLASS=qemu_x86_64-large-mem)
- Merge https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/313 & https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153
Or did you have something else in mind?
I'm thinking we could leave the machine definition unchanged, and I add qemu_x86_64-large-mem to the WORKER_CLASS setting in the job groups, but not sure what is better.
Updated by okurz over 3 years ago
acarvajal wrote:
Merge requests to remove the setting from QAM job groups:
Maintenance Single Incidents: https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/313
Maintenance TestRepo: https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153Oliver - just to confirm - will we follow the following sequence?
- Merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/327
- Update WORKER_CLASS setting on machine 64bit-sap-qam in osd (changing WORKER_CLASS=qemu_x86_64 to WORKER_CLASS=qemu_x86_64-large-mem)
- Merge https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/313 & https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153
Yes, sounds safe. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/327 is already merged as it's just adding a new worker class setting which should not break any existing settings.
I'm thinking we could leave the machine definition unchanged, and I add qemu_x86_64-large-mem to the WORKER_CLASS setting in the job groups, but not sure what is better.
I advise for using the machine definitions as intermediate abstraction point. Just some days ago some cloud test scenarios had problems after I needed to set the worker class in jobs from machine definitions and these jobs had the worker class overriden in job templates directly. So one more reason to say better define the worker class in machines, with exception of adding "tap" as an additional worker class requirement.
After all three points that you mentioned above we can remove the "sap_sle*" worker class settings. But for this we should give a reasonable grace time because otherwise retriggering older jobs still having the old worker class restrictions would be stuck in schedule, never being executed.
Updated by acarvajal over 3 years ago
Updated WORKER_CLASS setting on machine 64bit-sap-qam in osd.
Also merged Maintenance Single Incidents job group configuration MR.
Pending TestRepo MR.
Updated by okurz over 3 years ago
- Due date set to 2021-08-03
- Priority changed from Normal to Low
waiting for https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153 to be merged + grace period before we remove the "sap_sle*" classes from worker config.
Updated by acarvajal over 3 years ago
HANA jobs impacted by this change seems to be working with no issues both in TestRepo and in Single Incidents.
Some examples from today below:
TestRepo:
- https://openqa.suse.de/tests/overview?groupid=366&build=20210713-2&distri=sle&version=15-SP3&flavor=SAP-DVD-Updates
- https://openqa.suse.de/tests/overview?distri=sle&version=12-SP3&build=20210713-1&groupid=108&flavor=SAP-DVD-Updates
- https://openqa.suse.de/tests/overview?distri=sle&version=15-SP1&build=20210713-2&groupid=232&flavor=SAP-DVD-Updates
Single Incidents:
Updated by okurz over 3 years ago
ok, good. Now I suggest to remove the unused worker classes in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/331
Updated by okurz over 3 years ago
- Status changed from Feedback to Resolved
No further problems observed. With this I see all points covered