Project

General

Profile

Actions

action #90857

closed

Add redundancy for SAP multi machines tests - Extend openQA worker config to accomodate for upgraded RAM

Added by jadamek over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
Due date:
2021-08-03
% Done:

0%

Estimated time:

Description

OpenQA QEM review reported an issue with our SAP HANA tests executed on Maintenance TestRepo.
The need is to get more resources or executing the tests on more existing machines.

First, let me summarize the situation:
Nowadays, the timing is really tight because as you know, the maintenance test repo is triggered twice a day.
That means 2 X 6 OS versions to test (12-SP3 to 15-SP2) with one HANA test per OS version.
And it must be completed before the next build otherwise jobs are tagged obsolete.
One HANA test requires 49 GB RAM: 2 x 24 GB (HANA machines) + 1 GB for the support server machine.

For these tests, we are only using openqaworker8 (sap_sle12) and openqaworker9 (sap_sle15), we made it like that to preserve the memory usage of the openQA instance (https://progress.opensuse.org/issues/73246):
Like that the HANA tests are done in serialize for sle12 as well as sle15.

For instance:
HANA test starts for 15 GA on openqaworker9, the test lasts half an hour and a half. Once the test is done, the HANA test on 15 SP1 starts, and so on...
Like we have 3 differents 15 versions (GA, SP1, SP2), the tests last 4 hours and a half only for SLE15.
For SLE12, the HANA test lasts one hour so as we have 3 different 12 versions (SP3, SP4, SP5), the tests last 3 hours for SLE12. 12-S2P2 was removed recently.

Besides that, both workers are also used on Maintenance incident and we can not know how much we need there in advance.

I agree the solution isn't redundant at all. If one of the workers is down, the tests can not be executed elsewhere.
For speeding up the tests, we can think about adding memory in both workers (at least 64GB per worker, not less because the jobs are linked together as they are multi machines jobs).


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #73246: [osd-admins][alert] openqaworker8: Memory usage alertResolvedokurz2020-10-122020-10-23

Actions
Copied to openQA Infrastructure - action #93961: Add redundancy for SAP multi machines tests - extend RAM on machinesResolvednicksinger2021-04-082021-06-22

Actions
Actions #1

Updated by jadamek over 3 years ago

This merge request will add consistency between the SAP tests and free a little bit of RAM:
https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/274

Actions #2

Updated by okurz over 3 years ago

  • Target version set to future

Thank you for the improvement in the MR. That should already help. Reading dmidecode on openqaworker8 I consider a memory upgrade feasible.

https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=7626
is the racktables entry for openqaworker8_9. Can you guys find sponsoring for a RAM upgrade related to SAP development work?

Actions #3

Updated by jadamek over 3 years ago

Thanks Oliver, I will raise the point to our team leads this afternoon.

Actions #4

Updated by runger over 3 years ago

Funding is secured. Will be taken out of my QE LSG labs budget. Matthias will organize procurement.

Actions #5

Updated by okurz over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
  • Target version changed from future to Ready

wow, that was quick :) Thanks a lot!

We checked dmidecode. there are 4 banks with support for max 256GB each. Currently we have 4x4x16GB. nicksinger researched what replacement modules would cost: "cheapest module found (https://geizhals.de/samsung-rdimm-32gb-m393a4k40bb1-crc-a1378231.html) this would make ~520€ for a 64gb upgrade (4*32 as we would need to replace 4 16GB dimms). maybe something more in the lines of https://www.mindfactory.de/product_info.php/32GB-Kingston-Server-Premier-KSM24RD4-32MEI-DDR4-2400-regECC-DIMM-CL17-_1250630.html - which would bump the price up to ~700€ "

Actions #6

Updated by openqa_review over 3 years ago

  • Due date set to 2021-04-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by nicksinger over 3 years ago

runger wrote:

Funding is secured. Will be taken out of my QE LSG labs budget. Matthias will organize procurement.

I've requested an offer from delta. Matthias is included in the CC. Will update once we have that offer.

Actions #8

Updated by nicksinger over 3 years ago

  • Status changed from In Progress to Feedback

I'm currently waiting for a decision made by mgmt. Matthi and Ralf are in the loop here

Actions #9

Updated by okurz over 3 years ago

  • Due date changed from 2021-04-27 to 2021-05-27

As this is a ticket waiting for feedback from others I am suggesting a new due date one month in the future to check back if there is any response at that time or check back with them.

Actions #10

Updated by mgrifalconi over 3 years ago

Hello, some SAP updates are stuck now since they require 3 parallel jobs for worker sap_sle15 but there are only 2 workers for that. An other 2 are available as sap_sle12.

Job stuck
https://openqa.suse.de/tests/overview?version=15-SP2&groupid=311&flavor=Server-DVD-SAP-Incidents&distri=sle&build=%3A19259%3Adrbd-formula

Chat discussion
https://chat.suse.de/channel/qem-openqa-review?msg=8iih2j3m7koNSfC6Q

Actions #11

Updated by livdywan over 3 years ago

@nicksinger Did you see the above question?

Actions #12

Updated by jadamek over 3 years ago

mgrifalconi wrote:

Hello, some SAP updates are stuck now since they require 3 parallel jobs for worker sap_sle15 but there are only 2 workers for that. An other 2 are available as sap_sle12.

Job stuck
https://openqa.suse.de/tests/overview?version=15-SP2&groupid=311&flavor=Server-DVD-SAP-Incidents&distri=sle&build=%3A19259%3Adrbd-formula

Chat discussion
https://chat.suse.de/channel/qem-openqa-review?msg=8iih2j3m7koNSfC6Q

Hello Michael,
I remember this issue and as far as I know, it's already fixed.
The worker class for the supportserver was incorrect.

Actions #13

Updated by okurz over 3 years ago

  • Subject changed from Add redundancy for SAP multi machines tests to Add redundancy for SAP multi machines tests - extend RAM on machines

So let's keep this ticket centered around the approach to buy a RAM upgrade for our machines.

@nicksinger deadline will be reached tomorrow, what's the status?

Actions #14

Updated by mgriessmeier over 3 years ago

quote is requested and will be purchased today or tomorrow

Actions #15

Updated by mgriessmeier over 3 years ago

RAM upgrade was ordered and is already sent out.

Actions #16

Updated by nicksinger over 3 years ago

  • Status changed from Feedback to Blocked

As the sticks arrived in NBG by now I created
https://infra.nue.suse.com/Ticket/Display.html?id=189709 and asked infra to build them into the machines. I will provide them with further assistance if needed. Please be aware that this upgrade will cause a downtime so if you have a timeslot where the machines need to run please let me know ASAP.

Actions #17

Updated by okurz over 3 years ago

  • Due date changed from 2021-05-27 to 2021-06-22

ok, looks good. Bumping due date so that you can check the latest when the due date has passed if the blocking ticket progressed. Please remember for EngInfra tickets to use [openqa] in the subject and CC osd-admins@suse.de

Actions #18

Updated by mgriessmeier over 3 years ago

RAM has arrived at SUSE Office in NUE, I have put the package on the desk in Nicks office (3.2.12 iirc).
Please open infra ticket for installation.

I'm sorry - forget about my comment... please let infra know that I have moved the package to Nicks office... (I don't have vpn available atm)

Actions #19

Updated by nicksinger over 3 years ago

  • Status changed from Blocked to Resolved

The new module got build in today. I checked both machines and I can see them with dmidecode -t memory. According to free -h both machines now have 314Gi of RAM.

Actions #20

Updated by okurz over 3 years ago

  • Related to action #73246: [osd-admins][alert] openqaworker8: Memory usage alert added
Actions #21

Updated by okurz over 3 years ago

  • Copied to action #93961: Add redundancy for SAP multi machines tests - extend RAM on machines added
Actions #22

Updated by okurz over 3 years ago

  • Subject changed from Add redundancy for SAP multi machines tests - extend RAM on machines to Add redundancy for SAP multi machines tests - Extend openQA worker config to accomodate for upgraded RAM
  • Due date deleted (2021-06-22)
  • Status changed from Resolved to New
  • Assignee deleted (nicksinger)
  • Start date deleted (2021-04-08)

Very nice. I copied the ticket into #93961 assigned to you, nicksinger, and resolved so that we can reopen here and make better use of the machines with upgraded memory as I still see the need to tweak the worker config to actually use the extended ressources for more worker instances

Actions #23

Updated by okurz over 3 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/324

To add redundancy for SAP multi-machine tests we have now more RAM available in the according openQA workers. I am proposing to have SAP corresponding worker classes spread out over openqaworker8+9 so that we have 1. more worker instances to run such tests, 2. higher redundancy as any of the two machines is able to execute multi-machine clusters for both the clases "sap_sle12" as well as "sap_sle15".

Actions #24

Updated by okurz over 3 years ago

merged and looking ok since 8 days.

Now, what was the reason to have separate "sap_sle12" and "sap_sle15" worker classes? I would like to simplify that and keep only "sap" classes or even better have no special class at all. I assume the main challenge is to have a much bigger RAM amount than normal, right?

Actions #25

Updated by acarvajal over 3 years ago

okurz wrote:

merged and looking ok since 8 days.

Now, what was the reason to have separate "sap_sle12" and "sap_sle15" worker classes? I would like to simplify that and keep only "sap" classes or even better have no special class at all. I assume the main challenge is to have a much bigger RAM amount than normal, right?

Apparently this was done to force MM jobs to run in the same worker, i.e., all QAM SLES for SAP Applications 12-SP* jobs in a given worker, and all QAM SLES for SAP Applications 15-SP* jobs in a given worker.

I think this can be simplified, and if there is still a need to have MM jobs running in the same worker, we can use something like WORKER_CLASS=openqaworker8 or WORKER_CLASS=openqaworker9 instead.

I will add a task on our backlog to replace sap_sle12 and sap_sle15 on the WORKER_CLASS for whatever name is chosen. Settings are currently in use in Maintenance Incidents and Maintenance TestRepo job groups.

Edit: https://jira.suse.com/browse/TEAM-4381

Actions #27

Updated by acarvajal over 3 years ago

Merge requests to remove the setting from QAM job groups:

Maintenance Single Incidents: https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/313
Maintenance TestRepo: https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153

Oliver - just to confirm - will we follow the following sequence?

  1. Merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/327
  2. Update WORKER_CLASS setting on machine 64bit-sap-qam in osd (changing WORKER_CLASS=qemu_x86_64 to WORKER_CLASS=qemu_x86_64-large-mem)
  3. Merge https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/313 & https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153

Or did you have something else in mind?

I'm thinking we could leave the machine definition unchanged, and I add qemu_x86_64-large-mem to the WORKER_CLASS setting in the job groups, but not sure what is better.

Actions #28

Updated by okurz over 3 years ago

acarvajal wrote:

Merge requests to remove the setting from QAM job groups:

Maintenance Single Incidents: https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/313
Maintenance TestRepo: https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153

Oliver - just to confirm - will we follow the following sequence?

  1. Merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/327
  2. Update WORKER_CLASS setting on machine 64bit-sap-qam in osd (changing WORKER_CLASS=qemu_x86_64 to WORKER_CLASS=qemu_x86_64-large-mem)
  3. Merge https://gitlab.suse.de/qa-css/openqa_ha_sap/-/merge_requests/313 & https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153

Yes, sounds safe. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/327 is already merged as it's just adding a new worker class setting which should not break any existing settings.

I'm thinking we could leave the machine definition unchanged, and I add qemu_x86_64-large-mem to the WORKER_CLASS setting in the job groups, but not sure what is better.

I advise for using the machine definitions as intermediate abstraction point. Just some days ago some cloud test scenarios had problems after I needed to set the worker class in jobs from machine definitions and these jobs had the worker class overriden in job templates directly. So one more reason to say better define the worker class in machines, with exception of adding "tap" as an additional worker class requirement.

After all three points that you mentioned above we can remove the "sap_sle*" worker class settings. But for this we should give a reasonable grace time because otherwise retriggering older jobs still having the old worker class restrictions would be stuck in schedule, never being executed.

Actions #29

Updated by acarvajal over 3 years ago

Updated WORKER_CLASS setting on machine 64bit-sap-qam in osd.

Also merged Maintenance Single Incidents job group configuration MR.

Pending TestRepo MR.

Actions #30

Updated by okurz over 3 years ago

  • Due date set to 2021-08-03
  • Priority changed from Normal to Low

waiting for https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/153 to be merged + grace period before we remove the "sap_sle*" classes from worker config.

Actions #32

Updated by okurz over 3 years ago

ok, good. Now I suggest to remove the unused worker classes in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/331

Actions #33

Updated by okurz about 3 years ago

merged

Actions #34

Updated by okurz about 3 years ago

  • Status changed from Feedback to Resolved

No further problems observed. With this I see all points covered

Actions

Also available in: Atom PDF