Project

General

Profile

action #97943

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Increase number of CPU cores on OSD VM due to high usage size:S

Added by okurz 10 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

With number of workers and parallel processed tests as well as with the increased number of products tested on OSD and users using the system the workload on OSD constantly increases. CPU load alerts had been seen recently in #96713 and the higher load is visible in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=25 . We should increase the number of CPU cores on the OSD VM due to the higher usage.

Acceptance criteria

  • AC1: A reasonable higher number than the current 10 cores are available on OSD

Suggestions


Related issues

Related to openQA Project - coordination #99549: [epic] Split production workload onto multiple hosts (focusing on OSD)New2021-09-30

Related to openQA Project - action #107701: [osd] Job detail page fails to loadResolved2022-02-282022-03-15

Copied from openQA Infrastructure - action #96713: Slow grep in openqa-label-known-issues leads to high CPU usageResolved2021-09-10

History

#1 Updated by okurz 10 months ago

  • Copied from action #96713: Slow grep in openqa-label-known-issues leads to high CPU usage added

#2 Updated by okurz 10 months ago

  • Subject changed from Increase number of CPU cores on OSD VM due to high usage to Increase number of CPU cores on OSD VM due to high usage size:S
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler 10 months ago

  • Assignee set to mkittler

#4 Updated by mkittler 10 months ago

  • Status changed from Workable to Feedback

I filed the ticket using the new ticket system as described on https://progress.opensuse.org/projects/qa/wiki#How-we-work-on-our-backlog but I couldn't find the "Request participants" field anywhere so I've just wrote a comment for adding osd-admins@suse.de. The ticket URL is https://sd.suse.com/servicedesk/customer/portal/1/SD-60167 but likely nobody can currently access it except myself.

#5 Updated by ghormoon 10 months ago

repasting from SD, as i see you can't access it:

it's currently running on morla11 and it has quite constant cpu usage around 50%.

the problem might be that most morlas have 48cores, but one only on 32, so if during maintenance this workload ends up on the weaker one, we're already at 75%, so there won't be much overhead for spikes. If we want to have such huge VM, we'll need to make some special handling for it to be very sure it won't affect other workload. Does it definitely need to be one huge node or can you have multiple smaller workers distributed more evenly over the cluster?

#6 Updated by mkittler 9 months ago

As mentioned in the Jira ticket it currently needs to be one host. I suppose we need to decide in the team how we'd like to proceed here.

#7 Updated by cdywan 9 months ago

Discussed it in the weekly. From what I understand:

  • We can't get more CPU cores
  • Additional VMs could be provided
  • Dedicated hardware instead of a VM could also be provided

mkittler please ask if we can get shared storage between multiple VMs

#8 Updated by ghormoon 9 months ago

dedicated HW is usually bought by the team itself. I can discuss this with Evzenia when she comes back from vacation

it is technically possible to have a NFS share, if you need shared storage, if that helps you having more smaller VMs. how much would it need?

#9 Updated by mkittler 9 months ago

  • Status changed from Feedback to Resolved

I'm resolving this ticket with the outcome that it is not possible to extend the VM. It would be possible to request another VM so I created #99549 to split our workload.

#10 Updated by okurz 9 months ago

  • Related to coordination #99549: [epic] Split production workload onto multiple hosts (focusing on OSD) added

#11 Updated by okurz 9 months ago

  • Parent task set to #80142

#12 Updated by okurz 4 months ago

  • Related to action #107701: [osd] Job detail page fails to load added

Also available in: Atom PDF