Project

General

Profile

action #77890

openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results

openQA Project - coordination #80546: [epic] Scale up: Enable to store more results

[easy] Extend OSD storage space for "results" to make bug investigation and failure archeology easier

Added by okurz 9 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2020-11-14
Due date:
% Done:

0%

Estimated time:

Description

Motivation

More and more products are tested and more and more tests are running on OSD which increases the results we need to store per time period. This in turn means that we need to restrict the time duration for which we can save results. This makes it harder to investigate product bugs which have been reported based on openQA test results as well as test failures. As the new department QE was formed the two biggest user groups of OSD are now joint in one department which can make some decisions easier. It is a good opportunity to let QE management coordinate an increase of storage space for "results" with EngInfra.

Acceptance criteria

  • AC1: Significant increase of storage space for "results" on OSD
  • AC2: Job group result retention periods have been increased to make efficient use of the available storage space
  • AC3: "results" on OSD has still enough headroom, e.g. only used up to 80-85%

Suggestions

  • Suggest the proposal to QE mgmt
  • Ask QE mgmt to create EngInfra ticket with according "budget allocation" or at least something like a "yes, we need it to run the business" :)
  • If not happening create the EngInfra ticket on your own and ask QE mgmt to support after the fact

Further details

In mid of 2020 EngInfra was so kind to provide us a comparably big increase of storage space for "assets", which is "cheap+slow" rotating disk storage so it was cheaper and hence easier for them to just give us the space (someone could have come to that conclusion earlier), but "results" is on "expensive+fast" SSD backed storage so likely not that easy to convince but still likely cheaper to buy more enterprise SSD storage for 10x the consumer prices than making people busy rerunning tests in openQA or manually just to collect logs or find out what errors are about.

History

#1 Updated by okurz 9 months ago

  • Tags set to osd, storage, EngInfra, mgmt
  • Subject changed from Extend OSD storage space for "results" to make bug investigation and failure archeology easier to [easy] Extend OSD storage space for "results" to make bug investigation and failure archeology easier

#2 Updated by okurz 8 months ago

  • Parent task set to #64746

#3 Updated by okurz 8 months ago

  • Parent task changed from #64746 to #80546

#4 Updated by cdywan 6 months ago

Acceptance criteria

  • AC3: "results" on OSD has still enough headroom, e.g. only used up to 80-85% $ df -h /var/lib/openqa /dev/vdd 5.0T 3.9T 1.2T 77% /var/lib/openqa

Is this what we're looking for? Or a different disk? What's the actual preferred size we would like?

#5 Updated by okurz 6 months ago

Mentioned in AC1: "Significant increase" from the current, 5.0TB. For example 2TB, better more. I see even 10-20TB as realistic.

#6 Updated by cdywan 6 months ago

  • Status changed from Workable to Blocked

I'm setting this to Blocked with SD-36448 being the ticket waiting on a response. Whenever that's resolved the last part tracked here is probably adjusting the mount and jobgroup limits as needed.

#7 Updated by cdywan 6 months ago

  • Assignee set to cdywan

#8 Updated by okurz 6 months ago

  • Status changed from Blocked to Feedback

cdywan wrote:

I'm setting this to Blocked with SD-36448 being the ticket waiting on a response. Whenever that's resolved the last part tracked here is probably adjusting the mount and jobgroup limits as needed.

I don't have access to the SD ticket and I do not know what it is about. As a general approach for SUSE-IT tickets I suggest to always create a ticket in gitlab.suse.de/suse/internal-tools/issues/ which is readable/writable for others as well that might be able to help or could be interested. Then create an SD ticket and cross-reference with the gitlab ticket. But for the above tasks I do not know what you created a SUSE-IT SD ticket for. My suggestions were first "Suggest the proposal to QE mgmt", then "Ask QE mgmt to create EngInfra ticket … If not happening create the EngInfra ticket on your own and ask QE mgmt to support after the fact". If you followed the last then the ticket should be in infra.nue.suse.com/ (or by email to infra@suse.de ending up in the same). Is this what you intended or is the SUSE-IT ticket about something different?

#9 Updated by esujskaja 6 months ago

I would propose to continie the conversation here.

20Tb on top of your 5Tb is quite big. Currently, EngInfra administer NetApp storage in Nue, which is designed for extreme performance for the build job, so it's very expensive per gygabite, and should be managed accordingly. Keeping in mind possible move into AWS - it becomes even more serious question.

I would ask to provide a plan - how you'll be using this space, what and how will be distributed across it; for which period of time, you believe, it's enough; is any regular clean up/archiving planned; how the estimation has been done.

I'll contact Gerald Pfeiffer meanwhile to understand the process better.

#10 Updated by cdywan 6 months ago

The SD ticket was closed. A call will be organized to discuss the details.

#11 Updated by okurz 6 months ago

Hi esujskaja,

good to see that you found this ticket directly. Unfortunately I do not have access to https://sd.suse.com/servicedesk/customer/portal/1/SD-36448 so I do not know what has been written there

esujskaja wrote:

I would propose to continie the conversation here.

20Tb on top of your 5Tb is quite big. Currently, EngInfra administer NetApp storage in Nue, which is designed for extreme performance for the build job, so it's very expensive per gygabite, and should be managed accordingly. Keeping in mind possible move into AWS - it becomes even more serious question.

I am aware of the higher cost for storage on the NetApp storage. However as I learned our mount point for /results on openqa.suse.de resides on rotating disk storage so I understand that in comparison to SSD or NVMe based storage the cost is lower than that.

I would ask to provide a plan - how you'll be using this space, what and how will be distributed across it; for which period of time, you believe, it's enough; is any regular clean up/archiving planned; how the estimation has been done.

Unlike previous requests for storage extension which was mostly done in case of imminent storage exhaustion the idea for the ticket here was to discuss an extension to help not just short-term but longer. openQA can manage more storage that is currently assigned to OSD efficiently by using all available storage to store more test results and for a longer period while still ensuring that the available space is not completely depleted. Recurringly there are requests if more test results can be stored and for longer. Also if we provide test results for longer then test engineers as well as bug assignees these groups save time on bug investigation. Hence overall the idea was to save cost by reducing the time that people need to invest by investment in hardware cost. My teamlead mgriessmeier got in contact with me about this ticket (not sure how he got involved now) and we reminded us that the convention or agreement so far was to increase the space by 2TB per year or something along the line of that. This would be feasible to keep but the idea of this ticket is something different, not how to "cope with what we have" but "make more efficient use of hardware but save engineering time". Any available space would be distributed for all users with relative quotas and we can keep the ratios for any increase of hardware. https://progress.opensuse.org/projects/qa/wiki#Our-common-userbase describes our "common userbase". Please understand that there is no hard requirement for "20TB" or something like that. It can be higher or lower than that ( https://progress.opensuse.org/issues/77890#note-5 )

#12 Updated by esujskaja 6 months ago

I've had a meeting with Vit today; we agreed to meet and develop a plan together, how to proceed. NetApps hes been significantly upgraded a year ago and now it's hirh-performing in a whole. It's really preferrable not to reserve too much without using in forseeing future.

We also have to take into consideration, that OpenSuse supposed to get separated in the end of the day. If the public cloud becomes its' landing, than you would be able to get as much space as you need, dynamically (in that case, archiving and read/write patterns may require a look). Otherwise, the individual storage for OpenSUSE could be a gould variant too; we will discuss a possible config, and i'll require the quote.

#13 Updated by okurz 6 months ago

esujskaja wrote:

I've had a meeting with Vit today; we agreed to meet and develop a plan together, how to proceed.

If you mean "Vit Pelcak", sure, it's good to coordinate with multiple people but I don't know what plan you want to develop with Vit or what group of persons he can speak for. Just please keep in mind that the userbase for openqa.suse.de is much bigger than Vit's team and definitely bigger than the QE departement as I explained in comments above.

NetApps hes been significantly upgraded a year ago and now it's hirh-performing in a whole. It's really preferrable not to reserve too much without using in forseeing future.

I do not understand what you mean by that. Was my understanding wrong that /results of openqa.suse.de resides on rotating disk storage? And the plan is to make near-immediate use of the available space. The space would not be just kept free as reserve. Also in my understanding just enlarging the available assigned storage without allocating it would actually not occupy the space. With netapp assigned storage and deduplication over-allocation is default, right?

We also have to take into consideration, that OpenSuse supposed to get separated in the end of the day.

Sorry, I don't understand what such plans for openSUSE have to do with our internal instance openqa.suse.de .

If the public cloud becomes its' landing, than you would be able to get as much space as you need, dynamically (in that case, archiving and read/write patterns may require a look). Otherwise, the individual storage for OpenSUSE could be a gould variant too; we will discuss a possible config, and i'll require the quote.

Of course technically it is possible to use public cloud offerings for both openqa.opensuse.org as well as openqa.suse.de but I doubt we would save money regarding storage.

#14 Updated by esujskaja 6 months ago

Was my understanding wrong that /results of openqa.suse.de resides on rotating disk storage?

I'm not sure where it is now, but we don't have 20Tb on spindles anyway. NetApp got invested huge money to get upgraded to flash.

Sorry, I don't understand what such plans for openSUSE have to do with our internal instance openqa.suse.de .

Of course technically it is possible to use public cloud offerings for both openqa.opensuse.org as well as openqa.suse.de but I doubt we would save money regarding storage.

I agree with you here; we have to to start thinking toward independent infrastructure nevertheless - and toward storage space management too. It seems reasonable to me to ask, if you know how much space you actually need and what are the growth expectation.

But the most important is: storage on NetApp is not a free and endless resource. We would totally like to help you here, but we need to get more info to understand the pans and how it may impact the rest of the users.

#15 Updated by okurz 6 months ago

esujskaja wrote:

Was my understanding wrong that /results of openqa.suse.de resides on rotating disk storage?

I'm not sure where it is now, but we don't have 20Tb on spindles anyway. NetApp got invested huge money to get upgraded to flash.

Ok, I see. It would help if you clarify the situation for us first. I agree that if all further storage is on SSD then a further increase is more expensive than expected and we would prioritize looking into other solutions first.
What would also help us if you could share what the actual costs are for assigned storage so that we can accomodate that in plans better as well.

Sorry, I don't understand what such plans for openSUSE have to do with our internal instance openqa.suse.de .

Of course technically it is possible to use public cloud offerings for both openqa.opensuse.org as well as openqa.suse.de but I doubt we would save money regarding storage.

I agree with you here; we have to to start thinking toward independent infrastructure nevertheless - and toward storage space management too. It seems reasonable to me to ask, if you know how much space you actually need and what are the growth expectation.

But the most important is: storage on NetApp is not a free and endless resource. We would totally like to help you here, but we need to get more info to understand the pans and how it may impact the rest of the users.

Sure, that makes sense. Again, there is no urgent need for any more storage, but additional available storage would reduce engineering efforts.

#16 Updated by esujskaja 6 months ago

Ok, I see. It would help if you clarify the situation for us first.

I did as soon as I get the question :)

#17 Updated by okurz 6 months ago

esujskaja wrote:

Ok, I see. It would help if you clarify the situation for us first.

I did as soon as I get the question :)

Hm, not sure what you mean. Do you mean you did or you still want to "get the question"?
Let me rephrase It would help if you can confirm that we currently have 5TB storage backed by rotating disk and that rotating disk storage is cheaper than SSD based storage?

#19 Updated by okurz 6 months ago

Questions discussed in the meeting

my questions I wanted to clarify:

  • Where is /results currently stored?

    • 5.0TB, currently based on SSD
  • Is it true that the netapp storage is split into rotating disk + and flash based?

    • Yes, confirmed by bmwiedemann
  • What other storage solution exists?

    • the only real production-worthy option is Netapp storage (additionally there is only ceph based non-production setup)
  • What are numbers regarding cost to keep in mind?

    • According to bmwiedemann "flash is 3x more expensive than rotating disk".

Results

  • We clarified that this is only about "results", limited to the current setup where we have a single directory. /results is currently based on flash storage and we can make good use of more available space of for example +4.0TB for /results.
  • esujskaja can on request provide AWS testing accounts for experimentation, e.g. using AWS glacier
  • esujskaja will create a SUSE-IT ticket to handle increase of the current 5.0TB for /results as currently there should be the space available

#20 Updated by okurz 6 months ago

  • Status changed from Feedback to Blocked
  • Assignee changed from cdywan to okurz

https://jira.suse.com/browse/ENGINFRA-487 was created, handling in there for now.

#21 Updated by okurz 6 months ago

I provided a comment in the EngInfra ticket but there was no response. I realized now that the ticket was marked as "Done" so I set that back to "Open" now in the hope that someone from EngInfra will react. Setting due-date to check again at the latest when the due date has passed and provide a bump otherwise

#22 Updated by okurz 5 months ago

  • Status changed from Blocked to In Progress

We found an agreement with EngInfra to increase the space on vdd (expensive+fast) by 0.5TB and vde (cheap+slow) by 2.5TB. gschlotter did that in cooperation with me and he removed vdf again.

As the partitions were already detected with the new size within the OS within OSD I just did:

xfs_growfs /space-slow
xfs_growfs /results

so now we have

/dev/vde        5.5T  1.9T  3.7T  33% /space-slow
/dev/vdd        5.5T  4.1T  1.5T  74% /results

will adjust asset and result settings on osd for various groups to make use of that.

#23 Updated by okurz 5 months ago

  • Status changed from In Progress to Feedback

Already bumped limits for SLE15. Asked in https://chat.suse.de/channel/testing for what groups in general should receive more. Based on that feedback I can then adjust

#24 Updated by okurz 5 months ago

  • Due date set to 2021-03-03

No further feedback from anyone. I will wait some days to see how results fills up a bit more.

#25 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

We have now

/dev/vde        5.5T  1.9T  3.7T  33% /space-slow
/dev/vdc        7.0T  5.3T  1.8T  76% /assets
/dev/vdd        5.5T  4.5T  1.1T  81% /results

so enough headroom but more space used

#26 Updated by okurz 4 months ago

  • Due date deleted (2021-03-03)

Also available in: Atom PDF