action #77890: [easy] Extend OSD storage space for "results" to make bug investigation and failure archeology easier - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by okurz about 4 years ago

Tags set to osd, storage, EngInfra, mgmt
Subject changed from Extend OSD storage space for "results" to make bug investigation and failure archeology easier to [easy] Extend OSD storage space for "results" to make bug investigation and failure archeology easier

Actions

Copy link

#2

Updated by okurz almost 4 years ago

Parent task set to #64746

Actions

Copy link

#3

Updated by okurz almost 4 years ago

Parent task changed from #64746 to #80546

Actions

Copy link

#4

Updated by livdywan almost 4 years ago

Acceptance criteria¶

AC3: "results" on OSD has still enough headroom, e.g. only used up to 80-85% $ df -h /var/lib/openqa /dev/vdd 5.0T 3.9T 1.2T 77% /var/lib/openqa

Is this what we're looking for? Or a different disk? What's the actual preferred size we would like?

Actions

Copy link

#5

Updated by okurz almost 4 years ago

Mentioned in AC1: "Significant increase" from the current, 5.0TB. For example 2TB, better more. I see even 10-20TB as realistic.

Actions

Copy link

#6

Updated by livdywan almost 4 years ago

Status changed from Workable to Blocked

I'm setting this to Blocked with SD-36448 being the ticket waiting on a response. Whenever that's resolved the last part tracked here is probably adjusting the mount and jobgroup limits as needed.

Actions

Copy link

#7

Updated by livdywan almost 4 years ago

Assignee set to livdywan

Actions

Copy link

#8

Updated by okurz almost 4 years ago

Status changed from Blocked to Feedback

cdywan wrote:

I'm setting this to Blocked with SD-36448 being the ticket waiting on a response. Whenever that's resolved the last part tracked here is probably adjusting the mount and jobgroup limits as needed.

I don't have access to the SD ticket and I do not know what it is about. As a general approach for SUSE-IT tickets I suggest to always create a ticket in gitlab.suse.de/suse/internal-tools/issues/ which is readable/writable for others as well that might be able to help or could be interested. Then create an SD ticket and cross-reference with the gitlab ticket. But for the above tasks I do not know what you created a SUSE-IT SD ticket for. My suggestions were first "Suggest the proposal to QE mgmt", then "Ask QE mgmt to create EngInfra ticket … If not happening create the EngInfra ticket on your own and ask QE mgmt to support after the fact". If you followed the last then the ticket should be in infra.nue.suse.com/ (or by email to infra@suse.de ending up in the same). Is this what you intended or is the SUSE-IT ticket about something different?

Actions

Copy link

#9

Updated by esujskaja almost 4 years ago

I would propose to continie the conversation here.

20Tb on top of your 5Tb is quite big. Currently, EngInfra administer NetApp storage in Nue, which is designed for extreme performance for the build job, so it's very expensive per gygabite, and should be managed accordingly. Keeping in mind possible move into AWS - it becomes even more serious question.

I would ask to provide a plan - how you'll be using this space, what and how will be distributed across it; for which period of time, you believe, it's enough; is any regular clean up/archiving planned; how the estimation has been done.

I'll contact Gerald Pfeiffer meanwhile to understand the process better.

Actions

Copy link

#10

Updated by livdywan almost 4 years ago

The SD ticket was closed. A call will be organized to discuss the details.

Actions

Copy link

#11

Updated by okurz almost 4 years ago

Hi esujskaja,

good to see that you found this ticket directly. Unfortunately I do not have access to https://sd.suse.com/servicedesk/customer/portal/1/SD-36448 so I do not know what has been written there

esujskaja wrote:

I would propose to continie the conversation here.

20Tb on top of your 5Tb is quite big. Currently, EngInfra administer NetApp storage in Nue, which is designed for extreme performance for the build job, so it's very expensive per gygabite, and should be managed accordingly. Keeping in mind possible move into AWS - it becomes even more serious question.

I am aware of the higher cost for storage on the NetApp storage. However as I learned our mount point for /results on openqa.suse.de resides on rotating disk storage so I understand that in comparison to SSD or NVMe based storage the cost is lower than that.

I would ask to provide a plan - how you'll be using this space, what and how will be distributed across it; for which period of time, you believe, it's enough; is any regular clean up/archiving planned; how the estimation has been done.

Unlike previous requests for storage extension which was mostly done in case of imminent storage exhaustion the idea for the ticket here was to discuss an extension to help not just short-term but longer. openQA can manage more storage that is currently assigned to OSD efficiently by using all available storage to store more test results and for a longer period while still ensuring that the available space is not completely depleted. Recurringly there are requests if more test results can be stored and for longer. Also if we provide test results for longer then test engineers as well as bug assignees these groups save time on bug investigation. Hence overall the idea was to save cost by reducing the time that people need to invest by investment in hardware cost. My teamlead mgriessmeier got in contact with me about this ticket (not sure how he got involved now) and we reminded us that the convention or agreement so far was to increase the space by 2TB per year or something along the line of that. This would be feasible to keep but the idea of this ticket is something different, not how to "cope with what we have" but "make more efficient use of hardware but save engineering time". Any available space would be distributed for all users with relative quotas and we can keep the ratios for any increase of hardware. https://progress.opensuse.org/projects/qa/wiki#Our-common-userbase describes our "common userbase". Please understand that there is no hard requirement for "20TB" or something like that. It can be higher or lower than that ( https://progress.opensuse.org/issues/77890#note-5 )

Actions

Copy link

#12

Updated by esujskaja almost 4 years ago

I've had a meeting with Vit today; we agreed to meet and develop a plan together, how to proceed. NetApps hes been significantly upgraded a year ago and now it's hirh-performing in a whole. It's really preferrable not to reserve too much without using in forseeing future.

We also have to take into consideration, that OpenSuse supposed to get separated in the end of the day. If the public cloud becomes its' landing, than you would be able to get as much space as you need, dynamically (in that case, archiving and read/write patterns may require a look). Otherwise, the individual storage for OpenSUSE could be a gould variant too; we will discuss a possible config, and i'll require the quote.

Actions

Copy link

#13

Updated by okurz almost 4 years ago

esujskaja wrote:

I've had a meeting with Vit today; we agreed to meet and develop a plan together, how to proceed.

If you mean "Vit Pelcak", sure, it's good to coordinate with multiple people but I don't know what plan you want to develop with Vit or what group of persons he can speak for. Just please keep in mind that the userbase for openqa.suse.de is much bigger than Vit's team and definitely bigger than the QE departement as I explained in comments above.

NetApps hes been significantly upgraded a year ago and now it's hirh-performing in a whole. It's really preferrable not to reserve too much without using in forseeing future.

I do not understand what you mean by that. Was my understanding wrong that /results of openqa.suse.de resides on rotating disk storage? And the plan is to make near-immediate use of the available space. The space would not be just kept free as reserve. Also in my understanding just enlarging the available assigned storage without allocating it would actually not occupy the space. With netapp assigned storage and deduplication over-allocation is default, right?

We also have to take into consideration, that OpenSuse supposed to get separated in the end of the day.

Sorry, I don't understand what such plans for openSUSE have to do with our internal instance openqa.suse.de .

If the public cloud becomes its' landing, than you would be able to get as much space as you need, dynamically (in that case, archiving and read/write patterns may require a look). Otherwise, the individual storage for OpenSUSE could be a gould variant too; we will discuss a possible config, and i'll require the quote.

Of course technically it is possible to use public cloud offerings for both openqa.opensuse.org as well as openqa.suse.de but I doubt we would save money regarding storage.

Actions

Copy link

#14

Updated by esujskaja almost 4 years ago

Was my understanding wrong that /results of openqa.suse.de resides on rotating disk storage?

I'm not sure where it is now, but we don't have 20Tb on spindles anyway. NetApp got invested huge money to get upgraded to flash.

Sorry, I don't understand what such plans for openSUSE have to do with our internal instance openqa.suse.de .

Of course technically it is possible to use public cloud offerings for both openqa.opensuse.org as well as openqa.suse.de but I doubt we would save money regarding storage.

I agree with you here; we have to to start thinking toward independent infrastructure nevertheless - and toward storage space management too. It seems reasonable to me to ask, if you know how much space you actually need and what are the growth expectation.

But the most important is: storage on NetApp is not a free and endless resource. We would totally like to help you here, but we need to get more info to understand the pans and how it may impact the rest of the users.

Actions

Copy link

#15

Updated by okurz almost 4 years ago

esujskaja wrote:

Was my understanding wrong that /results of openqa.suse.de resides on rotating disk storage?

I'm not sure where it is now, but we don't have 20Tb on spindles anyway. NetApp got invested huge money to get upgraded to flash.

Ok, I see. It would help if you clarify the situation for us first. I agree that if all further storage is on SSD then a further increase is more expensive than expected and we would prioritize looking into other solutions first.
What would also help us if you could share what the actual costs are for assigned storage so that we can accomodate that in plans better as well.

Sorry, I don't understand what such plans for openSUSE have to do with our internal instance openqa.suse.de .

Of course technically it is possible to use public cloud offerings for both openqa.opensuse.org as well as openqa.suse.de but I doubt we would save money regarding storage.

I agree with you here; we have to to start thinking toward independent infrastructure nevertheless - and toward storage space management too. It seems reasonable to me to ask, if you know how much space you actually need and what are the growth expectation.

But the most important is: storage on NetApp is not a free and endless resource. We would totally like to help you here, but we need to get more info to understand the pans and how it may impact the rest of the users.

Sure, that makes sense. Again, there is no urgent need for any more storage, but additional available storage would reduce engineering efforts.

Actions

Copy link

#16

Updated by esujskaja almost 4 years ago

Ok, I see. It would help if you clarify the situation for us first.

I did as soon as I get the question :)

Actions

Copy link

#17

Updated by okurz almost 4 years ago

esujskaja wrote:

Ok, I see. It would help if you clarify the situation for us first.

I did as soon as I get the question :)

Hm, not sure what you mean. Do you mean you did or you still want to "get the question"?
Let me rephrase It would help if you can confirm that we currently have 5TB storage backed by rotating disk and that rotating disk storage is cheaper than SSD based storage?

Actions

Copy link

#19

Updated by okurz almost 4 years ago

Questions discussed in the meeting¶

my questions I wanted to clarify:

Where is /results currently stored?
- 5.0TB, currently based on SSD
Is it true that the netapp storage is split into rotating disk + and flash based?
- Yes, confirmed by bmwiedemann
What other storage solution exists?
- the only real production-worthy option is Netapp storage (additionally there is only ceph based non-production setup)
What are numbers regarding cost to keep in mind?
- According to bmwiedemann "flash is 3x more expensive than rotating disk".

Results¶

We clarified that this is only about "results", limited to the current setup where we have a single directory. /results is currently based on flash storage and we can make good use of more available space of for example +4.0TB for /results.
@esujskaja can on request provide AWS testing accounts for experimentation, e.g. using AWS glacier
@esujskaja will create a SUSE-IT ticket to handle increase of the current 5.0TB for /results as currently there should be the space available

Actions

Copy link

#20

Updated by okurz almost 4 years ago

Status changed from Feedback to Blocked
Assignee changed from livdywan to okurz

https://jira.suse.com/browse/ENGINFRA-487 was created, handling in there for now.

Actions

Copy link

#21

Updated by okurz almost 4 years ago

I provided a comment in the EngInfra ticket but there was no response. I realized now that the ticket was marked as "Done" so I set that back to "Open" now in the hope that someone from EngInfra will react. Setting due-date to check again at the latest when the due date has passed and provide a bump otherwise

Actions

Copy link

#22

Updated by okurz almost 4 years ago

Status changed from Blocked to In Progress

We found an agreement with EngInfra to increase the space on vdd (expensive+fast) by 0.5TB and vde (cheap+slow) by 2.5TB. gschlotter did that in cooperation with me and he removed vdf again.

As the partitions were already detected with the new size within the OS within OSD I just did:

xfs_growfs /space-slow
xfs_growfs /results

so now we have

/dev/vde        5.5T  1.9T  3.7T  33% /space-slow
/dev/vdd        5.5T  4.1T  1.5T  74% /results

will adjust asset and result settings on osd for various groups to make use of that.

Actions

Copy link

#23

Updated by okurz almost 4 years ago

Status changed from In Progress to Feedback

Already bumped limits for SLE15. Asked in https://chat.suse.de/channel/testing for what groups in general should receive more. Based on that feedback I can then adjust

Actions

Copy link

#24

Updated by okurz over 3 years ago

Due date set to 2021-03-03

No further feedback from anyone. I will wait some days to see how results fills up a bit more.

Actions

Copy link

#25

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

We have now

/dev/vde        5.5T  1.9T  3.7T  33% /space-slow
/dev/vdc        7.0T  5.3T  1.8T  76% /assets
/dev/vdd        5.5T  4.5T  1.1T  81% /results

so enough headroom but more space used

Actions

Copy link

#26

Updated by okurz over 3 years ago

Due date deleted (~~2021-03-03~~)

Actions

Copy link

#27

Updated by okurz almost 2 years ago

Copied to action #121594: Extend OSD storage space for "results" to make bug investigation and failure archeology easier - 2022 added

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #77890

[easy] Extend OSD storage space for "results" to make bug investigation and failure archeology easier

Motivation¶

Acceptance criteria¶

Suggestions¶

Further details¶

Updated by okurz about 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by livdywan almost 4 years ago

Acceptance criteria¶

Updated by okurz almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by okurz almost 4 years ago

Updated by esujskaja almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by okurz almost 4 years ago

Updated by esujskaja almost 4 years ago

Updated by okurz almost 4 years ago

Updated by esujskaja almost 4 years ago

Updated by okurz almost 4 years ago

Updated by esujskaja almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Questions discussed in the meeting¶

Results¶

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz almost 2 years ago