action #57494

increase space for o3, potentially split /assets and /results same as for osd

Added by okurz 5 months ago. Updated 3 days ago.

Status:FeedbackStart date:29/09/2019
Priority:HighDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration:

Description

Motivation

We currently have 4TB for o3 but we recently struggle again to accomodate all necessary data. Some job groups are hardly able to sustain even a single repo snapshot of Factory, see #57338 . For OSD we recently switched to have separate partitions for /assets and /results , probably we should do the same for o3.

History

#1 Updated by okurz 5 months ago

  • Status changed from In Progress to Feedback

#2 Updated by okurz 5 months ago

  • Status changed from Feedback to Blocked

#3 Updated by michel_mno 4 months ago

oups wrong issue

#4 Updated by okurz 2 months ago

  • Status changed from Blocked to Feedback
  • Priority changed from Normal to High
  • Target version set to Current Sprint

hmpf, ticket in "openqa" queue was resolved without any further comment. Reopened as https://infra.nue.suse.com/SelfService/Display.html?id=158336

But further it seems we have again a problem right now with too many results. Leap 15 shows long result+log retention periods: we have currently "important logs" 365d, "important results" 3650d

The SQL query select group_id,count(id) from jobs where logs_present=TRUE group by group_id order by count; shows

 group_id | count 
----------+-------
[…]
       50 |  6720

We should ask lnussel, maxlin, lkocman if we can reduce the log retention for leap. coolo and me did so on #opensuse-factory :

18/12/2019 11:58:40] <coolo> maxlin, DimStar: how far back do you need openqa logs on o3?
[18/12/2019 11:59:56] <DimStar> coolo: you mean on the tests? or logs like syncd?
[18/12/2019 12:00:36] <coolo> test results
[18/12/2019 12:01:51] <maxlin> do we ran out of space on o3?
[18/12/2019 12:02:07] <coolo> if not this week, then next
[18/12/2019 12:02:27] <okurz> maxlin, lnussel: Who is the main person for Leap on o3? I guess lkocman but couldn't find him here. Regarding to what coolo asks. Currently the main "Leap" job group on o3 has very long retention periods, e.g. 365d for "important" logs and "3650d" for "important results" and we currently do not have enough space to accomodate that much. either we reduce it or we find more disk space really fast which will only happen with 
[18/12/2019 12:02:27] <okurz> a good escalation

Further:

[18/12/2019 12:07:40] <lnussel> yast logs and videos for not linked jobs don't have to stay long. either we review something in time or never
[18/12/2019 12:07:53] <coolo> but it's important to differentiate your needs for test results and for logs
[18/12/2019 12:07:56] <okurz> define "not long" and "in time"?
[18/12/2019 12:08:17] <coolo> lnussel: a week? two?
[18/12/2019 12:08:26] <lnussel> I'd have guesses something like that too
[18/12/2019 12:08:40] <maxlin> a week should good enough IMO
[18/12/2019 12:08:49] <coolo> let's make it 10 days then?
[18/12/2019 12:08:53] <lnussel> k
[18/12/2019 12:08:57] <coolo> (right now we have set 60, which is insane)
[18/12/2019 12:09:11] <lnussel> still requesting more disk space makes sense as this issue will come again
[18/12/2019 12:09:27] <coolo> ok, now that we discussed logs - how long do you need results of daily builds?
[18/12/2019 12:09:33] <coolo> lnussel: go ahead!
[18/12/2019 12:11:13] <coolo> are 2 months for results of unlabeled builds good?
[18/12/2019 12:11:41] <lnussel> I hope so, yes
[18/12/2019 12:11:55] <lnussel> how much space are we talking about per build here?
[18/12/2019 12:13:04] <coolo> lnussel: one job is 50MB on average. one build has 130 jobs - clones not counted
[18/12/2019 12:13:18] <maxlin> the oldest bug I've had tagged on current snapshot is https://bugzilla.suse.com/show_bug.cgi?id=1158873
[18/12/2019 12:13:29] <maxlin> 2 months should be ok
[18/12/2019 12:13:40] <|Anna|> SUSE bug 1158873 in openSUSE Distribution "[Build 536.2] openQA test fails in random places on GNOME - wayland session suddenly crashed(sporadic)" [Normal, New]
[18/12/2019 12:13:47] <okurz> "oldest bugs" are not relevant. it's "oldest job" on o3
[18/12/2019 12:14:06] <coolo> good. I harmonized TW and Leap now to keep tagged builds forever, untaged logs for 10 days, tagged logs for 120 days and untagged results for 80 days
[18/12/2019 12:14:35] <okurz> lnussel: I doubt we will get "more space" without sponsoring and management escalation stating the business importance

so coolo set to "Keep logs for" 10, "Keep important logs for" 120, "Keep results for" 80, "Keep important results for" 0. Currently /space is on 93%, 304G free. over the past weeks /assets was around 89-90%. https://nagios-devel.suse.de/pnp4nagios//index.php/graph?host=ariel-opensuse.suse.de&srv=space_partition&start=1576235168&end=1576649843 shows the recent rise yesterday morning. Not sure if it's "just" more Leap 15 jobs now but of course critical bugs in parents cancelling whole clusters, especially kernel tests, could help us to save space.

#5 Updated by okurz about 1 month ago

wrote on https://trello.com/c/JQtnALhz/6-openqa-hw-budget-planning#comment-5e185a3e9a5c3786c32fd089 for the kind request to have budget for an increase of available space.

#6 Updated by okurz 3 days ago

no response, no actions in neither trello nor the infra-ticket.

Also available in: Atom PDF