Project

General

Profile

Actions

coordination #64746

closed

[saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results

Added by mkittler almost 4 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-03-18
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

ideas

  • OSD specific: cron job which checks if df is showing low space available for /results, then call find … delete videos older than <…>
  • DONE: Default to NOVIDEO=1 for all scenarios that set a MAX_JOB_TIME higher than default 2h -> gh#os-autoinst/openQA#3112 commit ab74357
  • Improve video compression codecs (a 20MB video.ogv can be compressed easily to 14MB video.ogv.xz, that can be improved)
  • Setup additional storage server for old assets and results
    • Use overlayfs to make archived assets/results appear alongside the regular assets/results
    • Move old assets/results at some point from the regular web UI host to the storage server, e.g. automatically via a Minion job
      • Mark jobs with archived results as such

Subtasks 18 (0 open18 closed)

action #64574: Keep track of disk usage of results by job groupsResolvedmkittler2020-03-18

Actions
action #67342: Support using a "generic" video encoder like ffmpeg instead of only relying on the one provided by os-autoinst itselfResolvedmkittler2020-05-27

Actions
action #75256: Try out AV1 video codec as potential new defaultResolvedmkittler2020-10-25

Actions
openQA Infrastructure - action #76987: re-encode some videos from existing results to save spaceResolvedokurz2020-11-04

Actions
coordination #80546: [epic] Scale up: Enable to store more resultsResolvedokurz2020-08-04

Actions
openQA Infrastructure - action #69577: Handle installation of the new "Storage Server"Resolvednicksinger2020-08-04

Actions
openQA Infrastructure - action #77890: [easy] Extend OSD storage space for "results" to make bug investigation and failure archeology easierResolvedokurz2020-11-14

Actions
openQA Infrastructure - action #88546: Make use of the new "Storage Server", e.g. complete OSD backupResolvedokurz

Actions
openQA Infrastructure - action #90629: administration of the new "Storage Server"Resolvednicksinger2020-08-04

Actions
action #91347: [spike][timeboxed:18h] Support for archived jobsResolvedmkittler2021-04-19

Actions
openQA Infrastructure - action #91779: Add monitoring for storage.qa.suse.deResolvedmkittler2021-04-26

Actions
action #91782: Add support for archived jobsResolvedmkittler2021-04-26

Actions
action #91785: UI representation for archived jobsResolvedmkittler

Actions
action #92788: Use openQA archiving feature on osd size:SResolvedokurz

Actions
openQA Infrastructure - coordination #96974: [epic] Improve/reconsider thresholds for skipping cleanupResolvedokurz2021-09-20

Actions
openQA Infrastructure - action #98922: Run asset cleanup concurrently to results based on configResolvedmkittler2021-09-20

Actions
openQA Infrastructure - action #103954: Run asset cleanup concurrently to results based on config on o3 as wellResolvedokurz

Actions
action #103953: Use openQA archiving feature on o3 size:SResolvedokurz2021-12-14

Actions

Related issues 6 (3 open3 closed)

Related to openQA Infrastructure - action #30595: [ppc64le] Deploy bcache on tmpfs workersClosed2018-01-22

Actions
Related to openQA Project - coordination #34357: [epic] Improve openQA performanceNew2017-08-25

Actions
Related to openQA Infrastructure - action #58805: [infra]Severe storage performance issue on openqa.suse.de workersResolvedokurz2019-10-29

Actions
Related to openQA Infrastructure - action #66709: Storage server for OSD and monitoringResolvednicksinger2020-05-12

Actions
Copied to openQA Project - coordination #92323: [saga][epic] Scale up: Fine-grained control over use and removal of results, assets, test dataNew2020-05-20

Actions
Copied to openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instancesNew2022-05-09

Actions
Actions #1

Updated by livdywan almost 4 years ago

mkittler wrote:

ideas

use innovative storage solutions

What real technology are you referring to here?

Actions #2

Updated by okurz almost 4 years ago

cdywan wrote:

What real technology are you referring to here?

Something better than single fixed, unflexible volumes. E.g. dm-cache, lvmcache, bcache, SUSE enterprise storage. Maybe also splitting stored results in openQA by "recent, active" and "old, archived" and then put both categories in different folders which can be mounted from different storage locations, e.g. fast, expensive for "recent, active" and slow, cheap, big for "old, archived"

EDIT: 2020-04-08: What I had been reading and thinking:

  • http://strugglers.net/~andy/blog/2017/07/19/bcache-and-lvmcache/ is a very nice blog post comparing HDD, SDD, bcache, lvmcache with git tests and fio. Overall result: bcache can reach results very near to SSD, lvmcache is good but stays below bcache but lvmcache is much more flexible and probably easier to migrate to and from
  • ZFS seems to provide good support for using fast SSD for caching but that is not an easy option for us due to no support in our OS for now
  • mitiao already worked on bcache for our tmpfs workers in #30595 but I think overall there was great misunderstanding and confusion and obviously no resolution. Still, the mentioned hints can be evaluated as well. In this specific case I wonder though how much RAM+HDD is comparable to SSD+HDD, maybe we need special solutions for both cases, not a common, too generic one for both
  • Discussed with nsinger: tmpfs should only use as much RAM as is actually used within the tmpfs so maybe for our tmpfs workers we should just allocate bigger tmpfs (as big as RAM) and let it use what is there? Or configure "dirty_ratio" and "dirty_background_ratio" to effectively allow caching 100% but start writing to persistent storage in the background as soon as possible to avoid I/O spikes? Like "dirty_ratio=100%", "dirty_background_ratio=0%"? But that applies to all devices. It is also possible to configure parameters per device though. nsinger also suggested https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/delay.html for local experiments or a qcow image on tmpfs that we map into a VM as "fast storage" and a qcow image on HDD/SDD as "slow storage" -> #30595
  • ramfs has no size limit and is not using swap, risky.

EDIT: 2020-04-20: Some more:

EDIT: 2020-05-01: More:

EDIT: 2020-05-13: coolo did some research years ago, see https://github.com/os-autoinst/os-autoinst/pull/664

EDIT: 2020-06-13: in combination zram might help us as well

Actions #3

Updated by okurz almost 4 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #4

Updated by okurz almost 4 years ago

  • Subject changed from [epic] Handle large storage efficiently to be able to run current tests but keep big archives of old results to [saga][epic] Handle large storage efficiently to be able to run current tests efficiently but keep big archives of old results
Actions #5

Updated by okurz almost 4 years ago

  • Related to action #30595: [ppc64le] Deploy bcache on tmpfs workers added
Actions #6

Updated by okurz almost 4 years ago

Actions #7

Updated by okurz almost 4 years ago

  • Related to action #58805: [infra]Severe storage performance issue on openqa.suse.de workers added
Actions #8

Updated by livdywan almost 4 years ago

  • Target version set to Current Sprint

Should this be on Feedback? Should it be in the Current Sprint? I'm not clear on the current status.
It looks like a research ticket which needs to be refined further but you set it to In Progress.

Actions #9

Updated by livdywan almost 4 years ago

  • Target version deleted (Current Sprint)
Actions #10

Updated by okurz almost 4 years ago

cdywan wrote:

Should this be on Feedback? Should it be in the Current Sprint? I'm not clear on the current status.
It looks like a research ticket which needs to be refined further but you set it to In Progress.

Yes, I set it "In Progress" because I have many articles in my reading backlog. Not "Feedback" because I am not waiting for anything, just doing more than just this task. Please see that I updated my previous comments with pretty recent updates. I was asked to add less comments in tickets to prevent "too many mail notifications" hence I took that approach instead.

Actions #11

Updated by okurz almost 4 years ago

  • Related to action #66709: Storage server for OSD and monitoring added
Actions #12

Updated by okurz almost 4 years ago

  • Description updated (diff)
Actions #13

Updated by okurz over 3 years ago

  • Description updated (diff)
  • Status changed from In Progress to Blocked
  • Target version set to Ready
Actions #14

Updated by mkittler over 3 years ago

  • Description updated (diff)
Actions #15

Updated by szarate over 3 years ago

  • Tracker changed from action to coordination
  • Status changed from Blocked to New
Actions #17

Updated by okurz over 3 years ago

  • Status changed from New to Blocked
Actions #18

Updated by okurz over 3 years ago

  • Subject changed from [saga][epic] Handle large storage efficiently to be able to run current tests efficiently but keep big archives of old results to [saga][epic] Scale up: Handle large storage efficiently to be able to run current tests efficiently but keep big archives of old results
Actions #19

Updated by okurz almost 3 years ago

  • Subject changed from [saga][epic] Scale up: Handle large storage efficiently to be able to run current tests efficiently but keep big archives of old results to [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results
Actions #20

Updated by okurz almost 3 years ago

  • Copied to coordination #92323: [saga][epic] Scale up: Fine-grained control over use and removal of results, assets, test data added
Actions #21

Updated by okurz almost 3 years ago

  • Description updated (diff)

split out some "future" ideas into the future saga #92323

Actions #22

Updated by okurz about 2 years ago

  • Status changed from Blocked to Resolved

With the archiving feature and multiple others completed and enabled archiving on both o3+osd we are much more flexible and should be good for the future

Actions #23

Updated by okurz almost 2 years ago

  • Copied to coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances added
Actions

Also available in: Atom PDF