Project

General

Profile

Actions

action #17574

closed

[tools]Add caching/syncing of assets

Added by szarate about 7 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2017-04-29
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Due to problems generated by the heavy NFS usage, actual asset caching needs to be implemented in openQA.

Among few discussions we came to the conclusion that the caching of assets is also tied to the caching of test distribution, the test distribution should be cached via rsync from the openqa web ui.

While currently the only criteria for enabling the caching is that the share directory does not exists in the worker, we should come up with a less hard coded approach.

See: https://github.com/os-autoinst/openQA/pull/1256


Checklist

  • Caching should allow syncronization of assets (HDD and ISO) in the first stage.
  • Caching should be enabled if the /share/ directory is not present in the worker.
  • Asset download should be done via HTTP
  • Caching should be done on the worker host, who should keep track of it's assets and clean up when needed
  • Caching method should be LRU
  • Test distribution must be cached through rsync
  • proper cleanup on s390pb

Subtasks 2 (0 open2 closed)

action #18880: fix interactive mode with cachingClosed2017-04-29

Actions
action #18882: job hangs in "pre-processing" for multiple hours, no screen display -> caching lockedRejected2017-04-29

Actions

Related issues 7 (0 open7 closed)

Related to openQA Tests - action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys)Resolvedszarate2017-03-07

Actions
Related to openQA Project - action #17790: Need to detach HDDs before uploadResolvedokurz2017-03-19

Actions
Related to openQA Project - action #17998: [tools] "No space left on device" on openqaworker7Rejectedszarate2017-03-24

Actions
Related to openQA Project - action #18730: openqaworker1 is causing weird test failuresRejected2017-04-24

Actions
Blocks openQA Tests - action #18072: [tools][opensuse] test fails in "good_buttons" seeing an existing cryptlvm partition when the harddisk should be cleanResolvedokurz2017-03-28

Actions
Blocks openQA Project - action #19274: cache download runs into 502 or 400Resolvedszarate2017-05-19

Actions
Follows openQA Project - action #17760: Upload .chksum files together with assetsRejected2017-03-16

Actions
Actions #1

Updated by szarate about 7 years ago

  • Subject changed from Add caching of Assets to Add caching/sincyng of assets
  • Description updated (diff)
Actions #2

Updated by szarate about 7 years ago

  • Subject changed from Add caching/sincyng of assets to Add caching/syncing of assets
Actions #3

Updated by oholecek about 7 years ago

Hmm.. I have couple of questions.
Why is the caching enabled only when share directory is missing? What about using different directory name which is possible now?
Is there a problem obtaining files also through NFS? (I imagine random I/O would drop once caching is used on workers). Why limit it to HTTP only? (Is it just because openQA can already do that?)
Are we space constrained on worker machine so we can't sync whole share directory to all workers?

During hackweek I read about this one https://github.com/chrislusf/seaweedfs which is an implementation of FaceBook`s Haystack ( www.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf ). True, intended for smaller files, but in large numbers. Problem is that I think it is still experimental.

How about web seeding for bittorrent downloads between workers? Would help during the peak when new product is published IMO.

Actions #4

Updated by coolo about 7 years ago

We don't need any fancy stuff and none of the problems bittorrent solves is something we have. We need to seperate downloading assets (and needles) from actually using it - to avoid IO bottle necks on the VM affecting the test.

And to give you a number: our workers have 300GB of NVME and our VM has 6TB of data. So no, we can't - and no, we want to sync all of it. And dropping NFS completely in favor of using a worker directory as sparse copy (do not call it cache - it's misleading) has the advantage of having unique filenames all over the stack. Whatever is /var/lib/openqa/share/X is CACHEDIR/X on the worker.

But if we mix /var/lib/openqa/share and CACHEDIR, we get in trouble - as the needle (mis-caching) just too prominently demonstrated. So I'm against any half-baked solution like copying off NFS.

Actions #5

Updated by coolo about 7 years ago

Oh - and before that's get lost: I'm also against researching on experimental file systems at this point. We have a simple problem and we need a simple solution for it - ASAP.

Actions #6

Updated by szarate about 7 years ago

  • Related to action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys) added
Actions #7

Updated by RBrownSUSE about 7 years ago

  • Subject changed from Add caching/syncing of assets to [tools]Add caching/syncing of assets
Actions #8

Updated by szarate about 7 years ago

  • Follows action #17760: Upload .chksum files together with assets added
Actions #9

Updated by szarate about 7 years ago

  • Checklist item changed from to [ ] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [ ] Caching should be done on the worker, and each worker should keep track of the assets in it's pool, [ ] Caching should be enabled if the /share/ directory is not present in the worker., [ ] Asset download should be done via HTTP, [ ] Caching method should be LRU, [ ] Test distribution must be cached through rsync
  • Description updated (diff)
  • Status changed from New to In Progress
  • Start date deleted (2017-03-17)
  • % Done changed from 0 to 40
Actions #10

Updated by szarate about 7 years ago

  • Checklist item changed from to [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage.
Actions #11

Updated by szarate about 7 years ago

  • Checklist item changed from to [x] Caching should be done on the worker, and each worker should keep track of the assets in it's pool
Actions #12

Updated by szarate about 7 years ago

  • Checklist item changed from [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [x] Caching should be done on the worker, and each worker should keep track of the assets in it's pool, [ ] Caching should be enabled if the /share/ directory is not present in the worker., [ ] Asset download should be done via HTTP, [ ] Caching method should be LRU, [ ] Test distribution must be cached through rsync to [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [ ] Caching should be enabled if the /share/ directory is not present in the worker., [ ] Asset download should be done via HTTP, [ ] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed, [ ] Caching method should be LRU, [ ] Test distribution must be cached through rsync
Actions #13

Updated by szarate about 7 years ago

  • Checklist item changed from to [x] Caching should be enabled if the /share/ directory is not present in the worker.
Actions #14

Updated by szarate about 7 years ago

  • Checklist item changed from to [x] Asset download should be done via HTTP
Actions #15

Updated by szarate about 7 years ago

  • Checklist item changed from to [x] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed
Actions #16

Updated by szarate about 7 years ago

  • Description updated (diff)
Actions #17

Updated by szarate about 7 years ago

To test locally, i'm using for rsync

gid = nobody
uid = nobody
read only = true
use chroot = true
transfer logging = true
log format = %h %o %f %l %b
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid
slp refresh = 300
use slp = false
[openqa-tests]
path = /var/lib/openqa/share/tests
comment = OpenQA Test Distributions

Starting the server currently with rcrsyncd start

Actions #18

Updated by mgriessmeier about 7 years ago

I've set up rsyncd on openqa.suse.de with a similiar config file as santi provided

rsync -av rsync://openqa.suse.de/tests . is working 

TODO: put it into a salt recipe (nsinger and me will work on this)

Actions #19

Updated by coolo about 7 years ago

  • Checklist item changed from to [x] Test distribution must be cached through rsync
Actions #20

Updated by coolo about 7 years ago

I finished up the PR, but my faith into the cache class is small. The locking looks broken IMO, because between the read_db and the write_db is a full download, so another worker will have overwritten the content in between. The locking needs to be around atomic operations like inserts and deletes.

And we need a size limit not an asset limit. For now we will survive - as our disks are hopefully large enough for 50 ISOs.

Actions #21

Updated by coolo about 7 years ago

  • Related to action #17790: Need to detach HDDs before upload added
Actions #22

Updated by coolo about 7 years ago

The cache.db needs to move to CACHEDIR. I tried to add it to the checklist, but failed

Actions #23

Updated by SLindoMansilla about 7 years ago

  • Blocks action #17812: [sles][functional][opensuse] Test fails in autoyast/installation: error fetching autoinst.xml (404) added
Actions #24

Updated by nicksinger about 7 years ago

  • Checklist item changed from to [x] Caching method should be LRU
Actions #25

Updated by nicksinger about 7 years ago

  • Checklist item changed from to [ ] Caching method should be LRU
Actions #26

Updated by nicksinger about 7 years ago

Since the cleanup of the cache is broken and workers run out of space I've added following cronjob to all workers:

0       4       *       *       *       if [ $(ls -l /var/lib/openqa/cache/* | wc -l) -gt 50 ]; then rcopenqa-worker stop && rm -rf /var/lib/openqa/cache/* && rcopenqa-worker start ;fi
Actions #27

Updated by nicksinger about 7 years ago

Current commit broke more things on the worker so we had to apply a hotfix on them.
The file "/usr/share/openqa/lib/OpenQA/Worker/Cache.pm" (Line 58 for the time of writing) now contains:

$db_file = catdir($location, 'cache.db');

Instead of: $db_file = catdir($location, $db_file);

Actions #28

Updated by szarate about 7 years ago

https://github.com/os-autoinst/openQA/pull/1279 has the hotfix

nicksinger wrote:

Current commit broke more things on the worker so we had to apply a hotfix on them.
The file "/usr/share/openqa/lib/OpenQA/Worker/Cache.pm" (Line 58 for the time of writing) now contains:

$db_file = catdir($location, 'cache.db');

Instead of: $db_file = catdir($location, $db_file);

Actions #29

Updated by szarate about 7 years ago

Second hotfix for the caching: limit the number of assets.

The following gist has a new hotfix just to ensure that workers will have in cache directory only the elements that are present in the cache.db file

https://gist.github.com/foursixnine/623397e6f3231c3dc8a0c435d874e5df

Actions #30

Updated by okurz about 7 years ago

please take a look into #17812 which is blocked by this issue.

Actions #31

Updated by szarate about 7 years ago

  • Related to action #17998: [tools] "No space left on device" on openqaworker7 added
Actions #32

Updated by coolo about 7 years ago

  • Blocks deleted (action #17812: [sles][functional][opensuse] Test fails in autoyast/installation: error fetching autoinst.xml (404))
Actions #33

Updated by RBrownSUSE about 7 years ago

  • Target version changed from Milestone 6 to Milestone 7
Actions #34

Updated by mgriessmeier about 7 years ago

  • Checklist item changed from [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [x] Caching should be enabled if the /share/ directory is not present in the worker., [x] Asset download should be done via HTTP, [x] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed, [ ] Caching method should be LRU, [x] Test distribution must be cached through rsync to [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [x] Caching should be enabled if the /share/ directory is not present in the worker., [x] Asset download should be done via HTTP, [x] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed, [ ] Caching method should be LRU, [x] Test distribution must be cached through rsync, [ ] proper cleanup on s390pb

quickfix applied for s390pb, should be done properly with accessing the database and checking for unused assets instead of using fuser
see https://progress.opensuse.org/issues/18608

Actions #35

Updated by okurz almost 7 years ago

  • Blocks action #18072: [tools][opensuse] test fails in "good_buttons" seeing an existing cryptlvm partition when the harddisk should be clean added
Actions #36

Updated by okurz almost 7 years ago

https://openqa.suse.de/tests/905351 shows an issue which I consider serious: https://openqa.suse.de/tests/905351/modules/mediacheck/steps/1/src shows in lines 28 and following my workaround that I wanted to try but https://openqa.suse.de/tests/905351/file/autoinst-log.txt shows that the corresponding part was not executed and the code corresponds to the old source so the content on the worker was not updated properly at start of the test.

Sorry, I was wrong. I think an if just evaluated to false here.

Actions #37

Updated by coolo almost 7 years ago

  • Related to action #18730: openqaworker1 is causing weird test failures added
Actions #38

Updated by mgriessmeier almost 7 years ago

  • Due date set to 2017-05-11
  • Start date changed from 2017-04-18 to 2017-04-29

due to changes in a related task

Actions #39

Updated by szarate almost 7 years ago

  • Checklist item changed from to [x] Caching method should be LRU
Actions #40

Updated by szarate almost 7 years ago

  • Checklist item changed from to [x] proper cleanup on s390pb
Actions #41

Updated by szarate almost 7 years ago

  • Status changed from In Progress to Resolved

PR is already merged and the caching has been deployed

Actions #42

Updated by okurz almost 7 years ago

  • Blocks action #19274: cache download runs into 502 or 400 added
Actions #43

Updated by okurz over 6 years ago

  • Due date set to 2018-01-30

due to changes in a related task

Actions #44

Updated by okurz over 6 years ago

  • Due date changed from 2018-01-30 to 2018-02-13

due to changes in a related task

Actions #45

Updated by okurz about 6 years ago

  • Due date changed from 2018-02-13 to 2017-05-11

due to changes in a related task

Actions #46

Updated by okurz about 6 years ago

  • Due date set to 2018-02-08

due to changes in a related task

Actions #47

Updated by okurz almost 6 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions #48

Updated by okurz almost 6 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions #49

Updated by okurz almost 6 years ago

  • Due date changed from 2017-04-29 to 2018-07-03

due to changes in a related task

Actions #50

Updated by mgriessmeier almost 6 years ago

  • Due date changed from 2018-07-03 to 2018-07-31

due to changes in a related task

Actions #51

Updated by okurz almost 6 years ago

  • Due date changed from 2018-07-31 to 2017-04-29

due to changes in a related task

Actions #52

Updated by okurz almost 6 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions #53

Updated by okurz over 5 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions #54

Updated by okurz almost 5 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions #55

Updated by SLindoMansilla over 4 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions #56

Updated by SLindoMansilla over 4 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions #57

Updated by SLindoMansilla over 4 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions #58

Updated by mgriessmeier over 4 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions #59

Updated by mgriessmeier over 4 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Actions

Also available in: Atom PDF