Project

General

Profile

action #17574

[tools]Add caching/syncing of assets

Added by szarate over 5 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2017-04-29
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Due to problems generated by the heavy NFS usage, actual asset caching needs to be implemented in openQA.

Among few discussions we came to the conclusion that the caching of assets is also tied to the caching of test distribution, the test distribution should be cached via rsync from the openqa web ui.

While currently the only criteria for enabling the caching is that the share directory does not exists in the worker, we should come up with a less hard coded approach.

See: https://github.com/os-autoinst/openQA/pull/1256


Checklist

  • Caching should allow syncronization of assets (HDD and ISO) in the first stage.
  • Caching should be enabled if the /share/ directory is not present in the worker.
  • Asset download should be done via HTTP
  • Caching should be done on the worker host, who should keep track of it's assets and clean up when needed
  • Caching method should be LRU
  • Test distribution must be cached through rsync
  • proper cleanup on s390pb

Subtasks

action #18880: fix interactive mode with cachingClosed

action #18882: job hangs in "pre-processing" for multiple hours, no screen display -> caching lockedRejected


Related issues

Related to openQA Tests - action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys)Resolved2017-03-07

Related to openQA Project - action #17790: Need to detach HDDs before uploadResolved2017-03-19

Related to openQA Project - action #17998: [tools] "No space left on device" on openqaworker7Rejected2017-03-24

Related to openQA Project - action #18730: openqaworker1 is causing weird test failuresRejected2017-04-24

Blocks openQA Tests - action #18072: [tools][opensuse] test fails in "good_buttons" seeing an existing cryptlvm partition when the harddisk should be cleanResolved2017-03-28

Blocks openQA Project - action #19274: cache download runs into 502 or 400Resolved2017-05-19

Follows openQA Project - action #17760: Upload .chksum files together with assetsRejected2017-03-16

History

#1 Updated by szarate over 5 years ago

  • Subject changed from Add caching of Assets to Add caching/sincyng of assets
  • Description updated (diff)

#2 Updated by szarate over 5 years ago

  • Subject changed from Add caching/sincyng of assets to Add caching/syncing of assets

#3 Updated by oholecek over 5 years ago

Hmm.. I have couple of questions.
Why is the caching enabled only when share directory is missing? What about using different directory name which is possible now?
Is there a problem obtaining files also through NFS? (I imagine random I/O would drop once caching is used on workers). Why limit it to HTTP only? (Is it just because openQA can already do that?)
Are we space constrained on worker machine so we can't sync whole share directory to all workers?

During hackweek I read about this one https://github.com/chrislusf/seaweedfs which is an implementation of FaceBook`s Haystack ( www.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf ). True, intended for smaller files, but in large numbers. Problem is that I think it is still experimental.

How about web seeding for bittorrent downloads between workers? Would help during the peak when new product is published IMO.

#4 Updated by coolo over 5 years ago

We don't need any fancy stuff and none of the problems bittorrent solves is something we have. We need to seperate downloading assets (and needles) from actually using it - to avoid IO bottle necks on the VM affecting the test.

And to give you a number: our workers have 300GB of NVME and our VM has 6TB of data. So no, we can't - and no, we want to sync all of it. And dropping NFS completely in favor of using a worker directory as sparse copy (do not call it cache - it's misleading) has the advantage of having unique filenames all over the stack. Whatever is /var/lib/openqa/share/X is CACHEDIR/X on the worker.

But if we mix /var/lib/openqa/share and CACHEDIR, we get in trouble - as the needle (mis-caching) just too prominently demonstrated. So I'm against any half-baked solution like copying off NFS.

#5 Updated by coolo over 5 years ago

Oh - and before that's get lost: I'm also against researching on experimental file systems at this point. We have a simple problem and we need a simple solution for it - ASAP.

#6 Updated by szarate over 5 years ago

  • Related to action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys) added

#7 Updated by RBrownSUSE over 5 years ago

  • Subject changed from Add caching/syncing of assets to [tools]Add caching/syncing of assets

#8 Updated by szarate over 5 years ago

  • Follows action #17760: Upload .chksum files together with assets added

#9 Updated by szarate over 5 years ago

  • Checklist item changed from to [ ] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [ ] Caching should be done on the worker, and each worker should keep track of the assets in it's pool, [ ] Caching should be enabled if the /share/ directory is not present in the worker., [ ] Asset download should be done via HTTP, [ ] Caching method should be LRU, [ ] Test distribution must be cached through rsync
  • Description updated (diff)
  • Status changed from New to In Progress
  • Start date deleted (2017-03-17)
  • % Done changed from 0 to 40

#10 Updated by szarate over 5 years ago

  • Checklist item changed from to [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage.

#11 Updated by szarate over 5 years ago

  • Checklist item changed from to [x] Caching should be done on the worker, and each worker should keep track of the assets in it's pool

#12 Updated by szarate over 5 years ago

  • Checklist item changed from [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [x] Caching should be done on the worker, and each worker should keep track of the assets in it's pool, [ ] Caching should be enabled if the /share/ directory is not present in the worker., [ ] Asset download should be done via HTTP, [ ] Caching method should be LRU, [ ] Test distribution must be cached through rsync to [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [ ] Caching should be enabled if the /share/ directory is not present in the worker., [ ] Asset download should be done via HTTP, [ ] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed, [ ] Caching method should be LRU, [ ] Test distribution must be cached through rsync

#13 Updated by szarate over 5 years ago

  • Checklist item changed from to [x] Caching should be enabled if the /share/ directory is not present in the worker.

#14 Updated by szarate over 5 years ago

  • Checklist item changed from to [x] Asset download should be done via HTTP

#15 Updated by szarate over 5 years ago

  • Checklist item changed from to [x] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed

#16 Updated by szarate over 5 years ago

  • Description updated (diff)

#17 Updated by szarate over 5 years ago

To test locally, i'm using for rsync

gid = nobody
uid = nobody
read only = true
use chroot = true
transfer logging = true
log format = %h %o %f %l %b
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid
slp refresh = 300
use slp = false
[openqa-tests]
path = /var/lib/openqa/share/tests
comment = OpenQA Test Distributions

Starting the server currently with rcrsyncd start

#18 Updated by mgriessmeier over 5 years ago

I've set up rsyncd on openqa.suse.de with a similiar config file as santi provided

rsync -av rsync://openqa.suse.de/tests . is working 

TODO: put it into a salt recipe (nsinger and me will work on this)

#19 Updated by coolo over 5 years ago

  • Checklist item changed from to [x] Test distribution must be cached through rsync

#20 Updated by coolo over 5 years ago

I finished up the PR, but my faith into the cache class is small. The locking looks broken IMO, because between the read_db and the write_db is a full download, so another worker will have overwritten the content in between. The locking needs to be around atomic operations like inserts and deletes.

And we need a size limit not an asset limit. For now we will survive - as our disks are hopefully large enough for 50 ISOs.

#21 Updated by coolo over 5 years ago

  • Related to action #17790: Need to detach HDDs before upload added

#22 Updated by coolo over 5 years ago

The cache.db needs to move to CACHEDIR. I tried to add it to the checklist, but failed

#23 Updated by SLindoMansilla over 5 years ago

  • Blocks action #17812: [sles][functional][opensuse] Test fails in autoyast/installation: error fetching autoinst.xml (404) added

#24 Updated by nicksinger over 5 years ago

  • Checklist item changed from to [x] Caching method should be LRU

#25 Updated by nicksinger over 5 years ago

  • Checklist item changed from to [ ] Caching method should be LRU

#26 Updated by nicksinger over 5 years ago

Since the cleanup of the cache is broken and workers run out of space I've added following cronjob to all workers:

0       4       *       *       *       if [ $(ls -l /var/lib/openqa/cache/* | wc -l) -gt 50 ]; then rcopenqa-worker stop && rm -rf /var/lib/openqa/cache/* && rcopenqa-worker start ;fi

#27 Updated by nicksinger over 5 years ago

Current commit broke more things on the worker so we had to apply a hotfix on them.
The file "/usr/share/openqa/lib/OpenQA/Worker/Cache.pm" (Line 58 for the time of writing) now contains:

$db_file = catdir($location, 'cache.db');

Instead of: $db_file = catdir($location, $db_file);

#28 Updated by szarate over 5 years ago

https://github.com/os-autoinst/openQA/pull/1279 has the hotfix

nicksinger wrote:

Current commit broke more things on the worker so we had to apply a hotfix on them.
The file "/usr/share/openqa/lib/OpenQA/Worker/Cache.pm" (Line 58 for the time of writing) now contains:

$db_file = catdir($location, 'cache.db');

Instead of: $db_file = catdir($location, $db_file);

#29 Updated by szarate about 5 years ago

Second hotfix for the caching: limit the number of assets.

The following gist has a new hotfix just to ensure that workers will have in cache directory only the elements that are present in the cache.db file

https://gist.github.com/foursixnine/623397e6f3231c3dc8a0c435d874e5df

#30 Updated by okurz about 5 years ago

please take a look into #17812 which is blocked by this issue.

#31 Updated by szarate about 5 years ago

  • Related to action #17998: [tools] "No space left on device" on openqaworker7 added

#32 Updated by coolo about 5 years ago

  • Blocks deleted (action #17812: [sles][functional][opensuse] Test fails in autoyast/installation: error fetching autoinst.xml (404))

#33 Updated by RBrownSUSE about 5 years ago

  • Target version changed from Milestone 6 to Milestone 7

#34 Updated by mgriessmeier about 5 years ago

  • Checklist item changed from [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [x] Caching should be enabled if the /share/ directory is not present in the worker., [x] Asset download should be done via HTTP, [x] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed, [ ] Caching method should be LRU, [x] Test distribution must be cached through rsync to [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [x] Caching should be enabled if the /share/ directory is not present in the worker., [x] Asset download should be done via HTTP, [x] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed, [ ] Caching method should be LRU, [x] Test distribution must be cached through rsync, [ ] proper cleanup on s390pb

quickfix applied for s390pb, should be done properly with accessing the database and checking for unused assets instead of using fuser
see https://progress.opensuse.org/issues/18608

#35 Updated by okurz about 5 years ago

  • Blocks action #18072: [tools][opensuse] test fails in "good_buttons" seeing an existing cryptlvm partition when the harddisk should be clean added

#36 Updated by okurz about 5 years ago

https://openqa.suse.de/tests/905351 shows an issue which I consider serious: https://openqa.suse.de/tests/905351/modules/mediacheck/steps/1/src shows in lines 28 and following my workaround that I wanted to try but https://openqa.suse.de/tests/905351/file/autoinst-log.txt shows that the corresponding part was not executed and the code corresponds to the old source so the content on the worker was not updated properly at start of the test.

Sorry, I was wrong. I think an if just evaluated to false here.

#37 Updated by coolo about 5 years ago

  • Related to action #18730: openqaworker1 is causing weird test failures added

#38 Updated by mgriessmeier about 5 years ago

  • Due date set to 2017-05-11
  • Start date changed from 2017-04-18 to 2017-04-29

due to changes in a related task

#39 Updated by szarate about 5 years ago

  • Checklist item changed from to [x] Caching method should be LRU

#40 Updated by szarate about 5 years ago

  • Checklist item changed from to [x] proper cleanup on s390pb

#41 Updated by szarate about 5 years ago

  • Status changed from In Progress to Resolved

PR is already merged and the caching has been deployed

#42 Updated by okurz about 5 years ago

  • Blocks action #19274: cache download runs into 502 or 400 added

#43 Updated by okurz over 4 years ago

  • Due date set to 2018-01-30

due to changes in a related task

#44 Updated by okurz over 4 years ago

  • Due date changed from 2018-01-30 to 2018-02-13

due to changes in a related task

#45 Updated by okurz over 4 years ago

  • Due date changed from 2018-02-13 to 2017-05-11

due to changes in a related task

#46 Updated by okurz over 4 years ago

  • Due date set to 2018-02-08

due to changes in a related task

#47 Updated by okurz about 4 years ago

  • Due date set to 2017-04-29

due to changes in a related task

#48 Updated by okurz about 4 years ago

  • Due date set to 2017-04-29

due to changes in a related task

#49 Updated by okurz about 4 years ago

  • Due date changed from 2017-04-29 to 2018-07-03

due to changes in a related task

#50 Updated by mgriessmeier almost 4 years ago

  • Due date changed from 2018-07-03 to 2018-07-31

due to changes in a related task

#51 Updated by okurz almost 4 years ago

  • Due date changed from 2018-07-31 to 2017-04-29

due to changes in a related task

#52 Updated by okurz almost 4 years ago

  • Due date set to 2017-04-29

due to changes in a related task

#53 Updated by okurz almost 4 years ago

  • Due date set to 2017-04-29

due to changes in a related task

#54 Updated by okurz about 3 years ago

  • Due date set to 2017-04-29

due to changes in a related task

#55 Updated by SLindoMansilla almost 3 years ago

  • Due date set to 2017-04-29

due to changes in a related task

#56 Updated by SLindoMansilla almost 3 years ago

  • Due date set to 2017-04-29

due to changes in a related task

#57 Updated by SLindoMansilla almost 3 years ago

  • Due date set to 2017-04-29

due to changes in a related task

#58 Updated by mgriessmeier almost 3 years ago

  • Due date set to 2017-04-29

due to changes in a related task

#59 Updated by mgriessmeier over 2 years ago

  • Due date set to 2017-04-29

due to changes in a related task

Also available in: Atom PDF