action #17574
closed[tools]Add caching/syncing of assets
100%
Description
Due to problems generated by the heavy NFS usage, actual asset caching needs to be implemented in openQA.
Among few discussions we came to the conclusion that the caching of assets is also tied to the caching of test distribution, the test distribution should be cached via rsync from the openqa web ui.
While currently the only criteria for enabling the caching is that the share directory does not exists in the worker, we should come up with a less hard coded approach.
Checklist
- Caching should allow syncronization of assets (HDD and ISO) in the first stage.
- Caching should be enabled if the /share/ directory is not present in the worker.
- Asset download should be done via HTTP
- Caching should be done on the worker host, who should keep track of it's assets and clean up when needed
- Caching method should be LRU
- Test distribution must be cached through rsync
- proper cleanup on s390pb
Updated by szarate almost 8 years ago
- Subject changed from Add caching of Assets to Add caching/sincyng of assets
- Description updated (diff)
Updated by szarate almost 8 years ago
- Subject changed from Add caching/sincyng of assets to Add caching/syncing of assets
Updated by oholecek almost 8 years ago
Hmm.. I have couple of questions.
Why is the caching enabled only when share
directory is missing? What about using different directory name which is possible now?
Is there a problem obtaining files also through NFS? (I imagine random I/O would drop once caching is used on workers). Why limit it to HTTP only? (Is it just because openQA can already do that?)
Are we space constrained on worker machine so we can't sync whole share
directory to all workers?
During hackweek I read about this one https://github.com/chrislusf/seaweedfs which is an implementation of FaceBook`s Haystack ( www.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf ). True, intended for smaller files, but in large numbers. Problem is that I think it is still experimental.
How about web seeding for bittorrent downloads between workers? Would help during the peak when new product is published IMO.
Updated by coolo almost 8 years ago
We don't need any fancy stuff and none of the problems bittorrent solves is something we have. We need to seperate downloading assets (and needles) from actually using it - to avoid IO bottle necks on the VM affecting the test.
And to give you a number: our workers have 300GB of NVME and our VM has 6TB of data. So no, we can't - and no, we want to sync all of it. And dropping NFS completely in favor of using a worker directory as sparse copy (do not call it cache - it's misleading) has the advantage of having unique filenames all over the stack. Whatever is /var/lib/openqa/share/X is CACHEDIR/X on the worker.
But if we mix /var/lib/openqa/share and CACHEDIR, we get in trouble - as the needle (mis-caching) just too prominently demonstrated. So I'm against any half-baked solution like copying off NFS.
Updated by coolo almost 8 years ago
Oh - and before that's get lost: I'm also against researching on experimental file systems at this point. We have a simple problem and we need a simple solution for it - ASAP.
Updated by szarate almost 8 years ago
- Related to action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys) added
Updated by RBrownSUSE almost 8 years ago
- Subject changed from Add caching/syncing of assets to [tools]Add caching/syncing of assets
Updated by szarate almost 8 years ago
- Follows action #17760: Upload .chksum files together with assets added
Updated by szarate almost 8 years ago
- Checklist item changed from to [ ] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [ ] Caching should be done on the worker, and each worker should keep track of the assets in it's pool, [ ] Caching should be enabled if the /share/ directory is not present in the worker., [ ] Asset download should be done via HTTP, [ ] Caching method should be LRU, [ ] Test distribution must be cached through rsync
- Description updated (diff)
- Status changed from New to In Progress
- Start date deleted (
2017-03-17) - % Done changed from 0 to 40
Updated by szarate almost 8 years ago
- Checklist item changed from to [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage.
Updated by szarate almost 8 years ago
- Checklist item changed from to [x] Caching should be done on the worker, and each worker should keep track of the assets in it's pool
Updated by szarate almost 8 years ago
- Checklist item changed from [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [x] Caching should be done on the worker, and each worker should keep track of the assets in it's pool, [ ] Caching should be enabled if the /share/ directory is not present in the worker., [ ] Asset download should be done via HTTP, [ ] Caching method should be LRU, [ ] Test distribution must be cached through rsync to [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [ ] Caching should be enabled if the /share/ directory is not present in the worker., [ ] Asset download should be done via HTTP, [ ] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed, [ ] Caching method should be LRU, [ ] Test distribution must be cached through rsync
Updated by szarate almost 8 years ago
- Checklist item changed from to [x] Caching should be enabled if the /share/ directory is not present in the worker.
Updated by szarate almost 8 years ago
- Checklist item changed from to [x] Asset download should be done via HTTP
Updated by szarate almost 8 years ago
- Checklist item changed from to [x] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed
Updated by szarate almost 8 years ago
To test locally, i'm using for rsync
gid = nobody
uid = nobody
read only = true
use chroot = true
transfer logging = true
log format = %h %o %f %l %b
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid
slp refresh = 300
use slp = false
[openqa-tests]
path = /var/lib/openqa/share/tests
comment = OpenQA Test Distributions
Starting the server currently with rcrsyncd start
Updated by mgriessmeier almost 8 years ago
I've set up rsyncd on openqa.suse.de with a similiar config file as santi provided
rsync -av rsync://openqa.suse.de/tests . is working
TODO: put it into a salt recipe (nsinger and me will work on this)
Updated by coolo almost 8 years ago
- Checklist item changed from to [x] Test distribution must be cached through rsync
Updated by coolo almost 8 years ago
I finished up the PR, but my faith into the cache class is small. The locking looks broken IMO, because between the read_db and the write_db is a full download, so another worker will have overwritten the content in between. The locking needs to be around atomic operations like inserts and deletes.
And we need a size limit not an asset limit. For now we will survive - as our disks are hopefully large enough for 50 ISOs.
Updated by coolo almost 8 years ago
- Related to action #17790: Need to detach HDDs before upload added
Updated by coolo over 7 years ago
The cache.db needs to move to CACHEDIR. I tried to add it to the checklist, but failed
Updated by SLindoMansilla over 7 years ago
- Blocks action #17812: [sles][functional][opensuse] Test fails in autoyast/installation: error fetching autoinst.xml (404) added
Updated by nicksinger over 7 years ago
- Checklist item changed from to [x] Caching method should be LRU
Updated by nicksinger over 7 years ago
- Checklist item changed from to [ ] Caching method should be LRU
Updated by nicksinger over 7 years ago
Since the cleanup of the cache is broken and workers run out of space I've added following cronjob to all workers:
0 4 * * * if [ $(ls -l /var/lib/openqa/cache/* | wc -l) -gt 50 ]; then rcopenqa-worker stop && rm -rf /var/lib/openqa/cache/* && rcopenqa-worker start ;fi
Updated by nicksinger over 7 years ago
Current commit broke more things on the worker so we had to apply a hotfix on them.
The file "/usr/share/openqa/lib/OpenQA/Worker/Cache.pm" (Line 58 for the time of writing) now contains:
$db_file = catdir($location, 'cache.db');
Instead of: $db_file = catdir($location, $db_file);
Updated by szarate over 7 years ago
https://github.com/os-autoinst/openQA/pull/1279 has the hotfix
nicksinger wrote:
Current commit broke more things on the worker so we had to apply a hotfix on them.
The file "/usr/share/openqa/lib/OpenQA/Worker/Cache.pm" (Line 58 for the time of writing) now contains:$db_file = catdir($location, 'cache.db');
Instead of: $db_file = catdir($location, $db_file);
Updated by szarate over 7 years ago
Second hotfix for the caching: limit the number of assets.
The following gist has a new hotfix just to ensure that workers will have in cache directory only the elements that are present in the cache.db file
https://gist.github.com/foursixnine/623397e6f3231c3dc8a0c435d874e5df
Updated by okurz over 7 years ago
please take a look into #17812 which is blocked by this issue.
Updated by szarate over 7 years ago
- Related to action #17998: [tools] "No space left on device" on openqaworker7 added
Updated by coolo over 7 years ago
- Blocks deleted (action #17812: [sles][functional][opensuse] Test fails in autoyast/installation: error fetching autoinst.xml (404))
Updated by RBrownSUSE over 7 years ago
- Target version changed from Milestone 6 to Milestone 7
Updated by mgriessmeier over 7 years ago
- Checklist item changed from [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [x] Caching should be enabled if the /share/ directory is not present in the worker., [x] Asset download should be done via HTTP, [x] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed, [ ] Caching method should be LRU, [x] Test distribution must be cached through rsync to [x] Caching should allow syncronization of assets (HDD and ISO) in the first stage., [x] Caching should be enabled if the /share/ directory is not present in the worker., [x] Asset download should be done via HTTP, [x] Caching should be done on the worker host, who should keep track of it's assets and clean up when needed, [ ] Caching method should be LRU, [x] Test distribution must be cached through rsync, [ ] proper cleanup on s390pb
quickfix applied for s390pb, should be done properly with accessing the database and checking for unused assets instead of using fuser
see https://progress.opensuse.org/issues/18608
Updated by okurz over 7 years ago
- Blocks action #18072: [tools][opensuse] test fails in "good_buttons" seeing an existing cryptlvm partition when the harddisk should be clean added
Updated by okurz over 7 years ago
https://openqa.suse.de/tests/905351 shows an issue which I consider serious: https://openqa.suse.de/tests/905351/modules/mediacheck/steps/1/src shows in lines 28 and following my workaround that I wanted to try but https://openqa.suse.de/tests/905351/file/autoinst-log.txt shows that the corresponding part was not executed and the code corresponds to the old source so the content on the worker was not updated properly at start of the test.
Sorry, I was wrong. I think an if just evaluated to false here.
Updated by coolo over 7 years ago
- Related to action #18730: openqaworker1 is causing weird test failures added
Updated by mgriessmeier over 7 years ago
- Due date set to 2017-05-11
- Start date changed from 2017-04-18 to 2017-04-29
due to changes in a related task
Updated by szarate over 7 years ago
- Checklist item changed from to [x] Caching method should be LRU
Updated by szarate over 7 years ago
- Checklist item changed from to [x] proper cleanup on s390pb
Updated by szarate over 7 years ago
- Status changed from In Progress to Resolved
PR is already merged and the caching has been deployed
Updated by okurz over 7 years ago
- Blocks action #19274: cache download runs into 502 or 400 added
Updated by okurz almost 7 years ago
- Due date changed from 2018-01-30 to 2018-02-13
due to changes in a related task
Updated by okurz almost 7 years ago
- Due date changed from 2018-02-13 to 2017-05-11
due to changes in a related task
Updated by okurz over 6 years ago
- Due date changed from 2017-04-29 to 2018-07-03
due to changes in a related task
Updated by mgriessmeier over 6 years ago
- Due date changed from 2018-07-03 to 2018-07-31
due to changes in a related task
Updated by okurz over 6 years ago
- Due date changed from 2018-07-31 to 2017-04-29
due to changes in a related task
Updated by SLindoMansilla over 5 years ago
- Due date set to 2017-04-29
due to changes in a related task
Updated by SLindoMansilla over 5 years ago
- Due date set to 2017-04-29
due to changes in a related task
Updated by SLindoMansilla over 5 years ago
- Due date set to 2017-04-29
due to changes in a related task
Updated by mgriessmeier over 5 years ago
- Due date set to 2017-04-29
due to changes in a related task
Updated by mgriessmeier almost 5 years ago
- Due date set to 2017-04-29
due to changes in a related task