openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-03-24T11:00:33Z</p> <ul></ul><p>mkittler wrote:</p> <blockquote> <a name="ideas"></a> <h2 >ideas<a href="#ideas" class="wiki-anchor">¶</a></h2> <p>use innovative storage solutions</p> </blockquote> <p>What real technology are you referring to here?</p> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-03-27T20:56:00Z</p> <ul></ul><p>cdywan wrote:</p> <blockquote> <p>What real technology are you referring to here?</p> </blockquote> <p>Something better than single fixed, unflexible volumes. E.g. dm-cache, lvmcache, bcache, SUSE enterprise storage. Maybe also splitting stored results in openQA by "recent, active" and "old, archived" and then put both categories in different folders which can be mounted from different storage locations, e.g. fast, expensive for "recent, active" and slow, cheap, big for "old, archived"</p> <p>EDIT: 2020-04-08: What I had been reading and thinking:</p> <ul> <li><a href="http://strugglers.net/%7Eandy/blog/2017/07/19/bcache-and-lvmcache/">http://strugglers.net/~andy/blog/2017/07/19/bcache-and-lvmcache/</a> is a very nice blog post comparing HDD, SDD, bcache, lvmcache with git tests and fio. Overall result: bcache can reach results very near to SSD, lvmcache is good but stays below bcache but lvmcache is much more flexible and probably easier to migrate to and from</li> <li>ZFS seems to provide good support for using fast SSD for caching but that is not an easy option for us due to no support in our OS for now</li> <li>mitiao already worked on bcache for our tmpfs workers in <a class="issue tracker-4 status-5 priority-3 priority-lowest closed" title="action: [ppc64le] Deploy bcache on tmpfs workers (Closed)" href="https://progress.opensuse.org/issues/30595">#30595</a> but I think overall there was great misunderstanding and confusion and obviously no resolution. Still, the mentioned hints can be evaluated as well. In this specific case I wonder though how much RAM+HDD is comparable to SSD+HDD, maybe we need special solutions for both cases, not a common, too generic one for both</li> <li>Discussed with nsinger: tmpfs should only use as much RAM as is actually used within the tmpfs so maybe for our tmpfs workers we should just allocate bigger tmpfs (as big as RAM) and let it use what is there? Or configure "dirty_ratio" and "dirty_background_ratio" to effectively allow caching 100% but start writing to persistent storage in the background as soon as possible to avoid I/O spikes? Like "dirty_ratio=100%", "dirty_background_ratio=0%"? But that applies to all devices. It is also possible to configure parameters per device though. nsinger also suggested <a href="https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/delay.html">https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/delay.html</a> for local experiments or a qcow image on tmpfs that we map into a VM as "fast storage" and a qcow image on HDD/SDD as "slow storage" -> <a class="issue tracker-4 status-5 priority-3 priority-lowest closed" title="action: [ppc64le] Deploy bcache on tmpfs workers (Closed)" href="https://progress.opensuse.org/issues/30595">#30595</a></li> <li>ramfs has no size limit and is not using swap, risky.</li> </ul> <p>EDIT: 2020-04-20: Some more:</p> <ul> <li><a href="https://unix.stackexchange.com/questions/334415/dirty-ratio-per-device">https://unix.stackexchange.com/questions/334415/dirty-ratio-per-device</a> describes that we can set a dirty ratio per device which we could use for tmpfs workers. The low-level details are in <a href="https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-bdi">https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-bdi</a> -> <a class="issue tracker-4 status-5 priority-3 priority-lowest closed" title="action: [ppc64le] Deploy bcache on tmpfs workers (Closed)" href="https://progress.opensuse.org/issues/30595">#30595</a></li> <li><a href="https://unix.stackexchange.com/questions/237030/how-safe-is-it-to-increase-tmpfs-to-more-than-physical-memory">https://unix.stackexchange.com/questions/237030/how-safe-is-it-to-increase-tmpfs-to-more-than-physical-memory</a> describes how tmpfs can be configured with even bigger size than complete RAM size to use swap. For tmpfs workers, maybe just configure a much bigger tmpfs and supply a big swap -> <a class="issue tracker-4 status-5 priority-3 priority-lowest closed" title="action: [ppc64le] Deploy bcache on tmpfs workers (Closed)" href="https://progress.opensuse.org/issues/30595">#30595</a></li> <li><a href="https://superuser.com/questions/1060468/confused-whether-to-switch-from-ext2-to-ext4-or-not/1060818#1060818">https://superuser.com/questions/1060468/confused-whether-to-switch-from-ext2-to-ext4-or-not/1060818#1060818</a> recommends ext4 over ext2 due to mentioned features, e.g. "like extents, pre-allocation, delayed allocation and multiblock allocators which all contribute to reduce fragmentation and therefore extend your SSD life."</li> <li><a href="https://www.thomas-krenn.com/de/wiki/SSD_Performance_optimieren">https://www.thomas-krenn.com/de/wiki/SSD_Performance_optimieren</a> describes how gpt+ext4 can perform better. We should ensure we use GPT at least but if we do not use a partition table at all maybe that also has an impact?</li> <li><a href="https://unix.stackexchange.com/questions/155784/advantages-disadvantages-of-increasing-commit-in-fstab">https://unix.stackexchange.com/questions/155784/advantages-disadvantages-of-increasing-commit-in-fstab</a> describes Advantages/disadvantages of increasing "commit" in fstab but we can experiment with higher commit values and see for performance impact</li> <li><a href="https://heiko-sieger.info/tuning-vm-disk-performance/">https://heiko-sieger.info/tuning-vm-disk-performance/</a> describes that <code>qemu-img create -f raw -o preallocation=full vmdisk.img 100G</code> can bring best performance for VMs, i.e. "raw" with "preallocation=full". Also recommend to use the virtio driver within qemu VMs with "iothread", e.g. <code>-object iothread,id=io1 -device virtio-blk-pci,drive=disk0,iothread=io1 -drive if=none,id=disk0,cache=none,format=raw,aio=threads,file=/path/to/vmdisk.img</code>, "note the aio=threads options, preferred option when storing the VM image file on an ext4 file system. With other file systems, aio=native should perform better. You can experiment with that."</li> <li><a href="http://mail-archives.apache.org/mod_mbox/cloudstack-users/201708.mbox/raw/%3CF7FFAF25-5228-47F1-9DD2-7A828E071520@gmail.com%3E/">http://mail-archives.apache.org/mod_mbox/cloudstack-users/201708.mbox/raw/%3CF7FFAF25-5228-47F1-9DD2-7A828E071520@gmail.com%3E/</a> recommends sparse qcow files to prevent multiple writes (actual write and image size extend), using "writeback" caching mode, use "fat" allocation with qcow which takes more time initially but helps further down, recommends XFS for datastore over ext?</li> <li><a href="http://www.linux-kvm.org/page/Tuning_KVM">http://www.linux-kvm.org/page/Tuning_KVM</a> recommends qemu parameters <code>-cpu host</code> (equivalent to libvirt selection "host-passthrough") and also <code>if=virtio</code> for storage. A command qemu command like in openQA tests looks like <code>/usr/bin/qemu-system-x86_64 -only-migratable … -cpu qemu64 … -smp 1 -enable-kvm … -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/1/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/1/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0</code>. As we never migrate machines to other hosts just boot the qcow images on other machines "host-passthrough" might not be a problem.</li> <li><a href="https://qemu.weilnetz.de/doc/qemu-doc.html#qemu_005fimg_005finvocation">https://qemu.weilnetz.de/doc/qemu-doc.html#qemu_005fimg_005finvocation</a> describes how one can run benchmarks against qemu drive files using <code>qemu-img bench</code> which can help us to find out performance of pool filesystems maybe more easily than running complete openQA test runs</li> <li><a href="https://blog.frehi.be/2011/05/27/linux-performance-improvements/">https://blog.frehi.be/2011/05/27/linux-performance-improvements/</a> mentions KSM to save memory on workers running the same OS in multiple VMs; Also mentions "deadline" scheduler for hosts mainly running VMs</li> </ul> <p>EDIT: 2020-05-01: More:</p> <ul> <li><a href="https://wiki.archlinux.org/index.php/Ext4#Improving_performance">https://wiki.archlinux.org/index.php/Ext4#Improving_performance</a> also has a nice list for ext4 filesystems, e.g. what we use on o3 workers , e.g. now using <code>noatime,data=writeback,commit=1200</code> on aarch64.o.o , did not try "barrier=0" yet.</li> </ul> <p>EDIT: 2020-05-13: coolo did some research years ago, see <a href="https://github.com/os-autoinst/os-autoinst/pull/664">https://github.com/os-autoinst/os-autoinst/pull/664</a></p> <p>EDIT: 2020-06-13: in combination <a href="https://en.m.wikipedia.org/wiki/Zram" class="external">zram</a> might help us as well</p> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-04-10T06:44:38Z</p> <ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>okurz</i></li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-04-20T04:32:43Z</p> <ul><li><strong>Subject</strong> changed from <i>[epic] Handle large storage efficiently to be able to run current tests but keep big archives of old results</i> to <i>[saga][epic] Handle large storage efficiently to be able to run current tests efficiently but keep big archives of old results</i></li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-04-21T07:50:52Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-5 priority-3 priority-lowest closed" href="/issues/30595">action #30595</a>: [ppc64le] Deploy bcache on tmpfs workers</i> added</li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-04-21T07:51:03Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-6 status-1 priority-4 priority-default parent" href="/issues/34357">coordination #34357</a>: [epic] Improve openQA performance</i> added</li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-04-23T11:17:28Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/58805">action #58805</a>: [infra]Severe storage performance issue on openqa.suse.de workers</i> added</li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-04-28T06:48:17Z</p> <ul><li><strong>Target version</strong> set to <i>Current Sprint</i></li></ul><p>Should this be on Feedback? Should it be in the Current Sprint? I'm not clear on the current status.<br> It looks like a research ticket which needs to be refined further but you set it to In Progress.</p> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-04-28T06:50:42Z</p> <ul><li><strong>Target version</strong> deleted (<del><i>Current Sprint</i></del>)</li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-04-28T12:39:49Z</p> <ul></ul><p>cdywan wrote:</p> <blockquote> <p>Should this be on Feedback? Should it be in the Current Sprint? I'm not clear on the current status.<br> It looks like a research ticket which needs to be refined further but you set it to In Progress.</p> </blockquote> <p>Yes, I set it "In Progress" because I have many articles in my reading backlog. Not "Feedback" because I am not waiting for anything, just doing more than just this task. Please see that I updated my previous comments with pretty recent updates. I was asked to add less comments in tickets to prevent "too many mail notifications" hence I took that approach instead.</p> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-05-14T11:51:54Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/66709">action #66709</a>: Storage server for OSD and monitoring</i> added</li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-05-26T09:01:21Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/302497/diff?detail_id=299269">diff</a>)</li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-07-03T20:49:46Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/311603/diff?detail_id=308855">diff</a>)</li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Blocked</i></li><li><strong>Target version</strong> set to <i>Ready</i></li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-09-08T14:25:25Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/323506/diff?detail_id=320755">diff</a>)</li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-10-12T13:31:40Z</p> <ul><li><strong>Tracker</strong> changed from <i>action</i> to <i>coordination</i></li><li><strong>Status</strong> changed from <i>Blocked</i> to <i>New</i></li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-10-12T13:44:55Z</p> <ul></ul><p>See for the reason of tracker change: <a href="http://mailman.suse.de/mailman/private/qa-sle/2020-October/002722.html" class="external">http://mailman.suse.de/mailman/private/qa-sle/2020-October/002722.html</a></p> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-10-13T11:55:13Z</p> <ul><li><strong>Status</strong> changed from <i>New</i> to <i>Blocked</i></li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2020-11-21T23:05:51Z</p> <ul><li><strong>Subject</strong> changed from <i>[saga][epic] Handle large storage efficiently to be able to run current tests efficiently but keep big archives of old results</i> to <i>[saga][epic] Scale up: Handle large storage efficiently to be able to run current tests efficiently but keep big archives of old results</i></li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2021-04-28T21:20:09Z</p> <ul><li><strong>Subject</strong> changed from <i>[saga][epic] Scale up: Handle large storage efficiently to be able to run current tests efficiently but keep big archives of old results</i> to <i>[saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</i></li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2021-05-07T15:35:37Z</p> <ul><li><strong>Copied to</strong> <i><a class="issue tracker-6 status-1 priority-4 priority-default parent" href="/issues/92323">coordination #92323</a>: [saga][epic] Scale up: Fine-grained control over use and removal of results, assets, test data</i> added</li></ul> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2021-05-07T15:40:35Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/405331/diff?detail_id=385102">diff</a>)</li></ul><p>split out some "future" ideas into the future saga <a class="issue tracker-6 status-1 priority-4 priority-default parent" title="coordination: [saga][epic] Scale up: Fine-grained control over use and removal of results, assets, test data (New)" href="https://progress.opensuse.org/issues/92323">#92323</a></p> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2021-12-23T10:08:57Z</p> <ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Resolved</i></li></ul><p>With the archiving feature and multiple others completed and enabled archiving on both o3+osd we are much more flexible and should be good for the future</p> </article> <article> <h1>openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results</h1> <p>2022-05-10T09:40:25Z</p> <ul><li><strong>Copied to</strong> <i><a class="issue tracker-6 status-1 priority-5 priority-high3 overdue parent behind-schedule" href="/issues/110833">coordination #110833</a>: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances</i> added</li></ul> </article> </main></body></html>