https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842017-05-19T05:26:28ZopenSUSE Project Management ToolopenQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=504002017-05-19T05:26:28Zcoolocoolo@suse.com
<ul></ul><p>This needs to be salted and systemded</p>
<pre><code>#! /bin/sh
set -e
function _umount {
if grep $1 /proc/mounts; then
umount $1
fi
}
POOL2="1 2 3 4 5 6 7 8 9 10 11 12"
POOL1="13 14 15 16 17 18 19 20"
_umount /var/lib/openqa/cache
for i in $POOL2; do
_umount /var/lib/openqa/pool/$i
done
_umount /var/lib/openqa/pool2
_umount /var/lib/openqa/pool
mkfs.ext2 -F /dev/nvme0n1p1
mkfs.ext2 -F /dev/nvme1n1p1
mount /var/lib/openqa/pool
mount /var/lib/openqa/pool2
for i in $POOL2; do
mkdir /var/lib/openqa/pool/$i
mkdir /var/lib/openqa/pool2/$i
chown _openqa-worker /var/lib/openqa/pool2/$i
mount -o bind /var/lib/openqa/pool2/$i /var/lib/openqa/pool/$i
done
for i in $POOL1; do
mkdir /var/lib/openqa/pool/$i
chown _openqa-worker /var/lib/openqa/pool/$i
done
mkdir /var/lib/openqa/pool/cache
chown _openqa-worker /var/lib/openqa/pool/cache
mount -o bind /var/lib/openqa/pool/cache /var/lib/openqa/cache
</code></pre> openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=514622017-05-26T20:23:05Zokurzokurz@suse.com
<ul><li><strong>Category</strong> set to <i>168</i></li></ul> openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=731892017-11-21T14:36:49Zcoolocoolo@suse.com
<ul><li><strong>Subject</strong> changed from <i>[tools] ext2 on workers busted</i> to <i>ext2 on workers busted</i></li><li><strong>Target version</strong> set to <i>Ready</i></li></ul> openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=1682692018-11-23T14:38:50Zmkittlermarius.kittler@suse.com
<ul><li><strong>Project</strong> changed from <i>openQA Project</i> to <i>openQA Infrastructure</i></li><li><strong>Category</strong> deleted (<del><i>168</i></del>)</li></ul><p>Seems to be an infra issue.</p>
openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=1688872018-11-27T07:04:16Znicksingernsinger@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Workable</i></li></ul> openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=2458462019-09-24T19:07:35Zokurzokurz@suse.com
<ul><li><strong>Subject</strong> changed from <i>ext2 on workers busted</i> to <i>setup pool devices+mounts+folders with salt(was: ext2 on workers busted)</i></li></ul><p>By now we have the NVMe devices on the three arm workers setup with salt, see <a href="https://gitlab.suse.de/openqa/salt-states-openqa/tree/master/openqa/nvme_store">https://gitlab.suse.de/openqa/salt-states-openqa/tree/master/openqa/nvme_store</a> . The caveat I saw there is that the file system is recreated on every reboot – as actually suggested here – but with the need to sync again especially the big test and needles repos the overall setup process takes rather long. I think we are able to find a way to re-use the existing partition and data with proper consistency checks and only repair what is necessary. Can you describe what was the original problem? Also, why ext2? I know, there is no journal but is it still the best approach?</p>
<p>EDIT: I tried on openqaworker10, mkfs.ext2 on an NVMe partition took 25s, mkfs.ext4 took 1s. As we are reformating on the arm workers on every reboot one more reason to use ext4.</p>
<p><a href="http://www.ilsistemista.net/index.php/virtualization/47-zfs-btrfs-xfs-ext4-and-lvm-with-kvm-a-storage-performance-comparison.html">http://www.ilsistemista.net/index.php/virtualization/47-zfs-btrfs-xfs-ext4-and-lvm-with-kvm-a-storage-performance-comparison.html</a> has same info. <a href="https://www.phoronix.com/scan.php?page=article&item=linux-50-filesystems&num=2">https://www.phoronix.com/scan.php?page=article&item=linux-50-filesystems&num=2</a> indicates XFS might be good for us (by now) to run for the pool dir. Following <a href="https://wiki.archlinux.org/index.php/ext4#Improving_performance">https://wiki.archlinux.org/index.php/ext4#Improving_performance</a> or <a href="https://www.thegeekdiary.com/what-are-the-mount-options-to-improve-ext4-filesystem-performance-in-linux/">https://www.thegeekdiary.com/what-are-the-mount-options-to-improve-ext4-filesystem-performance-in-linux/</a> I will try to use optimized settings for openqaworker10, see <a class="issue tracker-4 status-3 priority-4 priority-default closed behind-schedule" title="action: bring openqaworker10 back into the infrastructure (was: openqaworker10 is giving us many incomple... (Resolved)" href="https://progress.opensuse.org/issues/32605">#32605</a> as well. Interesting enough, I could not easily proof that ext4 w/o journal is any better than ext2:</p>
<pre><code>openqaworker10:/srv # time mkfs.ext2 -F /dev/nvme0n1p1
…
real 0m24.034s
openqaworker10:/srv # mount -o defaults /dev/nvme0n1p1 /var/lib/openqa/pool/
openqaworker10:/srv # mount | grep pool
/dev/nvme0n1p1 on /var/lib/openqa/pool type ext2 (rw,relatime,block_validity,barrier,user_xattr,acl)
openqaworker10:/srv # /tmp/avgtime -q -d -r 5 -h dd bs=4M count=1000 if=/dev/zero of=/var/lib/openqa/pool/test.img
Avg time : 7013.06
Std dev. : 225.566
Minimum : 6786.32
Maximum : 7442.27
openqaworker10:/srv # /tmp/avgtime -d -r 5 -h dd bs=4M count=1000 if=/dev/zero of=/var/lib/openqa/pool/test.img
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 6.04282 s, 694 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 6.25296 s, 671 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 6.04532 s, 694 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 6.27314 s, 669 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 6.40667 s, 655 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 6.56258 s, 639 MB/s
…
Avg time : 7090.44
Std dev. : 171.836
Minimum : 6789.99
Maximum : 7304.62
openqaworker10:/srv # umount /var/lib/openqa/pool
openqaworker10:/srv # time mkfs.ext4 -O ^has_journal -F /dev/nvme0n1p1
…
real 0m0.757s
openqaworker10:/srv # mount -o defaults,noatime,barrier=0 /dev/nvme0n1p1 /var/lib/openqa/pool/
openqaworker10:/srv # /tmp/avgtime -d -r 5 -h dd bs=4M count=1000 if=/dev/zero of=/var/lib/openqa/pool/test.img
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 4.23314 s, 991 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 7.79238 s, 538 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 7.6331 s, 549 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 7.95202 s, 527 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 7.57801 s, 553 MB/s
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 7.87948 s, 532 MB/s
Avg time : 8476.28
Std dev. : 163.676
Minimum : 8273.7
Maximum : 8641.14
openqaworker10:/srv #
</code></pre>
<p>I also conducted a test ext4+journal which was worse. However this is all still on openSUSE Leap 42.3 with Linux 4.4.159. I should redo this after upgrade (or reinstall).</p>
<p>EDIT: I checked all our current production workers and we have two nvme's on some, single nvme on openqaworker{9,10,13} and arm{1,2,3}:</p>
<pre><code>$ sudo salt --no-color '*' cmd.run 'ls /dev/nvme?'
QA-Power8-4-kvm.qa.suse.de:
ls: cannot access '/dev/nvme?': No such file or directory
QA-Power8-5-kvm.qa.suse.de:
ls: cannot access '/dev/nvme?': No such file or directory
powerqaworker-qam-1:
ls: cannot access '/dev/nvme?': No such file or directory
malbec.arch.suse.de:
ls: cannot access '/dev/nvme?': No such file or directory
openqaworker2.suse.de:
/dev/nvme0
/dev/nvme1
openqaworker9.suse.de:
/dev/nvme0
openqaworker8.suse.de:
/dev/nvme0
openqaworker5.suse.de:
/dev/nvme0
/dev/nvme1
openqaworker3.suse.de:
/dev/nvme0
/dev/nvme1
openqaworker7.suse.de:
/dev/nvme0
/dev/nvme1
openqaworker6.suse.de:
/dev/nvme0
/dev/nvme1
grenache-1.qa.suse.de:
ls: cannot access '/dev/nvme?': No such file or directory
openqa-monitor.qa.suse.de:
ls: cannot access '/dev/nvme?': No such file or directory
openqa.suse.de:
ls: cannot access '/dev/nvme?': No such file or directory
openqaworker10.suse.de:
/dev/nvme0
openqaworker-arm-1.suse.de:
/dev/nvme0
openqaworker13.suse.de:
/dev/nvme0
openqaworker-arm-3.suse.de:
/dev/nvme0
openqaworker-arm-2.suse.de:
/dev/nvme0
ERROR: Minions returned with non-zero exit code
</code></pre>
<p>so we can either make the script dynamic to use 0-2 (or more) NVMe devices or we rely on the specific workers setup statically. Preferences or ideas?</p>
openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=2511502019-10-18T06:30:49Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-3 priority-lowest closed" href="/issues/49694">action #49694</a>: openqaworker7 lost one NVMe</i> added</li></ul> openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=2711212020-01-14T12:53:53Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed behind-schedule" href="/issues/46742">action #46742</a>: test incompletes trying to revert to qemu snapshot auto_review:"Could not open backing file: Could not open .*.qcow.*No such file or directory", likely premature deletion of files from cache</i> added</li></ul> openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=2711482020-01-14T13:34:45Zokurzokurz@suse.com
<ul></ul><p>Given that with <a class="issue tracker-4 status-3 priority-4 priority-default closed behind-schedule" title="action: test incompletes trying to revert to qemu snapshot auto_review:"Could not open backing file: Coul... (Resolved)" href="https://progress.opensuse.org/issues/46742">#46742</a> we try to use efficient hard-links from cache to the pools I suggest to stripe all available NVMe's together, e.g. create a RAID0 of all NVMe's, mount as /var/lib/openqa as the latter on workers has commonly only three dirs, pool, cache and share which should be a mountpoint for NFS anyway. Based on <a href="http://www.fibrevillage.com/storage/429-performance-comparison-of-mdadm-raid0-and-lvm-striped-mapping" class="external">http://www.fibrevillage.com/storage/429-performance-comparison-of-mdadm-raid0-and-lvm-striped-mapping</a> I would choose mdadm RAID0 over striping LVM. We could configure volumes with <a href="https://docs.saltstack.com/en/latest/ref/states/all/salt.states.mdadm_raid.html" class="external">https://docs.saltstack.com/en/latest/ref/states/all/salt.states.mdadm_raid.html</a> but I doubt we can easily dynamically configure the number of devices based on what is present on the specific workers easily. I suggest to rework what we have in <a href="https://gitlab.suse.de/openqa/salt-states-openqa/tree/master/openqa/nvme_store" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/tree/master/openqa/nvme_store</a> which we currently only use for three ARM workers and apply that the same for all workers.</p>
<p>EDIT: I have an idea, maybe we <em>can</em> define it dynamically within salt based on grains, e.g.</p>
<pre><code>sudo salt --no-color '*' grains.item SSDs
QA-Power8-4-kvm.qa.suse.de:
----------
SSDs:
…
openqaworker10.suse.de:
----------
SSDs:
- nvme0n1
…
openqaworker-arm-2.suse.de:
----------
SSDs:
- sdb
- sda
- nvme0n1
</code></pre>
<p>see <a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/250" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/250</a> for an idea how to do that.</p>
openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=2756182020-01-31T22:24:38Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>Feedback</i></li><li><strong>Assignee</strong> set to <i>okurz</i></li></ul><p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/250" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/250</a> merged. I am experimenting with openqaworker11 and openqaworker13 for reinstalls.</p>
<pre><code>ipmitool -I lanplus -H openqaworker11-ipmi.suse.de -U $user -P $pass chassis bootdev pxe
ipmitool -I lanplus -H openqaworker11-ipmi.suse.de -U $user -P $pass power reset
sleep 3
ipmitool -I lanplus -H openqaworker11-ipmi.suse.de -U $user -P $pass sol activate
</code></pre>
<p>in PXE menu selecting openSUSE Leap 15.1, pressing "tab" for options, adding parameter <code>autoyast=http://w3.nue.suse.com/~okurz/ay-openqa-worker.xml</code> but so far I fail to see something useful on the screen so far, just after loading the kernel and initrd after some time, about a minute "ààààààààààààààààààüààààààüààààààff".</p>
<p>And now I'm lost.</p>
<p>EDIT: Trying with additional parameters <code>console=ttyS1,115200</code> so the complete boot line in PXE:</p>
<pre><code>/find/openSUSE-Leap-15.1-x86_64-DVD1/boot/x86_64/loader/linux initrd=/find/ope
nSUSE-Leap-15.1-x86_64-DVD1/boot/x86_64/loader/initrd install=http://dist.suse.d
e/netboot/find/openSUSE-Leap-15.1-x86_64-DVD1 splash=silent minmemory=128 ramdis
k_size=73728 vga=normal console=tty0 console=ttyS0,115200 sysrq_always_enabled l
inemode=1 panic=100 ignore_loglevel unknown_nmi_panic insecure=1 console=ttyS1,1
15200
</code></pre>
<p>EDIT: No luck</p>
openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=2765392020-02-04T14:35:49Znicksingernsinger@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>EDIT: Trying with additional parameters <code>console=ttyS1,115200</code> so the complete boot line in PXE:</p>
<pre><code>/find/openSUSE-Leap-15.1-x86_64-DVD1/boot/x86_64/loader/linux initrd=/find/ope
nSUSE-Leap-15.1-x86_64-DVD1/boot/x86_64/loader/initrd install=http://dist.suse.d
e/netboot/find/openSUSE-Leap-15.1-x86_64-DVD1 splash=silent minmemory=128 ramdis
k_size=73728 vga=normal console=tty0 console=ttyS0,115200 sysrq_always_enabled l
inemode=1 panic=100 ignore_loglevel unknown_nmi_panic insecure=1 console=ttyS1,1
15200
</code></pre>
<p>EDIT: No luck</p>
</blockquote>
<p>You where close. I realized it renders the black text on a black background. Changing the command-line to this fixes this and your AY profile starts up:</p>
<pre><code>/find/openSUSE-Leap-15.1-x86_64-DVD1/boot/x86_64/loader/linux initrd=/find/openSUSE-Leap-15.1-x86_64-DVD1/boot/x86_64/loader/initrd install=http://dist.suse.de/netboot/find/openSUSE-Leap-15.1-x86_64-DVD1 splash=silent console=ttyS1,115200 autoyast=http://w3.suse.de/~okurz/ay-openqa-worker.xml
</code></pre>
<p>I've really no clue why any of the existing <code>vga</code>, <code>minmemory</code> or <code>ramdisk_size</code> should cause this but just removing it worked fine. Be aware that the normal backspace does not work in PXE over SOL therefore you have to use ctrl+h instead.<br>
Unfortunately your profile also fails quite early with:</p>
<pre><code>salt-minion: The package is not available.
</code></pre> openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=2771502020-02-06T20:48:09Zokurzokurz@suse.com
<ul></ul><p>trying now with</p>
<pre><code>/find/openSUSE-Leap-15.1-x86_64-DVD1/boot/x86_64/loader/linux initrd=/find/openSUSE-Leap-15.1-x86_64-DVD1/boot/x86_64/loader/initrd install=http://download.opensuse.org/distribution/leap/15.1/repo/oss/ autoyast=http://w3.suse.de/~okurz/ay-openqa-worker.xml console=ttyS1,115200
</code></pre>
<p>but that somehow brought me into an installation summary screen, not what looks like autoyast, hm.</p>
<p>Anyway, I can also continue with manual migration which I need to do for o3 anyway:</p>
<ul>
<li>openqaworker4.o.o: migrated as one NVMe is broken and I experimented with the machine anyway, verified with jobs</li>
<li>aarch64.o.o: migrated, <code>openqa-clone-job --within-instance https://openqa.opensuse.org 1162608 _GROUP=0 BUILD=X TEST=okurz_poo19238</code> -> Created job #1165797: opensuse-Tumbleweed-DVD-aarch64-Build20200201-mediacheck@aarch64 -> <a href="https://openqa.opensuse.org/t1165797">https://openqa.opensuse.org/t1165797</a> -> passed</li>
<li>power8.o.o: migrated, <code>openqa-clone-job --within-instance https://openqa.opensuse.org 1164156 _GROUP=0 BUILD=X TEST=okurz_poo19238</code> -> Created job #1165798: opensuse-Tumbleweed-DVD-ppc64le-Build20200203-mediacheck@ppc64le -> <a href="https://openqa.opensuse.org/t1165798">https://openqa.opensuse.org/t1165798</a> -> passed</li>
<li>openqaworker1.o.o: migrated, verified</li>
</ul>
<p>for OSD I checked first again where needed (and where potentially not so many jobs running right now) <code>salt -l error --no-color '*' cmd.run 'lsblk | grep -q nvme && ps auxf | grep -c isotovideo'</code></p>
<p>but then I realized that e.g. openqaworker2 has already /dev/md0 and /dev/md1. Actually the same we have on the o3 workers but I think we should use the same name on all workers regardless of existance of md0 or md1 so I created<br>
<a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/268">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/268</a><br>
proposing /dev/md/openqa</p>
<p>Done:</p>
<ul>
<li>openqaworker-arm-1.suse.de: migrated, verified</li>
<li>openqaworker-arm-2.suse.de: migrated, <code>openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de 3867058 _GROUP=0 BUILD=X TEST=mediacheck_okurz_poo19238 WORKER_CLASS=openqaworker-arm-2</code> -> Created job #3872316: sle-15-SP2-Online-aarch64-Build136.2-mediacheck@aarch64 -> <a href="https://openqa.suse.de/t3872316">https://openqa.suse.de/t3872316</a> -> passed</li>
<li>openqaworker-arm-3.suse.de: migrated, verified</li>
<li>openqaworker2.suse.de: migrated, verified</li>
<li>openqaworker3.suse.de: migrated, verified</li>
<li>openqaworker5.suse.de: migrated, verified</li>
<li>openqaworker10.suse.de: migrated, verified</li>
<li>openqaworker6.suse.de: migrated, verified</li>
<li>openqaworker7.suse.de: migrated, verified</li>
<li>openqaworker8.suse.de: same as for 9, needs manual handling or adaptions, verified</li>
<li>openqaworker9.suse.de: has <em>only</em> NVMe, no other disks or SSDs, migrated manually, verified</li>
</ul>
<p>created <a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/269">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/269</a> to handle a single NVMe setup as for openqaworker8+9 automatically in the future.</p>
openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=2795712020-02-21T13:15:23Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>In Progress</i></li></ul><p>nicksinger ignored me in MR so merged myself ;)</p>
<p>I applied the state explicitly with <code>salt -l error --no-color -C 'G@roles:worker' --state-output=changes state.apply openqa.nvme_store</code> and the diff of changes looks fine.</p>
<p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/275" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/275</a> to apply by default.</p>
openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)https://progress.opensuse.org/issues/19238?journal_id=2795922020-02-21T14:06:41Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/275" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/275</a> merged and <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/172852" class="external">successfully applied</a></p>