openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-13T11:36:33Z</p> <ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li><li><strong>Target version</strong> set to <i>Ready</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-13T11:55:33Z</p> <ul><li><strong>File</strong> <a href="/attachments/14807">dmesg.txt</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/14807/dmesg.txt">dmesg.txt</a> added</li></ul><p>I've attached the dmesg log. It looks similar to the traces from worker11 and worker13. There's also a crash dump from a few months ago in <code>/var/crash-bak/2022-09-26-16:02</code> but it looks different.</p> </article> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-13T12:47:27Z</p> <ul><li><strong>Subject</strong> changed from <i>worker10 crashed triggering systemd-services alert and host-up alert</i> to <i>worker10 crashed triggering systemd-services alert and host-up alert size:M</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/612665/diff?detail_id=575336">diff</a>)</li><li><strong>Status</strong> changed from <i>New</i> to <i>Feedback</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-13T13:42:52Z</p> <ul><li><strong>Tags</strong> deleted (<del><i>alert, infra</i></del>)</li><li><strong>Target version</strong> deleted (<del><i>Ready</i></del>)</li></ul><p>I've asked on the public opensuse-factory channel and the internal kernel squad channel. Maybe someone knows more.</p> </article> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-13T13:43:15Z</p> <ul><li><strong>Tags</strong> set to <i>infra, alert</i></li><li><strong>Target version</strong> set to <i>Ready</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-13T16:04:28Z</p> <ul></ul><p>I've reported some new workqueue lockup warnings on SLE-15SP4 kernels in <a href="https://bugzilla.suse.com/show_bug.cgi?id=1201188" class="external">bsc#1201188</a> but this doesn't seem to be the same issue. All 3 worker however ran Btrfs rebalance shortly (~20 minutes) before the crash. Can you trigger the crash by running Btrfs rebalance manually? And if so, can you trigger it again when you boot the worker without the igb driver module?</p> </article> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-13T16:37:41Z</p> <ul></ul><blockquote> <p>All 3 worker however ran Btrfs rebalance shortly (~20 minutes) before the crash.</p> </blockquote> <p>I've of course noticed that as well but I'm not sure whether it is related.</p> <blockquote> <p>Can you trigger the crash by running Btrfs rebalance manually?</p> </blockquote> <p>Good idea, I'll try that.</p> <hr> <p>EDIT: I've just invoked <code>sudo systemctl start btrfs-balance.service</code> on all three workers (worker10, 11 and 13) to increase the chances to reproduce the issue. So far it doesn't lead to any crashes and the re-balancing has already ended on all hosts. Maybe this isn't an ideal way to reproduce the problem because there was likely not much to re-balance; the re-balancing didn't take very long.</p> <p>I've just triggered one more re-balancing on all three hosts but it also just exited very quickly with no crashes yet.</p> </article> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-15T12:27:09Z</p> <ul><li><strong>Due date</strong> set to <i>2023-03-27</i></li></ul><p>We suggest to manually trigger the balance, e.g.</p> <pre><code>sudo btrfs balance start --full-balance / </code></pre> </article> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-16T14:27:06Z</p> <ul></ul><p>I did 2 re-balances this way on each worker. It took indeed much more time that before. However, so far none of the machines has crashed.</p> </article> <article> <h1>openQA Infrastructure - action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M</h1> <p>2023-03-22T10:20:04Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>There were no further crashes. I'm considering this resolved for now. Maybe it was even a kernel issue that has been patched meanwhile. At this point spending more effort into investigating this is not worth it.</p> </article> </main></body></html>