https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842021-09-26T04:34:58ZopenSUSE Project Management ToolopenQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4494572021-09-26T04:34:58ZXiaojing_liuxliu1@suse.com
<ul><li><strong>Project</strong> changed from <i>openQA Project</i> to <i>openQA Infrastructure</i></li></ul> openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4494602021-09-26T05:42:21Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> set to <i>Ready</i></li></ul> openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4496102021-09-27T04:41:10Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/449610/diff?detail_id=426369">diff</a>)</li><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Urgent</i></li></ul><p>Xiaojing_liu wrote:</p>
<blockquote>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>used <code>ipmitool power cycle</code> to reboot arm-4 and arm-5</p>
</blockquote>
<p>I can not confirm this worked. The alerts in OSD are still active. <code>ssh openqaworker-arm-4.qa</code> asks me for a password which it should not. Updated description with suggestions and rollback steps. Paused alerts in grafana.</p>
<p>I assume in <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: Replacement openQA OSD aarch64 hardware (was: Dedicated non-rpi aarch64 hardware for manual testing) (Resolved)" href="https://progress.opensuse.org/issues/90275">#90275</a> the intended target state was never verified so likely no reboot after a salt high state was conducted. Right now on openqaworker-arm-5.qa it seems that /var/lib/openqa can not be correctly mounted assuming /dev/md/openqa which does not exist</p>
openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4496192021-09-27T04:51:00Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed" href="/issues/90275">action #90275</a>: Replacement openQA OSD aarch64 hardware (was: Dedicated non-rpi aarch64 hardware for manual testing)</i> added</li></ul> openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4496762021-09-27T07:44:14Zokurzokurz@suse.com
<ul></ul><p>openqaworker-arm-4 seems to repeatedly boot into installer medium.</p>
<p>openqaworker-arm-5 could be temporarily recovered by disabling the line <code>/dev/md/openqa /var/lib/openqa ext2 defaults 0 0</code> in /etc/fstab, calling <code>mount -a</code> and logging out of the recovery shell which continued the boot.</p>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> didn't you mention necessary changes to support the filesystem setup slightly differing from other hosts?</p>
openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4497362021-09-27T09:12:22Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/449736/diff?detail_id=426516">diff</a>)</li></ul> openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4497392021-09-27T09:13:35Zokurzokurz@suse.com
<ul><li><strong>Subject</strong> changed from <i>[Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26</i> to <i>[Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:M</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/449739/diff?detail_id=426522">diff</a>)</li><li><strong>Status</strong> changed from <i>New</i> to <i>Workable</i></li></ul> openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4497572021-09-27T09:30:02Znicksingernsinger@suse.com
<ul><li><strong>Assignee</strong> set to <i>nicksinger</i></li><li><strong>Target version</strong> deleted (<del><i>Ready</i></del>)</li></ul><blockquote>
<p>openqaworker-arm-5 could be temporarily recovered by disabling the line <code>/dev/md/openqa /var/lib/openqa ext2 defaults 0 0</code> in /etc/fstab, calling <code>mount -a</code> and logging out of the recovery shell which continued the boot.</p>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> didn't you mention necessary changes to support the filesystem setup slightly differing from other hosts?</p>
</blockquote>
<p>Yes, <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/572" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/572</a> - But I had to manually reverted the changes on arm-5 and overlooked <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/nvme_store/init.sls#L35" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/nvme_store/init.sls#L35</a> which creates an entry in /etc/fstab. I deleted it now manually on arm-5.</p>
openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4497902021-09-27T10:16:18Zlivdywanliv.dywan@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/449790/diff?detail_id=426573">diff</a>)</li><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li><li><strong>Target version</strong> set to <i>Ready</i></li></ul> openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4497962021-09-27T10:21:47Znicksingernsinger@suse.com
<ul><li><strong>Target version</strong> deleted (<del><i>Ready</i></del>)</li></ul><p>Set the bootdev for openqaworker-arm-4 persistent with: <code>ipmitool -I lanplus -C 3 -H ipmi.openqaworker-arm-4.qa.suse.de chassis bootdev disk options=persistent</code>. Reboot worked 4/4 times. Another highstate also confirmed that the fstab-entry does not reappear. Doing the reboot-test now with arm-5.</p>
openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4498442021-09-27T11:29:47Znicksingernsinger@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>Target version</strong> set to <i>Ready</i></li></ul><p>gnah sorry for deleting "Ready" once again… openqaworker-arm-5 also survived 3/3 reboots. I consider this ticket covered then.</p>
openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4499042021-09-27T13:23:52Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>Feedback</i></li></ul><p>waaait … what happens if salt overwrites fstab changes again? I think the AC should be "survives multiple reboot+salt+reboot cycles" :) Or is the above really that you reverted parts but never rebooted so missed the fstab part?</p>
openQA Infrastructure - action #99288: [Alerting] openqaworker-arm-5 and openqaworker-arm-4: host up alert on 2021-09-26 size:Mhttps://progress.opensuse.org/issues/99288?journal_id=4500872021-09-28T06:33:29Znicksingernsinger@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>okurz wrote:</p>
<blockquote>
<p>waaait … what happens if salt overwrites fstab changes again? I think the AC should be "survives multiple reboot+salt+reboot cycles" :) Or is the above really that you reverted parts but never rebooted so missed the fstab part?</p>
</blockquote>
<p>Right, I reverted the changes manually after applying a proper fix in salt (the addition of another grain) and never rebooted afterwards. In <a href="https://progress.opensuse.org/issues/99288#note-10" class="external">https://progress.opensuse.org/issues/99288#note-10</a> I mentioned another highstate to confirm this is working. Another way to view this: arm-5 was deployed first and showed problems while arm-4 didn't have the fstab problems (it was deployed with the grain already present) but just a wrong boot order :)</p>
<p>So your newly formulated AC is met as well</p>