openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-03-22T10:52:59ZopenSUSE Project Management Tool
Redmine QA - action #157753 (Workable): Bring back automatic recovery for openqaworker-arm-1 size:Mhttps://progress.opensuse.org/issues/1577532024-03-22T10:52:59Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In #132614 openqaworker-arm-1 was moved to FC Basement so that we have one hot-redundant aarch64 OSD machine outside of PRG2. For that to be setup we need to also accomodate the automatic recovery feature.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> The automatic recovery of openqaworker-arm-1 on crashes works</li>
<li><strong>AC2:</strong> openqaworker-arm-1 runs OSD production jobs in a stable way</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Read <a class="issue tracker-4 status-3 priority-3 priority-lowest closed child" title="action: Move of openqaworker-arm-1 to FC Basement size:M (Resolved)" href="https://progress.opensuse.org/issues/133748">#133748</a> about notes regarding PDU auto-control</li>
<li>Find on <a href="https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs" class="external">https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs</a> how the new PDU can be used</li>
<li>Integrate the new PDU in <a href="https://gitlab.suse.de/openqa/grafana-webhook-actions" class="external">https://gitlab.suse.de/openqa/grafana-webhook-actions</a></li>
<li>After openqaworker-arm-1 is fully back including recovery remove silences in <a href="https://monitor.qa.suse.de/alerting/silences" class="external">https://monitor.qa.suse.de/alerting/silences</a></li>
<li>Remove the "Mute All times" in <a href="https://monitor.qa.suse.de/alerting/routes" class="external">https://monitor.qa.suse.de/alerting/routes</a> for <code>__contacts__ =~ .*"Trigger reboot of openqaworker-arm-1".*</code></li>
</ul>
<a name="Rollback-actions"></a>
<h2 >Rollback actions<a href="#Rollback-actions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Bring back openqaworker-arm-1 into production <a href="https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production" class="external">https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production</a></li>
</ul>
openQA Infrastructure - action #64580 (Workable): Detect and recover from I/O blocked worker mach...https://progress.opensuse.org/issues/645802020-03-18T15:53:16Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In <a class="issue tracker-4 status-3 priority-3 priority-lowest closed" title="action: all arm worker die after some time (Resolved)" href="https://progress.opensuse.org/issues/41882">#41882</a> we identified arm machines being completely unresponsive and are automatically detecting these situations and recover. But there are also cases when systems are I/O blocked, the machine still responds to ping but is not "usable". In this situation the machine can still have openQA jobs assigned that are then stuck for many hours. Also the machine is not detected as broken in grafana hence never recovered automatically. We should detect a situation like this and recover automatically.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Machines in I/O blocked stated for multiple minutes/hours are detected and recovered, e.g. with reboot, similar/same as "worker completely down"</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check if there are already measurements available in grafana that could be used to trigger alerts which then trigger the reboot actions same as <a href="https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1" class="external">https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1</a></li>
<li>If not, find an additional measurement/alert for this purpose</li>
<li>Ensure the alerts and notification configurations are covered in salt</li>
</ul>