openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842023-03-23T16:08:51ZopenSUSE Project Management Tool
Redmine QA - action #126554 (New): [qem-dashboard] Show more details about incident specific openQA jobs ...https://progress.opensuse.org/issues/1265542023-03-23T16:08:51Zkraihsebastian.riedel@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>Inconsistent job counts have been a recurring issue in the qem-dashboard, and each investigation currently requires database access and SQL knowledge. So it would make sense to find a nice way to optionally show additional job details in addition to counts on the incident detail page. This would empower test reviewers to much better diagnose issues on their own and allow us to benefit from their domain knowledge.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> It is possible to view details about individual openQA jobs for incident specific jobs</li>
<li><strong>AC2:</strong> openQA jobs flagged as missing do not appear in job counts</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Some incidents have a lot of incident specific jobs, maybe hide the details by default so it doesn't get in the way</li>
<li>Highlight jobs that are flagged as missing in openQA in some way</li>
</ul>
openQA Project - coordination #98463 (Blocked): [epic] Avoid too slow asset downloads leading to ...https://progress.opensuse.org/issues/984632021-09-10T12:34:46Zmkittlermarius.kittler@suse.com
<a name="problem-and-scope"></a>
<h3 >problem and scope<a href="#problem-and-scope" class="wiki-anchor">¶</a></h3>
<p>This epic is about the general problem that asset downloads can be quite slow leading to jobs exceeding <code>MAX_SETUP_TIME</code> or being incompleted with <code>Cache service queue already full</code>; it is <strong>not</strong> about worker host specific problems, e.g. broken filesystem or networking problems.</p>
<a name="ideas-to-improve"></a>
<h3 >ideas to improve<a href="#ideas-to-improve" class="wiki-anchor">¶</a></h3>
<p>There are multiple factors contributing to the problem so there's not one simple fix. Here is a list of the areas where we have room for improvement (feel free to add more items):</p>
<ol>
<li>The file system on OSD workers is re-created on every reboot so the cache needs to be completely renewed on every reboot. Hence this problem is almost only apparent on OSD (but not on o3).
<ul>
<li>see <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: Re-use existing filesystems on workers after reboot if possible to prevent full worker asset cach... (New)" href="https://progress.opensuse.org/issues/97409">#97409</a></li>
</ul></li>
<li>We would also benefit from using a bigger asset cache (although without 1. being addressed it is likely not of that much use)
<ul>
<li>see <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: Reduce I/O load on OSD by using more cache size on workers with using free disk space when availa... (New)" href="https://progress.opensuse.org/issues/97412">#97412</a></li>
</ul></li>
<li>We should avoid processing downloads when their jobs have exceeded the timeout anyways. This of course only improves handling the symptom of the problem and might not be very useful anymore once the problem itself is fixed.
<ul>
<li>see <a class="issue tracker-4 status-6 priority-3 priority-lowest closed child" title="action: Abort asset download via the cache service when related job runs into a timeout (or is otherwise ... (Rejected)" href="https://progress.opensuse.org/issues/96684">#96684</a></li>
</ul></li>
<li>We could try to tweak the parameter <code>OPENQA_CACHE_MAX_INACTIVE_JOBS</code>.
<ul>
<li>This parameter has been introduced by <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Let workers declare themselves as broken if asset downloads are piling up size:M (Resolved)" href="https://progress.opensuse.org/issues/96623">#96623</a>.</li>
<li>At this point, it is set to 10 via <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/1e7e862475b40d94f46dc2a72af6b7a4dae6340b">https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/1e7e862475b40d94f46dc2a72af6b7a4dae6340b</a>.</li>
<li>Such broken workers are already ignored by our monitoring but a too low value can still cause unnecessary incomplete jobs.</li>
</ul></li>
</ol>
<a name="acceptance-criteria"></a>
<h3 >acceptance criteria<a href="#acceptance-criteria" class="wiki-anchor">¶</a></h3>
<ul>
<li><strong>AC1</strong>: The figures for jobs exceeding <code>MAX_SETUP_TIME</code> are significantly lower than the ones mentioned under "further details" below. A specific worker host causing problems for reasons specific to that machine it is out of scope, though.</li>
</ul>
<a name="further-details"></a>
<h3 >further details<a href="#further-details" class="wiki-anchor">¶</a></h3>
<p>Multiple worker hosts are affected:</p>
<pre><code>openqa=> select host, count(id) as online_slots, (select array[count(distinct id), count(distinct id) / (extract(epoch FROM (timezone('UTC', now()) - '2021-09-07T00:00:00')) / 3600)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-09-07T00:00:00' and reason like '%setup exceeded MAX_SETUP_TIME%') as recently_abandoned_jobs_total_and_per_hour from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recently_abandoned_jobs_total_and_per_hour desc;
host | online_slots | recently_abandoned_jobs_total_and_per_hour
---------------------+--------------+--------------------------------------------
openqaworker5 | 41 | {14,0.167352897235061}
openqaworker6 | 29 | {12,0.143445340487195}
openqaworker13 | 16 | {9,0.107584005365396}
openqaworker3 | 19 | {5,0.0597688918696647}
openqaworker8 | 16 | {5,0.0597688918696647}
openqaworker9 | 16 | {5,0.0597688918696647}
QA-Power8-5-kvm | 8 | {3,0.0358613351217988}
openqaworker11 | 10 | {0,0}
openqaworker2 | 34 | {0,0}
QA-Power8-4-kvm | 8 | {0,0}
powerqaworker-qam-1 | 8 | {0,0}
automotive-3 | 1 | {0,0}
grenache-1 | 50 | {0,0}
malbec | 4 | {0,0}
openqaworker-arm-1 | 10 | {0,0}
openqaworker-arm-2 | 20 | {0,0}
openqaworker10 | 10 | {0,0}
(17 Zeilen)
</code></pre>
<p>The ones which are affected most are also the ones needing the most assets:</p>
<pre><code>openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-09-07T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
host | online_slots | recent_asset_size_in_gb_and_job_count
---------------------+--------------+---------------------------------------
openqaworker11 | 10 | {NULL,0}
automotive-3 | 1 | {NULL,0}
openqaworker6 | 29 | {1739.5315849324688340,3444}
openqaworker5 | 41 | {1668.8964441129937744,3665}
openqaworker13 | 16 | {1591.4191119810566328,2221}
openqaworker8 | 16 | {1487.1783863399177842,2531}
openqaworker3 | 19 | {1447.2926171422004697,2350}
openqaworker9 | 16 | {1368.1286235852167031,2380}
openqaworker10 | 10 | {1117.2662402801215645,1706}
openqaworker2 | 34 | {781.0186277972534277,718}
grenache-1 | 50 | {663.5168796060606865,1477}
openqaworker-arm-2 | 20 | {346.2731295535340879,1123}
openqaworker-arm-1 | 10 | {332.1729393638670449,614}
QA-Power8-5-kvm | 8 | {239.5352552458643916,298}
powerqaworker-qam-1 | 8 | {238.9669120963662910,361}
QA-Power8-4-kvm | 8 | {223.1794419540092373,297}
malbec | 4 | {187.9319233968853955,141}
(17 Zeilen)
</code></pre>