https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842022-12-09T13:07:30ZopenSUSE Project Management ToolopenQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5831352022-12-09T13:07:30Zokurzokurz@suse.com
<ul></ul><p>That's what the openQA database knows about the job:</p>
<pre><code>openqa=> select * from jobs where id=2940351
openqa-> ;
id | priority | clone_id | t_started | t_finished | t_created | t_updated | state | result | backend_info | result_dir | group_id | test
| distri | version | flavor | arch | build | machine | logs_present | assigned_worker_id | passed_module_count | failed_module_count | softfailed_module_count | skipped_m
odule_count | blocked_by_id | externally_skipped_module_count | scheduled_product_id | reason | result_size | archived
---------+----------+----------+---------------------+------------+---------------------+---------------------+---------+----------------+--------------+------------+----------+-----------
--+----------+-----------+-------------+--------+---------+------------+--------------+--------------------+---------------------+---------------------+-------------------------+----------
------------+---------------+---------------------------------+----------------------+--------+-------------+----------
2940351 | 45 | 2942915 | 2022-12-09 04:30:05 | | 2022-12-09 04:29:50 | 2022-12-09 04:30:05 | running | user_restarted | | | 2 | kde_usbboo
t | opensuse | Staging:G | Staging-DVD | x86_64 | G.587.1 | USBboot_64 | t | | 0 | 0 | 0 |
0 | | 0 | | | | f
(1 row)
</code></pre>
<p>so there is a "result" <code>user_restarted</code> and a state <code>running</code>. We should make sure to prevent this combination happening.</p>
<p>Right now on o3 <code>select count(*) from jobs where state='running' and result='user_restarted';</code> says there are ten such jobs.</p>
<p>I now did <code>update jobs set state='done' where state='running' and result='user_restarted';</code> as manual workaround.</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5831382022-12-09T13:09:24Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>okurz</i></li></ul><p>Found more examples of stuck jobs in "running", patched two with <code>update jobs set state='done' where state='running' and result='parallel_failed';</code></p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5831412022-12-09T13:13:09Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>New</i></li><li><strong>Assignee</strong> deleted (<del><i>okurz</i></del>)</li></ul><p>dimstar confirmed that that helped to release the next Tumbleweed snapshot so I won't do more here for now. <a href="https://openqa.opensuse.org/tests/2941399" class="external">https://openqa.opensuse.org/tests/2941399</a> could be the last remaining "still running" candidate which shouldn't disturb anyone for now so somebody else can look into that live example and see what's going on.</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5832102022-12-09T13:50:58Zkraihsebastian.riedel@suse.com
<ul></ul><p>There is a possibility that this might be related to the update deadlock issue <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: PostgreSQL update ordering deadlock on jobs table (Resolved)" href="https://progress.opensuse.org/issues/121777">#121777</a>.</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5839752022-12-12T08:46:30Zmkittlermarius.kittler@suse.com
<ul><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul> openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5840052022-12-12T09:46:27Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul><p>2941399 is the last remaining instance, indeed:</p>
<pre><code>openqa=# select id, state, result, assigned_worker_id from jobs where (state != 'done' and state != 'cancelled') and result != 'none';
id | state | result | assigned_worker_id
---------+---------+--------------------+--------------------
2941399 | running | parallel_restarted |
(1 row)
</code></pre>
<p>I'll keep it for now to have at least one example around (as having the job around shouldn't cause any further problems).</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5840232022-12-12T10:24:23Zmkittlermarius.kittler@suse.com
<ul></ul><p>Looks like this job hasn't had the best start:</p>
<pre><code>/var/log/openqa.6.xz:[2022-12-09T06:19:27.763482Z] [debug] [pid:13150] Job 2941394 duplicated as 2941399
/var/log/openqa.6.xz:[2022-12-09T06:19:46.158303Z] [debug] [pid:11987] Started to send message to 498 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:19:46.161826Z] [debug] [pid:11987] Worker 498 rejected job(s) 2941399: already busy with job(s) 2941386
/var/log/openqa.6.xz:[2022-12-09T06:19:46.174983Z] [debug] [pid:11987] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:20:04.262921Z] [debug] [pid:11987] Started to send message to 508 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:20:04.271645Z] [debug] [pid:11987] Worker 508 rejected job(s) 2941399: Unable to lock pool directory: /var/lib/openqa/pool/12 already locked
/var/log/openqa.6.xz:[2022-12-09T06:20:04.303268Z] [debug] [pid:11987] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:20:26.183862Z] [debug] [pid:11987] Started to send message to 498 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:20:26.197333Z] [debug] [pid:11987] Worker 498 rejected job(s) 2941399: already busy with job(s) 2941400
/var/log/openqa.6.xz:[2022-12-09T06:20:26.230838Z] [debug] [pid:11987] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:21:04.848448Z] [debug] [pid:11987] Started to send message to 506 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:21:04.923170Z] [debug] [pid:21222] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:21:04.976339Z] [info] [pid:11987] Worker 506 accepted job 2941399 which was never assigned to it or has already finished
/var/log/openqa.6.xz: parallel_parents => [2941399],
/var/log/openqa.6.xz: 2941399 => {
/var/log/openqa.6.xz:[2022-12-09T06:21:09.989474Z] [info] [pid:6924] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:10.035153Z] [info] [pid:21075] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:15.093301Z] [info] [pid:32084] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:15.133679Z] [info] [pid:6924] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:20.200140Z] [info] [pid:21222] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:20.248761Z] [info] [pid:12587] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:25.204731Z] [debug] [pid:11987] Worker 506 rejected job(s) 2941402: already busy with job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:21:25.288844Z] [info] [pid:24498] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:25.363313Z] [info] [pid:7547] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:30.409337Z] [info] [pid:26975] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:30.467943Z] [info] [pid:12947] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:35.509723Z] [info] [pid:16283] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:35.581128Z] [info] [pid:15316] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:35.623819Z] [info] [pid:4226] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:19:45.979664Z] [debug] [pid:11986] Assigned job '2941399' to worker ID '498'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:19:45.983909Z] [debug] [pid:11986] [Job#2941399] Prepare for being processed by worker 498
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:19:46.161223Z] [debug] [pid:11986] Sent job(s) '2941399' to worker '498'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:19:46.164712Z] [debug] [pid:11986] Allocated: { job => 2941399, worker => 498 }
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:03.954815Z] [debug] [pid:11986] Assigned job '2941399' to worker ID '508'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:03.959613Z] [debug] [pid:11986] [Job#2941399] Prepare for being processed by worker 508
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:04.193124Z] [debug] [pid:11986] Created network for 2941399 : 27
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:04.266865Z] [debug] [pid:11986] Sent job(s) '2941399' to worker '508'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:06.369507Z] [debug] [pid:11986] Allocated: { job => 2941399, worker => 508 }
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:25.982559Z] [debug] [pid:11986] Assigned job '2941399' to worker ID '498'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:25.992590Z] [debug] [pid:11986] [Job#2941399] Prepare for being processed by worker 498
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:26.192479Z] [debug] [pid:11986] Sent job(s) '2941399' to worker '498'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:26.774716Z] [debug] [pid:11986] Allocated: { job => 2941399, worker => 498 }
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:21:04.682691Z] [debug] [pid:11986] Assigned job '2941399' to worker ID '506'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:21:04.687134Z] [debug] [pid:11986] [Job#2941399] Prepare for being processed by worker 506
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:21:04.895021Z] [debug] [pid:11986] Sent job(s) '2941399' to worker '506'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:21:05.519232Z] [debug] [pid:11986] Allocated: { job => 2941399, worker => 506 }
</code></pre>
<p>The workers that were attempted to be assigned with that job rejected it until worker 506 finally accepted it but then the web UI side as it apparently misses that 506 has in fact accepted the job - despite thinking at least that the job is running (just not assigned to a worker). This looks like there's some inconsistency in how the database is updated. Maybe it is related to <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: PostgreSQL update ordering deadlock on jobs table (Resolved)" href="https://progress.opensuse.org/issues/121777">#121777</a>, indeed. However, likely the interleaving between these three events is causing the problem here:</p>
<pre><code>/var/log/openqa.6.xz:[2022-12-09T06:21:04.848448Z] [debug] [pid:11987] Started to send message to 506 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:21:04.923170Z] [debug] [pid:21222] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:21:04.976339Z] [info] [pid:11987] Worker 506 accepted job 2941399 which was never assigned to it or has already finished
</code></pre>
<p>We're starting to assign the job to worker 506, then we reset it back to scheduled and then, when the worker has finally accepted the assignment we don't accept it because the job has already been set back to scheduled.</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5840442022-12-12T11:56:53Zmkittlermarius.kittler@suse.com
<ul></ul><p>While it is unfortunate that the stale job detection reset a job back to scheduled this shouldn't have caused the issue. After the reset the job should have stayed "scheduled" and be eventually assigned again. However, it somehow ended up being stuck in "running" which of course means the scheduler won't touch it anymore (as there's no stale job detection for jobs in "running").</p>
<p>There's a deadlock logged by PostgreSQL:</p>
<pre><code>2022-12-09 06:21:08.416 UTC openqa geekotest [6931]ERROR: deadlock detected
2022-12-09 06:21:08.416 UTC openqa geekotest [6931]DETAIL: Process 6931 waits for ShareLock on transaction 243609936; blocked by process 5779.
Process 5779 waits for ShareLock on transaction 243609932; blocked by process 6931.
Process 6931: UPDATE jobs SET result = $1, t_updated = $2 WHERE id = $3
Process 5779: UPDATE jobs SET result = $1, t_updated = $2 WHERE id = $3
2022-12-09 06:21:08.416 UTC openqa geekotest [6931]HINT: See server log for query details.
2022-12-09 06:21:08.416 UTC openqa geekotest [6931]CONTEXT: while updating tuple (3166,30) in relation "jobs"
2022-12-09 06:21:08.416 UTC openqa geekotest [6931]STATEMENT: UPDATE jobs SET result = $1, t_updated = $2 WHERE id = $3
</code></pre>
<hr>
<p>Likely the following happened:</p>
<ol>
<li>The job is assigned to worker 506 (which should be openqaworker20 slot 5 although there's not trace of job 2941399 anywhere in this machines journal).</li>
<li>The job is reset to scheduled and thus <code>Job 2941399 reset to state scheduled</code> is being logged but that change has <em>not</em> been committed yet (as the enclosing transaction has not been committed yet). So the log message is kind of a lie at this point.</li>
<li>The job is picked up by worker after all. It will inform the web socket server about it (but not wait for a reply) and then continue sending status updates.</li>
<li>The job is updated by the worker. Since the job ended up running presumably at least one successful update was made (there's no log message for every successful status update so there's no corresponding log message).</li>
<li>The web UI starts handling the job update from 4.; it sees that the job is assigned to the worker so it won't refuse the update.</li>
<li>The transaction from 2. is completed.</li>
<li>The web socket server handles the message from 3. and logs <code>Worker 506 accepted job 2941399 which was never assigned to it or has already finished</code>. That's not really interfering with anything and just a log message (the job isn't updated here).</li>
<li>The update from 5. reaches the point where it updates the job status to be "running". It has a check to do so only if the job is still assigned/setup but it will not see the change made by 2./6. as the job information has already been queried before 6.</li>
<li>The job is updated by the worker again. This time we see <code>Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)</code> because now the update code sees 2./6.</li>
<li>The worker eventually gives up and can proceed (successfully) with other jobs but the job is stuck in "running".</li>
</ol>
<p>I'm not yet sure whether that actually <em>is</em> what happened but it is a valid theory.</p>
<p>I'm not sure where the deadlock fits in. It happens between</p>
<pre><code>/var/log/openqa.6.xz:[2022-12-09T06:21:04.976339Z] [info] [pid:11987] Worker 506 accepted job 2941399 which was never assigned to it or has already finished
</code></pre>
<p>and</p>
<pre><code>/var/log/openqa.6.xz:[2022-12-09T06:21:09.989474Z] [info] [pid:6924] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
</code></pre>
<p>Maybe it is within the job update code. That would correspond to point 8. in the sequence of my theory. However, it would be very strange if it was about setting the job to "running" considering that actually worked (which is what leads us to this problem in the first place). Besides, when setting the job to "running" the column <code>t_started</code> and <code>state</code> would be affected (and the message about the deadlock only mentions <code>t_updated</code> and <code>result</code>).</p>
<hr>
<p>For 2940351 the situation looks similar:</p>
<pre><code>/var/log/openqa.5.xz:[2022-12-09T13:05:58.868294Z] [debug] attempt to open ws proxy for job opensuse-Staging:G-Staging-DVD-x86_64-BuildG.587.1-kde_usbboot@USBboot_64 (2940351) where URL to os-autoinst command server is unknown
/var/log/openqa.6.xz:[2022-12-09T08:06:17.907841Z] [debug] attempt to open ws proxy for job opensuse-Staging:G-Staging-DVD-x86_64-BuildG.587.1-kde_usbboot@USBboot_64 (2940351) where URL to os-autoinst command server is unknown
/var/log/openqa.6.xz:[2022-12-09T08:06:27.423709Z] [debug] Restarting job 2940351
/var/log/openqa.6.xz: 2940351 => {
/var/log/openqa.6.xz:[2022-12-09T08:06:27.682788Z] [debug] [pid:21403] Job 2940351 duplicated as 2942915
/var/log/openqa.7.xz:[2022-12-09T04:29:50.511153Z] [debug] [pid:11018] Job 2940338 duplicated as 2940351
/var/log/openqa.7.xz:[2022-12-09T04:30:05.097349Z] [debug] [pid:11987] Started to send message to 510 for job(s) 2940351
/var/log/openqa.7.xz:[2022-12-09T04:30:05.163987Z] [debug] [pid:21591] Job 2940351 reset to state scheduled
/var/log/openqa.7.xz:[2022-12-09T04:30:05.244247Z] [info] [pid:11987] Worker 510 accepted job 2940351 which was never assigned to it or has already finished
/var/log/openqa.7.xz:[2022-12-09T04:30:10.240912Z] [info] [pid:18312] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:10.277291Z] [info] [pid:22634] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:16.312427Z] [info] [pid:22634] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:16.346489Z] [info] [pid:22634] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:21.381752Z] [info] [pid:5061] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:25.263946Z] [debug] [pid:11987] Worker 510 rejected job(s) 2940342: already busy with job(s) 2940351
/var/log/openqa.7.xz:[2022-12-09T04:30:26.431890Z] [info] [pid:6026] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:26.487044Z] [info] [pid:5041] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:26.522891Z] [info] [pid:21587] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
</code></pre>
<p>By the way, the log lines <code>[debug] attempt to open ws proxy for job</code> are just a symptom of setting a job to "running" when it is actually not running.</p>
<hr>
<p>I suppose I'll have to figure out two things:</p>
<ol>
<li>How to avoid the race condition?</li>
<li>Why the stale job detection tries to reset these jobs almost immediately after the assignment and how to prevent that.</li>
</ol>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5841102022-12-12T14:01:10Zmkittlermarius.kittler@suse.com
<ul></ul><p>I found a piece of seemingly unused code: <a href="https://github.com/os-autoinst/openQA/pull/4954" class="external">https://github.com/os-autoinst/openQA/pull/4954</a></p>
<p>Maybe it is a good idea to split the transaction of the stale job detection: <a href="https://github.com/os-autoinst/openQA/pull/4955" class="external">https://github.com/os-autoinst/openQA/pull/4955</a></p>
<p>However, this won't really fix 2. from the end of my last comment. That the job is considered by the stale job detection makes it look like the timestamp when the worker was last seen is somehow messed with (and thus all jobs assigned to the worker are considered stale).</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5841312022-12-12T14:35:24Zmkittlermarius.kittler@suse.com
<ul></ul><p>To make sense of the deadlock I've checked for places where we only set <code>result</code> and <code>t_updated</code>:</p>
<ol>
<li><code>sub auto_duplicate</code> for <code>PARALLEL_RESTARTED</code>; it is invoked in transactions, e.g. the one of the stale job detection. <em>Maybe</em> <a href="https://github.com/os-autoinst/openQA/pull/4954" class="external">https://github.com/os-autoinst/openQA/pull/4954</a> helps regarding that but it won't remove the possibility for deadlocks completely as we also affect different jobs (in a random order) within one iteration of that loop.</li>
<li><code>sub _job_stop_cluster</code> for ´PARALLEL_FAILED`; it is invoked in transactions, e.g. the one of the stale job detection. <em>Maybe</em> <a href="https://github.com/os-autoinst/openQA/pull/4954" class="external">https://github.com/os-autoinst/openQA/pull/4954</a> helps regarding that. Updates are at least already done in a sorted order within the call itself.</li>
<li><code>sub update_result</code> but I don't see it being called in a problematic context.</li>
<li><code>sub sub job_restart</code> but I also don't see it being called in a problematic context.</li>
</ol>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5841822022-12-12T15:09:38Zlivdywanliv.dywan@suse.com
<ul></ul><p>mkittler wrote:</p>
<blockquote>
<p>However, this won't really fix 2. from the end of my last comment. That the job is considered by the stale job detection makes it look like the timestamp when the worker was last seen is somehow messed with (and thus all jobs assigned to the worker are considered stale).</p>
</blockquote>
<p>I looked up the <a href="https://metacpan.org/pod/DBIx::Class::Storage#txn_do" class="external">txn_do</a> docs while reviewing the changes, and this gave me an idea. It seems the best option would be to log after the committed transaction so I made a proof of concept: <a href="https://github.com/os-autoinst/openQA/pull/4956" class="external">https://github.com/os-autoinst/openQA/pull/4956</a></p>
<p>This obviously loses the distinction between log levels. Not sure if that's a problem or maybe there's another way to preserve that.</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5841972022-12-12T16:01:30Zmkittlermarius.kittler@suse.com
<ul></ul><p>PR to fix the race condition between workers updating a job and the stale job detection setting jobs back to scheduled: <a href="https://github.com/os-autoinst/openQA/pull/4957" class="external">https://github.com/os-autoinst/openQA/pull/4957</a></p>
<p>It may also prevent the stale job detection from trying to set back jobs to scheduled too eagerly by checking first whether the worker hasn't shown up meanwhile.</p>
<p>I'll have a look at the PR to improve logging although I don't find it <em>that</em> important. It is the same deal as with any other logging of operations we do within a transaction. So I'd avoid making the code more complicated just for the sake of logging.</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5843262022-12-13T04:11:13Zopenqa_reviewopenqa-review@suse.de
<ul><li><strong>Due date</strong> set to <i>2022-12-27</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5872932022-12-22T13:08:26Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> changed from <i>2022-12-27</i> to <i>2023-01-19</i></li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>christmas grace due date bump :)</p>
<p><a href="https://github.com/os-autoinst/openQA/pull/4957" class="external">https://github.com/os-autoinst/openQA/pull/4957</a> was merged</p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5872962022-12-22T13:08:39Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/121777">action #121777</a>: PostgreSQL update ordering deadlock on jobs table</i> added</li></ul> openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5904462023-01-05T10:28:12Zlivdywanliv.dywan@suse.com
<ul><li><strong>Subject</strong> changed from <i>openQA jobs have been restarted but are stuck in "running"</i> to <i>openQA jobs have been restarted but are stuck in "running" size:M</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/590446/diff?detail_id=554575">diff</a>)</li></ul> openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5908362023-01-06T10:46:53Zkraihsebastian.riedel@suse.com
<ul></ul><p>After the last round of PRs, we have a deadlock remaining on the assets table it seems:</p>
<pre><code>2022-12-30 10:50:18.466 CET openqa geekotest [16610]DETAIL: Process 16610 waits for ShareLock on transaction 1570088837; blocked by process 16606.
Process 16606 waits for ShareLock on transaction 1570088839; blocked by process 16610.
Process 16610: SELECT me.id, me.type, me.name, me.size, me.checksum, me.last_use_job_id, me.fixed, me.t_created, me.t_updated FROM assets me WHERE ( ( me.name = $1 AND me.type = $2 ) ) FOR UPDATE
Process 16606: SELECT me.id, me.type, me.name, me.size, me.checksum, me.last_use_job_id, me.fixed, me.t_created, me.t_updated FROM assets me WHERE ( ( me.name = $1 AND me.type = $2 ) ) FOR UPDATE
</code></pre> openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5908782023-01-06T12:27:24Zkraihsebastian.riedel@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-1 priority-4 priority-default" href="/issues/122791">action #122791</a>: New PostgreSQL query deadlock on assets table</i> added</li></ul> openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5908842023-01-06T12:28:08Zkraihsebastian.riedel@suse.com
<ul></ul><p>Opened a new ticket for the <code>SELECT...assets...FOR UPDATE</code> deadlock. <a class="issue tracker-4 status-1 priority-4 priority-default" title="action: New PostgreSQL query deadlock on assets table (New)" href="https://progress.opensuse.org/issues/122791">#122791</a></p>
openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5909772023-01-07T13:39:00Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> changed from <i>openQA</i> to <i>openQA, reactive work</i></li></ul> openQA Project - action #121768: openQA jobs have been restarted but are stuck in "running" size:Mhttps://progress.opensuse.org/issues/121768?journal_id=5917752023-01-10T14:50:12Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>The PR was merged 3 weeks ago. There are no further jobs stuck in "running" on OSD or o3. So presumably the fix worked (or it just didn't happen again). Considering this issue is hard to reproduce in production I suppose the unit tests in the PR must be good enough to verify that the code changes work as intended.</p>
<p>For the remaining race condition (that was there before my PR) we have <a class="issue tracker-4 status-1 priority-4 priority-default" title="action: New PostgreSQL query deadlock on assets table (New)" href="https://progress.opensuse.org/issues/122791">#122791</a>.</p>
<p>I can also now answer my previously asked questions:</p>
<blockquote>
<p>How to avoid the race condition?</p>
</blockquote>
<p>The PR <a href="https://github.com/os-autoinst/openQA/pull/4957" class="external">https://github.com/os-autoinst/openQA/pull/4957</a> should fix them.</p>
<blockquote>
<p>Why the stale job detection tries to reset these jobs almost immediately after the assignment and how to prevent that.</p>
</blockquote>
<p>I'm still not sure about this but maybe the worker's last seen update was really not present when stale jobs were queried. The change <a href="https://github.com/os-autoinst/openQA/pull/4957/files#diff-f15ca1d973749d88c3683f1ea766d9a61ba5cebac970999ef9da9dfe83e94a08" class="external">https://github.com/os-autoinst/openQA/pull/4957/files#diff-f15ca1d973749d88c3683f1ea766d9a61ba5cebac970999ef9da9dfe83e94a08</a> might help as the stale job detection would now back-off when the worker showed up meanwhile.</p>
<hr>
<p>So I suppose the issue can be considered resolved.</p>