action #121768
closedopenQA jobs have been restarted but are stuck in "running" size:M
Added by okurz about 2 years ago. Updated almost 2 years ago.
0%
Description
Observation¶
Right now https://openqa.opensuse.org/tests shows multiple jobs "running" with a grey icon and apparently stuck. For example https://openqa.opensuse.org/tests/2940351#live shows as running but as already cloned as 2942915. Clicking on cancel does not have an effect.
Acceptance criteria¶
- AC1: No more jobs getting stuck in state "running"
- AC2: Jobs can always be cancelled
Suggestions¶
- Look into deadlock issues to see if they are responsible #121777
Updated by okurz about 2 years ago
That's what the openQA database knows about the job:
openqa=> select * from jobs where id=2940351
openqa-> ;
id | priority | clone_id | t_started | t_finished | t_created | t_updated | state | result | backend_info | result_dir | group_id | test
| distri | version | flavor | arch | build | machine | logs_present | assigned_worker_id | passed_module_count | failed_module_count | softfailed_module_count | skipped_m
odule_count | blocked_by_id | externally_skipped_module_count | scheduled_product_id | reason | result_size | archived
---------+----------+----------+---------------------+------------+---------------------+---------------------+---------+----------------+--------------+------------+----------+-----------
--+----------+-----------+-------------+--------+---------+------------+--------------+--------------------+---------------------+---------------------+-------------------------+----------
------------+---------------+---------------------------------+----------------------+--------+-------------+----------
2940351 | 45 | 2942915 | 2022-12-09 04:30:05 | | 2022-12-09 04:29:50 | 2022-12-09 04:30:05 | running | user_restarted | | | 2 | kde_usbboo
t | opensuse | Staging:G | Staging-DVD | x86_64 | G.587.1 | USBboot_64 | t | | 0 | 0 | 0 |
0 | | 0 | | | | f
(1 row)
so there is a "result" user_restarted
and a state running
. We should make sure to prevent this combination happening.
Right now on o3 select count(*) from jobs where state='running' and result='user_restarted';
says there are ten such jobs.
I now did update jobs set state='done' where state='running' and result='user_restarted';
as manual workaround.
Updated by okurz about 2 years ago
- Status changed from New to In Progress
- Assignee set to okurz
Found more examples of stuck jobs in "running", patched two with update jobs set state='done' where state='running' and result='parallel_failed';
Updated by okurz about 2 years ago
- Status changed from In Progress to New
- Assignee deleted (
okurz)
dimstar confirmed that that helped to release the next Tumbleweed snapshot so I won't do more here for now. https://openqa.opensuse.org/tests/2941399 could be the last remaining "still running" candidate which shouldn't disturb anyone for now so somebody else can look into that live example and see what's going on.
Updated by kraih about 2 years ago
There is a possibility that this might be related to the update deadlock issue #121777.
Updated by mkittler about 2 years ago
- Status changed from New to In Progress
2941399 is the last remaining instance, indeed:
openqa=# select id, state, result, assigned_worker_id from jobs where (state != 'done' and state != 'cancelled') and result != 'none';
id | state | result | assigned_worker_id
---------+---------+--------------------+--------------------
2941399 | running | parallel_restarted |
(1 row)
I'll keep it for now to have at least one example around (as having the job around shouldn't cause any further problems).
Updated by mkittler about 2 years ago
Looks like this job hasn't had the best start:
/var/log/openqa.6.xz:[2022-12-09T06:19:27.763482Z] [debug] [pid:13150] Job 2941394 duplicated as 2941399
/var/log/openqa.6.xz:[2022-12-09T06:19:46.158303Z] [debug] [pid:11987] Started to send message to 498 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:19:46.161826Z] [debug] [pid:11987] Worker 498 rejected job(s) 2941399: already busy with job(s) 2941386
/var/log/openqa.6.xz:[2022-12-09T06:19:46.174983Z] [debug] [pid:11987] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:20:04.262921Z] [debug] [pid:11987] Started to send message to 508 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:20:04.271645Z] [debug] [pid:11987] Worker 508 rejected job(s) 2941399: Unable to lock pool directory: /var/lib/openqa/pool/12 already locked
/var/log/openqa.6.xz:[2022-12-09T06:20:04.303268Z] [debug] [pid:11987] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:20:26.183862Z] [debug] [pid:11987] Started to send message to 498 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:20:26.197333Z] [debug] [pid:11987] Worker 498 rejected job(s) 2941399: already busy with job(s) 2941400
/var/log/openqa.6.xz:[2022-12-09T06:20:26.230838Z] [debug] [pid:11987] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:21:04.848448Z] [debug] [pid:11987] Started to send message to 506 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:21:04.923170Z] [debug] [pid:21222] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:21:04.976339Z] [info] [pid:11987] Worker 506 accepted job 2941399 which was never assigned to it or has already finished
/var/log/openqa.6.xz: parallel_parents => [2941399],
/var/log/openqa.6.xz: 2941399 => {
/var/log/openqa.6.xz:[2022-12-09T06:21:09.989474Z] [info] [pid:6924] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:10.035153Z] [info] [pid:21075] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:15.093301Z] [info] [pid:32084] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:15.133679Z] [info] [pid:6924] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:20.200140Z] [info] [pid:21222] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:20.248761Z] [info] [pid:12587] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:25.204731Z] [debug] [pid:11987] Worker 506 rejected job(s) 2941402: already busy with job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:21:25.288844Z] [info] [pid:24498] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:25.363313Z] [info] [pid:7547] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:30.409337Z] [info] [pid:26975] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:30.467943Z] [info] [pid:12947] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:35.509723Z] [info] [pid:16283] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:35.581128Z] [info] [pid:15316] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.6.xz:[2022-12-09T06:21:35.623819Z] [info] [pid:4226] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:19:45.979664Z] [debug] [pid:11986] Assigned job '2941399' to worker ID '498'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:19:45.983909Z] [debug] [pid:11986] [Job#2941399] Prepare for being processed by worker 498
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:19:46.161223Z] [debug] [pid:11986] Sent job(s) '2941399' to worker '498'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:19:46.164712Z] [debug] [pid:11986] Allocated: { job => 2941399, worker => 498 }
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:03.954815Z] [debug] [pid:11986] Assigned job '2941399' to worker ID '508'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:03.959613Z] [debug] [pid:11986] [Job#2941399] Prepare for being processed by worker 508
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:04.193124Z] [debug] [pid:11986] Created network for 2941399 : 27
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:04.266865Z] [debug] [pid:11986] Sent job(s) '2941399' to worker '508'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:06.369507Z] [debug] [pid:11986] Allocated: { job => 2941399, worker => 508 }
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:25.982559Z] [debug] [pid:11986] Assigned job '2941399' to worker ID '498'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:25.992590Z] [debug] [pid:11986] [Job#2941399] Prepare for being processed by worker 498
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:26.192479Z] [debug] [pid:11986] Sent job(s) '2941399' to worker '498'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:20:26.774716Z] [debug] [pid:11986] Allocated: { job => 2941399, worker => 498 }
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:21:04.682691Z] [debug] [pid:11986] Assigned job '2941399' to worker ID '506'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:21:04.687134Z] [debug] [pid:11986] [Job#2941399] Prepare for being processed by worker 506
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:21:04.895021Z] [debug] [pid:11986] Sent job(s) '2941399' to worker '506'
/var/log/openqa_scheduler.1.xz:[2022-12-09T06:21:05.519232Z] [debug] [pid:11986] Allocated: { job => 2941399, worker => 506 }
The workers that were attempted to be assigned with that job rejected it until worker 506 finally accepted it but then the web UI side as it apparently misses that 506 has in fact accepted the job - despite thinking at least that the job is running (just not assigned to a worker). This looks like there's some inconsistency in how the database is updated. Maybe it is related to #121777, indeed. However, likely the interleaving between these three events is causing the problem here:
/var/log/openqa.6.xz:[2022-12-09T06:21:04.848448Z] [debug] [pid:11987] Started to send message to 506 for job(s) 2941399
/var/log/openqa.6.xz:[2022-12-09T06:21:04.923170Z] [debug] [pid:21222] Job 2941399 reset to state scheduled
/var/log/openqa.6.xz:[2022-12-09T06:21:04.976339Z] [info] [pid:11987] Worker 506 accepted job 2941399 which was never assigned to it or has already finished
We're starting to assign the job to worker 506, then we reset it back to scheduled and then, when the worker has finally accepted the assignment we don't accept it because the job has already been set back to scheduled.
Updated by mkittler about 2 years ago
While it is unfortunate that the stale job detection reset a job back to scheduled this shouldn't have caused the issue. After the reset the job should have stayed "scheduled" and be eventually assigned again. However, it somehow ended up being stuck in "running" which of course means the scheduler won't touch it anymore (as there's no stale job detection for jobs in "running").
There's a deadlock logged by PostgreSQL:
2022-12-09 06:21:08.416 UTC openqa geekotest [6931]ERROR: deadlock detected
2022-12-09 06:21:08.416 UTC openqa geekotest [6931]DETAIL: Process 6931 waits for ShareLock on transaction 243609936; blocked by process 5779.
Process 5779 waits for ShareLock on transaction 243609932; blocked by process 6931.
Process 6931: UPDATE jobs SET result = $1, t_updated = $2 WHERE id = $3
Process 5779: UPDATE jobs SET result = $1, t_updated = $2 WHERE id = $3
2022-12-09 06:21:08.416 UTC openqa geekotest [6931]HINT: See server log for query details.
2022-12-09 06:21:08.416 UTC openqa geekotest [6931]CONTEXT: while updating tuple (3166,30) in relation "jobs"
2022-12-09 06:21:08.416 UTC openqa geekotest [6931]STATEMENT: UPDATE jobs SET result = $1, t_updated = $2 WHERE id = $3
Likely the following happened:
- The job is assigned to worker 506 (which should be openqaworker20 slot 5 although there's not trace of job 2941399 anywhere in this machines journal).
- The job is reset to scheduled and thus
Job 2941399 reset to state scheduled
is being logged but that change has not been committed yet (as the enclosing transaction has not been committed yet). So the log message is kind of a lie at this point. - The job is picked up by worker after all. It will inform the web socket server about it (but not wait for a reply) and then continue sending status updates.
- The job is updated by the worker. Since the job ended up running presumably at least one successful update was made (there's no log message for every successful status update so there's no corresponding log message).
- The web UI starts handling the job update from 4.; it sees that the job is assigned to the worker so it won't refuse the update.
- The transaction from 2. is completed.
- The web socket server handles the message from 3. and logs
Worker 506 accepted job 2941399 which was never assigned to it or has already finished
. That's not really interfering with anything and just a log message (the job isn't updated here). - The update from 5. reaches the point where it updates the job status to be "running". It has a check to do so only if the job is still assigned/setup but it will not see the change made by 2./6. as the job information has already been queried before 6.
- The job is updated by the worker again. This time we see
Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
because now the update code sees 2./6. - The worker eventually gives up and can proceed (successfully) with other jobs but the job is stuck in "running".
I'm not yet sure whether that actually is what happened but it is a valid theory.
I'm not sure where the deadlock fits in. It happens between
/var/log/openqa.6.xz:[2022-12-09T06:21:04.976339Z] [info] [pid:11987] Worker 506 accepted job 2941399 which was never assigned to it or has already finished
and
/var/log/openqa.6.xz:[2022-12-09T06:21:09.989474Z] [info] [pid:6924] Got status update for job 2941399 and worker 506 but there is not even a worker assigned to this job (job is running)
Maybe it is within the job update code. That would correspond to point 8. in the sequence of my theory. However, it would be very strange if it was about setting the job to "running" considering that actually worked (which is what leads us to this problem in the first place). Besides, when setting the job to "running" the column t_started
and state
would be affected (and the message about the deadlock only mentions t_updated
and result
).
For 2940351 the situation looks similar:
/var/log/openqa.5.xz:[2022-12-09T13:05:58.868294Z] [debug] attempt to open ws proxy for job opensuse-Staging:G-Staging-DVD-x86_64-BuildG.587.1-kde_usbboot@USBboot_64 (2940351) where URL to os-autoinst command server is unknown
/var/log/openqa.6.xz:[2022-12-09T08:06:17.907841Z] [debug] attempt to open ws proxy for job opensuse-Staging:G-Staging-DVD-x86_64-BuildG.587.1-kde_usbboot@USBboot_64 (2940351) where URL to os-autoinst command server is unknown
/var/log/openqa.6.xz:[2022-12-09T08:06:27.423709Z] [debug] Restarting job 2940351
/var/log/openqa.6.xz: 2940351 => {
/var/log/openqa.6.xz:[2022-12-09T08:06:27.682788Z] [debug] [pid:21403] Job 2940351 duplicated as 2942915
/var/log/openqa.7.xz:[2022-12-09T04:29:50.511153Z] [debug] [pid:11018] Job 2940338 duplicated as 2940351
/var/log/openqa.7.xz:[2022-12-09T04:30:05.097349Z] [debug] [pid:11987] Started to send message to 510 for job(s) 2940351
/var/log/openqa.7.xz:[2022-12-09T04:30:05.163987Z] [debug] [pid:21591] Job 2940351 reset to state scheduled
/var/log/openqa.7.xz:[2022-12-09T04:30:05.244247Z] [info] [pid:11987] Worker 510 accepted job 2940351 which was never assigned to it or has already finished
/var/log/openqa.7.xz:[2022-12-09T04:30:10.240912Z] [info] [pid:18312] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:10.277291Z] [info] [pid:22634] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:16.312427Z] [info] [pid:22634] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:16.346489Z] [info] [pid:22634] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:21.381752Z] [info] [pid:5061] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:25.263946Z] [debug] [pid:11987] Worker 510 rejected job(s) 2940342: already busy with job(s) 2940351
/var/log/openqa.7.xz:[2022-12-09T04:30:26.431890Z] [info] [pid:6026] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:26.487044Z] [info] [pid:5041] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
/var/log/openqa.7.xz:[2022-12-09T04:30:26.522891Z] [info] [pid:21587] Got status update for job 2940351 and worker 510 but there is not even a worker assigned to this job (job is running)
By the way, the log lines [debug] attempt to open ws proxy for job
are just a symptom of setting a job to "running" when it is actually not running.
I suppose I'll have to figure out two things:
- How to avoid the race condition?
- Why the stale job detection tries to reset these jobs almost immediately after the assignment and how to prevent that.
Updated by mkittler about 2 years ago
I found a piece of seemingly unused code: https://github.com/os-autoinst/openQA/pull/4954
Maybe it is a good idea to split the transaction of the stale job detection: https://github.com/os-autoinst/openQA/pull/4955
However, this won't really fix 2. from the end of my last comment. That the job is considered by the stale job detection makes it look like the timestamp when the worker was last seen is somehow messed with (and thus all jobs assigned to the worker are considered stale).
Updated by mkittler about 2 years ago
To make sense of the deadlock I've checked for places where we only set result
and t_updated
:
sub auto_duplicate
forPARALLEL_RESTARTED
; it is invoked in transactions, e.g. the one of the stale job detection. Maybe https://github.com/os-autoinst/openQA/pull/4954 helps regarding that but it won't remove the possibility for deadlocks completely as we also affect different jobs (in a random order) within one iteration of that loop.sub _job_stop_cluster
for ´PARALLEL_FAILED`; it is invoked in transactions, e.g. the one of the stale job detection. Maybe https://github.com/os-autoinst/openQA/pull/4954 helps regarding that. Updates are at least already done in a sorted order within the call itself.sub update_result
but I don't see it being called in a problematic context.sub sub job_restart
but I also don't see it being called in a problematic context.
Updated by livdywan about 2 years ago
mkittler wrote:
However, this won't really fix 2. from the end of my last comment. That the job is considered by the stale job detection makes it look like the timestamp when the worker was last seen is somehow messed with (and thus all jobs assigned to the worker are considered stale).
I looked up the txn_do docs while reviewing the changes, and this gave me an idea. It seems the best option would be to log after the committed transaction so I made a proof of concept: https://github.com/os-autoinst/openQA/pull/4956
This obviously loses the distinction between log levels. Not sure if that's a problem or maybe there's another way to preserve that.
Updated by mkittler about 2 years ago
PR to fix the race condition between workers updating a job and the stale job detection setting jobs back to scheduled: https://github.com/os-autoinst/openQA/pull/4957
It may also prevent the stale job detection from trying to set back jobs to scheduled too eagerly by checking first whether the worker hasn't shown up meanwhile.
I'll have a look at the PR to improve logging although I don't find it that important. It is the same deal as with any other logging of operations we do within a transaction. So I'd avoid making the code more complicated just for the sake of logging.
Updated by openqa_review about 2 years ago
- Due date set to 2022-12-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz almost 2 years ago
- Due date changed from 2022-12-27 to 2023-01-19
- Status changed from In Progress to Feedback
christmas grace due date bump :)
Updated by okurz almost 2 years ago
- Related to action #121777: PostgreSQL update ordering deadlock on jobs table added
Updated by livdywan almost 2 years ago
- Subject changed from openQA jobs have been restarted but are stuck in "running" to openQA jobs have been restarted but are stuck in "running" size:M
- Description updated (diff)
Updated by kraih almost 2 years ago
After the last round of PRs, we have a deadlock remaining on the assets table it seems:
2022-12-30 10:50:18.466 CET openqa geekotest [16610]DETAIL: Process 16610 waits for ShareLock on transaction 1570088837; blocked by process 16606.
Process 16606 waits for ShareLock on transaction 1570088839; blocked by process 16610.
Process 16610: SELECT me.id, me.type, me.name, me.size, me.checksum, me.last_use_job_id, me.fixed, me.t_created, me.t_updated FROM assets me WHERE ( ( me.name = $1 AND me.type = $2 ) ) FOR UPDATE
Process 16606: SELECT me.id, me.type, me.name, me.size, me.checksum, me.last_use_job_id, me.fixed, me.t_created, me.t_updated FROM assets me WHERE ( ( me.name = $1 AND me.type = $2 ) ) FOR UPDATE
Updated by kraih almost 2 years ago
- Related to action #122791: New PostgreSQL query deadlock on assets table added
Updated by kraih almost 2 years ago
Opened a new ticket for the SELECT...assets...FOR UPDATE
deadlock. #122791
Updated by okurz almost 2 years ago
- Tags changed from openQA to openQA, reactive work
Updated by mkittler almost 2 years ago
- Status changed from Feedback to Resolved
The PR was merged 3 weeks ago. There are no further jobs stuck in "running" on OSD or o3. So presumably the fix worked (or it just didn't happen again). Considering this issue is hard to reproduce in production I suppose the unit tests in the PR must be good enough to verify that the code changes work as intended.
For the remaining race condition (that was there before my PR) we have #122791.
I can also now answer my previously asked questions:
How to avoid the race condition?
The PR https://github.com/os-autoinst/openQA/pull/4957 should fix them.
Why the stale job detection tries to reset these jobs almost immediately after the assignment and how to prevent that.
I'm still not sure about this but maybe the worker's last seen update was really not present when stale jobs were queried. The change https://github.com/os-autoinst/openQA/pull/4957/files#diff-f15ca1d973749d88c3683f1ea766d9a61ba5cebac970999ef9da9dfe83e94a08 might help as the stale job detection would now back-off when the worker showed up meanwhile.
So I suppose the issue can be considered resolved.