action #165866
closedJobs stuck waiting on background tasks git_clone size:M
Added by livdywan 3 months ago. Updated 2 months ago.
0%
Description
Observation¶
https://openqa.suse.de/tests/15275211 is stuck waiting for background tasks (id: 38298990, name: git_clone)
whilst the parallel job is running https://openqa.suse.de/tests/15275212#live but not progressing (but was meanwhile fixed by deleting the gru job).
The minion job says it finished successfuly.
https://openqa.suse.de/tests/15264257 similarly got stuck in scheduled waiting for background tasks (id: 38286481, name: git_clone)
(but was meanwhile cancelled).
Acceptance Criteria¶
- AC1: Errors from git_clone tasks are reflected in the journal and/or minion task
- AC2: Errors in git_clone tasks either cause incompletes or be retried
Suggestions¶
- Something stops gru tasks being deleted even though they are finished with no error.
- The journal doesn't seem to contain any relevant errors for
openqa-gru
.- Check
xzgrep 'Gru job error' /var/log/openqa_gru*
- Check
- Relevant database query:
select (select job_id from gru_dependencies where gru_dependencies.gru_task_id = gru_tasks.id) as job_id from gru_tasks where taskname = 'git_clone';
Updated by livdywan 3 months ago
https://openqa.suse.de/tests/15275077
State: scheduled, waiting for background tasks (id: 38298622, name: git_clone), created about 4 hours ago
Updated by tinita 3 months ago · Edited
openqa=> select gd.job_id, gt.id, gt.taskname, gt.args, gt.run_at from gru_tasks gt join gru_dependencies gd on gt.id=gd.gru_task_id where taskname='git_clone' order by run_at desc;
job_id | id | taskname | args | run_at
----------+----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------+---------------------
15275455 | 38299421 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/tjyrinki\/os-autoinst-distri-opensuse.git#poo-165608"}] | 2024-08-27 12:44:34
15275091 | 38298636 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:58
15275087 | 38298632 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:58
15275085 | 38298630 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:58
15275083 | 38298628 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:58
15275089 | 38298634 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:58
15275093 | 38298638 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:58
15275095 | 38298640 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:58
15275053 | 38298598 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275055 | 38298600 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275058 | 38298603 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275060 | 38298605 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275062 | 38298607 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275064 | 38298609 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275066 | 38298611 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275068 | 38298613 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275070 | 38298615 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275071 | 38298616 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275072 | 38298617 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275074 | 38298619 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275076 | 38298621 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275077 | 38298622 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275079 | 38298624 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275081 | 38298626 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275057 | 38298602 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275040 | 38298585 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275042 | 38298587 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275044 | 38298589 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275046 | 38298591 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275049 | 38298594 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275051 | 38298596 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275052 | 38298597 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:57
15275036 | 38298581 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:56
15275034 | 38298579 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:56
15275039 | 38298584 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:56
15275038 | 38298583 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/mdoucha\/os-autoinst-distri-opensuse.git#force_livepatch"}] | 2024-08-27 09:15:56
15275025 | 38298553 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/tjyrinki\/os-autoinst-distri-opensuse.git#poo-165608"}] | 2024-08-27 09:11:57
15274758 | 38297995 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/tjyrinki\/os-autoinst-distri-opensuse.git#poo-165608"}] | 2024-08-27 07:26:30
15274687 | 38297801 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 07:03:26
15274655 | 38297756 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 06:55:15
15274656 | 38297757 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 06:55:15
15274567 | 38297543 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 06:38:00
15274566 | 38297542 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 06:38:00
15274015 | 38296669 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 03:51:37
15274014 | 38296668 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 03:51:36
15274013 | 38296667 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 03:51:36
15274011 | 38296661 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 03:50:13
15274010 | 38296660 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 03:50:13
15274007 | 38296653 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 03:47:46
15274008 | 38296654 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 03:47:46
15274004 | 38296650 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 03:47:04
15274005 | 38296651 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 03:47:04
15273955 | 38296005 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 01:21:46
15273954 | 38296004 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 01:21:46
15273953 | 38296003 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-27 01:21:45
15264257 | 38286481 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-26 09:56:44
15264259 | 38286483 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-26 09:56:44
15264258 | 38286482 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":"https:\/\/github.com\/lilyeyes\/os-autoinst-distri-opensuse.git#enhance-sap_suse_cluster_connector"}] | 2024-08-26 09:56:44
(58 rows)
Interestingly, the first timestamp is 2024-08-26 09:56:44, so either this didn't happen before, or we some cleanup those tasks at some point.
It only seems to be about git_clone jobs that are about forks.
We do see these in the openqa_gru logfile (this is from a reproducing job I created):
[2024-08-27T14:44:34.931696+02:00] [warn] [#12588972] Local checkout at /var/lib/openqa/share/tests/sle has origin https://github.com/os-autoinst/os-autoinst-distri-opensuse.git but requesting to clone from https://github.com/tjyrinki/os-autoinst-distri-opensuse.git
So the minion job is running and doing what it is supposed to do.
https://openqa.suse.de/tests/15275455
Updated by tinita 3 months ago
We do have two leftovers on o3 as well:
openqa=> select gd.job_id, gt.id, gt.taskname, gt.args, gt.run_at from gru_tasks gt join gru_dependencies gd on gt.id=gd.gru_task_id where taskname='git_clone' order by run_at desc;
job_id | id | taskname | args
| run_at
---------+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------+---------------------
4423323 | 20497721 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/openqa":"https:\/\/github.com\/os-autoinst\/os-autoinst-distri-openQA.git","\/var\/lib\/openqa\/share\/tests\/openqa\/products\/openqa\/needles":"https:\/\/github.com\/os-autoinst\/
os-autoinst-needles-openQA"}] | 2024-08-22 13:28:11
4398081 | 20467406 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/opensuse":"https:\/\/github.com\/os-autoinst\/os-autoinst-distri-opensuse.git#c0780d9a1617c6f5c3446888cd2a0dc9505aa897"}]
| 2024-08-12 09:45:12
(2 rows)
https://openqa.opensuse.org/tests/4423323
https://openqa.opensuse.org/tests/4398081
Updated by tinita 3 months ago
- Status changed from New to In Progress
- Assignee set to tinita
So I think it's not only about forks; it's just that git_clone currently only runs if CASEDIR is set to a url, and that's very often a fork.
I scheduled a job on o3 and it ran fine: https://openqa.opensuse.org/tests/4438083
Updated by tinita 3 months ago
I deleted 69 hanging gru_tasks on osd.
openqa=> delete from gru_tasks gt where taskname='git_clone' and run_at < '2024-08-27 14:00:00' ;
DELETE 69
Marius and I think it could be a race condition.
It only seems to happen for jobs with dependencies, which make a race condition more likely, as two jobs are trying to be scheduled at almost the same time.
I will turn off git_auto_clone on osd.
We thought about adding more logging to OpenQA::Shared::GruJob, because we can't even reproduce this on o3, and very less likely on a local developer instance.
Updated by tinita 3 months ago · Edited
according to journalctl --system
on osd:
Aug 26 07:16:34 openqa [RPM][24227]: install openQA-4.6.1724422555.bca3b3b4-lp155.7112.1.x86_64: success
Aug 26 07:16:50 openqa [RPM][24227]: install openQA-4.6.1724422555.bca3b3b4-lp155.7112.1.x86_64: success
Aug 26 11:33:48 openqa [RPM][10046]: install openQA-4.6.1724663185.4de94ebc-lp155.7113.1.x86_64: success
Aug 26 11:35:19 openqa [RPM][10046]: install openQA-4.6.1724663185.4de94ebc-lp155.7113.1.x86_64: success
Aug 27 07:26:26 openqa [RPM][20656]: install openQA-4.6.1724734919.d0b91cb5-lp155.7126.1.x86_64: success
Aug 27 07:26:34 openqa [RPM][20656]: install openQA-4.6.1724734919.d0b91cb5-lp155.7126.1.x86_64: success
It happened first at 2024-08-26 09:56:44
, and since this is a db timestamp, it is UTC and would be 11:56 CEST.
So I assume the installation at 11:33:48
or 11:35:19
is the culprit - not sure why we have it two times for every update (something to investigate?).
The only commit in between is my git command "improvement":
https://github.com/os-autoinst/openQA/commit/a5cfb1daee95ca037fb01e3b7b1a19f273ea0197
Looking at all evidence, this must be the reason, I just don't see how yet.
It could be related to the fact that in the cases we saw the gru job finishes very fast. I just needs to check the remote origin url (and doesn't need to do a network request); it sees the url differs from the requested url and returns. This could make race conditions more likely.
Note that we at least had two cases in the o3 database from 2024-08-22 and 2024-08-12, both before that commit was merged.
Updated by tinita 3 months ago
I find this a bit suspicious:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Shared/GruJob.pm#L32
# Avoid a possible race condition where the task retries the job and it gets
# picked up by a new worker before we reach this line (by checking the
# "finish" return value)
elsif ($state eq 'active') { $self->_delete_gru($gru_id) if $self->finish('Job successfully executed') }
Updated by tinita 3 months ago · Edited
\o/
I was able to reproduce it locally by reducing the _git_clone_all
function to just print a debugging line, thus making it very fast, and cloning the ping server/client test 10 times.
Nothing suspicious found yet. I need to add more logging.
Updated by tinita 3 months ago
Sometimes $gru
is undef here:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Shared/GruJob.pm#L44
sub _delete_gru {
my ($self, $id) = @_;
my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id);
$gru->delete() if $gru;
}
Updated by openqa_review 3 months ago
- Due date set to 2024-09-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita 3 months ago
- Related to action #159171: Create and maintain up to date version of test distri/needles for webui size:M added
Updated by ybonatakis 3 months ago
- Subject changed from Jobs stuck waiting on background tasks git_clone to Jobs stuck waiting on background tasks git_clone size:M
Updated by tinita 3 months ago
Just a little refactoring: https://github.com/os-autoinst/openQA/pull/5902
Moving the enqueuing of the minion job after the transaction looks hard because the creation of the GruTasks entry in the database and the minion enqueue happens in the same method:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Shared/Plugin/Gru.pm#L130-L151
We would have to save the parametes instead and return all the way up to the code where the transaction is created.
Haven't got a good idea yet how to solve that.
Updated by tinita 3 months ago
I'm currently implementing the idea to check for the GruTask entry in OpenQA::Shared::GruJob and call retry if it's not there yet.
That's easy, but I'm struggling with testing this because in the test the same schema instance is used by both code paths, so they are in the same transaction and the GruTask can be seen by GruJob.
Updated by tinita 3 months ago
Draft: https://github.com/os-autoinst/openQA/pull/5905 Require GruTask in database before running a minion job
Updated by tinita 3 months ago
- Status changed from In Progress to Feedback
Undrafted: https://github.com/os-autoinst/openQA/pull/5905 Require GruTask in database before running a minion job