action #177048
closedAggregates only on 12-SP5 x86_64 randomly don't get scheduled because of deadlock size:M
0%
Description
Lately sometimes 12-SP5 x86_64 are not scheduled, when I restart the schedule there is no problem.
At the time of scheduling there is some deadlock, very strange this happens only on 12-SP5 x86_64
Is there some conflict between Server-DVD-Updates-LTSS-ES & Server-DVD-Updates ?
https://openqa.suse.de/admin/productlog?id=2689877 BUILD=20250211-1
https://openqa.suse.de/admin/productlog?id=2691469 BUILD=20250212-1
{
"failed_job_info": [
{
"error_message": "OpenQA::Schema::Result::Jobs::register_assets_from_settings(): DBI Exception: DBD::Pg::st execute failed: ERROR: deadlock detected\nDETAIL: Process 8553 waits for ShareLock on transaction 2490096299; blocked by process 21248.\nProcess 21248 waits for ShareLock on transaction 2490096379; blocked by process 8553.\nHINT: See server log for query details.\nCONTEXT: while inserting index tuple (795,49) in relation \"assets\" [for Statement \"INSERT INTO assets (type, name, t_created, t_updated)\n VALUES (?, ?, now(), now())\n ON CONFLICT DO NOTHING RETURNING id\n\" with ParamValues: 1='hdd', 2='SLES-12-SP5-x86_64-mru-install-desktop-with-addons-Build20250211-1.qcow2'] at /usr/share/openqa/script/../lib/OpenQA/Schema/ResultSet/Jobs.pm line 205\n",
"job_name": "qam-smt-server"
},
{
"error_messages": [
"DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: insert or update on table \"job_dependencies\" violates foreign key constraint \"job_dependencies_fk_child_job_id\"\nDETAIL: Key (child_job_id)=(16732338) is not present in table \"jobs\". [for Statement \"INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( ?, ?, ? )\" with ParamValues: 1='16732338', 2='1', 3='16732331'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/ScheduledProducts.pm line 703\n"
],
"job_id": 16732301
},
{
"error_messages": [
"DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: insert or update on table \"job_dependencies\" violates foreign key constraint \"job_dependencies_fk_child_job_id\"\nDETAIL: Key (child_job_id)=(16732338) is not present in table \"jobs\". [for Statement \"INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( ?, ?, ? )\" with ParamValues: 1='16732338', 2='1', 3='16732331'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/ScheduledProducts.pm line 703\n"
],
"job_id": 16732307
},
{
"error_messages": [
"DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: insert or update on table \"job_dependencies\" violates foreign key constraint \"job_dependencies_fk_child_job_id\"\nDETAIL: Key (child_job_id)=(16732338) is not present in table \"jobs\". [for Statement \"INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( ?, ?, ? )\" with ParamValues: 1='16732338', 2='1', 3='16732331'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/ScheduledProducts.pm line 703\n"
],
"job_id": 16732311
},
{
"error_messages": [
"DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: insert or update on table \"job_dependencies\" violates foreign key constraint \"job_dependencies_fk_child_job_id\"\nDETAIL: Key (child_job_id)=(16732338) is not present in table \"jobs\". [for Statement \"INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( ?, ?, ? )\" with ParamValues: 1='16732338', 2='1', 3='16732331'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/ScheduledProducts.pm line 703\n"
],
"job_id": 16732319
},
{
"error_messages": [
"DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: insert or update on table \"job_dependencies\" violates foreign key constraint \"job_dependencies_fk_child_job_id\"\nDETAIL: Key (child_job_id)=(16732338) is not present in table \"jobs\". [for Statement \"INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( ?, ?, ? )\" with ParamValues: 1='16732338', 2='1', 3='16732331'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/ScheduledProducts.pm line 703\n"
],
"job_id": 16732326
},
{
"error_messages": [
"DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: insert or update on table \"job_dependencies\" violates foreign key constraint \"job_dependencies_fk_child_job_id\"\nDETAIL: Key (child_job_id)=(16732338) is not present in table \"jobs\". [for Statement \"INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( ?, ?, ? )\" with ParamValues: 1='16732338', 2='1', 3='16732331'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/ScheduledProducts.pm line 703\n"
],
"job_id": 16732331
}
],
"notes": [
"Transaction failed: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: insert or update on table \"job_dependencies\" violates foreign key constraint \"job_dependencies_fk_child_job_id\"\nDETAIL: Key (child_job_id)=(16732338) is not present in table \"jobs\". [for Statement \"INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( ?, ?, ? )\" with ParamValues: 1='16732338', 2='1', 3='16732331'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/ScheduledProducts.pm line 703\n"
],
"successful_job_ids": []
}
Acceptance criteria¶
- AC1: Multiple products can be scheduled reliably at the same time
Suggestions¶
- It looks like jobs are not created because scheduling fails within a transaction
- This might happen when many products are scheduled at the same time?
- Confirm how often this happens
- Check for similar issues we had in the past and how the solution looked like back then. Maybe something was overlooked or now a similar problem surfaces at a slightly different place.
Updated by tinita about 2 months ago
- Tags set to reactive work
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by livdywan about 2 months ago
- Subject changed from Aggregates only on 12-SP5 x86_64 randomly don't get scheduled because of deadlock to Aggregates only on 12-SP5 x86_64 randomly don't get scheduled because of deadlock size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler about 2 months ago
We recently did one change that caused more things going on in database transactions (but when restarting, not when initially creating jobs): #168376
Maybe that's related because there is now a potential for deadlocks between the transaction when creating/scheduling jobs and the transaction when restarting jobs due to additional locks held on Minion-related tables on the restarting transaction.
Updated by mkittler about 2 months ago
- Status changed from Workable to In Progress
Updated by mkittler about 2 months ago
I checked the server logs (in particular /var/lib/pgsql/data/log/postgresql-2025-02-11_000000.log
). There is in fact a mentioning of the problem:
2025-02-11 21:24:53.751 CET openqa geekotest [8553]ERROR: deadlock detected
2025-02-11 21:24:53.751 CET openqa geekotest [8553]DETAIL: Process 8553 waits for ShareLock on transaction 2490096299; blocked by process 21248.
Process 21248 waits for ShareLock on transaction 2490096379; blocked by process 8553.
Process 8553: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
Process 21248: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-11 21:24:53.751 CET openqa geekotest [8553]HINT: See server log for query details.
2025-02-11 21:24:53.751 CET openqa geekotest [8553]CONTEXT: while inserting index tuple (795,49) in relation "assets"
2025-02-11 21:24:53.751 CET openqa geekotest [8553]STATEMENT: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-11 21:24:57.114 CET openqa geekotest [8553]ERROR: insert or update on table "job_dependencies" violates foreign key constraint "job_dependencies_fk_child_job_id"
2025-02-11 21:24:57.114 CET openqa geekotest [8553]DETAIL: Key (child_job_id)=(16732338) is not present in table "jobs".
2025-02-11 21:24:57.114 CET openqa geekotest [8553]STATEMENT: INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( $1, $2, $3 )
This or something similar happened multiple times:
# grep -i deadlock /var/lib/pgsql/data/log/postgresql-2025-02-*.log
/var/lib/pgsql/data/log/postgresql-2025-02-02_000000.log:2025-02-02 21:25:10.106 CET openqa geekotest [29631]ERROR: deadlock detected
/var/lib/pgsql/data/log/postgresql-2025-02-03_000000.log:2025-02-03 21:12:30.233 CET openqa geekotest [14642]ERROR: deadlock detected
/var/lib/pgsql/data/log/postgresql-2025-02-03_000000.log:2025-02-03 21:12:32.752 CET openqa geekotest [25812]ERROR: deadlock detected
/var/lib/pgsql/data/log/postgresql-2025-02-06_000000.log:2025-02-06 21:13:08.483 CET openqa geekotest [5598]ERROR: deadlock detected
/var/lib/pgsql/data/log/postgresql-2025-02-06_000000.log:2025-02-06 21:13:11.897 CET openqa geekotest [9939]ERROR: deadlock detected
/var/lib/pgsql/data/log/postgresql-2025-02-09_000000.log:2025-02-09 21:11:39.094 CET openqa geekotest [20798]ERROR: deadlock detected
/var/lib/pgsql/data/log/postgresql-2025-02-11_000000.log:2025-02-11 21:24:53.751 CET openqa geekotest [8553]ERROR: deadlock detected
/var/lib/pgsql/data/log/postgresql-2025-02-12_000000.log:2025-02-12 21:25:08.259 CET openqa geekotest [10476]ERROR: deadlock detected
The first occurrence is from 2025-02-02. That is not super long ago but before https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1368 was merged and also before https://github.com/os-autoinst/openQA/pull/6048 was merged. The first occurrence looks very similar:
2025-02-02 21:25:10.106 CET openqa geekotest [29631]ERROR: deadlock detected
2025-02-02 21:25:10.106 CET openqa geekotest [29631]DETAIL: Process 29631 waits for ShareLock on transaction 2337143893; blocked by process 24346.
Process 24346 waits for ShareLock on transaction 2337144008; blocked by process 29631.
Process 29631: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
Process 24346: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-02 21:25:10.106 CET openqa geekotest [29631]HINT: See server log for query details.
2025-02-02 21:25:10.106 CET openqa geekotest [29631]CONTEXT: while inserting index tuple (255,34) in relation "assets"
2025-02-02 21:25:10.106 CET openqa geekotest [29631]STATEMENT: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-02 21:25:12.926 CET openqa geekotest [29631]ERROR: insert or update on table "job_dependencies" violates foreign key constraint "job_dependencies_fk_child_job_id"
2025-02-02 21:25:12.926 CET openqa geekotest [29631]DETAIL: Key (child_job_id)=(16635636) is not present in table "jobs".
2025-02-02 21:25:12.926 CET openqa geekotest [29631]STATEMENT: INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( $1, $2, $3 )
As well as the 2nd:
2025-02-03 21:12:30.233 CET openqa geekotest [14642]ERROR: deadlock detected
2025-02-03 21:12:30.233 CET openqa geekotest [14642]DETAIL: Process 14642 waits for ShareLock on transaction 2353126592; blocked by process 25812.
Process 25812 waits for ShareLock on transaction 2353126226; blocked by process 14642.
Process 14642: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
Process 25812: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-03 21:12:30.233 CET openqa geekotest [14642]HINT: See server log for query details.
2025-02-03 21:12:30.233 CET openqa geekotest [14642]CONTEXT: while inserting index tuple (683,8) in relation "assets"
2025-02-03 21:12:30.233 CET openqa geekotest [14642]STATEMENT: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-03 21:12:32.752 CET openqa geekotest [25812]ERROR: deadlock detected
2025-02-03 21:12:32.752 CET openqa geekotest [25812]DETAIL: Process 25812 waits for ShareLock on transaction 2353127203; blocked by process 14642.
Process 14642 waits for ShareLock on transaction 2353126592; blocked by process 25812.
Process 25812: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
Process 14642: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-03 21:12:32.752 CET openqa geekotest [25812]HINT: See server log for query details.
2025-02-03 21:12:32.752 CET openqa geekotest [25812]CONTEXT: while inserting index tuple (690,10) in relation "assets"
2025-02-03 21:12:32.752 CET openqa geekotest [25812]STATEMENT: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-03 21:12:32.823 CET openqa geekotest [14642]ERROR: insert or update on table "job_dependencies" violates foreign key constraint "job_dependencies_fk_child_job_id"
2025-02-03 21:12:32.823 CET openqa geekotest [14642]DETAIL: Key (child_job_id)=(16644741) is not present in table "jobs".
2025-02-03 21:12:32.823 CET openqa geekotest [14642]STATEMENT: INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( $1, $2, $3 )
2025-02-03 21:12:34.246 CET openqa geekotest [25812]ERROR: insert or update on table "job_dependencies" violates foreign key constraint "job_dependencies_fk_child_job_id"
2025-02-03 21:12:34.246 CET openqa geekotest [25812]DETAIL: Key (child_job_id)=(16644952) is not present in table "jobs".
2025-02-03 21:12:34.246 CET openqa geekotest [25812]STATEMENT: INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( $1, $2, $3 )
So I think they're all about the same underlying issue. And because started to happen before https://github.com/os-autoinst/openQA/pull/6048 was merged it is also unrelated. Probably a change in job dependencies or assets is the surfacing this problem which we likely have for a long time.
Updated by openqa_review about 2 months ago
- Due date set to 2025-02-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 2 months ago
I checked the code and we have the following setup:
- A top-level txn for the whole scheduling is created.
- We create the jobs one after another in a loop. For this we create a nested txn introduced by b03aa8d6c6 in 2017. This is also where we register assets and where the first error is generated. The nested txn is not committed due to the error but the exception is caught.
- We continue creating dependencies based on the jobs we created. This generates additional errors because not all jobs are present. I'm not sure why that is because we only try to create dependencies for jobs that could have been created in the previous step.
Maybe the nested txn we create in 2. is problematic when being rolled back and actually causes all changes from the top-level txn to be rolled back. This would explain the missing job IDs in step 3. The source of the error is definitely the asset creation anyway, and we even know the specific asset to blame (even though it might be different ones in different cases):
while inserting index tuple (795,49) in relation \"assets\" [for Statement \"INSERT INTO assets (type, name, t_created, t_updated)\n VALUES (?, ?, now(), now())\n ON CONFLICT DO NOTHING RETURNING id\n\" with ParamValues: 1='hdd', 2='SLES-12-SP5-x86_64-mru-install-desktop-with-addons-Build20250211-1.qcow2'] at /usr/share/openqa/script/../lib/OpenQA/Schema/ResultSet/Jobs.pm line 205\n"
Updated by mkittler about 2 months ago · Edited
The most recent occurrence is:
2025-02-13 21:21:53.160 CET openqa geekotest [9954]ERROR: deadlock detected
2025-02-13 21:21:53.160 CET openqa geekotest [9954]DETAIL: Process 9954 waits for ShareLock on transaction 2537902483; blocked by process 8998.
Process 8998 waits for ShareLock on transaction 2537902718; blocked by process 9954.
Process 9954: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
Process 8998: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-13 21:21:53.160 CET openqa geekotest [9954]HINT: See server log for query details.
2025-02-13 21:21:53.160 CET openqa geekotest [9954]CONTEXT: while inserting index tuple (1623,6) in relation "assets"
2025-02-13 21:21:53.160 CET openqa geekotest [9954]STATEMENT: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-13 21:21:56.488 CET openqa geekotest [9954]ERROR: insert or update on table "job_dependencies" violates foreign key constraint "job_dependencies_fk_child_job_id"
2025-02-13 21:21:56.488 CET openqa geekotest [9954]DETAIL: Key (child_job_id)=(16765742) is not present in table "jobs".
2025-02-13 21:21:56.488 CET openqa geekotest [9954]STATEMENT: INSERT INTO job_dependencies ( child_job_id, dependency, parent_job_id) VALUES ( $1, $2, $3 )
Unfortunately none of the processes are running anymore so we don't know to what services they correspond to.
In other cases the PIDs drifted apart much more. So maybe the process holding the ShareLock is a gru task (which are running always as a new processes and thus might have an high process ID).
According to the documentation a ShareLock is acquired only by CREATE INDEX
. Not sure what this would correspond to in our code using DBIx except the deployment code.
It is always about the query from register_assets_from_settings
. We call that in:
- create_from_settings -> _create_job/_create_jobs_in_database -> _create_jobs/_schedule_iso -> creating a single set of jobs or scheduling a product
- prepare_for_work -> ws_send -> scheduler loop
- _create_clones -> duplicate -> auto_duplicate -> several API functions
- reevaluate_children_asset_settings -> register (asset) -> asset uploads and scans
I tried to trace locks via https://github.com/jnidzwetzki/pg-lock-tracer but it would require a custom build of PostgreSQL so I'm now simply putting a sleep in our code and then check what locks are held via select relation::regclass, * from pg_locks;
.
After the register_assets_from_settings
call in 1. we have the following locks registered:
openqa-local=> select relation::regclass, * from pg_locks;
relation | locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
------------------------------------+---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+------------------+---------+----------
pg_locks | relation | 413008 | 12143 | | | | | | | | 5/2101 | 35899 | AccessShareLock | t | t
| virtualxid | | | | | 5/2101 | | | | | 5/2101 | 35899 | ExclusiveLock | t | t
jobs_id_seq | relation | 413008 | 17553 | | | | | | | | 3/7001 | 36084 | RowExclusiveLock | t | t
jobs_pkey | relation | 413008 | 349401 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | t
idx_jobs_scenario | relation | 413008 | 17584 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | t
idx_jobs_result | relation | 413008 | 17582 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | t
idx_jobs_state | relation | 413008 | 17581 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | t
job_settings_id_seq | relation | 413008 | 17292 | | | | | | | | 3/7001 | 36084 | RowExclusiveLock | t | t
job_settings_idx_job_id | relation | 413008 | 349384 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | t
idx_job_id_value_settings | relation | 413008 | 349383 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | t
job_settings_pkey | relation | 413008 | 349371 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | t
idx_value_settings | relation | 413008 | 17304 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | t
job_settings | relation | 413008 | 17294 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | t
job_settings | relation | 413008 | 17294 | | | | | | | | 3/7001 | 36084 | RowExclusiveLock | t | t
jobs | relation | 413008 | 17555 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | t
jobs | relation | 413008 | 17555 | | | | | | | | 3/7001 | 36084 | RowExclusiveLock | t | t
job_groups_idx_parent_id | relation | 413008 | 350081 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | t
job_groups_name_parent_id | relation | 413008 | 350079 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | t
job_groups_pkey | relation | 413008 | 350053 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | t
job_groups | relation | 413008 | 17498 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | t
| virtualxid | | | | | 3/7001 | | | | | 3/7001 | 36084 | ExclusiveLock | t | t
idx_job_dependencies_dependency | relation | 413008 | 17607 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | f
assets_idx_last_use_job_id | relation | 413008 | 349248 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | f
assets_idx_last_use_job_id | relation | 413008 | 349248 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
jobs_idx_scheduled_product_id | relation | 413008 | 350246 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
scheduled_products | relation | 413008 | 51673 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
job_dependencies_idx_child_job_id | relation | 413008 | 349308 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | f
scheduled_products_idx_gru_task_id | relation | 413008 | 350449 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
| transactionid | | | | | | 13958313 | | | | 3/7001 | 36084 | ExclusiveLock | t | f
idx_jobs_build_group | relation | 413008 | 350223 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
jobs_idx_blocked_by_id | relation | 413008 | 349516 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
assets_id_seq | relation | 413008 | 17585 | | | | | | | | 3/7001 | 36084 | RowExclusiveLock | t | f
job_dependencies | relation | 413008 | 17600 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | f
jobs_idx_group_id | relation | 413008 | 350224 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
scheduled_products_idx_user_id | relation | 413008 | 350434 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
jobs_idx_clone_id | relation | 413008 | 349494 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
job_groups_name_parent_id | relation | 413008 | 350079 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
scheduled_products_idx_webhook_id | relation | 413008 | 414420 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
scheduled_products_pkey | relation | 413008 | 350418 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
| transactionid | | | | | | 13958312 | | | | 3/7001 | 36084 | ExclusiveLock | t | f
assets | relation | 413008 | 17587 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | f
assets | relation | 413008 | 17587 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
assets | relation | 413008 | 17587 | | | | | | | | 3/7001 | 36084 | RowExclusiveLock | t | f
assets_pkey | relation | 413008 | 349232 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | f
assets_pkey | relation | 413008 | 349232 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
jobs_assets | relation | 413008 | 17648 | | | | | | | | 3/7001 | 36084 | RowExclusiveLock | t | f
job_dependencies_idx_parent_job_id | relation | 413008 | 349323 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | f
job_groups_pkey | relation | 413008 | 350053 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
job_groups_idx_parent_id | relation | 413008 | 350081 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
jobs_idx_assigned_worker_id | relation | 413008 | 349538 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
job_dependencies_pkey | relation | 413008 | 349321 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | f
assets_type_name | relation | 413008 | 17597 | | | | | | | | 3/7001 | 36084 | AccessShareLock | t | f
assets_type_name | relation | 413008 | 17597 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
job_groups | relation | 413008 | 17498 | | | | | | | | 3/7001 | 36084 | RowShareLock | t | f
(54 Zeilen)
In 2.:
openqa-local=> select relation::regclass, * from pg_locks;
relation | locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
------------------------------------+------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+------------------+---------+----------
pg_locks | relation | 413008 | 12143 | | | | | | | | 5/2118 | 35899 | AccessShareLock | t | t
| virtualxid | | | | | 5/2118 | | | | | 5/2118 | 35899 | ExclusiveLock | t | t
jobs_assets | relation | 413008 | 17648 | | | | | | | | 8/114 | 37982 | RowExclusiveLock | t | t
assets_id_seq | relation | 413008 | 17585 | | | | | | | | 8/114 | 37982 | RowExclusiveLock | t | t
assets_idx_last_use_job_id | relation | 413008 | 349248 | | | | | | | | 8/114 | 37982 | AccessShareLock | t | t
assets_pkey | relation | 413008 | 349232 | | | | | | | | 8/114 | 37982 | AccessShareLock | t | t
assets_type_name | relation | 413008 | 17597 | | | | | | | | 8/114 | 37982 | AccessShareLock | t | t
assets | relation | 413008 | 17587 | | | | | | | | 8/114 | 37982 | AccessShareLock | t | t
assets | relation | 413008 | 17587 | | | | | | | | 8/114 | 37982 | RowExclusiveLock | t | t
job_dependencies_idx_parent_job_id | relation | 413008 | 349323 | | | | | | | | 8/114 | 37982 | AccessShareLock | t | t
job_dependencies_pkey | relation | 413008 | 349321 | | | | | | | | 8/114 | 37982 | AccessShareLock | t | t
job_dependencies_idx_child_job_id | relation | 413008 | 349308 | | | | | | | | 8/114 | 37982 | AccessShareLock | t | t
idx_job_dependencies_dependency | relation | 413008 | 17607 | | | | | | | | 8/114 | 37982 | AccessShareLock | t | t
job_dependencies | relation | 413008 | 17600 | | | | | | | | 8/114 | 37982 | AccessShareLock | t | t
| virtualxid | | | | | 8/114 | | | | | 8/114 | 37982 | ExclusiveLock | t | t
(15 Zeilen)
In 3.:
openqa-local=> select relation::regclass, * from pg_locks;
relation | locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
------------------------------------+---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+------------------+---------+----------
pg_locks | relation | 413008 | 12143 | | | | | | | | 5/2112 | 35899 | AccessShareLock | t | t
| virtualxid | | | | | 5/2112 | | | | | 5/2112 | 35899 | ExclusiveLock | t | t
jobs_idx_group_id | relation | 413008 | 350224 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
idx_jobs_build_group | relation | 413008 | 350223 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
jobs_idx_assigned_worker_id | relation | 413008 | 349538 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
jobs_idx_blocked_by_id | relation | 413008 | 349516 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
jobs_idx_clone_id | relation | 413008 | 349494 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
idx_jobs_scenario | relation | 413008 | 17584 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
idx_jobs_result | relation | 413008 | 17582 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
idx_jobs_state | relation | 413008 | 17581 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
job_settings_id_seq | relation | 413008 | 17292 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | t
job_settings_idx_job_id | relation | 413008 | 349384 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | t
idx_job_id_value_settings | relation | 413008 | 349383 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | t
job_settings_pkey | relation | 413008 | 349371 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | t
idx_value_settings | relation | 413008 | 17304 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | t
job_settings | relation | 413008 | 17294 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | t
job_settings | relation | 413008 | 17294 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | t
jobs_pkey | relation | 413008 | 349401 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | t
jobs_pkey | relation | 413008 | 349401 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
jobs | relation | 413008 | 17555 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | t
jobs | relation | 413008 | 17555 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | t
jobs | relation | 413008 | 17555 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | t
| virtualxid | | | | | 3/7814 | | | | | 3/7814 | 37278 | ExclusiveLock | t | t
idx_job_dependencies_dependency | relation | 413008 | 17607 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | f
assets_idx_last_use_job_id | relation | 413008 | 349248 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | f
assets_idx_last_use_job_id | relation | 413008 | 349248 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | f
jobs_idx_scheduled_product_id | relation | 413008 | 350246 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | f
jobs_idx_scheduled_product_id | relation | 413008 | 350246 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
idx_jobs_scenario | relation | 413008 | 17584 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
job_dependencies_idx_child_job_id | relation | 413008 | 349308 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | f
idx_jobs_build_group | relation | 413008 | 350223 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
jobs_idx_blocked_by_id | relation | 413008 | 349516 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
assets_id_seq | relation | 413008 | 17585 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
job_dependencies | relation | 413008 | 17600 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | f
job_dependencies | relation | 413008 | 17600 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
jobs_idx_group_id | relation | 413008 | 350224 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
idx_jobs_state | relation | 413008 | 17581 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
jobs_idx_clone_id | relation | 413008 | 349494 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
job_groups_name_parent_id | relation | 413008 | 350079 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | f
assets | relation | 413008 | 17587 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | f
assets | relation | 413008 | 17587 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | f
assets | relation | 413008 | 17587 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
assets_pkey | relation | 413008 | 349232 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | f
assets_pkey | relation | 413008 | 349232 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | f
jobs_assets | relation | 413008 | 17648 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
idx_jobs_result | relation | 413008 | 17582 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
job_dependencies_idx_parent_job_id | relation | 413008 | 349323 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | f
job_groups_pkey | relation | 413008 | 350053 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | f
| transactionid | | | | | | 13958363 | | | | 3/7814 | 37278 | ExclusiveLock | t | f
job_groups_idx_parent_id | relation | 413008 | 350081 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | f
jobs_idx_assigned_worker_id | relation | 413008 | 349538 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
jobs_pkey | relation | 413008 | 349401 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
job_dependencies_pkey | relation | 413008 | 349321 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | f
assets_type_name | relation | 413008 | 17597 | | | | | | | | 3/7814 | 37278 | AccessShareLock | t | f
assets_type_name | relation | 413008 | 17597 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | f
jobs_id_seq | relation | 413008 | 17553 | | | | | | | | 3/7814 | 37278 | RowExclusiveLock | t | f
job_groups | relation | 413008 | 17498 | | | | | | | | 3/7814 | 37278 | RowShareLock | t | f
(57 Zeilen)
In 4.:
openqa-local=> select relation::regclass, * from pg_locks;
relation | locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
------------------------------------+---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+------------------+---------+----------
pg_locks | relation | 413008 | 12143 | | | | | | | | 5/2129 | 35899 | AccessShareLock | t | t
| virtualxid | | | | | 5/2129 | | | | | 5/2129 | 35899 | ExclusiveLock | t | t
jobs_idx_blocked_by_id | relation | 413008 | 349516 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
jobs_idx_clone_id | relation | 413008 | 349494 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
jobs_pkey | relation | 413008 | 349401 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
idx_jobs_scenario | relation | 413008 | 17584 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
idx_jobs_result | relation | 413008 | 17582 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
idx_jobs_state | relation | 413008 | 17581 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
jobs | relation | 413008 | 17555 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
job_dependencies_idx_child_job_id | relation | 413008 | 349308 | | | | | | | | 3/18602 | 44258 | AccessShareLock | t | t
idx_job_dependencies_dependency | relation | 413008 | 17607 | | | | | | | | 3/18602 | 44258 | AccessShareLock | t | t
job_dependencies | relation | 413008 | 17600 | | | | | | | | 3/18602 | 44258 | AccessShareLock | t | t
jobs_assets | relation | 413008 | 17648 | | | | | | | | 3/18602 | 44258 | RowExclusiveLock | t | t
assets_id_seq | relation | 413008 | 17585 | | | | | | | | 3/18602 | 44258 | RowExclusiveLock | t | t
assets_idx_last_use_job_id | relation | 413008 | 349248 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
assets_pkey | relation | 413008 | 349232 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
assets_type_name | relation | 413008 | 17597 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
assets | relation | 413008 | 17587 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | t
assets | relation | 413008 | 17587 | | | | | | | | 3/18602 | 44258 | RowExclusiveLock | t | t
| virtualxid | | | | | 3/18602 | | | | | 3/18602 | 44258 | ExclusiveLock | t | t
jobs_idx_scheduled_product_id | relation | 413008 | 350246 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | f
idx_jobs_build_group | relation | 413008 | 350223 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | f
jobs_idx_group_id | relation | 413008 | 350224 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | f
job_dependencies_idx_parent_job_id | relation | 413008 | 349323 | | | | | | | | 3/18602 | 44258 | AccessShareLock | t | f
jobs_idx_assigned_worker_id | relation | 413008 | 349538 | | | | | | | | 3/18602 | 44258 | RowShareLock | t | f
job_dependencies_pkey | relation | 413008 | 349321 | | | | | | | | 3/18602 | 44258 | AccessShareLock | t | f
| transactionid | | | | | | 13959720 | | | | 3/18602 | 44258 | ExclusiveLock | t | f
(27 Zeilen)
All of these places require a similar set of locks. I think these tables show the locks in the order they were acquired. That means the order is always a bit different which can of course lead to deadlocks.
Updated by mkittler about 2 months ago · Edited
I was able to write a (failing) unit test that provokes this situation: https://github.com/os-autoinst/openQA/compare/master...Martchus:openQA:improve-job-creation
The unit tests can provoke the situation via job creation (case 1. in my previous comment) and via general asset registration (case 4. in my previous comment). We don't do the latter in a txn so it is probably not relevant. The same counts most likely for the other cases mentioned in my previous comment. So I would focus on job creation now.
We could resolve this by creating jobs always in the same order and then also their assets. Not sure how easy this is, though. Otherwise we could simply retry the transaction.
Note that in my unit test I got the exact same error message also mentioning ShareLock
. Not sure why https://www.postgresql.org/docs/current/explicit-locking.html only lists this lock mode for CREATE INDEX
which is definitely not used at this point in the unit test. Maybe the error message is also just a bit unprecise when mentioning the lock mode.
Updated by mkittler about 1 month ago
- Status changed from In Progress to Feedback
Updated by mkittler about 1 month ago · Edited
The PR has been merged but hasn't been deployed yet.
I've just checked grep -i deadlock /var/lib/pgsql/data/log/postgresql-2025-02-*.log
and now the most recent occurrence is /var/lib/pgsql/data/log/postgresql-2025-02-13_000000.log:2025-02-13 21:21:53.160 CET openqa geekotest [9954]ERROR: deadlock detected.
. That means the problem hasn't happened for almost a week.
So I'm not sure whether it makes sense to keep this PR open to see whether the problem has disappeared in production when it isn't happening often/consistently anyway. The PR contains a test to provoke the problem (hopefully reproducing the same underlying problem we saw in production) so I hope it will fix the problem.
Updated by mkittler about 1 month ago
- Status changed from Feedback to Resolved
There were two more deadlocks yesterday:
2025-02-24 21:10:31.025 CET [1654]LOG: checkpoint complete: wrote 82 buffers (0.5%); 0 WAL file(s) added, 0 removed, 1 recycled; write=9.522 s, sync=0.058 s, total=9.587 s; sync files=170, longest=0.002 s, average=0.001 s; distance=22420 kB, estimate=36091 kB; lsn=1A9E/E09F3A80, redo lsn=1A9E/E07768F0
2025-02-24 21:14:40.641 CET openqa geekotest [16625]ERROR: deadlock detected
2025-02-24 21:14:40.641 CET openqa geekotest [16625]DETAIL: Process 16625 waits for ShareLock on transaction 2701480135; blocked by process 26356.
Process 26356 waits for ShareLock on transaction 2701479809; blocked by process 16625.
Process 16625: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
Process 26356: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-24 21:14:40.641 CET openqa geekotest [16625]HINT: See server log for query details.
2025-02-24 21:14:40.641 CET openqa geekotest [16625]CONTEXT: while inserting index tuple (1749,42) in relation "assets"
2025-02-24 21:14:40.641 CET openqa geekotest [16625]STATEMENT: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-24 21:14:42.241 CET openqa geekotest [16625]ERROR: deadlock detected
2025-02-24 21:14:42.241 CET openqa geekotest [16625]DETAIL: Process 16625 waits for ShareLock on transaction 2701480135; blocked by process 26356.
Process 26356 waits for ShareLock on transaction 2701481320; blocked by process 16625.
Process 16625: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
Process 26356: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-24 21:14:42.241 CET openqa geekotest [16625]HINT: See server log for query details.
2025-02-24 21:14:42.241 CET openqa geekotest [16625]CONTEXT: while inserting index tuple (1749,52) in relation "assets"
2025-02-24 21:14:42.241 CET openqa geekotest [16625]STATEMENT: INSERT INTO assets (type, name, t_created, t_updated)
VALUES ($1, $2, now(), now())
ON CONFLICT DO NOTHING RETURNING id
2025-02-24 21:15:21.307 CET [1654]LOG: checkpoint starting: time
That means my change worked, so despite the deadlock there is no further failing scheduled product.
The most recent scheduled product failing due to a deadlock is from the 13.:
openqa=> select id, t_updated from scheduled_products where CAST (results AS text) ilike '%deadlock%' order by t_updated desc limit 25;
id | t_updated
---------+---------------------
2693126 | 2025-02-13 20:21:56
2691469 | 2025-02-12 20:25:11
2689877 | 2025-02-11 20:24:57
2686126 | 2025-02-09 20:11:41
2682605 | 2025-02-06 20:13:13
2682604 | 2025-02-06 20:13:11
2677143 | 2025-02-03 20:12:34
2677142 | 2025-02-03 20:12:32
2675937 | 2025-02-02 20:25:12
2669282 | 2025-01-27 20:20:48
2666889 | 2025-01-24 20:21:53
2663827 | 2025-01-22 20:28:47
2662568 | 2025-01-21 20:24:44
2661273 | 2025-01-20 20:22:59
2661272 | 2025-01-20 20:22:54
2659876 | 2025-01-19 20:14:45
2659873 | 2025-01-19 20:14:43
2657906 | 2025-01-17 20:14:41
2656427 | 2025-01-16 20:13:51
2654537 | 2025-01-15 20:13:51
2648966 | 2025-01-12 20:14:44
2646800 | 2025-01-10 20:14:27
2645239 | 2025-01-09 20:10:19
2645241 | 2025-01-09 20:10:16
2643768 | 2025-01-08 20:15:38
(25 rows)
Updated by livdywan about 1 month ago
mkittler wrote in #note-15:
There were two more deadlocks yesterday:
[...]
That means my change worked, so despite the deadlock there is no further failing scheduled product.
The most recent scheduled product failing due to a deadlock is from the 13.:
Nice! My first thought seeing that was oh no, is it getting even worse. Awesome to see it work so well in production.