Project

General

Profile

Actions

action #120891

closed

Product not scheduled: DBD::Pg::st execute failed: ERROR: deadlock detected size:M

Added by dimstar about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-11-23
Due date:
2023-01-20
% Done:

0%

Estimated time:

Description

Motivation

Last night, Tumbleweed snapshot 20221122 was synced over to openQA and scheduled for testing.
Normally, the full test suite is 313 tests long; for 1122, only 311 were scheduled though

I managed to track it down to the NET/i586 test not having scheduled, which was attempted in
https://openqa.opensuse.org/admin/productlog?id=283823

The error reported was

{
"failed_job_info": [
{
"error_message": "DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: deadlock detected\nDETAIL: Process 3080 waits for ShareLock on transaction 166511492; blocked by process 3132.\nProcess 3132 waits for ShareLock on transaction 166511322; blocked by process 3080.\nHINT: See server log for query details.\nCONTEXT: while updating tuple (144,69) in relation \"assets\" [for Statement \"UPDATE assets SET size = ?, t_updated = ? WHERE id = ?\" with ParamValues: 1='499788508', 2='2022-11-23 01:09:38', 3='66765244'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Assets.pm line 147\n",
"job_name": "install_only"
},
{
"error_message": "DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: current transaction is aborted, commands ignored until end of transaction block [for Statement \"SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = ? )\" with ParamValues: 1='1'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 576\n",
"job_name": "install_only"
}
],
"successful_job_ids": []
}

Acceptance criteria

  • AC1: Jobs are scheduled reliably

Suggestions

  • Clarify why the "unrelated" operations interfere
  • Try and avoid updating the asset size too often
  • Identify all places that lock the asset table
  • Read PostgreSQL documentation, e.g. https://www.postgresql.org/docs/current/explicit-locking.html
  • We could add a re-try around the transaction, which by design is atomic. It's not a "proper solution" but would release all locks as well before the next transaction is attempted.
  • Maybe we can ensure there's only one place that updates the asset size e.g. a dedicated job that any users can trigger as needed
    • There's https://github.com/os-autoinst/openQA/pull/4963 but we're still seeing cases on o3 so maybe that fix wasn't complete
    • Maybe the my $asset = $assets->find_or_create($asset_info, {for => 'update'}); using or_create/for update is simply the problem because it always "updates"

Out of scope

  • Introduce another internal state to distinguish the sate of a not really running but should be scheduled job that can be reset back

Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #121777: PostgreSQL update ordering deadlock on jobs tableResolvedkraih2022-12-09

Actions
Actions #1

Updated by mkittler about 2 years ago

The first query likely corresponds to the update via DBIx in OpenQA::Schema::Result::Assets::refresh_size which is invoked when tracking/cleaning assets. The second query points to OpenQA::Schema::Result::Jobs::extract_group_args_from_settings which is invoked when scheduling the product (and thus happening within an explicit transaction but using the default isolation level). So asset tracking and product scheduling are interfering with each other.

After re-scheduling the product the jobs were created as expected. The deadlock likely only occurs in a very specific situation (the asset tracking needs to be at a certain point and the scheduling needs to be at a certain point). That makes it hard to reproduce.

Actions #2

Updated by mkittler about 2 years ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler about 2 years ago

Looks like this isn't the only time when a deadlock occurred:

ariel:/var/lib/pgsql/data/log # grep deadlock *.log
postgresql-2022-11-09_000000.log:2022-11-09 16:55:50.979 UTC openqa geekotest [8975]ERROR:  deadlock detected
postgresql-2022-11-14_000000.log:2022-11-14 17:14:51.019 UTC openqa geekotest [31118]ERROR:  deadlock detected
postgresql-2022-11-19_000000.log:2022-11-19 05:21:07.088 UTC openqa geekotest [16976]ERROR:  deadlock detected
postgresql-2022-11-20_000000.log:2022-11-20 01:59:09.753 UTC openqa geekotest [6320]ERROR:  deadlock detected
postgresql-2022-11-23_000000.log:2022-11-23 00:40:32.766 UTC openqa geekotest [26448]ERROR:  deadlock detected
postgresql-2022-11-23_000000.log:2022-11-23 00:40:33.770 UTC openqa geekotest [26494]ERROR:  deadlock detected
postgresql-2022-11-23_000000.log:2022-11-23 01:09:39.771 UTC openqa geekotest [3080]ERROR:  deadlock detected

The other deadlocks are also either about reading a row from job groups or about updating the size of an asset.

The log looks like this:

2022-11-23 00:40:32.766 UTC openqa geekotest [26448]ERROR:  deadlock detected
2022-11-23 00:40:32.766 UTC openqa geekotest [26448]DETAIL:  Process 26448 waits for ShareLock on transaction 166381112; blocked by process 26494.
        Process 26494 waits for ExclusiveLock on tuple (135,64) of relation 16400 of database 16401; blocked by process 26466.
        Process 26466 waits for ShareLock on transaction 166381053; blocked by process 26448.
        Process 26448: UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
        Process 26494: UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
        Process 26466: UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
2022-11-23 00:40:32.766 UTC openqa geekotest [26448]HINT:  See server log for query details.
2022-11-23 00:40:32.766 UTC openqa geekotest [26448]CONTEXT:  while updating tuple (135,67) in relation "assets"
2022-11-23 00:40:32.766 UTC openqa geekotest [26448]STATEMENT:  UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
2022-11-23 00:40:32.795 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.795 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.801 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.801 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.808 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.808 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.813 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.813 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.817 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.817 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.822 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.822 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.827 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.827 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.834 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.834 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.839 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.839 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.844 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.844 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.848 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.848 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.853 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.853 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.859 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.859 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.864 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.864 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.869 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.869 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.873 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.873 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.879 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.879 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.890 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.890 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.912 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.912 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.919 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.919 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.924 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.924 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.928 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.928 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.932 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:32.932 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:32.937 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
…
2022-11-23 00:40:33.065 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:33.065 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:33.072 UTC openqa geekotest [26448]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 00:40:33.072 UTC openqa geekotest [26448]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 00:40:33.770 UTC openqa geekotest [26494]ERROR:  deadlock detected
2022-11-23 00:40:33.770 UTC openqa geekotest [26494]DETAIL:  Process 26494 waits for ShareLock on transaction 166381074; blocked by process 26466.
        Process 26466 waits for ShareLock on transaction 166381112; blocked by process 26494.
        Process 26494: UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
        Process 26466: UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
2022-11-23 00:40:33.770 UTC openqa geekotest [26494]HINT:  See server log for query details.
2022-11-23 00:40:33.770 UTC openqa geekotest [26494]CONTEXT:  while updating tuple (135,64) in relation "assets"
2022-11-23 00:40:33.770 UTC openqa geekotest [26494]STATEMENT:  UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
2022-11-23 01:09:39.771 UTC openqa geekotest [3080]ERROR:  deadlock detected
2022-11-23 01:09:39.771 UTC openqa geekotest [3080]DETAIL:  Process 3080 waits for ShareLock on transaction 166511492; blocked by process 3132.
        Process 3132 waits for ShareLock on transaction 166511322; blocked by process 3080.
        Process 3080: UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
        Process 3132: UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
2022-11-23 01:09:39.771 UTC openqa geekotest [3080]HINT:  See server log for query details.
2022-11-23 01:09:39.771 UTC openqa geekotest [3080]CONTEXT:  while updating tuple (144,69) in relation "assets"
2022-11-23 01:09:39.771 UTC openqa geekotest [3080]STATEMENT:  UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
2022-11-23 01:09:39.784 UTC openqa geekotest [3080]ERROR:  current transaction is aborted, commands ignored until end of transaction block
2022-11-23 01:09:39.784 UTC openqa geekotest [3080]STATEMENT:  SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = $1 )
2022-11-23 01:09:46.768 UTC openqa geekotest [3188]LOG:  duration: 7578.684 ms  execute <unnamed>:         INSERT INTO assets (type, name, t_created, t_updated)
                            VALUES ($1,    $2,    now(),     now()    ) ON CONFLICT DO NOTHING

2022-11-23 01:09:46.768 UTC openqa geekotest [3188]DETAIL:  parameters: $1 = 'repo', $2 = 'openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20221122-source'
2022-11-23 01:09:46.768 UTC openqa geekotest [3168]LOG:  duration: 7809.833 ms  execute <unnamed>: UPDATE assets SET size = $1, t_updated = $2 WHERE id = $3
2022-11-23 01:09:46.768 UTC openqa geekotest [3168]DETAIL:  parameters: $1 = '499788508', $2 = '2022-11-23 01:09:38', $3 = '66765244'
Actions #4

Updated by mkittler about 2 years ago

I fail to see how reading job groups would interfere with updating assets. Maybe this is just the query that's executed when the transaction is aborted. So I had a look what else we do in that transaction. It turns out that within the scheduling transaction we run register_assets_from_settings which runs ensure_size which runs ensure_size which eventually also runs refresh_size. So we update the size column of assets while tracking assets and while scheduling products. That could be our deadlock/conflict.

Actions #5

Updated by mkittler about 2 years ago

By the way, when doing

$schema->txn_do(sub {  # explicit transaction via first database connection
        note 'update 1';
        $schema->resultset('Assets')->first->update({size => 50});
        note 'update 2';
        $schema2->resultset('Assets')->first->update({size => 49});  # an update via a different database connection while the explicit transaction has not been committed
        note 'done 1';
    });
    note 'done 2';

that creates a deadlock. I get only the first two lines logged. However, in my tests it gets just stuck without running into the "deadlock detected" error. I just get the following log messages:

2022-11-23 15:27:36.136 CET openqa_test martchus [31691]LOG:  process 31691 still waiting for ShareLock on transaction 996 after 1.052 ms
2022-11-23 15:27:36.136 CET openqa_test martchus [31691]DETAIL:  Process holding the lock: 31688. Wait queue: 31691.
2022-11-23 15:27:36.136 CET openqa_test martchus [31691]CONTEXT:  while updating tuple (0,1) in relation "assets"
2022-11-23 15:27:36.136 CET openqa_test martchus [31691]STATEMENT:  UPDATE assets SET size = $1, t_updated = $2 WHERE ( id = $3 )
2022-11-23 15:27:46.128 CET openqa_test martchus [31691]LOG:  process 31691 still waiting for ShareLock on transaction 996 after 9993.372 ms
2022-11-23 15:27:46.128 CET openqa_test martchus [31691]DETAIL:  Process holding the lock: 31688. Wait queue: 31691.
2022-11-23 15:27:46.128 CET openqa_test martchus [31691]CONTEXT:  while updating tuple (0,1) in relation "assets"
2022-11-23 15:27:46.128 CET openqa_test martchus [31691]STATEMENT:  UPDATE assets SET size = $1, t_updated = $2 WHERE ( id = $3 )

EDIT: That's actually not a good example. If I start the 2nd update in a sub process and wait in the main process until it terminated it works just fine.

Actions #6

Updated by okurz about 2 years ago

  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #7

Updated by livdywan about 2 years ago

  • Subject changed from Product not scheduled: DBD::Pg::st execute failed: ERROR: deadlock detected to Product not scheduled: DBD::Pg::st execute failed: ERROR: deadlock detected size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by kraih about 2 years ago

PostgreSQL docs about transaction isolation that should be relevant here: https://www.postgresql.org/docs/current/transaction-iso.html

Actions #9

Updated by mkittler about 2 years ago

  • Description updated (diff)
  • Category deleted (Regressions/Crashes)
  • Target version deleted (Ready)
  • The explicit transaction we're starting when scheduling a job is using the default isolation level which is according to the documentation mentioned by @kraih "Read Committed Isolation Level". The updates when scanning assets are not using an explicit transaction.
  • The error message is about a ShareLock but according to https://www.postgresql.org/docs/current/explicit-locking.html an UPDATE actually uses a RowExclusiveLock. The ShareLock mentioned in the error message would only be created acquired by CREATE INDEX (and is indeed in conflict with ROW EXCLUSIVE). That raises the question why the ShareLock is acquired because we surely don't use CREATE INDEX after the initial deployment.
  • In theory one option would be to avoid the asset size update when scheduling jobs. However, that would break the check for missing assets when restarting jobs (see https://github.com/os-autoinst/openQA/commit/50b986e957a4f2c30d94c8499d43700746ae6c42). Besides, it is still not completely clear whether these two updates are actually really the culprit.
Actions #10

Updated by mkittler about 2 years ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #11

Updated by mkittler about 2 years ago

  • Description updated (diff)
Actions #12

Updated by mkittler about 2 years ago

  • Assignee deleted (mkittler)

I'm unassigning because I have currently enough other hard tickets assigned to myself. It would be great if someone else in the team could have a look as well.

Actions #13

Updated by kraih about 2 years ago

  • Assignee set to kraih

Since i said i would take a look.

Actions #14

Updated by kraih about 2 years ago

  • Status changed from Workable to In Progress
Actions #15

Updated by kraih about 2 years ago

As expected this is a tricky one. I do have suspicions though, the deadlock itself is undoubtedly on this line. And the root cause will have something to do with this transaction, which has a very wide scope. This kind of deadlock usually happens with an UPDATE ordering problem, where two transactions try to update multiple rows of a table, and each gets the lock on a different row at the same time, deadlocking each other. Which leads me to believe that this line could be at fault. The usual fix for a problem like this would be to do a SELECT ... FOR UPDATE on all the rows you plan to update before doing it. Which is not so easy with the convoluted DBIx::Class code involved here.

Actions #16

Updated by openqa_review about 2 years ago

  • Due date set to 2022-12-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by kraih about 2 years ago

We did have a few more cases recently btw.

postgresql-2022-11-24_000000.log:2022-11-24 09:35:05.878 UTC openqa geekotest [32110]ERROR:  deadlock detected
postgresql-2022-11-24_000000.log:2022-11-24 09:35:06.881 UTC openqa geekotest [32298]ERROR:  deadlock detected
postgresql-2022-11-27_000000.log:2022-11-27 04:31:55.788 UTC openqa geekotest [10404]ERROR:  deadlock detected
postgresql-2022-12-01_000000.log:2022-12-01 03:46:12.925 UTC openqa geekotest [7300]ERROR:  deadlock detected
postgresql-2022-12-01_000000.log:2022-12-01 22:08:49.641 UTC openqa geekotest [26387]ERROR:  deadlock detected
postgresql-2022-12-01_000000.log:2022-12-01 23:57:27.173 UTC openqa geekotest [25245]ERROR:  deadlock detected
postgresql-2022-12-06_000000.log:2022-12-06 01:38:27.615 UTC openqa geekotest [10906]ERROR:  deadlock detected
Actions #18

Updated by kraih about 2 years ago

Added a PR to rule out one of two spots where the UPDATE query causing the deadlock could have originated: https://github.com/os-autoinst/openQA/pull/4943

Actions #19

Updated by kraih about 2 years ago

In case someone else has to work on this again, this article has a good explanation for UPDATE ordering deadlocks. I suspect that we are dealing with a variant of that.

Actions #20

Updated by kraih about 2 years ago

Think i know how to fix it, simply making sure that all processes handle assets in the same order here should do the trick. The exact order doesn't matter, it just needs to be consistent between processes. Preparing a patch now.

Actions #21

Updated by kraih about 2 years ago

Second PR with a possible ordering fix: https://github.com/os-autoinst/openQA/pull/4952

Actions #22

Updated by kraih about 2 years ago

  • Status changed from In Progress to Feedback

Lets wait a few days and keep an eye on the logs.

Actions #23

Updated by livdywan about 2 years ago

kraih wrote:

Lets wait a few days and keep an eye on the logs.

[2022-12-09T10:01:48.096130Z] [warn] [pid:21728] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 498: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  deadlock detected
DETAIL:  Process 21755 waits for ShareLock on transaction 244660614; blocked by process 22220.
Process 22220 waits for ShareLock on transaction 244660621; blocked by process 21755.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (5964,52) in relation "jobs" [for Statement "UPDATE jobs SET result = ?, t_updated = ? WHERE id = ?" with ParamValues: 1='parallel_failed', 2='2022-12-09 10:01:47', 3='2943098'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1779

[2022-12-09T06:21:08.420239Z] [warn] [pid:6924] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 511: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  deadlock detected
DETAIL:  Process 6931 waits for ShareLock on transaction 243609936; blocked by process 5779.
Process 5779 waits for ShareLock on transaction 243609932; blocked by process 6931.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (3166,30) in relation "jobs" [for Statement "UPDATE jobs SET result = ?, t_updated = ? WHERE id = ?" with ParamValues: 1='parallel_failed', 2='2022-12-09 06:21:07', 3='2939946'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1779
[2022-12-09T06:54:32.672698Z] [warn] [pid:28682] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 519: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  deadlock detected
DETAIL:  Process 28685 waits for ShareLock on transaction 243819997; blocked by process 30309.
Process 30309 waits for ShareLock on transaction 243819999; blocked by process 28685.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2930,111) in relation "jobs" [for Statement "UPDATE jobs SET result = ?, t_updated = ? WHERE id = ?" with ParamValues: 1='parallel_failed', 2='2022-12-09 06:54:31', 3='2941928'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1779
Actions #24

Updated by kraih about 2 years ago

  • Related to action #121777: PostgreSQL update ordering deadlock on jobs table added
Actions #25

Updated by kraih about 2 years ago

cdywan wrote:

[2022-12-09T10:01:48.096130Z] [warn] [pid:21728] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 498: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  deadlock detected
DETAIL:  Process 21755 waits for ShareLock on transaction 244660614; blocked by process 22220.
Process 22220 waits for ShareLock on transaction 244660621; blocked by process 21755.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (5964,52) in relation "jobs" [for Statement "UPDATE jobs SET result = ?, t_updated = ? WHERE id = ?" with ParamValues: 1='parallel_failed', 2='2022-12-09 10:01:47', 3='2943098'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1779

Ah, you noticed it too. That's a different deadlock, and i've opened a new ticket for it: #121777

Actions #26

Updated by jbaier_cz about 2 years ago

Can this: https://openqa.suse.de/minion/jobs?id=6073000 be related?

  - error_message: |
      DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  deadlock detected
      DETAIL:  Process 11392 waits for ShareLock on transaction 1441261351; blocked by process 20591.
      Process 20591 waits for ShareLock on transaction 1441213509; blocked by process 11392.
      HINT:  See server log for query details.
      CONTEXT:  while locking tuple (238,55) in relation "assets" [for Statement "SELECT me.id, me.type, me.name, me.size, me.checksum, me.last_use_job_id, me.fixed, me.t_created, me.t_updated FROM assets me WHERE ( ( me.name = ? AND me.type = ? ) ) FOR UPDATE" with ParamValues: 1='windows-10-x86_64-21H2@windows_bios_boot.qcow2', 2='hdd'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1509
    job_name: sles4sap_remote_desktop_windows_client_ppc64le
  - error_message: |
      DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  current transaction is aborted, commands ignored until end of transaction block [for Statement "SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = ? )" with ParamValues: 1='146'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 576
    job_name: SAPHanaSR_ScaleUp_PerfOpt_supportserver
...
Actions #27

Updated by kraih almost 2 years ago

On O3 there have been no new cases. So it looks like at least the assets specific deadlock has been resolved.

Actions #28

Updated by kraih almost 2 years ago

jbaier_cz wrote:

Can this: https://openqa.suse.de/minion/jobs?id=6073000 be related?

Yes it can is. This is what i see in the PostgreSQL logs:

2022-12-15 22:30:58.389 CET openqa geekotest [11392]ERROR:  deadlock detected
2022-12-15 22:30:58.389 CET openqa geekotest [11392]DETAIL:  Process 11392 waits for ShareLock on transaction 1441261351; blocked by process 20591.
        Process 20591 waits for ShareLock on transaction 1441213509; blocked by process 11392.
        Process 11392: SELECT me.id, me.type, me.name, me.size, me.checksum, me.last_use_job_id, me.fixed, me.t_created, me.t_updated FROM assets me WHERE ( ( me.name = $1 AND me.type = $2 ) ) FOR UPDATE
        Process 20591: SELECT me.id, me.type, me.name, me.size, me.checksum, me.last_use_job_id, me.fixed, me.t_created, me.t_updated FROM assets me WHERE ( ( me.name = $1 AND me.type = $2 ) ) FOR UPDATE

Which seems to come from here, triggered by the schedule_iso task. And from there it reaches the same loop that previously caused the update deadlock. Problem is the huge transaction block spanning all jobs being created. There is probably no way around splitting it up. At least the deadlock from this code path is a lot more rare, so this is not as urgent.

To sum it up: I think the problem here is that the big transaction loops over the same assets multiple times, for different jobs, which means the order from the inner loop cannot prevent the deadlock. One solution would be to split up the big transaction, or maybe the update of the assets table can be skipped since this is about creating jobs.

Actions #29

Updated by kraih almost 2 years ago

  • Status changed from Feedback to Workable
  • Assignee deleted (kraih)
Actions #30

Updated by kraih almost 2 years ago

Putting it back in the queue for now, in case someone else would like to give it a shot. Untangling the big transaction is definitely a bit tricky, and there is no reproducer to make sure you've actually fixed it. If necessary i'll pick it up again later.

Actions #31

Updated by mkittler almost 2 years ago

Problem is the huge transaction block spanning all jobs being created. There is probably no way around splitting it up.

Unfortunately this is in a transaction on purpose. The scheduler code should only see all those newly created jobs or none and in case of an error we don't want a half-scheduled product.

or maybe the update of the assets table can be skipped since this is about creating jobs.

That might be the simplest solution. However, this would likely break the check whether an asset exists that's done when restarting (see 50b986e957a4f2c30d94c8499d43700746ae6c42). At least that would be the case when the asset size hasn't been otherwise determined yet.

Actions #32

Updated by mkittler almost 2 years ago

  • Assignee set to mkittler
Actions #33

Updated by kraih almost 2 years ago

mkittler wrote:

Unfortunately this is in a transaction on purpose. The scheduler code should only see all those newly created jobs or none and in case of an error we don't want a half-scheduled product.

I feared this would be the case. Really too bad, was hoping we could switch to smaller transactions everywhere and make the code easier to follow.

Actions #34

Updated by livdywan almost 2 years ago

  • Due date changed from 2022-12-20 to 2023-01-06

Let's assume we're not in a rush to fix things up today. It's fine to discuss further and select the right solution.

Actions #35

Updated by mkittler almost 2 years ago

  • Status changed from Workable to In Progress

PR for removing the asset size update: https://github.com/os-autoinst/openQA/pull/4963

Note that the transaction should generally not be that problematic because it is (besides the asset update) not supposed to alter any existing rows (only add new ones).

Actions #36

Updated by mkittler almost 2 years ago

  • Status changed from In Progress to Feedback

The PR has been merged. Let's see whether log messages about deadlocks will no longer occur after is has been deployed.

Actions #37

Updated by okurz almost 2 years ago

  • Due date changed from 2023-01-06 to 2023-01-20

christmas grace due date bump :)

Actions #38

Updated by livdywan almost 2 years ago

  • Description updated (diff)
Actions #39

Updated by mkittler almost 2 years ago

Here's a follow-up for my previous PR (which was incomplete): https://github.com/os-autoinst/openQA/pull/4981

That will hopefully fix the remaining cases we've still seen on OSD.

Actions #40

Updated by mkittler almost 2 years ago

I haven't found any further occurrences via sudo -u postgres bash -c 'grep "deadlock detected" /var/lib/pgsql/data/log/postgresql-2023-01-*' on o3 and OSD after my PR has been merged. I'll keep the ticket in feedback to check again next week.

Actions #41

Updated by mkittler almost 2 years ago

Still no deadlocks being looked. Let's wait before resolving this ticket at least until Wednesday (when I'll check again).

Actions #42

Updated by mkittler almost 2 years ago

  • Status changed from Feedback to Resolved

There are still no further occurrences so I'm considering this ticket resolved now.

Actions

Also available in: Atom PDF