Project

General

Profile

Actions

action #120891

closed

Product not scheduled: DBD::Pg::st execute failed: ERROR: deadlock detected size:M

Added by dimstar about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-11-23
Due date:
2023-01-20
% Done:

0%

Estimated time:

Description

Motivation

Last night, Tumbleweed snapshot 20221122 was synced over to openQA and scheduled for testing.
Normally, the full test suite is 313 tests long; for 1122, only 311 were scheduled though

I managed to track it down to the NET/i586 test not having scheduled, which was attempted in
https://openqa.opensuse.org/admin/productlog?id=283823

The error reported was

{
"failed_job_info": [
{
"error_message": "DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: deadlock detected\nDETAIL: Process 3080 waits for ShareLock on transaction 166511492; blocked by process 3132.\nProcess 3132 waits for ShareLock on transaction 166511322; blocked by process 3080.\nHINT: See server log for query details.\nCONTEXT: while updating tuple (144,69) in relation \"assets\" [for Statement \"UPDATE assets SET size = ?, t_updated = ? WHERE id = ?\" with ParamValues: 1='499788508', 2='2022-11-23 01:09:38', 3='66765244'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Assets.pm line 147\n",
"job_name": "install_only"
},
{
"error_message": "DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: current transaction is aborted, commands ignored until end of transaction block [for Statement \"SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = ? )\" with ParamValues: 1='1'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 576\n",
"job_name": "install_only"
}
],
"successful_job_ids": []
}

Acceptance criteria

  • AC1: Jobs are scheduled reliably

Suggestions

  • Clarify why the "unrelated" operations interfere
  • Try and avoid updating the asset size too often
  • Identify all places that lock the asset table
  • Read PostgreSQL documentation, e.g. https://www.postgresql.org/docs/current/explicit-locking.html
  • We could add a re-try around the transaction, which by design is atomic. It's not a "proper solution" but would release all locks as well before the next transaction is attempted.
  • Maybe we can ensure there's only one place that updates the asset size e.g. a dedicated job that any users can trigger as needed
    • There's https://github.com/os-autoinst/openQA/pull/4963 but we're still seeing cases on o3 so maybe that fix wasn't complete
    • Maybe the my $asset = $assets->find_or_create($asset_info, {for => 'update'}); using or_create/for update is simply the problem because it always "updates"

Out of scope

  • Introduce another internal state to distinguish the sate of a not really running but should be scheduled job that can be reset back

Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #121777: PostgreSQL update ordering deadlock on jobs tableResolvedkraih2022-12-09

Actions
Actions

Also available in: Atom PDF