Project

General

Profile

action #120891

Updated by livdywan over 1 year ago

## Motivation 

 Last night, Tumbleweed snapshot 20221122 was synced over to openQA and scheduled for testing. 
 Normally, the full test suite is 313 tests long; for 1122, only 311 were scheduled though 

 I managed to track it down to the NET/i586 test not having scheduled, which was attempted in 
 https://openqa.opensuse.org/admin/productlog?id=283823 

 The error reported was 

 ``` 
 { 
 "failed_job_info": [ 
 { 
 "error_message": "DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: deadlock detected\nDETAIL: Process 3080 waits for ShareLock on transaction 166511492; blocked by process 3132.\nProcess 3132 waits for ShareLock on transaction 166511322; blocked by process 3080.\nHINT: See server log for query details.\nCONTEXT: while updating tuple (144,69) in relation \"assets\" [for Statement \"UPDATE assets SET size = ?, t_updated = ? WHERE id = ?\" with ParamValues: 1='499788508', 2='2022-11-23 01:09:38', 3='66765244'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Assets.pm line 147\n", 
 "job_name": "install_only" 
 }, 
 { 
 "error_message": "DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: current transaction is aborted, commands ignored until end of transaction block [for Statement \"SELECT me.id, me.name, me.parent_id, me.size_limit_gb, me.exclusively_kept_asset_size, me.keep_logs_in_days, me.keep_important_logs_in_days, me.keep_results_in_days, me.keep_important_results_in_days, me.default_priority, me.sort_order, me.description, me.template, me.build_version_sort, me.carry_over_bugrefs, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = ? )\" with ParamValues: 1='1'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 576\n", 
 "job_name": "install_only" 
 } 
 ], 
 "successful_job_ids": [] 
 } 
 ``` 

 ## Acceptance criteria 
 - **AC1**: Jobs Scheduling products and asset scanning isn't interfering with each others; there are scheduled reliably no deadlocks 

 ## Suggestions 
 - Clarify why the "unrelated" operations interfere 
 - Try and avoid updating the asset size too often 
 - Identify all places that lock the asset table 
 - Read PostgreSQL documentation, e.g. https://www.postgresql.org/docs/current/explicit-locking.html  
 - We could add a re-try around the transaction, which by design is atomic. It's not a "proper solution" but would release all locks as well before the next transaction is attempted. 
 - Maybe we can ensure there's only one place that updates the asset size e.g. a dedicated job that any users can trigger as needed 
   - There's https://github.com/os-autoinst/openQA/pull/4963 but we're still seeing cases on o3 so maybe that fix wasn't complete 
   - Maybe the `my $asset = $assets->find_or_create($asset_info, {for => 'update'});` using `or_create/for update` is simply the problem because it always "updates" 

 ## Out of scope 
 - Introduce another internal state to distinguish the sate of a not really running but should be scheduled job that can be reset back 

Back