Project

General

Profile

Actions

action #119866

closed

o3 logreport message DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::Jobs=HASH(...): row not found at lib/OpenQA/Schema/Result/Jobs.pm line 1283 size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-11-03
Due date:
% Done:

0%

Estimated time:

Description

Observation

From the email from 2022-10-26 08:10Z:

[2022-10-26T08:01:19.477782Z] [error] [A3jpdQkRrxxb] DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::Jobs=HASH(0x561f96aeec80): row not found at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1283

(Line corresponds to DBIx update of job in account_result_size.)

Actions #1

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #2

Updated by mkittler over 1 year ago

We'd get different behavior when the job doesn't exist from the start (in all code paths calling this function). All code paths obtain the job ID first and then do this update only later at some point (without using a txn) so maybe the job has been deleted in the middle. It seems a bit unlikely but it is still the most likely explanation I can currently think of.

Actions #3

Updated by mkittler over 1 year ago

A job deletion was actually logged very shortly before the error occurred:

[2022-10-26T08:01:18.432185Z] [debug] [pid:9672] Updating seen of worker 106 from worker_status (free)
[2022-10-26T08:01:19.044647Z] [debug] [pid:9672] Updating seen of worker 102 from worker_status (free)
[2022-10-26T08:01:19.241330Z] [debug] [pid:16265] Sending AMQP event: opensuse.openqa.job.delete
[2022-10-26T08:01:19.242132Z] [debug] [pid:16265] AMQP URL: amqps://openqa:b45z45bz645tzrhwer@rabbit.opensuse.org:5671/?exchange=pubsub
[2022-10-26T08:01:19.477782Z] [error] [A3jpdQkRrxxb] DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::Jobs=HASH(0x561f96aeec80): row not found at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1283

[2022-10-26T08:01:19.558298Z] [debug] [pid:16265] opensuse.openqa.job.delete published
[2022-10-26T08:01:19.886680Z] [debug] [pid:9672] Updating seen of worker 97 from worker_status (free)
[2022-10-26T08:01:20.258523Z] [debug] [pid:9672] Updating seen of worker 178 from worker_status (free)
[2022-10-26T08:01:20.933074Z] [debug] Worker exceeded RSS limit "188116 > 180000", restarting

So maybe the unlikely case mentioned in the previous comment was actually happening. (There's no job ID logged so the log messages could still be unrelated.)

Actions #4

Updated by mkittler over 1 year ago

If the job was deleted in the middle that would definitely give us this exact error message, e.g. adding the following in t/10-jobs.t

my $job = $jobs->find(80000);
$jobs->find(80000)->delete;
$job->account_result_size(test => 123);

leads to

DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::Jobs=HASH(0x55df72e65908): row not found at /hdd/openqa-devel/repos/openQA/lib/OpenQA/Schema/Result/Jobs.pm line 1283
Actions #5

Updated by mkittler over 1 year ago

Using a transaction in the controller code that so far runs into "row not found" wouldn't really help much. That is of course not a surprise. The only advantage is that it would cause a better error message, e.g.

my $schema_name = OpenQA::Test::Database::generate_schema_name;
my $schema = OpenQA::Test::Case->new->init_data(fixtures_glob => '01-jobs.pl 05-job_modules.pl 06-job_dependencies.pl', schema_name => $schema_name);
my $t = Test::Mojo->new('OpenQA::WebAPI');
my $jobs = $t->app->schema->resultset('Jobs');

…

$schema->txn_begin;
$schema->storage->dbh->prepare('SET TRANSACTION ISOLATION LEVEL REPEATABLE READ DEFERRABLE;')->execute();
my $job = $jobs->find(80000);
my $schema2 = OpenQA::Schema->connect($ENV{TEST_PG});
$schema2->storage->on_connect_do("SET search_path TO \"$schema_name\"");
$schema2->resultset('Jobs')->find(80000)->delete;
$job->account_result_size(test => 123);
$schema->txn_commit;

prints the following error:

DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  could not serialize access due to concurrent delete [for Statement "UPDATE jobs SET result_size = coalesce(result_size, 0) + 123, t_updated = ? WHERE ( id = ? )" with ParamValues: 1='2022-11-04 13:15:14', 2='80000'] at /hdd/openqa-devel/repos/openQA/lib/OpenQA/Schema/Result/Jobs.pm line 1283
Actions #6

Updated by mkittler over 1 year ago

  • Description updated (diff)

Note that the txn would help with read accesses. So if one would add

$schema2->resultset('Jobs')->find(80000)->delete;
$job->discard_changes;
is $job->id, 80000, 'job still appears to be present';

then that test still passes. However, using such a txn has actually the disadvantage that concurrent writes won't work anymore and fail similar to concurrent deletes:

DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  could not serialize access due to concurrent update [for Statement "UPDATE jobs SET result_size = coalesce(result_size, 0) + 123, t_updated = ? WHERE ( id = ? )" with ParamValues: 1='2022-11-04 13:41:46', 2='80000'] at /hdd/openqa-devel/repos/openQA/lib/OpenQA/Schema/Result/Jobs.pm line 1283

(Without the txn simply the last update wins in case of a conflict which I supposedly even ok in most cases.)


I'm wondering what we should do about it. We have lots of code dealing with jobs that is written under the assumption that a job is not deleted (and that there are no problematic concurrent updates). The specific exception that showed up as error in the logs here is only one possible place where things could've gone wrong. As my experiments showed, wrapping controller code dealing with jobs in a txn would only make things worse (except if the code is read-only; in this case we'd be able to avoid inconsistent subsequent reads by using a txn which could then be a read-only txn).

Actions #7

Updated by mkittler over 1 year ago

We could only delete jobs that are in state DONE. If we then additionally ensure that no updates to jobs are made on jobs in state DONE the problem is less likely to happen. This is still not ideal, though. If the job is set done while a job update is ongoing it will still break:

create_artefact is invoked on running job
done is invoked on running job and job no longer done
delete is invoked on done job and job deleted
create_artefact is still ongoing and reaches a point were it does a job update -> same problem as before
Actions #8

Updated by mkittler over 1 year ago

We could of course also simply accept that a "row not found" error can happen and handle the error accordingly. I suppose in most situations, when we try updating a job but it has already been deleted meanwhile we can simply abort the whole controller code that does the updating and might even return a 200 return code. The 200 return code would make sense considering that if the deletion would have happened slightly later then there's be the same overall outcome (no job in the end so the update wasn't really making any difference after all despite being successful).

Actions #9

Updated by livdywan over 1 year ago

mkittler wrote:

We could only delete jobs that are in state DONE. If we then additionally ensure that no updates to jobs are made on jobs in state DONE the problem is less likely to happen. This is still not ideal, though. If the job is set done while a job update is ongoing it will still break:

My first thought was maybe we can track that all operations regarding a job are complete but that would probably require a new state. On second thought, what if the delete was queued - and queued could simply mean try to delete and if that fails spawn a minion job that tries to delete the job again assuming the update will be finished by then.

Actions #10

Updated by tinita over 1 year ago

mkittler wrote:

We could of course also simply accept that a "row not found" error can happen and handle the error accordingly.

Yes, I also think that would be ok.

Actions #11

Updated by mkittler over 1 year ago

… and queued could simply mean try to delete and if that fails …

Not really because the deletion won't fail. Deleting is not the problem, dealing with the deletion in other places is the problem. Not sure whether we could make the deletion fail if there are any pending transactions.

We could of course also simply accept that a "row not found" error can happen and handle the error accordingly.

And this would also be similar to how single DELETE/UPDATE queries behave (see 2nd paragraph of https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED).


Looks like depending on whether $job->discard_changes; has been invoked in the meantime we either get "row not found" or "Not in database". Maybe it is better to use raw SQL (instead of doing regex matching on the die message). PR: https://github.com/os-autoinst/openQA/pull/4893

Actions #12

Updated by mkittler over 1 year ago

  • Status changed from New to Feedback

The PR has been merged. It covers the case producing the log message from the ticket description.

I'd wait for the estimation meeting to decide whether we want to drag this out to cover also other places where we update jobs.

Actions #13

Updated by livdywan over 1 year ago

  • Subject changed from o3 logreport message DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::Jobs=HASH(...): row not found at lib/OpenQA/Schema/Result/Jobs.pm line 1283 to o3 logreport message DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::Jobs=HASH(...): row not found at lib/OpenQA/Schema/Result/Jobs.pm line 1283 size:M
  • Status changed from Feedback to Resolved

We discussed this briefly. If another similar error occurs in the future we don't expect a big problem as a side effect, so we can look into it whenever that happened. A similar approach may not be usable, and this is why we're deciding against a proactive fix for now.

Actions

Also available in: Atom PDF