Project

General

Profile

action #45041

Asset cache cleanup still fails if new jobs are created at the same time

Added by mkittler over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2018-12-12
Due date:
% Done:

0%

Estimated time:
Difficulty:
Duration:

Description

The error message is:

"{UNKNOWN}: repo/SLE-15-SP1-Module-Transactional-Server-POOL-aarch64-Build121.1-Media1 was scheduled during cleanup (max job initially 0, now 2322143) at /usr/share/openqa/script/../lib/OpenQA/Schema/ResultSet/Assets.pm line 273. at /usr/share/openqa/script/../lib/OpenQA/Schema/ResultSet/Assets.pm line 278\n"

(from https://openqa.suse.de/minion/jobs?state=failed&offset=0&task=limit_assets)


We previously wrapped the concerning cleanup code into a transaction so it would not see other changes happening in the meantime. So either the transaction has no effect or there's a bug in the condition for the die.


Related issues

Related to openQA Project - action #19672: GRU may delete assets while jobs are registeredResolved2017-06-08

Related to openQA Project - action #41483: [tools] medium that should belong to job group was deleted after just 2 minutes making our SLE15 tests in build 47.1 useless ... and moreResolved2018-09-24

History

#1 Updated by mkittler over 1 year ago

  • Description updated (diff)

#2 Updated by mkittler over 1 year ago

Apparently a transaction doesn't help here as it only protects from seeing incomplete changes from other transactions*. But the new jobs are likely not added in a transactional way.

I also tested this by adding the following to a controller:

$self->app->schema->txn_do(sub {
    log_debug('starting transaction');
    sleep 10; # in the meantime, insert a new worker via psql
    log_debug('actual lookup: ' . $self->app->schema->resultset('Workers')->count); # the count is incremented
});

This still leaves the question how to handle it instead. We could make scheduling an iso a transaction. But in general new jobs can always be added so the asset cleanup should likely be able to handle it.


* see https://www.postgresql.org/docs/8.3/tutorial-transactions.html, "when multiple transactions are running concurrently, each one should not be able to see the incomplete changes made by others"

#3 Updated by mkittler over 1 year ago

  • Related to action #19672: GRU may delete assets while jobs are registered added

#4 Updated by mkittler over 1 year ago

  • Related to action #41483: [tools] medium that should belong to job group was deleted after just 2 minutes making our SLE15 tests in build 47.1 useless ... and more added

#5 Updated by mkittler over 1 year ago

The database operations for scheduling ISOs are actually already a transaction. Maybe the jobs which are added in the meantime are added elsewhere?

#6 Updated by mkittler over 1 year ago

  • Status changed from New to Feedback

It seems that the transaction isolation level is not high enough.

PR: https://github.com/os-autoinst/openQA/pull/1919 (merged, let's see whether this fixes the issue in production)

#7 Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

o3 was also affected by the issue and it hasn't occurred since the last deployment. So I assume it works now in production.

Also available in: Atom PDF