Project

General

Profile

action #71827

coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

coordination #62420: [epic] Distinguish all types of incompletes

test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloaded

Added by okurz 4 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-07-30
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

see the relevant cases from #69451

Acceptance criteria

  • AC1: No "download successful" should end with "Failed to download" referencing the same asset

Suggestions


Related issues

Related to openQA Project - action #73285: test incompletes with auto_review:"(?s)Download of.*processed[^:].*Failed to download":retry , not helpful details about reason of errorBlocked2020-07-30

Copied from openQA Project - coordination #69451: [epic] test incompletes with "(?s)Download.*successful.*Failed to download":retry, not helpful detailsBlocked2020-07-30

Copied to openQA Project - action #80118: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry, not effective on osd, or second fix neededResolved

History

#1 Updated by okurz 4 months ago

  • Copied from coordination #69451: [epic] test incompletes with "(?s)Download.*successful.*Failed to download":retry, not helpful details added

#2 Updated by okurz 4 months ago

https://github.com/os-autoinst/openQA/pull/3418 for a specific log message which auto-review can match on

#3 Updated by mkittler 4 months ago

The note

            # note: This check has no effect if the download was performed by
            # an already enqueued Minion job or if the pruning happened within
            # a completely different asset download.

I asked to add in the PR review shows that the problem is tricky:

  1. Multiple downloads can run in parallel and therefore purging might run in parallel.
  2. Even the download for the particular asset might be triggered by different jobs and is therefore executed within another Minion job. That means at least within the worker (where the PR tries to extend the message) we can not reliable detect the case.

kraih You know the asset cache better than me: Are the mentioned points correct? Any idea how to solve this? Even if we use a lock to prevent parallel purging it would still leave a race condition because at the time the worker tries to access the asset (and creates the hardlink) it might have been purged nevertheless. Maybe it would help to create the hard link already within the download task but different jobs obviously need different hard links so that would complicate things. By the way, updating the asset to be the most recently used asset before the purging doesn't seem to do the trick. Otherwise we wouldn't see the issue (see #69451#note-5).

#4 Updated by kraih 3 months ago

mkittler wrote:

I asked to add in the PR review shows that the problem is tricky:

  1. Multiple downloads can run in parallel and therefore purging might run in parallel.
  2. Even the download for the particular asset might be triggered by different jobs and is therefore executed within another Minion job. That means at least within the worker (where the PR tries to extend the message) we can not reliable detect the case.

kraih You know the asset cache better than me: Are the mentioned points correct?

If i understand you correctly only the first point is correct. There can be multiple downloads active at the same time (and multiple purges, which probably is the cause for race conditions somehow), and multiple requests for the same asset at the same time. It is correct that only one cache service Minion job will do the actual download, but all requests will receive the output from the Minion job that performed the download. That's a fairly new feature i added to the cache service, you see it in action every time the Minion job ids from the cache service logs are different from the id of the job that was supposed to perform the download.

#5 Updated by okurz 3 months ago

  • Status changed from New to Workable

#6 Updated by okurz 3 months ago

  • Subject changed from test incompletes with "(?s)Download.*successful.*Failed to download" because worker cache prunes the asset it just downloaded to test incompletes with auto_review:"(?s)Download.*successful.*Failed to download.*Asset was pruned immediately after download (poo#71827), please retrigger":retry because worker cache prunes the asset it just downloaded

After https://github.com/os-autoinst/openQA/pull/3418 is in we can have a more specific auto-review string in this ticket now with "retry" flag.

#7 Updated by mkittler 3 months ago

  • Subject changed from test incompletes with auto_review:"(?s)Download.*successful.*Failed to download.*Asset was pruned immediately after download (poo#71827), please retrigger":retry because worker cache prunes the asset it just downloaded to test incompletes with auto_review:"(?s)Download.*successful.*Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloaded

Just for the record, there are now actually 2 jobs showing the error from okurz's PR:

I'm changing the title because auto-review fails to associate the incompletes with this ticket (likely the braces needed to be escaped).

#8 Updated by okurz 3 months ago

  • Parent task set to #62420

#9 Updated by okurz 3 months ago

  • Related to action #73285: test incompletes with auto_review:"(?s)Download of.*processed[^:].*Failed to download":retry , not helpful details about reason of error added

#10 Updated by mkittler 3 months ago

Although only point 1. is correct the conclusion is still true: We can not easily detect the "Purging" case on the worker-side because the log message won't show up when the purging of the downloaded asset happened to make room for a completely different asset.

I don't know how to solve this race-free. As mentioned before: We could of course add a lock to prevent parallel purging within the cache service but that would still leave room for the asset being purged when the worker is trying to create the hardlink. We could try to create the hard-link already within the cache service. However, that's difficult because the same download/Minion job might be running for multiple openQA jobs and therefore multiple hardlinks need to be created. kraih Any idea how to solve this? There's no designated way for Minion jobs to communicate with each other, right? We could of course "leave a message" about additional hardlinks within the database.

#11 Updated by mkittler 2 months ago

Encountered more cases today: https://openqa.suse.de/tests/4961797, https://openqa.suse.de/tests/4961771, https://openqa.suse.de/tests/4961784, https://openqa.suse.de/tests/4961792, https://openqa.suse.de/tests/4961774

So multiple jobs are affected but these are all about the asset sle-12-SP5-x86_64-:17046:kernel-ec2-Server-DVD-Incidents-Kernel@64bit-with-ltp.qcow2 and they ran at the same time on the same worker:

openqa=> select id, state, result, t_started, (select host from workers where id = assigned_worker_id) as worker_host from jobs where id in (4961797, 4961771, 4961784, 4961792, 4961774) order by t_started asc;
   id    | state |   result   |      t_started      | worker_host 
---------+-------+------------+---------------------+-------------
 4961797 | done  | incomplete | 2020-11-09 18:33:33 | 10.160.1.20
 4961771 | done  | incomplete | 2020-11-09 18:33:34 | 10.160.1.20
 4961774 | done  | incomplete | 2020-11-09 18:33:34 | 10.160.1.20
 4961784 | done  | incomplete | 2020-11-09 18:33:34 | 10.160.1.20
 4961792 | done  | incomplete | 2020-11-09 18:33:34 | 10.160.1.20

(10.160.1.20 is actually openqaworker9.)

That's the relevant log of the cache service:

Nov 09 19:33:28 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24380] [i] [#1348] Downloading: "sle-12-SP5-x86_64-:17046:kernel-ec2-Server-DVD-Incidents-Kernel@64bit-with-ltp.qcow2"
Nov 09 19:33:28 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24380] [i] [#1348] Cache size of "/var/lib/openqa/cache" is 47GiB, with limit 50GiB
Nov 09 19:33:28 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24380] [i] [#1348] Downloading "sle-12-SP5-x86_64-:17046:kernel-ec2-Server-DVD-Incidents-Kernel@64bit-with-ltp.qcow2" from "http://openqa.suse.de/tests/4961776/asset/hdd/sle-12-SP5-x86_64-:17046:kernel-ec2-Server-DVD-Incidents-Kernel@64bit-wit>
Nov 09 19:34:22 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24736] [i] [#1354] Downloading: "SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
Nov 09 19:34:22 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24736] [i] [#1354] Cache size of "/var/lib/openqa/cache" is 47GiB, with limit 50GiB
Nov 09 19:34:22 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24736] [i] [#1354] Downloading "SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" from "http://openqa.suse.de/tests/4962171/asset/iso/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso"
Nov 09 19:34:22 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24736] [i] [#1354] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" has not changed, updating last use
Nov 09 19:34:22 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24736] [i] [#1354] Finished download
Nov 09 19:34:32 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24797] [i] [#1355] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Nov 09 19:34:32 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24797] [i] [#1355] Calling: rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Nov 09 19:34:33 openqaworker9 openqa-worker-cacheservice-minion[3240]: [24797] [i] [#1355] Finished sync: 0
Nov 09 19:36:01 openqaworker9 openqa-worker-cacheservice-minion[3240]: [14671] [i] [#1339] Cache size 47GiB + needed 9.9GiB exceeds limit of 50GiB, purging least used assets
Nov 09 19:36:01 openqaworker9 openqa-worker-cacheservice-minion[3240]: [14671] [i] [#1339] Purging "/var/lib/openqa/cache/openqa.suse.de/SLE-12-SP4-SAP-DVD-x86_64-GM-DVD1.iso" because we need space for new assets, reclaiming 4.1GiB
Nov 09 19:36:01 openqaworker9 openqa-worker-cacheservice-minion[3240]: [14671] [i] [#1339] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-12-SP5-x86_64-:17046:kernel-ec2-Server-DVD-Incidents-Kernel@64bit-with-ltp.qcow2" because we need space for new assets, reclaiming 2GiB
Nov 09 19:36:01 openqaworker9 openqa-worker-cacheservice-minion[3240]: [14671] [i] [#1339] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP3-x86_64-mru-install-minimal-with-addons-Build:17088:openldap2-Server-DVD-Incidents-64bit.qcow2" because we need space for new assets, reclaiming 1GiB
Nov 09 19:36:01 openqaworker9 openqa-worker-cacheservice-minion[3240]: [14671] [i] [#1339] Purging "/var/lib/openqa/cache/openqa.suse.de/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso" because we need space for new assets, reclaiming 3.6GiB
Nov 09 19:36:01 openqaworker9 openqa-worker-cacheservice-minion[3240]: [14671] [i] [#1339] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso" is 9.9GiB, with ETag ""279000000-5b1a5e45390d0""
Nov 09 19:36:01 openqaworker9 openqa-worker-cacheservice-minion[3240]: [14671] [i] [#1339] Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso" successful, new cache size is 47GiB
Nov 09 19:36:01 openqaworker9 openqa-worker-cacheservice-minion[3240]: [14671] [i] [#1339] Finished download

That the asset is purged to make room for itself is really stupid. I suppose we can at least prevent this case in a simple way.

#12 Updated by mkittler 2 months ago

  • Status changed from Workable to In Progress

PR to fix cases like the one from the previous comment: https://github.com/os-autoinst/openQA/pull/3535


It would also make sense to generally preserve assets while being downloaded.

#13 Updated by okurz 2 months ago

PR merged.

mkittler wrote:

It would also make sense to generally preserve assets while being downloaded.

can you explain that further?

#14 Updated by mkittler 2 months ago

I believe currently an asset might be purged even during its download. I'm not 100 % sure but I suppose this is so far only avoided by preferring the oldest assets when purging. However, an asset "ages" already during its download and as we've seen that might be enough so that the asset is purged right after its own download. The current fix only ensures that the very stupid case of purging an asset to make room for itself is prevented. However, the purging might have as well be triggered by a concurrent download and in this case we would have still purged the asset.

So I'd introduce a pending flag and prevent such assets from being purged. The pending flag would be unset after the download when we currently already bump the "last use". Likely we should also clear the flag on the service startup to avoid leftovers.

#15 Updated by kraih 2 months ago

mkittler wrote:

I believe currently an asset might be purged even during its download. I'm not 100 % sure but I suppose this is so far only avoided by preferring the oldest assets when purging. However, an asset "ages" already during its download and as we've seen that might be enough so that the asset is purged right after its own download. The current fix only ensures that the very stupid case of purging an asset to make room for itself is prevented. However, the purging might have as well be triggered by a concurrent download and in this case we would have still purged the asset.

Yes, i believe you are correct. If assets age so fast that they expire while they are still being downloaded that that could actually happen.

So I'd introduce a pending flag and prevent such assets from being purged. The pending flag would be unset after the download when we currently already bump the "last use". Likely we should also clear the flag on the service startup to avoid leftovers.

Flagged entries would be incomplete, so they could just be removed to avoid other side effects.

#17 Updated by Xiaojing_liu 2 months ago

On o3 some jobs are incomplete because of setup failure: Cache service status error 500: Internal Server Error. Not sure if it's related to the pr#3545.
An example: https://openqa.opensuse.org/tests/1469375#

#18 Updated by mkittler 2 months ago

My conclusion so far: The migration added by PR might have the existing problems with corrupted SQLite databases worse. However, the migration and queries generally work in production. By the way, the migration worked on other updated worker hosts and I couldn't reproduce it locally as well.

Just for the record, the error messages look like this on the worker host:

Nov 13 03:33:33 openqaworker4 openqa-workercache-daemon[2288]: [2425] [i] Worker 2425 started
Nov 13 03:33:38 openqaworker4 openqa-workercache-daemon[2288]: [2425] [e] [M7EWttEz] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 13 03:33:38 openqaworker4 openqa-workercache-daemon[2288]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 13 03:33:38 openqaworker4 openqa-workercache-daemon[2288]: [2422] [e] [qyZyZIUC] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 13 03:33:38 openqaworker4 openqa-workercache-daemon[2288]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 13 03:33:38 openqaworker4 openqa-workercache-daemon[2288]: [2422] [e] [sEiUhW7b] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 13 03:33:38 openqaworker4 openqa-workercache-daemon[2288]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 13 03:38:05 openqaworker4 openqa-workercache-daemon[2288]: [2424] [e] [wJIj5DWB] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 138.
Nov 13 03:38:05 openqaworker4 openqa-workercache-daemon[2288]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 138.
Nov 13 11:12:55 openqaworker4 systemd[1]: Stopping OpenQA Worker Cache Service...
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [w] Stopping worker 2424 immediately
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [w] Stopping worker 2422 immediately
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [w] Stopping worker 2425 immediately
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [w] Stopping worker 2423 immediately
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [i] Worker 2425 stopped
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [i] Worker 2424 stopped
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [w] Stopping worker 2424 immediately
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [w] Stopping worker 2422 immediately
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [w] Stopping worker 2423 immediately
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [i] Worker 2422 stopped
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [w] Stopping worker 2423 immediately
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [i] Worker 2423 stopped
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[2288]: [2288] [i] Manager 2288 stopped
Nov 13 11:12:55 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service.

Even after restarting I still got

Nov 13 11:12:55 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service.
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] [gAI627UE] Purging "foo" because it appears pending after service startup
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] [gAI627UE] Cache size of "/var/lib/openqa/cache" is 8.6GiB, with limit 50GiB
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] Listening at "http://127.0.0.1:9530"
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] Listening at "http://[::1]:9530"
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] Manager 3805 started
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: Web application available at http://127.0.0.1:9530
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: Web application available at http://[::1]:9530
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3809] [i] Worker 3809 started
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3810] [i] Worker 3810 started
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3811] [i] Worker 3811 started
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] Creating process id file "/tmp/prefork.pid"
Nov 13 11:12:55 openqaworker4 openqa-workercache-daemon[3805]: [3812] [i] Worker 3812 started
Nov 13 11:20:28 openqaworker4 openqa-workercache-daemon[3805]: [3810] [e] [pA0QM9KS] DBD::SQLite::db prepare_cached failed: no such table: minion_jobs at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 138.
Nov 13 11:20:38 openqaworker4 openqa-workercache-daemon[3805]: [3810] [e] [UB1x20qX] DBD::SQLite::db prepare_cached failed: no such table: minion_jobs at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 13 11:20:38 openqaworker4 openqa-workercache-daemon[3805]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 13 11:20:40 openqaworker4 openqa-workercache-daemon[3805]: [3810] [e] [XywseRLg] DBD::SQLite::db prepare_cached failed: no such table: minion_jobs at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 13 11:20:40 openqaworker4 openqa-workercache-daemon[3805]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.

That's the log of the -minion service at the same time:

Nov 13 03:37:57 openqaworker4 openqa-worker-cacheservice-minion[2289]: [2490] [i] [#165669] Size of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging-DVD-x86_64.qcow2" is 2.4GiB, with ETag ""9744fe00-553b4174305b3""
Nov 13 03:37:57 openqaworker4 openqa-worker-cacheservice-minion[2289]: [2490] [i] [#165669] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging-DVD-x86_64.qcow2" successful, new cache size is 8.6GiB
Nov 13 03:37:57 openqaworker4 openqa-worker-cacheservice-minion[2289]: [2490] [i] [#165669] Finished download
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2289]: DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 422.
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2289]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 422.
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2289]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 26.
Nov 13 03:38:05 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=4/NOPERMISSION
Nov 13 03:38:05 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 13 03:38:05 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 13 03:38:05 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Nov 13 03:38:05 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 13 03:38:05 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2517]: [2517] [i] [7csBjR0R] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build246.1-Media.iso" because it appears pending after service startup
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2517]: [2517] [i] [7csBjR0R] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:B-Tumbleweed-DVD-x86_64-Build355.20-Media.iso" because it appears pending after service startup
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2517]: [2517] [i] [7csBjR0R] Cache size of "/var/lib/openqa/cache" is 8.6GiB, with limit 50GiB
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2517]: [2517] [i] Resetting all leftover locks after restart
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2517]: [2517] [i] Worker 2517 started
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2517]: DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 422.
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2517]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 422.
Nov 13 03:38:05 openqaworker4 openqa-worker-cacheservice-minion[2517]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 26.
Nov 13 03:38:05 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=255/n/a
Nov 13 03:38:05 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 13 03:38:05 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
…
Nov 13 11:14:16 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 13 11:14:16 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 13 11:14:17 openqaworker4 openqa-worker-cacheservice-minion[3842]: [3842] [i] [_VrdFEyu] Cache size of "/var/lib/openqa/cache" is 8.6GiB, with limit 50GiB
Nov 13 11:14:17 openqaworker4 openqa-worker-cacheservice-minion[3842]: [3842] [i] Resetting all leftover locks after restart
Nov 13 11:14:17 openqaworker4 openqa-worker-cacheservice-minion[3842]: [3842] [i] Worker 3842 started
Nov 13 11:14:17 openqaworker4 openqa-worker-cacheservice-minion[3842]: DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 301.
Nov 13 11:14:17 openqaworker4 openqa-worker-cacheservice-minion[3842]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 301.
Nov 13 11:14:17 openqaworker4 openqa-worker-cacheservice-minion[3842]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 26.
Nov 13 11:14:17 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=255/n/a
Nov 13 11:14:17 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 13 11:14:17 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 13 11:14:17 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Nov 13 11:14:17 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 13 11:14:17 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Start request repeated too quickly.
Nov 13 11:14:17 openqaworker4 systemd[1]: Failed to start OpenQA Worker Cache Service Minion.
Nov 13 11:14:17 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 13 11:14:17 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 13 12:05:10 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 13 12:05:10 openqaworker4 openqa-worker-cacheservice-minion[3974]: [3974] [i] [hhbV1BD_] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging-DVD-x86_64.qcow2" because the asset is not registered
Nov 13 12:05:10 openqaworker4 openqa-worker-cacheservice-minion[3974]: [3974] [i] [hhbV1BD_] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:J-Kubic-DVD-x86_64-Build324.2-Media.iso" because the asset is not registered
Nov 13 12:05:10 openqaworker4 openqa-worker-cacheservice-minion[3974]: [3974] [i] [hhbV1BD_] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:J-MicroOS-DVD-x86_64-Build324.4-Media.iso" because the asset is not registered
Nov 13 12:05:11 openqaworker4 openqa-worker-cacheservice-minion[3974]: [3974] [i] [hhbV1BD_] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:J-Tumbleweed-DVD-x86_64-Build324.5-Media.iso" because the asset is not registered
Nov 13 12:05:11 openqaworker4 openqa-worker-cacheservice-minion[3974]: [3974] [i] [hhbV1BD_] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50GiB
Nov 13 12:05:11 openqaworker4 openqa-worker-cacheservice-minion[3974]: [3974] [i] Resetting all leftover locks after restart
Nov 13 12:05:11 openqaworker4 openqa-worker-cacheservice-minion[3974]: [3974] [i] Worker 3974 started
Nov 13 12:07:43 openqaworker4 systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Nov 13 12:07:46 openqaworker4 openqa-worker-cacheservice-minion[3974]: [3974] [i] Worker 3974 stopped
Nov 13 12:07:46 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 13 12:07:46 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 13 12:07:46 openqaworker4 openqa-worker-cacheservice-minion[4002]: [4002] [i] [y4Mp0J15] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50GiB
Nov 13 12:07:46 openqaworker4 openqa-worker-cacheservice-minion[4002]: [4002] [i] Resetting all leftover locks after restart
Nov 13 12:07:46 openqaworker4 openqa-worker-cacheservice-minion[4002]: [4002] [i] Worker 4002 started
Nov 13 12:07:52 openqaworker4 openqa-worker-cacheservice-minion[4002]: [4013] [i] [#1] Downloading: "openSUSE-Staging:F-Tumbleweed-DVD-x86_64-Build463.6-Media.iso.sha256"
Nov 13 12:07:52 openqaworker4 openqa-worker-cacheservice-minion[4002]: [4013] [i] [#1] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50GiB
Nov 13 12:07:52 openqaworker4 openqa-worker-cacheservice-minion[4002]: [4013] [i] [#1] Downloading "openSUSE-Staging:F-Tumbleweed-DVD-x86_64-Build463.6-Media.iso.sha256" from "http://openqa1-opensuse/tests/1470110/asset/other/openSUSE-Staging:F-Tumbleweed-DVD-x86_64-Build463.6-Media.iso.sha256"
Nov 13 12:07:52 openqaworker4 openqa-worker-cacheservice-minion[4002]: [4013] [i] [#1] Size of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:F-Tumbleweed-DVD-x86_64-Build463.6-Media.iso.sha256" is 118 Byte, with ETag ""76-5b3fad73dbfeb""
Nov 13 12:07:52 openqaworker4 openqa-worker-cacheservice-minion[4002]: [4013] [i] [#1] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:F-Tumbleweed-DVD-x86_64-Build463.6-Media.iso.sha256" successful, new cache size is 118 Byte
Nov 13 12:07:52 openqaworker4 openqa-worker-cacheservice-minion[4002]: [4013] [i] [#1] Finished download

After restarting the -minion service, the openqa-workercache service works again as well:

Nov 13 12:07:46 openqaworker4 systemd[1]: Stopping OpenQA Worker Cache Service...
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [w] Stopping worker 3811 immediately
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [w] Stopping worker 3812 immediately
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [w] Stopping worker 3809 immediately
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [w] Stopping worker 3810 immediately
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] Worker 3810 stopped
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] Worker 3811 stopped
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] Worker 3812 stopped
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [w] Stopping worker 3809 immediately
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] Worker 3809 stopped
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[3805]: [3805] [i] Manager 3805 stopped
Nov 13 12:07:46 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service.
Nov 13 12:07:46 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service.
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: [4001] [i] [sGYWrUvD] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50GiB
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: [4001] [i] Listening at "http://127.0.0.1:9530"
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: [4001] [i] Listening at "http://[::1]:9530"
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: [4001] [i] Manager 4001 started
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: Web application available at http://127.0.0.1:9530
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: Web application available at http://[::1]:9530
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: [4005] [i] Worker 4005 started
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: [4006] [i] Worker 4006 started
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: [4007] [i] Worker 4007 started
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: [4001] [i] Creating process id file "/tmp/prefork.pid"
Nov 13 12:07:46 openqaworker4 openqa-workercache-daemon[4001]: [4008] [i] Worker 4008 started

Purging cache directory because database has been corrupted has never been logged. So the recovery did not involve deleting the SQLite file and the errors occurred during the runtime of the services (and not at startup). At any time I've tried to access the database file via sqlite3 I didn't run into any problems. I also inserted some data for testing the queries my PR used manually. That's the reason for the Purging "foo" messages.


I would not revert the PR but keep monitoring the situation. Likely the problem won't re-occur because the migration only runs once (if the problem was even related to the PR at all). A simple restart of both services (openqa-workercache and openqa-workercache-minion) should help if the issue neverthless re-occurs. I didn't need to remove the SQLite database but if the restart doesn't help one can still try rm /var/lib/openqa/cache/cache.sqlite* (and restart the services again). Before deleting, please backup the old files for investigation purposes. Use a different directory to store the backup because the services will wipe out the cache directory. Also backup the -wal and -shm files.

#19 Updated by Xiaojing_liu 2 months ago

  • Related to action #78163: After OSD upgrade, many jobs incomplete with "Cache service status error 500: Internal Server Error" added

#20 Updated by Xiaojing_liu 2 months ago

After OSD is deployed today, this issue happens again.
I backup the sqlite files in openqaworker6
sudo cp cache/cache.sqlite* cache_backup/
If you need to check the files, you could find them in /var/lib/openqa/cache_backup/ on worker6

#21 Updated by mkittler 2 months ago

  • Related to deleted (action #78163: After OSD upgrade, many jobs incomplete with "Cache service status error 500: Internal Server Error")

#22 Updated by mkittler 2 months ago

  • Status changed from In Progress to Resolved

The SQLite problems are nothing new and are handled in #67000.

The actual problem to be solved here can still happen when the asset is purged by another download before the worker picks it up. However, fixing that is a bit more involved and given the supposedly small probability for this to happen it is likely not worth fixing. I've added this outstanding point to #65271.

#23 Updated by okurz 2 months ago

  • Status changed from Resolved to Feedback

sorry if I have to reopen but we have encountered at lot of these issues in https://gitlab.suse.de/openqa/auto-review/-/jobs/290493 and I think we are not using resolved ticket for automatically marking so the gitlab CI pipeline failed. Maybe we should look into closed tickets as well in the automatic labelling and retry script or we ignore that error differently.

#24 Updated by okurz 2 months ago

  • Status changed from Feedback to Resolved

hm, sorry. I think the problem is that the regex needs to change afterall, but not here :)

#25 Updated by okurz 2 months ago

  • Subject changed from test incompletes with auto_review:"(?s)Download.*successful.*Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloaded to test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloaded

#26 Updated by okurz 2 months ago

  • Copied to action #80118: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry, not effective on osd, or second fix needed added

Also available in: Atom PDF