Project

General

Profile

Actions

action #67000

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #62420: [epic] Distinguish all types of incompletes

Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry

Added by okurz over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-05-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.opensuse.org/tests/1271133 incompleted with, maybe due to out-of-space corrupting db content?

Likely error from autoinst-log.txt:

[2020-05-18T14:21:52.0879 CEST] [info] +++ setup notes +++
[2020-05-18T14:21:52.0879 CEST] [info] Running on openqa-aarch64:5 (Linux 4.12.14-lp151.28.48-default #1 SMP Fri Apr 17 05:38:36 UTC 2020 (18849d1) aarch64)
[2020-05-18T14:21:52.0891 CEST] [debug] Found ASSET_1, caching Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200517.vagrant.libvirt.box
[2020-05-18T14:21:52.0895 CEST] [info] Downloading Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200517.vagrant.libvirt.box, request #14735 sent to Cache Service
[2020-05-18T14:22:42.0696 CEST] [info] Download of Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200517.vagrant.libvirt.box processed
[2020-05-18T14:22:42.0714 CEST] [debug] Found ASSET_256, caching openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256
[2020-05-18T14:22:42.0719 CEST] [info] Downloading openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256, request #14738 sent to Cache Service
[2020-05-18T14:22:52.0865 CEST] [info] Download of openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256 processed:
[info] [#14738] Cache size of "/var/lib/openqa/cache" is 157GiB, with limit 160GiB
[info] [#14738] Downloading "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256" from "http://openqa1-opensuse/tests/1271133/asset/other/openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256"
[info] [#14738] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256" has not changed, updating last use

[2020-05-18T14:22:52.0874 CEST] [debug] Found HDD_1, caching opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2
[2020-05-18T14:22:52.0879 CEST] [info] Downloading opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2, request #14740 sent to Cache Service
[2020-05-18T14:22:57.0961 CEST] [info] Download of opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2 processed
[2020-05-18T14:22:57.0962 CEST] [error] Failed to download opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2 to /var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2
[2020-05-18T14:22:57.0992 CEST] [info] +++ worker notes +++
[2020-05-18T14:22:57.0992 CEST] [info] End time: 2020-05-18 12:22:57
[2020-05-18T14:22:57.0993 CEST] [info] Result: setup failure
[2020-05-18T14:22:58.0019 CEST] [info] Uploading autoinst-log.txt

This might be related to my actions earlier this day. Previously the cache directory was set to a size of 200GB and I had to reduce that to 160GB as the available space was exceeded in before. After changing the configuration I restarted all services, journal output:

May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Cache size 198GiB + needed 0 Byte exceeds limit of 160GiB, purging least used assets
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Cache size 198GiB + needed 0 Byte exceeds limit of 160GiB, purging least used assets
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64.qcow2" because we need space for n>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64.qcow2" because we need space for new asset>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64-uefi-vars.qcow2" because we need space for>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64.qcow2" because we need space for new assets, re>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64-uefi-vars.qcow2" because we need space for new >
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz.sha256" because we n>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz" because we need spa>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64-uefi-vars.qcow2" because we need s>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64.qcow2" because we need space for new as>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64-uefi-vars.qcow2" because we need space >
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz.sha256" beca>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz" because we >
…
May 18 08:10:22 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.3-minimalx@aarch64.qcow2" because we need space for ne>
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.3-minimalx@aarch64.qcow2" because we need space for new assets>
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Cache size of "/var/lib/openqa/cache" is 159GiB, with limit 160GiB
May 18 08:10:22 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Cache size of "/var/lib/openqa/cache" is 159GiB, with limit 160GiB
May 18 08:10:22 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] Worker 11087 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Listening at "http://127.0.0.1:9530"
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Listening at "http://[::1]:9530"
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Manager 11086 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: Server available at http://127.0.0.1:9530
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: Server available at http://[::1]:9530
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11090] [i] Worker 11090 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11091] [i] Worker 11091 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [i] Worker 11092 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Creating process id file "/tmp/prefork.pid"
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11093] [i] Worker 11093 started
May 18 08:10:26 openqa-aarch64 worker[2847]: [info] +++ worker notes +++
May 18 08:10:26 openqa-aarch64 worker[2847]: [info] End time: 2020-05-18 06:10:26
May 18 08:10:26 openqa-aarch64 worker[2847]: [info] Result: quit
May 18 08:10:26 openqa-aarch64 worker[2835]: [info] +++ worker notes +++
May 18 08:10:26 openqa-aarch64 worker[2835]: [info] End time: 2020-05-18 06:10:26
May 18 08:10:26 openqa-aarch64 worker[2835]: [info] Result: quit
…
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [H-9wvmR1] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 worker[11162]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 157
May 18 08:10:32 openqa-aarch64 worker[11160]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error
May 18 08:10:32 openqa-aarch64 worker[11160]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 173
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [YaWNG1Vj] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 worker[11165]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error
May 18 08:10:32 openqa-aarch64 worker[11165]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
May 18 08:10:32 openqa-aarch64 worker[11165]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share

observe the error lines:

May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [H-9wvmR1] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
…
May 18 08:10:32 openqa-aarch64 worker[11160]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error
…
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [YaWNG1Vj] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 worker[11165]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error

which might be due to the previous out-of-space condition potentially corrupting the database as well which resides on the same filesystem as the assets that filled up the available space?

Suggestions

  • Check integrity and content of attached database file for suspicious entries
  • Try to ensure cache database integrity even if writes fail (out-of-space) or reset cache on unexpected problems like these automatically

Further details

#66988 happened at around the same time.

the compressed SQLite database file "cache.sqlite.xz" is attached

Workaround

On the affected worker machine:

systemctl stop openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service
rm -rf /var/lib/openqa/cache/*
systemctl start openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service

Files

cache.sqlite.xz (742 KB) cache.sqlite.xz okurz, 2020-05-18 13:28
database-disk-image-malformed.txt (8.06 KB) database-disk-image-malformed.txt mkittler, 2020-08-05 09:26

Related issues 9 (2 open7 closed)

Related to openQA Project (public) - action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"Rejectedokurz2020-10-14

Actions
Related to openQA Infrastructure (public) - action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retryResolvedokurz2020-10-142020-10-16

Actions
Related to openQA Project (public) - action #73273: job incompletes with auto_review:"setup failure: Cache service status error from API.* file is not a database .*":retryWorkable2020-10-13

Actions
Related to openQA Infrastructure (public) - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8Resolvedokurz2020-11-162020-11-18

Actions
Related to openQA Project (public) - action #62441: openqa-worker systemd service can timeout when stoppingNew2020-01-21

Actions
Related to openQA Project (public) - action #80202: jobs incomplete with auto_review:"setup failure: No workers active in the cache service":retryResolvedmkittler2020-11-182020-12-19

Actions
Related to openQA Project (public) - action #73339: auto_review:"setup failure: Cache service status error from API: Minion job.* failed: Can't use an undefined value as a HASH reference at.*"Resolvedkraih2020-10-14

Actions
Has duplicate openQA Project (public) - action #78163: After OSD upgrade, many jobs incomplete with "Cache service status error 500: Internal Server Error"Closed2020-11-18

Actions
Copied to openQA Infrastructure (public) - action #80106: corrupted worker cache sqlite: Enlarge systemd service kill timeout temporarilyResolvednicksinger

Actions
Actions

Also available in: Atom PDF