action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry - openQA Project - openSUSE Project Management Tool

Actions

action #67000

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #62420: [epic] Distinguish all types of incompletes

Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.(database disk image is malformed|Specified job ID is invalid).":retry

Added by okurz over 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2020-05-18

Due date:

% Done:

Estimated time:

Description

Observation¶

https://openqa.opensuse.org/tests/1271133 incompleted with, maybe due to out-of-space corrupting db content?

Likely error from autoinst-log.txt:

[2020-05-18T14:21:52.0879 CEST] [info] +++ setup notes +++
[2020-05-18T14:21:52.0879 CEST] [info] Running on openqa-aarch64:5 (Linux 4.12.14-lp151.28.48-default #1 SMP Fri Apr 17 05:38:36 UTC 2020 (18849d1) aarch64)
[2020-05-18T14:21:52.0891 CEST] [debug] Found ASSET_1, caching Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200517.vagrant.libvirt.box
[2020-05-18T14:21:52.0895 CEST] [info] Downloading Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200517.vagrant.libvirt.box, request #14735 sent to Cache Service
[2020-05-18T14:22:42.0696 CEST] [info] Download of Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200517.vagrant.libvirt.box processed
[2020-05-18T14:22:42.0714 CEST] [debug] Found ASSET_256, caching openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256
[2020-05-18T14:22:42.0719 CEST] [info] Downloading openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256, request #14738 sent to Cache Service
[2020-05-18T14:22:52.0865 CEST] [info] Download of openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256 processed:
[info] [#14738] Cache size of "/var/lib/openqa/cache" is 157GiB, with limit 160GiB
[info] [#14738] Downloading "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256" from "http://openqa1-opensuse/tests/1271133/asset/other/openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256"
[info] [#14738] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256" has not changed, updating last use

[2020-05-18T14:22:52.0874 CEST] [debug] Found HDD_1, caching opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2
[2020-05-18T14:22:52.0879 CEST] [info] Downloading opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2, request #14740 sent to Cache Service
[2020-05-18T14:22:57.0961 CEST] [info] Download of opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2 processed
[2020-05-18T14:22:57.0962 CEST] [error] Failed to download opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2 to /var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2
[2020-05-18T14:22:57.0992 CEST] [info] +++ worker notes +++
[2020-05-18T14:22:57.0992 CEST] [info] End time: 2020-05-18 12:22:57
[2020-05-18T14:22:57.0993 CEST] [info] Result: setup failure
[2020-05-18T14:22:58.0019 CEST] [info] Uploading autoinst-log.txt

This might be related to my actions earlier this day. Previously the cache directory was set to a size of 200GB and I had to reduce that to 160GB as the available space was exceeded in before. After changing the configuration I restarted all services, journal output:

May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Cache size 198GiB + needed 0 Byte exceeds limit of 160GiB, purging least used assets
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Cache size 198GiB + needed 0 Byte exceeds limit of 160GiB, purging least used assets
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64.qcow2" because we need space for n>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64.qcow2" because we need space for new asset>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64-uefi-vars.qcow2" because we need space for>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64.qcow2" because we need space for new assets, re>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64-uefi-vars.qcow2" because we need space for new >
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz.sha256" because we n>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz" because we need spa>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64-uefi-vars.qcow2" because we need s>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64.qcow2" because we need space for new as>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64-uefi-vars.qcow2" because we need space >
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz.sha256" beca>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz" because we >
…
May 18 08:10:22 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.3-minimalx@aarch64.qcow2" because we need space for ne>
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.3-minimalx@aarch64.qcow2" because we need space for new assets>
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Cache size of "/var/lib/openqa/cache" is 159GiB, with limit 160GiB
May 18 08:10:22 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Cache size of "/var/lib/openqa/cache" is 159GiB, with limit 160GiB
May 18 08:10:22 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] Worker 11087 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Listening at "http://127.0.0.1:9530"
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Listening at "http://[::1]:9530"
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Manager 11086 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: Server available at http://127.0.0.1:9530
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: Server available at http://[::1]:9530
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11090] [i] Worker 11090 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11091] [i] Worker 11091 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [i] Worker 11092 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Creating process id file "/tmp/prefork.pid"
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11093] [i] Worker 11093 started
May 18 08:10:26 openqa-aarch64 worker[2847]: [info] +++ worker notes +++
May 18 08:10:26 openqa-aarch64 worker[2847]: [info] End time: 2020-05-18 06:10:26
May 18 08:10:26 openqa-aarch64 worker[2847]: [info] Result: quit
May 18 08:10:26 openqa-aarch64 worker[2835]: [info] +++ worker notes +++
May 18 08:10:26 openqa-aarch64 worker[2835]: [info] End time: 2020-05-18 06:10:26
May 18 08:10:26 openqa-aarch64 worker[2835]: [info] Result: quit
…
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [H-9wvmR1] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 worker[11162]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 157
May 18 08:10:32 openqa-aarch64 worker[11160]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error
May 18 08:10:32 openqa-aarch64 worker[11160]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 173
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [YaWNG1Vj] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 worker[11165]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error
May 18 08:10:32 openqa-aarch64 worker[11165]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
May 18 08:10:32 openqa-aarch64 worker[11165]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share

observe the error lines:

May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [H-9wvmR1] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
…
May 18 08:10:32 openqa-aarch64 worker[11160]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error
…
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [YaWNG1Vj] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 worker[11165]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error

which might be due to the previous out-of-space condition potentially corrupting the database as well which resides on the same filesystem as the assets that filled up the available space?

Suggestions¶

Check integrity and content of attached database file for suspicious entries
Try to ensure cache database integrity even if writes fail (out-of-space) or reset cache on unexpected problems like these automatically

Further details¶

#66988 happened at around the same time.

the compressed SQLite database file "cache.sqlite.xz" is attached

Workaround¶

On the affected worker machine:

systemctl stop openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service
rm -rf /var/lib/openqa/cache/*
systemctl start openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service

Files

Download all files

cache.sqlite.xz (742 KB) cache.sqlite.xz		okurz, 2020-05-18 13:28
database-disk-image-malformed.txt (8.06 KB) database-disk-image-malformed.txt		mkittler, 2020-08-05 09:26

Related issues 9 (2 open — 7 closed)

Related to openQA Project - action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"

Rejected

okurz

2020-10-14

Actions

Related to openQA Infrastructure - action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry

Resolved

okurz

2020-10-14

2020-10-16

Actions

Related to openQA Project - action #73273: job incompletes with auto_review:"setup failure: Cache service status error from API.* file is not a database .*":retry

Workable

2020-10-13

Actions

Related to openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8

Resolved

okurz

2020-11-16

2020-11-18

Actions

Related to openQA Project - action #62441: openqa-worker systemd service can timeout when stopping

New

2020-01-21

Actions

Related to openQA Project - action #80202: jobs incomplete with auto_review:"setup failure: No workers active in the cache service":retry

Resolved

mkittler

2020-11-18

2020-12-19

Actions

Related to openQA Project - action #73339: auto_review:"setup failure: Cache service status error from API: Minion job.* failed: Can't use an undefined value as a HASH reference at.*"

Resolved

kraih

2020-10-14

Actions

Has duplicate openQA Project - action #78163: After OSD upgrade, many jobs incomplete with "Cache service status error 500: Internal Server Error"

Closed

2020-11-18

Actions

Copied to openQA Infrastructure - action #80106: corrupted worker cache sqlite: Enlarge systemd service kill timeout temporarily

Resolved

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #67000

Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry

Observation¶

Suggestions¶

Further details¶

Workaround¶

Updated by kraih over 4 years ago

Updated by kraih over 4 years ago

Updated by okurz over 4 years ago

Updated by kraih over 4 years ago

Updated by kraih over 4 years ago

Updated by kraih over 4 years ago

Updated by kraih over 4 years ago

Updated by livdywan over 4 years ago

Updated by okurz over 4 years ago

Updated by mkittler about 4 years ago

Updated by okurz about 4 years ago

Updated by kraih about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by kraih about 4 years ago

Updated by kraih about 4 years ago

Updated by kraih about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by kraih about 4 years ago

Updated by okurz about 4 years ago

Updated by kraih about 4 years ago

Updated by kraih about 4 years ago

Updated by okurz almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by ggardet_arm almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by kraih almost 4 years ago

Updated by favogt almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by okurz almost 4 years ago

Updated by kraih almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by kraih almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by kraih almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by favogt almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by nicksinger almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.(database disk image is malformed|Specified job ID is invalid).":retry