Project

General

Profile

action #67000

coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

coordination #62420: [epic] Distinguish all types of incompletes

Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry

Added by okurz 8 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-05-18
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.opensuse.org/tests/1271133 incompleted with, maybe due to out-of-space corrupting db content?

Likely error from autoinst-log.txt:

[2020-05-18T14:21:52.0879 CEST] [info] +++ setup notes +++
[2020-05-18T14:21:52.0879 CEST] [info] Running on openqa-aarch64:5 (Linux 4.12.14-lp151.28.48-default #1 SMP Fri Apr 17 05:38:36 UTC 2020 (18849d1) aarch64)
[2020-05-18T14:21:52.0891 CEST] [debug] Found ASSET_1, caching Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200517.vagrant.libvirt.box
[2020-05-18T14:21:52.0895 CEST] [info] Downloading Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200517.vagrant.libvirt.box, request #14735 sent to Cache Service
[2020-05-18T14:22:42.0696 CEST] [info] Download of Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200517.vagrant.libvirt.box processed
[2020-05-18T14:22:42.0714 CEST] [debug] Found ASSET_256, caching openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256
[2020-05-18T14:22:42.0719 CEST] [info] Downloading openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256, request #14738 sent to Cache Service
[2020-05-18T14:22:52.0865 CEST] [info] Download of openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256 processed:
[info] [#14738] Cache size of "/var/lib/openqa/cache" is 157GiB, with limit 160GiB
[info] [#14738] Downloading "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256" from "http://openqa1-opensuse/tests/1271133/asset/other/openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256"
[info] [#14738] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200517-Media.iso.sha256" has not changed, updating last use

[2020-05-18T14:22:52.0874 CEST] [debug] Found HDD_1, caching opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2
[2020-05-18T14:22:52.0879 CEST] [info] Downloading opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2, request #14740 sent to Cache Service
[2020-05-18T14:22:57.0961 CEST] [info] Download of opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2 processed
[2020-05-18T14:22:57.0962 CEST] [error] Failed to download opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2 to /var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-aarch64-20200517-textmode@aarch64.qcow2
[2020-05-18T14:22:57.0992 CEST] [info] +++ worker notes +++
[2020-05-18T14:22:57.0992 CEST] [info] End time: 2020-05-18 12:22:57
[2020-05-18T14:22:57.0993 CEST] [info] Result: setup failure
[2020-05-18T14:22:58.0019 CEST] [info] Uploading autoinst-log.txt

This might be related to my actions earlier this day. Previously the cache directory was set to a size of 200GB and I had to reduce that to 160GB as the available space was exceeded in before. After changing the configuration I restarted all services, journal output:

May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Cache size 198GiB + needed 0 Byte exceeds limit of 160GiB, purging least used assets
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Cache size 198GiB + needed 0 Byte exceeds limit of 160GiB, purging least used assets
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64.qcow2" because we need space for n>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64.qcow2" because we need space for new asset>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64-uefi-vars.qcow2" because we need space for>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64.qcow2" because we need space for new assets, re>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64-uefi-vars.qcow2" because we need space for new >
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz.sha256" because we n>
May 18 08:10:21 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz" because we need spa>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-gnome-x11@aarch64-uefi-vars.qcow2" because we need s>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64.qcow2" because we need space for new as>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.1-xfce@aarch64-uefi-vars.qcow2" because we need space >
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz.sha256" beca>
May 18 08:10:21 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-ARM-JeOS-efi.aarch64-2020.02.26-Build1.92.raw.xz" because we >
…
May 18 08:10:22 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.3-minimalx@aarch64.qcow2" because we need space for ne>
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-164.3-minimalx@aarch64.qcow2" because we need space for new assets>
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] [QgQl857h] Cache size of "/var/lib/openqa/cache" is 159GiB, with limit 160GiB
May 18 08:10:22 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] [raw1f4qa] Cache size of "/var/lib/openqa/cache" is 159GiB, with limit 160GiB
May 18 08:10:22 openqa-aarch64 openqa-worker-cacheservice-minion[11087]: [11087] [i] Worker 11087 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Listening at "http://127.0.0.1:9530"
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Listening at "http://[::1]:9530"
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Manager 11086 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: Server available at http://127.0.0.1:9530
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: Server available at http://[::1]:9530
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11090] [i] Worker 11090 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11091] [i] Worker 11091 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [i] Worker 11092 started
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11086] [i] Creating process id file "/tmp/prefork.pid"
May 18 08:10:22 openqa-aarch64 openqa-workercache-daemon[11086]: [11093] [i] Worker 11093 started
May 18 08:10:26 openqa-aarch64 worker[2847]: [info] +++ worker notes +++
May 18 08:10:26 openqa-aarch64 worker[2847]: [info] End time: 2020-05-18 06:10:26
May 18 08:10:26 openqa-aarch64 worker[2847]: [info] Result: quit
May 18 08:10:26 openqa-aarch64 worker[2835]: [info] +++ worker notes +++
May 18 08:10:26 openqa-aarch64 worker[2835]: [info] End time: 2020-05-18 06:10:26
May 18 08:10:26 openqa-aarch64 worker[2835]: [info] Result: quit
…
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [H-9wvmR1] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 worker[11162]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 157
May 18 08:10:32 openqa-aarch64 worker[11160]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error
May 18 08:10:32 openqa-aarch64 worker[11160]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 173
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [YaWNG1Vj] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 worker[11165]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error
May 18 08:10:32 openqa-aarch64 worker[11165]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
May 18 08:10:32 openqa-aarch64 worker[11165]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share

observe the error lines:

May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [H-9wvmR1] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
…
May 18 08:10:32 openqa-aarch64 worker[11160]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error
…
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]: [11092] [e] [YaWNG1Vj] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 openqa-workercache-daemon[11086]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 339.
May 18 08:10:32 openqa-aarch64 worker[11165]: [error] Worker cache not available: Cache service returned error 500: Internal Server Error

which might be due to the previous out-of-space condition potentially corrupting the database as well which resides on the same filesystem as the assets that filled up the available space?

Suggestions

  • Check integrity and content of attached database file for suspicious entries
  • Try to ensure cache database integrity even if writes fail (out-of-space) or reset cache on unexpected problems like these automatically

Further details

#66988 happened at around the same time.

the compressed SQLite database file "cache.sqlite.xz" is attached

Workaround

On the affected worker machine:

systemctl stop openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service
rm -rf /var/lib/openqa/cache/*
systemctl start openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service
cache.sqlite.xz (742 KB) cache.sqlite.xz okurz, 2020-05-18 13:28
database-disk-image-malformed.txt (8.06 KB) database-disk-image-malformed.txt mkittler, 2020-08-05 09:26

Related issues

Related to openQA Project - action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"Rejected2020-10-14

Related to openQA Infrastructure - action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retryResolved2020-10-142020-10-16

Related to openQA Project - action #73273: job incompletes with auto_review:"setup failure: Cache service status error from API.* file is not a database .*":retryWorkable2020-10-13

Related to openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8Resolved2020-11-162020-11-18

Related to openQA Project - action #62441: openqa-worker systemd service can timeout when stoppingWorkable2020-01-21

Related to openQA Project - action #80202: jobs incomplete with auto_review:"setup failure: No workers active in the cache service":retryResolved2020-11-182020-12-19

Related to openQA Project - action #73339: auto_review:"setup failure: Cache service status error from API: Minion job.* failed: Can't use an undefined value as a HASH reference at.*"Resolved2020-10-14

Has duplicate openQA Project - action #78163: After OSD upgrade, many jobs incomplete with "Cache service status error 500: Internal Server Error"Closed2020-11-18

Copied to openQA Infrastructure - action #80106: corrupted worker cache sqlite: Enlarge systemd service kill timeout temporarilyResolved

History

#1 Updated by kraih 8 months ago

  • Assignee set to kraih

Great that the database was preserved this time. Lets see what i can find out.

#2 Updated by kraih 8 months ago

The SQLite file is definitely corrupted. And i believe it's the same table as last time.

*** in database main ***
On tree page 6 cell 73: Rowid 13377 out of order
row 2476 missing from index downloads_created
row 2476 missing from index downloads_lock
row 2477 missing from index downloads_created
row 2477 missing from index downloads_lock
wrong # of entries in index downloads_created
wrong # of entries in index downloads_lock

#3 Updated by okurz 8 months ago

so do you think this could be related to the "out-of-space" condition?

#4 Updated by kraih 8 months ago

I don't think so. Running out of disk space would make write queries fail with an out of space error message (rolling back the transaction with the write operation). But it is not supposed to corrupt the SQLite file. So far the only ways to actually corrupt an SQLite file i have found are a) hardware failure, b) SQLite process gets killed in the middle of a write (journal might mitigate this one though), or c) SQLite file is copied during a write operation.

Only scenario b would seem to make sense for our setup (even though we run SQLite in wal mode). While there is nothing in the SQLite documentation, I did find some stories about explicit exclusive locks around write operations mitigating SQLite file corruption, so i'm just going to try that.

#5 Updated by kraih 8 months ago

I've opened a PR that adds explicit locks. https://github.com/os-autoinst/openQA/pull/3103

#6 Updated by kraih 8 months ago

One more thing. The Minion job queue tables are not covered by explicit locks (3rd party module). That is something to look out for in case we see a corrupted SQLite file again. As far as i can remember until now we've only seen the downloads table and its indexes get corrupted. If that changes to one of the minion_* tables, it might suggest the mitigation PR above is working, and that we need to contribute a similar patch to Minon::Backend::SQLite.

#7 Updated by kraih 8 months ago

  • Status changed from Workable to Feedback

Lets see if the patch works out after the next deployment.

#8 Updated by cdywan 6 months ago

Do we consider this solved?

#9 Updated by okurz 6 months ago

  • Status changed from Feedback to Resolved

We could not link any recent problems to this issue so, yes, I agree we can see this as Resolved

#10 Updated by mkittler 6 months ago

  • File database-disk-image-malformed.txt database-disk-image-malformed.txt added
  • Subject changed from job incompletes with failure to download into cache (cache sqlite db attached), maybe due to out-of-space corrupting db content? to Job incompletes due to malformed worker cache database disk image with auto_review:"setup failure: Cache service status error.*database disk image is malformed.*"
  • Description updated (diff)
  • Status changed from Resolved to Feedback

Seems like it happened again: https://openqa.opensuse.org/tests/1352583

[2020-08-04T14:23:41.0564 CEST] [info] +++ setup notes +++
[2020-08-04T14:23:41.0564 CEST] [info] Running on openqa-aarch64:2 (Linux 4.12.14-lp151.28.52-default #1 SMP Wed Jun 10 15:32:08 UTC 2020 (464fb5f) aarch64)
[2020-08-04T14:23:41.0571 CEST] [debug] Found ASSET_1, caching Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200802.vagrant.libvirt.box
[2020-08-04T14:23:41.0574 CEST] [info] Downloading Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200802.vagrant.libvirt.box, request #23990 sent to Cache Service
[2020-08-04T14:23:46.0637 CEST] [info] Download of Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200802.vagrant.libvirt.box processed:
[info] [#23990] Cache size of "/var/lib/openqa/cache" is 158GiB, with limit 160GiB
[info] [#23990] Downloading "Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200802.vagrant.libvirt.box" from "http://openqa1-opensuse/tests/1352583/asset/other/Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200802.vagrant.libvirt.box"
[info] [#23990] Content of "/var/lib/openqa/cache/openqa1-opensuse/Tumbleweed.aarch64-1.0-libvirt_aarch64-Snapshot20200802.vagrant.libvirt.box" has not changed, updating last use

[2020-08-04T14:23:46.0644 CEST] [debug] Found ASSET_256, caching openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200802-Media.iso.sha256
[2020-08-04T14:23:46.0647 CEST] [info] Downloading openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200802-Media.iso.sha256, request #23994 sent to Cache Service
[2020-08-04T14:23:51.0693 CEST] [info] Download of openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200802-Media.iso.sha256 processed:
[info] [#23994] Cache size of "/var/lib/openqa/cache" is 158GiB, with limit 160GiB
[info] [#23994] Downloading "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200802-Media.iso.sha256" from "http://openqa1-opensuse/tests/1352583/asset/other/openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200802-Media.iso.sha256"
[info] [#23994] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200802-Media.iso.sha256" has not changed, updating last use

[2020-08-04T14:23:51.0697 CEST] [debug] Found ISO, caching openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200802-Media.iso
[2020-08-04T14:23:51.0701 CEST] [info] Downloading openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200802-Media.iso, request #24004 sent to Cache Service
[2020-08-04T14:24:01.0803 CEST] [info] +++ worker notes +++
[2020-08-04T14:24:01.0803 CEST] [info] End time: 2020-08-04 12:24:01
[2020-08-04T14:24:01.0803 CEST] [info] Result: setup failure
[2020-08-04T14:24:01.0818 CEST] [info] Uploading autoinst-log.txt

The problem strangely seemed to have resolved itself because subsequent jobs on the same worker haven't incompleted. Hence I can also not upload the malformed SQLite file. I can only attache the system journal around the time the error was logged. It is weird that there are no other related errors at the same time.

#11 Updated by okurz 5 months ago

  • Target version set to Ready

#12 Updated by kraih 4 months ago

That seems very strange, a malformed SQLite file does not fix itself. The cases i investigated had actual data inconsistencies, where indexes had different content than tables.

#13 Updated by okurz 3 months ago

  • Related to action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*" added

#14 Updated by okurz 3 months ago

  • Related to action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added

#15 Updated by okurz 3 months ago

kraih I have new data for you to churn on: #73321 :)

#16 Updated by kraih 3 months ago

Great, i'll have a look right away!

#17 Updated by kraih 3 months ago

Interesting results. Apparently this time it affects more than one table (minion_jobs and downloads). Both are never accessed by the same query, not even for selects.

*** in database main ***
Page 26 is never used
Page 27 is never used
Page 1048 is never used
Page 1367 is never used
Page 1411 is never used
Page 1445 is never used
Page 1649 is never used
Page 1706 is never used
Page 1783 is never used
Page 1803 is never used
Page 1816 is never used
Page 1846 is never used
Page 1851 is never used
Page 1941 is never used
Page 1970 is never used
Page 1976 is never used
Page 1993 is never used
Page 2032 is never used
Page 2082 is never used
Page 2129 is never used
Page 2140 is never used
Page 2242 is never used
Page 2264 is never used
Page 2294 is never used
Page 2345 is never used
Page 2349 is never used
Page 2405 is never used
Page 2420 is never used
Page 2475 is never used
Page 2513 is never used
Page 2531 is never used
Page 2532 is never used
Page 2576 is never used
Page 2588 is never used
Page 2632 is never used
Page 2677 is never used
Page 2695 is never used
Page 2903 is never used
Page 2916 is never used
Page 2940 is never used
Page 2997 is never used
Page 3117 is never used
Page 3119 is never used
Page 3142 is never used
wrong # of entries in index minion_jobs_state_priority_id
NULL value in downloads.created
row 1 missing from index downloads_created
row 1 missing from index downloads_lock
NULL value in downloads.created
row 2 missing from index downloads_created
row 2 missing from index downloads_lock
NULL value in downloads.created
row 3 missing from index downloads_created
row 3 missing from index downloads_lock
NULL value in downloads.created
row 4 missing from index downloads_created
row 4 missing from index downloads_lock
row 1704 missing from index downloads_created
wrong # of entries in index downloads_created
wrong # of entries in index downloads_lock

Unfortunately i have absolutely no idea what might cause this.

#18 Updated by kraih 3 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (kraih)

I'm out of ideas i'm afraid, maybe it's time for someone else to give it a try.

#19 Updated by okurz 3 months ago

  • Related to action #73273: job incompletes with auto_review:"setup failure: Cache service status error from API.* file is not a database .*":retry added

#20 Updated by okurz 3 months ago

  • Subject changed from Job incompletes due to malformed worker cache database disk image with auto_review:"setup failure: Cache service status error.*database disk image is malformed.*" to Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*database disk image is malformed.*":retry

#21 Updated by kraih 3 months ago

  • Assignee set to kraih

Ok, since nobody could think of anything else to try, we are going to add mitigation features for the issue. Starting with a check on service startup if the SQLite file has been corrupted. And if it has been, that will trigger the cache cleanup.

#22 Updated by okurz 3 months ago

  • Parent task set to #62420

#23 Updated by kraih 3 months ago

  • Status changed from Workable to In Progress

I've opened a PR with a mitigation feature. https://github.com/os-autoinst/openQA/pull/3497

#24 Updated by kraih 3 months ago

Additionally we could have the mitigation code make a copy of the corrupted database for later investigation.

#25 Updated by okurz 3 months ago

kraih wrote:

Additionally we could have the mitigation code make a copy of the corrupted database for later investigation.

Yes, we have a good entry point for that. Currently I don't see the need after you already tried hard enough to corrupt the database file on purpose and struggled to do that so it seems the system itself is resilient enough except for cases when the filesystem itself is in an inconsistent state

PR had been merged and should be validated in production on osd before we can call the work done.

#26 Updated by cdywan 3 months ago

  • Status changed from In Progress to Feedback

#27 Updated by ggardet_arm 2 months ago

This error started on openqa-aarch64 today: https://openqa.opensuse.org/tests/1469625

EDIT: could also be related to https://progress.opensuse.org/issues/77752

#28 Updated by okurz 2 months ago

kraih I think we are still missing a "restart" trigger for the cache service. Something like:

  • If corruption is encountered, terminate the service with error exit code
  • Change systemd service to restart on failure
  • the already existing corruption mitigation in sub init should recover

#29 Updated by mkittler 2 months ago

I've added my finding in #71827#note-18. I agree that the service should terminate in case of the database disk image is malformed error because apparently it will never recover from that error state otherwise. Note that while the service still continuously logged the database disk image is malformed message I was able to read/write the same database file just fine via SQLite and a manual restart of the service didn't even enter the Purging cache directory because database has been corrupted code path (or the log message has been suppressed).

#30 Updated by kraih 2 months ago

  • Assignee changed from kraih to mkittler

#31 Updated by favogt 2 months ago

This happened to openqaworker1 today and I was able to fix it by doing:

openqaworker1:/var/lib/openqa/cache # sqlite3 cache.sqlite
SQLite version 3.28.0 2019-04-16 19:49:53
Enter ".help" for usage hints.
sqlite> vacuum
   ...> ;
sqlite> PRAGMA integrity_check;
wrong # of entries in index minion_jobs_state_priority_id
row 1 missing from index downloads_created
row 1 missing from index downloads_lock
row 2 missing from index downloads_created
row 2 missing from index downloads_lock
row 3 missing from index downloads_created
row 3 missing from index downloads_lock
row 4 missing from index downloads_created
row 4 missing from index downloads_lock
wrong # of entries in index downloads_created
wrong # of entries in index downloads_lock
sqlite> reindex;
sqlite> PRAGMA integrity_check;
ok
sqlite> .quit

I stopped openqa-worker-cacheservice.service and openqa-worker-cacheservice-minion.service for that time.

#32 Updated by okurz 2 months ago

  • Description updated (diff)
  • Priority changed from Normal to Urgent

For some reasons this seems to happen more and more. E.g. was reported by fvogt (see above) for openqaworker1.o.o and today again by guillaume_g on aarch64.o.o

Updated the workaround section

#33 Updated by mkittler 2 months ago

If it is still happening more than it is unlikely that the new migration causes it (as the code runs only once).

I couldn't find any errors on aarch64 in the journal (journalctl --since '1 day ago' -u openqa-worker-cacheservice, journalctl --since '1 day ago' -u openqa-worker-cacheservice-minion).

#34 Updated by mkittler 2 months ago

I couldn't find anything in the logs for openqaworker1 as well. Usually we've at least seen lots of errors there (e.g. #71827#note-18). I'll try to dig in the job history to find any clues. However, if the error isn't visible from the side of the cache services than I wouldn't know how to implement #67000#note-28.

#35 Updated by okurz 2 months ago

on aarch64.o.o journalctl --since="3 days ago" -u openqa-worker-cacheservice tells me:

Nov 14 08:09:32 openqa-aarch64 openqa-workercache-daemon[2924]: [3034] [e] [6EpmB1Ma] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 14 08:09:32 openqa-aarch64 openqa-workercache-daemon[2924]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.

the current context from source is:

 360 sub stats {
 361   my $self = shift;
 362 
 363   my $stats = $self->sqlite->db->query(
 364     q{select count(case when state = 'inactive' and (expires is null or expires > datetime('now'))
…
 377       (select count(*) from minion_workers) as inactive_workers, null as uptime
 378       from minion_jobs}
 379   )->hash;
 380   $stats->{inactive_workers} -= $stats->{active_workers};
 381 
 382   return $stats;
 383 }

so the call $self->sqlite->db->query fails but is caught by some error handler. I would just look for where this error handler is and not catch this exception which we can not gracefully handle in a useful way within the minion.

Is my suggestion from #67000#note-28 not possible?

Trying to rephrase: openqa-worker-cacheservice-minion sees the problem and reports as error. Same as before I am still suggesting to either crash the service hard and let the openqa-worker-cacheservice detect that the minion is gone unexpectedly and also crash hard and then implement the retry on systemd service level. Or, if openqa-worker-cacheservice can't detect that the minion is gone then the minion needs to inform openqa-worker-cacheservice that it should crash hard (same game for systemd service)

#36 Updated by mkittler 2 months ago

I'm aware of the log from 3 days ago but is is not relevant for the current incompletes. But I was able to find them by digging into the job history and by asking on IRC. Seems like this time only the downloads table was affected. That would be in accordance with the output from @fvogt's PRAGMA integrity_check; and explain why we didn't see any errors within the logs. Now, knowing where the error comes from I can implement an automatic repair/restart in this case as well. I assume we need to trigger that from any eval block we use to catch database errors in.

#37 Updated by okurz 2 months ago

I assume we need to trigger that from any eval block we use to catch database errors in.

Or even simpler: Delete the eval block and let the whole process crash.

#38 Updated by kraih 2 months ago

Nov 14 08:09:32 openqa-aarch64 openqa-workercache-daemon[2924]: [3034] [e] [6EpmB1Ma] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 14 08:09:32 openqa-aarch64 openqa-workercache-daemon[2924]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.

the current context from source is:

 360 sub stats {
 361   my $self = shift;
 362 
 363   my $stats = $self->sqlite->db->query(
 364     q{select count(case when state = 'inactive' and (expires is null or expires > datetime('now'))
…
 377       (select count(*) from minion_workers) as inactive_workers, null as uptime
 378       from minion_jobs}
 379   )->hash;
 380   $stats->{inactive_workers} -= $stats->{active_workers};
 381 
 382   return $stats;
 383 }

That would actually be the first time i've seen a SELECT trigger the database disk image is malformed error. In all other cases i'm aware of it has been write operations.

#39 Updated by mkittler 2 months ago

By the way, that's one of the jobs where the error didn't end up within the log: https://openqa.opensuse.org/tests/1474352

Reason: setup failure: Cache service status error from API: Minion job #3966 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Downloads.pm line 34. at /usr/share/openqa/script/../lib/OpenQA/CacheServ…

kraih Yes, the recent errors we've already been discussing about on Friday show up when reading the database (but the corruption is nevertheless likely caused by write operations).


okurz

Or even simpler: Delete the eval block and let the whole process crash.

I'll have to check the various eval blocks one-by-one for what happens if an exception passes. Maybe the error would just be logged after all and the event loop continues to run. It looks like that's actually the case for the error messages occurring within the logs. Besides, it would likely also still make sense to check for the particular error.

#40 Updated by kraih 2 months ago

I'll have to check the various eval blocks one-by-one for what happens if an exception passes. Maybe the error would just be logged after all and the event loop continues to run. It looks like that's actually the case for the error messages occurring within the logs. Besides, it would likely also still make sense to check for the particular error.

There's at least 4 different process types where the error would be handled differently (and probably a whole bunch of variants for each case depending on context).

  1. openqa-cacheservice-worker (web app) manager process that connects to the database on startup. Exceptions should already kill the service.
  2. openqa-cacheservice-worker (web app) worker processes that processes requests and which can encounter the problem with many different queries. Exceptions should get caught by a framework handler generating a built-in 500 page.
  3. openqa-cacheservice-worker-minion (job queue) manager process that also connects to the database on startup and runs queries to dequeue jobs and update results and other metadata. Exceptions should already kill the service.
  4. openqa-cacheservice-worker-minion (job queue) job processes that get forked for each job and processes rsync and asset download tasks. Exceptions get caught and stored in the job result.

So it's unfortunately not as simple as disabling an eval. At best you take down a forked process that doesn't really influence the service as a whole.

#42 Updated by okurz 2 months ago

Just to provide context to prove that this issue is really important: https://gitlab.suse.de/openqa/auto-review/-/jobs/287066 shows just today we have 355 incompletes linked with this ticket. All could be retriggered but imagine how reviewers would need to handle this manually if we would not have "auto-review" :D

But on osd I found just now that openqaworker8 is heavily causing a lot of incompletes. If we don't have the PR merged tomorrow and deployed as hotpatch I would suggest we look for another, faster workaround for the time being.

#43 Updated by okurz 2 months ago

  • Description updated (diff)

#44 Updated by okurz 2 months ago

  • Related to action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8 added

#45 Updated by okurz 2 months ago

found https://www.sqlite.org/howtocorrupt.html , sounds interesting :) This article lists many different scenarios that can lead to corruption. Maybe something we should look into if we are doing any of these. For example, are we doing "Carrying an open database connection across a fork()" ?

#46 Updated by mkittler 2 months ago

PR to better deal with corrupted databases: https://github.com/os-autoinst/openQA/pull/3559


Thanks, I'll read it.

For example, are we doing "Carrying an open database connection across a fork()" ?

I suppose we do because the Minion framework creates sub process to run the job and these most likely just re-use the established database connection (kraih , right?).

#47 Updated by kraih 2 months ago

For example, are we doing "Carrying an open database connection across a fork()" ?

I suppose we do because the Minion framework creates sub process to run the job and these most likely just re-use the established database connection (kraih , right?).

No, that should not be the case. The Mojo::SQLite module has a fork-safety feature that deletes all connections after a fork without ever using them. https://metacpan.org/release/Mojo-SQLite/source/lib/Mojo/SQLite.pm#L75

#48 Updated by mkittler 2 months ago

  • Has duplicate action #78163: After OSD upgrade, many jobs incomplete with "Cache service status error 500: Internal Server Error" added

#49 Updated by mkittler 2 months ago

Seems like the restart introduced with https://github.com/os-autoinst/openQA/pull/3559 works in production:

Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9266] [e] [HfT4AA7G] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9266] [e] Stopping service after critical database error
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9261] [i] Stopping worker 9266 gracefully (80 seconds)
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9261] [i] Stopping worker 9267 gracefully (80 seconds)
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9261] [i] Stopping worker 9265 gracefully (80 seconds)
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9261] [i] Stopping worker 9268 gracefully (80 seconds)
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9261] [i] Worker 9265 stopped
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9261] [i] Worker 9266 stopped
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9261] [i] Worker 9267 stopped
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9261] [i] Worker 9268 stopped
Nov 19 13:15:19 openqaworker4 openqa-workercache-daemon[9261]: [9261] [i] Manager 9261 stopped
Nov 19 13:15:19 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=32/n/a
Nov 19 13:15:19 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 19 13:15:19 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Nov 19 13:15:20 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Service RestartSec=100ms expired, scheduling restart.
Nov 19 13:15:20 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service.
Nov 19 13:15:20 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service.
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: [9285] [i] [YThoSVqB] Cache size of "/var/lib/openqa/cache" is 48GiB, with limit 50GiB
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: [9285] [i] Listening at "http://127.0.0.1:9530"
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: [9285] [i] Listening at "http://[::1]:9530"
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: [9285] [i] Manager 9285 started
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: Web application available at http://127.0.0.1:9530
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: Web application available at http://[::1]:9530
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: [9289] [i] Worker 9289 started
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: [9290] [i] Worker 9290 started
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: [9291] [i] Worker 9291 started
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: [9285] [i] Creating process id file "/tmp/prefork.pid"
Nov 19 13:15:21 openqaworker4 openqa-workercache-daemon[9285]: [9292] [i] Worker 9292 started

In the meantime I've created https://github.com/os-autoinst/openQA/pull/3565 to further improve the repairing. See the comments on GitHub for details. (Note that none of the mentioned PR were not deployed on the QA-Power8-5-kvm machine mentioned there.)

#50 Updated by favogt 2 months ago

This just happened to openqaworker4 and checking the log during shutdown I found:

Nov 19 03:30:03 openqaworker4 openqa-workercache-daemon[2199]: [2199] [w] Stopping worker 2403 immediately
Nov 19 03:30:03 openqaworker4 openqa-workercache-daemon[2199]: [2199] [w] Stopping worker 2404 immediately
Nov 19 03:30:03 openqaworker4 openqa-workercache-daemon[2199]: [2199] [w] Stopping worker 2405 immediately
Nov 19 03:30:03 openqaworker4 openqa-workercache-daemon[2199]: [2199] [w] Stopping worker 2402 immediately
Nov 19 03:30:03 openqaworker4 openqa-workercache-daemon[2199]: [2199] [i] Worker 2404 stopped
Nov 19 03:30:03 openqaworker4 openqa-workercache-daemon[2199]: [2199] [i] Worker 2405 stopped
Nov 19 03:30:03 openqaworker4 openqa-workercache-daemon[2199]: [2199] [i] Worker 2403 stopped
Nov 19 03:30:03 openqaworker4 openqa-workercache-daemon[2199]: [2199] [i] Worker 2402 stopped
Nov 19 03:30:03 openqaworker4 openqa-workercache-daemon[2199]: [2199] [w] Stopping worker 2403 immediately

If those workers had the sqlite database open when they got sent a SIGKILL, then that might indeed cause corruption.

#51 Updated by okurz 2 months ago

That's a very good observation. You know my catch phrase? "We have a ticket for that" :)

That would be #62441 about "openqa-worker systemd service can timeout when stopping".

#52 Updated by okurz 2 months ago

  • Related to action #62441: openqa-worker systemd service can timeout when stopping added

#53 Updated by mkittler 2 months ago

If those workers had the sqlite database open when they got sent a SIGKILL, then that might indeed cause corruption.

But shouldn't SQLite's journaling help us here? Of course we might lose something but at least the database should be able to restore itself to a consistent state.

I note down the possible upstream issue you've mentioned on IRC: https://github.com/Grinnz/Mojo-SQLite/issues/5

#54 Updated by mkittler 2 months ago

This now constantly shows up on openqaworker4:

Nov 20 03:48:39 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service.
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] [WNLIkngQ] Cache size of "/var/lib/openqa/cache" is 47GiB, with limit 50GiB
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Listening at "http://127.0.0.1:9530"
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Listening at "http://[::1]:9530"
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Manager 2769 started
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: Web application available at http://127.0.0.1:9530
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: Web application available at http://[::1]:9530
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: [2773] [i] Worker 2773 started
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: [2774] [i] Worker 2774 started
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: [2775] [i] Worker 2775 started
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Creating process id file "/tmp/prefork.pid"
Nov 20 03:48:40 openqaworker4 openqa-workercache-daemon[2769]: [2776] [i] Worker 2776 started
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2774] [e] [mjgYfJJ7] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2776] [e] [mjgYfJJ7] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2773] [e] [mjgYfJJ7] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2774] [e] Stopping service after critical database error
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Stopping worker 2775 gracefully (80 seconds)
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Stopping worker 2773 gracefully (80 seconds)
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Stopping worker 2776 gracefully (80 seconds)
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Stopping worker 2774 gracefully (80 seconds)
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2776] [e] Stopping service after critical database error
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2773] [e] Stopping service after critical database error
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Worker 2774 stopped
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Worker 2773 stopped
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Worker 2775 stopped
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Worker 2776 stopped
Nov 20 03:48:43 openqaworker4 openqa-workercache-daemon[2769]: [2769] [i] Manager 2769 stopped
Nov 20 03:48:43 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=160/n/a
Nov 20 03:48:43 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 20 03:48:43 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Nov 20 03:48:44 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Service RestartSec=100ms expired, scheduling restart.
Nov 20 03:48:45 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service.
Nov 20 03:48:45 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service.
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] [4MOBVYiU] Cache size of "/var/lib/openqa/cache" is 47GiB, with limit 50GiB
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] Listening at "http://127.0.0.1:9530"
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] Listening at "http://[::1]:9530"
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] Manager 2777 started
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: Web application available at http://127.0.0.1:9530
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: Web application available at http://[::1]:9530
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: [2781] [i] Worker 2781 started
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: [2782] [i] Worker 2782 started
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: [2783] [i] Worker 2783 started
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] Creating process id file "/tmp/prefork.pid"
Nov 20 03:48:45 openqaworker4 openqa-workercache-daemon[2777]: [2784] [i] Worker 2784 started
Nov 20 09:16:18 openqaworker4 systemd[1]: Stopping OpenQA Worker Cache Service...
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [w] Stopping worker 2784 immediately
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [w] Stopping worker 2783 immediately
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [w] Stopping worker 2781 immediately
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [w] Stopping worker 2782 immediately
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] Worker 2783 stopped
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [w] Stopping worker 2784 immediately
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [w] Stopping worker 2781 immediately
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [w] Stopping worker 2782 immediately
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] Worker 2781 stopped
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] Worker 2784 stopped
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [w] Stopping worker 2782 immediately
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] Worker 2782 stopped
Nov 20 09:16:18 openqaworker4 openqa-workercache-daemon[2777]: [2777] [i] Manager 2777 stopped
Nov 20 09:16:18 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=8/n/a
Nov 20 09:16:18 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service.
Nov 20 09:16:18 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 20 09:16:18 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Nov 20 09:16:52 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service.
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: [8990] [i] [GPabEAVD] Cache size of "/var/lib/openqa/cache" is 48GiB, with limit 50GiB
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: [8990] [i] Listening at "http://127.0.0.1:9530"
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: [8990] [i] Listening at "http://[::1]:9530"
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: [8990] [i] Manager 8990 started
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: Web application available at http://127.0.0.1:9530
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: Web application available at http://[::1]:9530
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: [8994] [i] Worker 8994 started
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: [8995] [i] Worker 8995 started
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: [8996] [i] Worker 8996 started
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: [8990] [i] Creating process id file "/tmp/prefork.pid"
Nov 20 09:16:52 openqaworker4 openqa-workercache-daemon[8990]: [8997] [i] Worker 8997 started

And that although the SQLite integrity check shows errors:

openqaworker4:~ # sqlite3 /var/lib/openqa/cache/cache.sqlite
SQLite version 3.28.0 2019-04-16 19:49:53
Enter ".help" for usage hints.
sqlite> pragma integrity_check;
*** in database main ***
Page 18 is never used
Page 19 is never used
Page 24 is never used
Page 26 is never used
Page 42 is never used
Page 83 is never used
Page 99 is never used
row 1 missing from index downloads_created
row 1 missing from index downloads_lock
row 2 missing from index downloads_created
row 2 missing from index downloads_lock
row 3 missing from index downloads_created
row 3 missing from index downloads_lock
row 22 missing from index downloads_lock
row 31 missing from index downloads_lock
row 35 missing from index downloads_lock
row 39 missing from index downloads_lock
wrong # of entries in index downloads_created
wrong # of entries in index downloads_lock
sqlite>

However, after a restart of the service the repairing seems to be done automatically:

Nov 20 11:04:11 openqaworker4 systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[8992]: [8992] [i] Worker 8992 stopped
Nov 20 11:04:15 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=136/n/a
Nov 20 11:04:15 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 20 11:04:15 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 20 11:04:15 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 20 11:04:15 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] Database integrity check found errors:
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] *** in database main ***
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: Page 18 is never used
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: Page 19 is never used
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: Page 24 is never used
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: Page 26 is never used
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: Page 42 is never used
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: Page 83 is never used
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: Page 99 is never used
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 1 missing from index downloads_created
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 1 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 2 missing from index downloads_created
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 2 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 3 missing from index downloads_created
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 3 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 22 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 31 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 35 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 39 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] wrong # of entries in index downloads_created
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] wrong # of entries in index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] Re-indexing and vacuuming broken database
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] Database integrity check found errors:
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 1 missing from index downloads_created
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 1 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 2 missing from index downloads_created
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 2 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 3 missing from index downloads_created
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 3 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 19 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 22 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 23 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 28 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 31 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 35 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 36 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] row 39 missing from index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] wrong # of entries in index downloads_created
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] wrong # of entries in index downloads_lock
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [e] [Df2AfljU] Purging cache directory because database has been corrupted: Unable to fix errors reported by integrity check
Nov 20 11:04:15 openqaworker4 openqa-worker-cacheservice-minion[16691]: [16691] [i] [Df2AfljU] Creating cache directory tree for "/var/lib/openqa/cache"
Nov 20 11:04:16 openqaworker4 openqa-worker-cacheservice-minion[16691]: cannot remove directory for /var/lib/openqa/cache/tmp: No such file or directory at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 120.
Nov 20 11:04:16 openqaworker4 openqa-worker-cacheservice-minion[16691]: cannot unlink file for /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles/gnome-next-button-20170213.json: No such file or directory at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 120.

I'm wondering why it still can not fix all errors with "Unable to fix errors reported by integrity check". The errors the integrity check found can actually usually be fixed via reindex.

I guess this code in our repair function can also be problematic:

    # remove broken database
    if (my $err = $@) {
        $log->error("Purging cache directory because database has been corrupted: $err");
        $db_file->remove;
    }

Even if it is just for the sake of throwing away a database which is broken anyways I'm not sure whether simply removing the SQLite file will not cause any further trouble. kraih What do you think (since you've introduced this purging)?

#55 Updated by okurz 2 months ago

  • Copied to action #80106: corrupted worker cache sqlite: Enlarge systemd service kill timeout temporarily added

#56 Updated by mkittler 2 months ago

To follow up on my last comment:


I'm wondering why it still can not fix all errors with "Unable to fix errors reported by integrity check". The errors the integrity check found can actually usually be fixed via reindex.

This PR fixes that problem: https://github.com/os-autoinst/openQA/pull/3566

I guess this code in our repair function can also be problematic:

That's the according section from the SQLite documentation: https://www.sqlite.org/howtocorrupt.html#_unlinking_or_renaming_a_database_file_while_in_use

With the mentioned PR we will hopefully no longer run into this case. However, it is still questionable whether we should have it at all. Maybe this kind of cleanup is better done by an external script which shuts down all services before removing the file.


By the way, that's where Mojo::SQLite implements "fork-safety" to avoid using the SQLite handle within forked processes: https://metacpan.org/release/Mojo-SQLite/source/lib/Mojo/SQLite.pm#L75
The only way to circumvent this protection would be passing an SQLite handle established before forking into the forked process (which we don't do).

#57 Updated by mkittler about 2 months ago

When investigating #80202 I had a look at the log of openqaworker1's cache service:

Nov 23 03:33:36 openqaworker1 systemd[1]: Started OpenQA Worker Cache Service.
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] [ADC6V2Xw] Cache size of "/var/lib/openqa/cache" is 19GiB, with limit 400GiB
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Listening at "http://127.0.0.1:9530"
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Listening at "http://[::1]:9530"
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Manager 3045 started
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: Web application available at http://127.0.0.1:9530
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: Web application available at http://[::1]:9530
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3292] [i] Worker 3292 started
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3293] [i] Worker 3293 started
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3294] [i] Worker 3294 started
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Creating process id file "/tmp/prefork.pid"
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3295] [i] Worker 3295 started
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3295] [e] [K7E-kVGB] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3294] [e] [K7E-kVGB] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3295] [e] Stopping service after critical database error
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Stopping worker 3292 gracefully (80 seconds)
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Stopping worker 3295 gracefully (80 seconds)
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Stopping worker 3293 gracefully (80 seconds)
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Stopping worker 3294 gracefully (80 seconds)
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3294] [e] Stopping service after critical database error
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Worker 3295 stopped
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Worker 3292 stopped
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Worker 3294 stopped
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Worker 3293 stopped
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Manager 3045 stopped
Nov 23 03:33:39 openqaworker1 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=232/ADDRESS_FAMILIES
Nov 23 03:33:39 openqaworker1 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 23 03:33:39 openqaworker1 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Nov 23 03:33:39 openqaworker1 systemd[1]: openqa-worker-cacheservice.service: Service RestartSec=100ms expired, scheduling restart.
Nov 23 03:33:44 openqaworker1 systemd[1]: Stopped OpenQA Worker Cache Service.
Nov 23 03:33:44 openqaworker1 systemd[1]: Started OpenQA Worker Cache Service.
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] [QFKZjHFD] Cache size of "/var/lib/openqa/cache" is 19GiB, with limit 400GiB
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] Listening at "http://127.0.0.1:9530"
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] Listening at "http://[::1]:9530"
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] Manager 3323 started
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: Web application available at http://127.0.0.1:9530
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: Web application available at http://[::1]:9530
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3327] [i] Worker 3327 started
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3328] [i] Worker 3328 started
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3329] [i] Worker 3329 started
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] Creating process id file "/tmp/prefork.pid"
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3330] [i] Worker 3330 started

The weird thing is that it just works again after the restart but there's no trace of the database repairing (e.g. the log message "Re-indexing and vacuuming broken database"). The log of openqa-workercache-minion doesn't even show any SQLite related errors. The Minion dashboard also shows jobs from past 12 hours ago so the database also hasn't been purged at any point. So it is strange that we see this database error but after the (now automated service restart) no errors anymore at all. (I also just did the integrity check manually again to be sure in case the automated check would be buggy.)

#58 Updated by okurz about 2 months ago

  • Status changed from Feedback to In Progress

What we found out when looking over logs again together:

Problems started on openqaworker1 at 2020-11-22 07:36:50 when openqa-worker-cacheservice-minion detected that the database disk image is malformed. No messages before that in the complete system journal log journalctl --since="2020-11-22 07:00 could explain why suddenly the image would be malformed:

Nov 22 07:36:50 openqaworker1 openqa-worker-cacheservice-minion[4906]: DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend>
Nov 22 07:36:50 openqaworker1 openqa-worker-cacheservice-minion[4906]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 301.
Nov 22 07:36:50 openqaworker1 openqa-worker-cacheservice-minion[4906]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 26.
Nov 22 07:36:50 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=4/NOPERMISSION
Nov 22 07:36:50 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 22 07:36:50 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 22 07:36:51 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Nov 22 07:36:51 openqaworker1 systemd[1]: Stopped OpenQA Worker Cache Service Minion.

The service was stopped and restarted automatically by systemd, starting an integrity check soon after:

Nov 22 07:36:51 openqaworker1 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] Database integrity check found errors:
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] *** in database main ***
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: Page 15 is never used
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: Page 16 is never used
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: Page 18 is never used
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: Page 19 is never used
…

this triggered a repair attempt which seemed like it succeeded but then another error on the migration attempt happened which triggered the attempt to remove the database file "cache.sqlite". Note that with the current code the files "-shm" and "-wal" are not removed:

Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] Re-indexing and vacuuming broken database
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] Purging cache directory because database has been corrupted: DBD::SQLite::st execute failed:>
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]:  at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Cache.pm line 66.
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] [4MP8m9ei] Creating cache directory tree for "/var/lib/openqa/cache"
Nov 22 07:36:56 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] [4MP8m9ei] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 400GiB
Nov 22 07:36:57 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] Resetting all leftover locks after restart
Nov 22 07:36:57 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] Worker 31276 started

Note that "openqa-worker-cacheservice" itself did not stop and did not even realize any problems. Likely the server processes were continuing based on an inconsistent database state in memory.

Some minutes later the next openQA worker service finished its job and was trying to pick up a new job which failed with "[error] Worker cache not available: No workers active in the cache service":

Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Isotovideo exit status: 0
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] +++ worker notes +++
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] End time: 2020-11-22 06:40:07
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Result: done
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading video.ogv
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading vars.json
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading autoinst-log.txt
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading worker-log.txt
Nov 22 07:40:08 openqaworker1 worker[3076]: [info] Uploading serial0.txt
Nov 22 07:40:08 openqaworker1 worker[3076]: [info] Uploading video_time.vtt
Nov 22 07:40:08 openqaworker1 worker[3076]: [info] Uploading serial_terminal.txt
Nov 22 07:40:12 openqaworker1 worker[3076]: [error] Worker cache not available: No workers active in the cache service

Likely no job was ever picked up and the worker likely reported itself as "broken" until it would check again for the worker cache availability.

Over the next hours workers were changing between the above error message and actually picking up jobs but incompleting. Likely the worker services were talking to different processes of openqa-worker-cacheservice returning inconsistent states.

This all continued until the situation effectively healed itself after the reboot of the system with the start of openqa-worker-cacheservice which detected the malformed database (finally) and re-initialized it (again):

Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3295] [e] [K7E-kVGB] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/>
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.

This then triggered, as expected, a restart of openqa-worker-cacheservice which then, due to the systemd service dependency, triggered a restart of openqa-worker-cacheservice-minion as well:

Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3295] [e] Stopping service after critical database error
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Stopping worker 3292 gracefully (80 seconds)
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Stopping worker 3295 gracefully (80 seconds)
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Stopping worker 3293 gracefully (80 seconds)
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Stopping worker 3294 gracefully (80 seconds)
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3294] [e] Stopping service after critical database error
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Worker 3295 stopped
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Worker 3292 stopped
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Worker 3294 stopped
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Worker 3293 stopped
Nov 23 03:33:39 openqaworker1 openqa-workercache-daemon[3045]: [3045] [i] Manager 3045 stopped
Nov 23 03:33:39 openqaworker1 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=232/ADDRESS_FAMILIES
Nov 23 03:33:39 openqaworker1 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 23 03:33:39 openqaworker1 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
…
Nov 23 03:33:39 openqaworker1 systemd[1]: openqa-worker-cacheservice.service: Service RestartSec=100ms expired, scheduling restart.
Nov 23 03:33:39 openqaworker1 systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Nov 23 03:33:44 openqaworker1 openqa-worker-cacheservice-minion[3060]: [3060] [i] Worker 3060 stopped
Nov 23 03:33:44 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=144/n/a
Nov 23 03:33:44 openqaworker1 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 23 03:33:44 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 23 03:33:44 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 23 03:33:44 openqaworker1 systemd[1]: Stopped OpenQA Worker Cache Service.
Nov 23 03:33:44 openqaworker1 systemd[1]: Started OpenQA Worker Cache Service.
Nov 23 03:33:44 openqaworker1 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] [QFKZjHFD] Cache size of "/var/lib/openqa/cache" is 19GiB, with limit 400GiB
Nov 23 03:33:44 openqaworker1 openqa-worker-cacheservice-minion[3324]: [3324] [i] [MCXi6dwt] Cache size of "/var/lib/openqa/cache" is 19GiB, with limit 400GiB
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] Listening at "http://127.0.0.1:9530"
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] Listening at "http://[::1]:9530"
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] Manager 3323 started
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: Web application available at http://127.0.0.1:9530
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: Web application available at http://[::1]:9530
Nov 23 03:33:44 openqaworker1 openqa-worker-cacheservice-minion[3324]: [3324] [i] Resetting all leftover locks after restart
Nov 23 03:33:44 openqaworker1 openqa-worker-cacheservice-minion[3324]: [3324] [i] Worker 3324 started
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3327] [i] Worker 3327 started
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3328] [i] Worker 3328 started
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3329] [i] Worker 3329 started
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3323] [i] Creating process id file "/tmp/prefork.pid"
Nov 23 03:33:44 openqaworker1 openqa-workercache-daemon[3323]: [3330] [i] Worker 3330 started

and now the situation was again in sync with a clean database and minions available to pick up work again. After all the impact were many unhandled incompletes with a new reason handled in #80202 and the unavailability of openqaworker1 for 20h but eventually healing itself.

Suggestions for improvement:

#59 Updated by okurz about 2 months ago

  • Related to action #80202: jobs incomplete with auto_review:"setup failure: No workers active in the cache service":retry added

#60 Updated by okurz about 2 months ago

  • Subject changed from Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*database disk image is malformed.*":retry to Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry

#61 Updated by okurz about 2 months ago

https://openqa.opensuse.org/tests/1482184 on openqaworker4 this morning: "Reason: setup failure: Cache service status error from API: Specified job ID is invalid", extended auto_review expression.

From log on openqaworker4:

Nov 24 02:08:12 openqaworker4 openqa-worker-cacheservice-minion[7772]: [4377] [i] [#2884] Downloading: "openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build249.1-Media.iso"
Nov 24 02:08:12 openqaworker4 openqa-worker-cacheservice-minion[7772]: [4377] [i] [#2884] Cache size of "/var/lib/openqa/cache" is 43GiB, with limit 50GiB
Nov 24 02:08:12 openqaworker4 openqa-worker-cacheservice-minion[7772]: [4377] [i] [#2884] Downloading "openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build249.1-Media.iso" from "http://openqa1->
Nov 24 02:08:12 openqaworker4 openqa-workercache-daemon[7771]: [7782] [e] [nizmOeKc] DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/>
Nov 24 02:08:12 openqaworker4 openqa-workercache-daemon[7771]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 138.
Nov 24 02:08:12 openqaworker4 worker[2240]: [info] Download of openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build249.1-Media.iso.sha256 processed:
Nov 24 02:08:12 openqaworker4 worker[2240]: [info] [#2878] Cache size of "/var/lib/openqa/cache" is 43GiB, with limit 50GiB
Nov 24 02:08:12 openqaworker4 worker[2240]: [info] [#2878] Downloading "openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build249.1-Media.iso.sha256" from "http://openqa1-opensuse/tests/1482133/a>
Nov 24 02:08:12 openqaworker4 worker[2240]: [info] [#2878] Size of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build249.1-Media.iso.sha256" is 118 Byt>
Nov 24 02:08:12 openqaworker4 worker[2240]: [info] [#2878] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build249.1-Media.iso.sha256" succes>
Nov 24 02:08:12 openqaworker4 openqa-workercache-daemon[7771]: [7782] [e] Stopping service after critical database error
Nov 24 02:08:12 openqaworker4 openqa-workercache-daemon[7771]: [7771] [i] Stopping worker 7784 gracefully (80 seconds)

so suddenly the cache database turned bad. openqa-workercache-daemon terminated itself.
At this time the worker was still running without https://github.com/os-autoinst/openQA/pull/3579 but the startup still looks clean because it was initiated by openqa-workercache-daemon which already restarts the minion service through systemd:

Nov 24 02:08:23 openqaworker4 systemd[1]: Stopped OpenQA Worker Cache Service.
Nov 24 02:08:23 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service.
Nov 24 02:08:23 openqaworker4 systemd[1]: Started OpenQA Worker Cache Service Minion.
…
Nov 24 02:08:23 openqaworker4 openqa-workercache-daemon[4385]: [4385] [e] [GwomvWTX] Re-indexing and vacuuming broken database
Nov 24 02:08:23 openqaworker4 openqa-workercache-daemon[4385]: [4385] [i] [GwomvWTX] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build249.1-Me>
Nov 24 02:08:23 openqaworker4 openqa-workercache-daemon[4385]: [4385] [i] [GwomvWTX] Cache size of "/var/lib/openqa/cache" is 43GiB, with limit 50GiB
Nov 24 02:08:23 openqaworker4 openqa-worker-cacheservice-minion[4386]: [4386] [i] [Pplh3Aqx] Cache size of "/var/lib/openqa/cache" is 43GiB, with limit 50GiB
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: [4385] [i] Listening at "http://127.0.0.1:9530"
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: [4385] [i] Listening at "http://[::1]:9530"
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: [4385] [i] Manager 4385 started
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: Web application available at http://127.0.0.1:9530
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: Web application available at http://[::1]:9530
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: [4389] [i] Worker 4389 started
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: [4390] [i] Worker 4390 started
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: [4391] [i] Worker 4391 started
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: [4385] [i] Creating process id file "/tmp/prefork.pid"
Nov 24 02:08:24 openqaworker4 openqa-workercache-daemon[4385]: [4392] [i] Worker 4392 started
Nov 24 02:08:24 openqaworker4 openqa-worker-cacheservice-minion[4386]: [4386] [i] Resetting all leftover locks after restart
Nov 24 02:08:24 openqaworker4 openqa-worker-cacheservice-minion[4386]: [4386] [i] Worker 4386 started

on openqaworker7 I saw after bootup:

Nov 24 03:42:56 openqaworker7 openqa-worker-cacheservice-minion[3911]: [3911] [i] Resetting all leftover locks after restart
Nov 24 03:42:56 openqaworker7 openqa-worker-cacheservice-minion[3911]: [3911] [i] Worker 3911 started
Nov 24 03:42:56 openqaworker7 openqa-workercache-daemon[3622]: [3910] [e] [2O1O1owz] DBD::SQLite::st execute failed: UNIQUE constraint failed: mojo_migrations.name at /usr/lib/perl5/vendo>
Nov 24 03:42:56 openqaworker7 openqa-workercache-daemon[3622]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite.pm line 115.

but then other tests continued fine. Seems like the error had no effect and was not caught and triggering a reset of the DB though. Should be checked.

#62 Updated by okurz about 2 months ago

https://sqlite.org/lpc2019/doc/trunk/briefing.md mentions "Rollback is therefore the default. The write-ahead log (WAL mode) is faster, but only works on systems for which either (1) only a single process accesses the database at a time or (2) a mmap()-ed file can be used as shared memory. ". Judging from the presence of a "WAL" file I assume we use "write-ahead". We could try if we can use one of the other modes to prevent the database corruptions.

from kraih: "btw. to use DELETE mode there's a no_wal=1 option in Mojo::SQLite https://metacpan.org/pod/Mojo::SQLite#from_string . only problem is that the old sqlite file will need to be deleted to do the switch. WAL mode is actually not the default for sqlite, it's just Mojo::SQLite changing the default from DELETE mode"

#63 Updated by mkittler about 2 months ago

https://openqa.opensuse.org/tests/1482184 on openqaworker4 this morning: "Reason: setup failure: Cache service status error from API: Specified job ID is invalid", extended auto_review expression.

I've seen it before as well, e.g. https://progress.opensuse.org/issues/80202#note-4

Judging from the presence of a "WAL" file I assume we use "write-ahead".

That's correct. And we also use the mmap()-ed file so our multi-process use case should be covered.

We could try if we can use one of the other modes to prevent the database corruptions.

I've created https://github.com/os-autoinst/openQA/pull/3586 so we could try disabling WAL mode on a worker to see whether it makes a difference.

#64 Updated by nicksinger about 2 months ago

the systemd timeouts where adjusted last friday by me to 300s. I remember mkittler mentioning jobs running even longer then that. Let me know if I should bump once again or if you're rather pursue another possibility.

#65 Updated by okurz about 2 months ago

I checked today with

for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "systemctl --no-legend --failed" ; done

and found:

openqaworker7
openqa-worker-cacheservice-minion.service loaded failed failed OpenQA Worker Cache Service Minion
openqa-worker-cacheservice.service        loaded failed failed OpenQA Worker Cache Service

did on openqaworker7:

systemctl start openqa-worker.target openqa-worker-cacheservice.service openqa-worker-cacheservice-minion.service

please check logs

#66 Updated by mkittler about 2 months ago

On openqaworker7 I saw only the following SQL errors lately:

Nov 25 05:09:56 openqaworker7 openqa-workercache-daemon[3858]: [3864] [e] [ziUeQAqI] DBD::SQLite::st execute failed: file is not a database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:09:56 openqaworker7 openqa-workercache-daemon[3858]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:09:56 openqaworker7 openqa-workercache-daemon[3858]: [3864] [e] [00sUh8eh] DBD::SQLite::st execute failed: file is not a database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:09:56 openqaworker7 openqa-workercache-daemon[3858]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:09:56 openqaworker7 openqa-workercache-daemon[3858]: [3864] [e] [NSRbWKla] DBD::SQLite::st execute failed: file is not a database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:09:56 openqaworker7 openqa-workercache-daemon[3858]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:11:42 openqaworker7 openqa-workercache-daemon[3858]: [3864] [e] [gVS33GSC] DBD::SQLite::st execute failed: file is not a database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:11:42 openqaworker7 openqa-workercache-daemon[3858]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:13:17 openqaworker7 openqa-workercache-daemon[3858]: [3864] [e] [QexCEyDn] DBD::SQLite::st execute failed: file is not a database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:13:17 openqaworker7 openqa-workercache-daemon[3858]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:13:22 openqaworker7 openqa-workercache-daemon[3858]: [3864] [e] [sSslKzHD] DBD::SQLite::st execute failed: file is not a database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.
Nov 25 05:13:22 openqaworker7 openqa-workercache-daemon[3858]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 363.

Since I haven't seen this error before it isn't detected by my change for stopping the service after fatal errors. So the service continued to run after that. The next lines in the journal concerning the cache service are logged only 4 hours later:

Nov 25 09:18:37 openqaworker7 systemd[1]: Stopping OpenQA Worker Cache Service...
Nov 25 09:18:37 openqaworker7 openqa-workercache-daemon[3858]: [3858] [w] Stopping worker 3865 immediately
Nov 25 09:18:37 openqaworker7 openqa-workercache-daemon[3858]: [3858] [w] Stopping worker 3863 immediately
Nov 25 09:18:37 openqaworker7 openqa-workercache-daemon[3858]: [3858] [w] Stopping worker 3864 immediately
Nov 25 09:18:37 openqaworker7 openqa-workercache-daemon[3858]: [3858] [w] Stopping worker 3862 immediately
Nov 25 09:18:37 openqaworker7 openqa-workercache-daemon[3858]: [3858] [i] Worker 3862 stopped
Nov 25 09:18:37 openqaworker7 openqa-workercache-daemon[3858]: [3858] [i] Worker 3863 stopped
Nov 25 09:18:37 openqaworker7 openqa-workercache-daemon[3858]: [3858] [i] Worker 3864 stopped
Nov 25 09:18:37 openqaworker7 openqa-workercache-daemon[3858]: [3858] [i] Worker 3865 stopped
Nov 25 09:18:37 openqaworker7 openqa-workercache-daemon[3858]: [3858] [i] Manager 3858 stopped
Nov 25 09:18:37 openqaworker7 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=144/n/a
Nov 25 09:18:37 openqaworker7 systemd[1]: Stopped OpenQA Worker Cache Service.
Nov 25 09:18:37 openqaworker7 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 25 09:18:37 openqaworker7 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Nov 25 10:05:57 openqaworker7 systemd[1]: Started OpenQA Worker Cache Service.
Nov 25 10:05:57 openqaworker7 systemd[1]: Started OpenQA Worker Cache Service Minion.

Maybe this is when okurz restarted the service? The system is rebooted at Nov 25 10:06:53 and then rebooted again a few times. Nothing special is happening anymore in that time frame.


On openqaworker1 and openqaworker4 we see the usual SQLite problems in the log which lead to the services being restarted. However, there are no failing integrity checks being logged after the last deployment so https://github.com/os-autoinst/openQA/pull/3584 couldn't actually make a difference.

#67 Updated by mkittler about 2 months ago

I've just noticed that we use ext2 as filesystem (where the SQLite file is stored). I've never used ext2 myself and I'm wondering whether it might help to use a different filesystem, e.g. might locking be an issue with ext2 (compared to newer file systems)?

#68 Updated by mkittler about 2 months ago

regarding ext2:

[11:29] ext2 can easily have data corruption on unclean shutdown, but locking is AFAIK fine


I've been enabling the DELETE journaling mode now on openqaworker4 (and I'm keeping the WAL journaling mode on openqaworker1). This way we can compare whether it makes a difference. (So far the SQLite errors are seen on both workers quite frequently.)

The relevant configuration are /etc/systemd/system/openqa-worker-cacheservice.service.d/poo67000.conf and /etc/systemd/system/openqa-worker-cacheservice-minion.service.d/poo67000.conf with the following content:

[Service]
Environment=OPENQA_CACHE_SERVICE_SQLITE_JOURNAL_MODE=delete

To undo this change:

  1. Delete files again.
  2. systemctl daemon-reload
  3. systemctl stop openqa-worker-cacheservice{,-minion}
  4. systemctl start openqa-worker-cacheservice
  5. Check whether the service behaves as expected and whether /var/lib/openqa/cache/ contains a WAL file.
  6. systemctl start openqa-worker-cacheservice-minion

#69 Updated by okurz about 2 months ago

  • Related to action #73339: auto_review:"setup failure: Cache service status error from API: Minion job.* failed: Can't use an undefined value as a HASH reference at.*" added

#70 Updated by mkittler about 2 months ago

PR for using DELETE journaling mode by default since it worked out well¹ on openqaworker4 so far: https://github.com/os-autoinst/openQA/pull/3592

¹ select state, t_started, (select host from workers where id = assigned_worker_id), reason from jobs where t_started > '2020-11-25 12:17:44' and result = 'incomplete' and reason like '%Cache service%'; only returns 20 jobs and none of them ran on openqaworker4 and the word "SQLite" isn't present in openqaworker4's journal since 2020-11-25 12:17:44.

#71 Updated by mkittler about 2 months ago

It looks like enabling DELETE journaling mode helped so far. There are no incomplete jobs returned by the query from my previous comment after the deployment and also not when searching for %No workers active%.

On power8 the issue is still present after the nightly deployment time, e.g. Nov 27 08:08:38 power8 worker[119107]: [error] Unable to setup job 1485673: Cache service status error from API: Minion job #48922 failed: DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 138.. However, this worker wasn't updated so I updated the worker package there manually now.

There were some SQLite errors logged shortly after the nightly reboot but these problems resolved themselves and then there were no further SQLite errors:

Nov 27 03:34:26 openqaworker1 openqa-workercache-daemon[3098]: [3098] [e] [GAqgNxDL] Database has been corrupted: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 82.
Nov 27 03:34:26 openqaworker1 openqa-workercache-daemon[3098]: [3098] [e] [GAqgNxDL] Killing processes accessing the database file handles and removing database
Nov 27 03:34:26 openqaworker1 openqa-worker-cacheservice-minion[3109]: [3109] [e] [p6d97lXH] Database has been corrupted: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 82.
Nov 27 03:34:26 openqaworker1 openqa-worker-cacheservice-minion[3109]: [3109] [e] [p6d97lXH] Killing processes accessing the database file handles and removing database
Nov 27 03:34:26 openqaworker1 openqa-worker-cacheservice-minion[3109]: [3109] [i] [p6d97lXH] Creating cache directory tree for "/var/lib/openqa/cache"
Nov 27 03:34:26 openqaworker1 openqa-workercache-daemon[3098]: [3098] [i] [GAqgNxDL] Creating cache directory tree for "/var/lib/openqa/cache"
Nov 27 03:38:57 openqaworker7 openqa-workercache-daemon[3930]: [3930] [i] [Q8NeRHTH] Creating cache directory tree for "/var/lib/openqa/cache"
Nov 27 03:38:57 openqaworker7 openqa-worker-cacheservice-minion[3931]: [3931] [i] [SoISGbrc] Creating cache directory tree for "/var/lib/openqa/cache"
Nov 27 03:38:57 openqaworker7 openqa-worker-cacheservice-minion[3931]: [3931] [i] [SoISGbrc] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 180GiB
Nov 27 03:38:57 openqaworker7 openqa-workercache-daemon[3930]: Deploying cache database to "/var/lib/openqa/cache/cache.sqlite" failed (Maybe the file is corrupted and needs to be deleted?): DBD::SQLite::st execute failed: UNIQUE constraint failed: mojo_migrations.name at /usr/share/openqa/script/../lib/OpenQA/Cac>
Nov 27 03:38:57 openqaworker7 openqa-workercache-daemon[3930]:  at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Cache.pm line 96.
Nov 27 03:38:57 openqaworker7 openqa-workercache-daemon[3930]:  at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 97.
Nov 27 03:38:57 openqaworker7 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 27 03:38:57 openqaworker7 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 27 03:38:57 openqaworker7 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Nov 27 03:38:57 openqaworker7 openqa-worker-cacheservice-minion[3931]: [3931] [i] Resetting all leftover locks after restart
Nov 27 03:38:57 openqaworker7 openqa-worker-cacheservice-minion[3931]: [3931] [i] Worker 3931 started
Nov 27 03:38:57 openqaworker7 systemd[1]: openqa-worker-cacheservice.service: Service RestartSec=100ms expired, scheduling restart.
Nov 27 03:38:57 openqaworker7 systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Nov 27 03:39:02 openqaworker7 openqa-worker-cacheservice-minion[3931]: [3931] [i] Worker 3931 stopped
Nov 27 03:39:02 openqaworker7 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=48/n/a
Nov 27 03:39:02 openqaworker7 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 27 03:39:02 openqaworker7 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 27 03:39:02 openqaworker7 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 27 03:39:02 openqaworker7 systemd[1]: Stopped OpenQA Worker Cache Service.
Nov 27 03:39:02 openqaworker7 systemd[1]: Started OpenQA Worker Cache Service.
Nov 27 03:39:02 openqaworker7 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 27 03:34:52 openqa-aarch64 openqa-worker-cacheservice-minion[2936]: [2936] [e] [6c9O_fq2] Database has been corrupted: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 82.
Nov 27 03:34:52 openqa-aarch64 openqa-worker-cacheservice-minion[2936]: [2936] [e] [6c9O_fq2] Killing processes accessing the database file handles and removing database
Nov 27 03:34:52 openqa-aarch64 openqa-workercache-daemon[2926]: [2926] [e] [pw-GcYQZ] Database has been corrupted: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 82.
Nov 27 03:34:52 openqa-aarch64 openqa-workercache-daemon[2926]: [2926] [e] [pw-GcYQZ] Killing processes accessing the database file handles and removing database

#72 Updated by mkittler about 2 months ago

failed_since="timestamp '2020-11-27 03:30:00'" show_job_ids=1 show_worker_hosts=1 ./openqa-incompletes-stats

shows no further relevant jobs and there are no SQLite errors in the journal (checked 3 o3 workers).

The change hasn't been deployed on OSD yet. There we have 626 incompletes alone for #78169 (in the same timespan). Maybe we want to deploy there sooner than usual?

#73 Updated by mkittler about 2 months ago

Still no more relevant jobs on o3 and for OSD it also looks good after the deployment today (host=openqa.suse.de failed_since="timestamp '2020-12-02 07:12:00'" show_job_ids=1 show_worker_hosts=1 ./openqa-incompletes-stats only shows unrelated jobs).

#74 Updated by cdywan about 2 months ago

post mortem

  • retrospective
    • the issue became visible several months ago, but likely existed for years
    • have we always been using wal? yes
    • explicit transactions used to be used years ago, which made reads exclusive, but would have been inconsistent with minion
    • adding new column and new table might have increased the frequency of corruption
    • logic to remove the db in case of problems was never used? probably contributed to the corruption but never actually deleted the db
    • all of the changes helped us understand the issues better
  • user impact
    • a small issue has a big impact on end users
    • there was no automatic restarting, no self-repair and no failure
    • jobs became incompletes instead of failing
    • gradually got worse due to increased frequency
  • superficial mitigation
    • #opensuse-factory
    • repairing sqlite by hand
    • autoreview restarted a lot of jobs -> number of incompletes shows the effects of further fixes
  • upstream
    • properties of ext2, heavily relies on locking, very fast nvme's
    • easy reproducer for malformed sqlite was not feasible
    • perl sqlite bindings: contacting maintainer didn't yield anything, mojo sqlite uses write ahead mode by default, minion uses mojo sqlite
    • mojo sqlite: the default delete mode has 5s timeout and users very commonly run into that, consider addition to upstream docs?
  • what could be improved
    • a lot of similar issues reported, a lot of time spent
    • auto review never looked into resolved ticket
    • we could have spent more time early on?
  • what went well
    • we can be more confident extending the cache service again
    • marius took over after sebstian stepper back for a bit
    • we managed to keep up the urgency w/o blaming anyone in particular
    • autoreview, incompletes, grafana etc worked well

#75 Updated by cdywan about 2 months ago

Next steps:

  • check for work-arounds to be removed
  • add a big warning in the code that nobody changes the mode again
  • check all related tickets to see if specific symptoms are still there or work-around were applied
  • better stress tests?
  • have autoreview look into resolved tickets
    • could find rejected or closed tickets not deployed yet
    • queue will grow substantially
    • make it an additional/optional source of information
  • reminder: more brainstorming on complicated issues
  • reminder: handling outbound communication to keep the team focussed is good

#76 Updated by mkittler about 2 months ago

  • Status changed from In Progress to Resolved

We've already improved the previous workaround so we don't delete the SQLite database file anymore while processes still have it opened. So I suppose we can keep the workaround in its current state. I also don't see any leftover workarounds for this ticket in production anymore. There are also no problems anymore on OSD and o3 since the switch to the DELETE journaling mode.

Also available in: Atom PDF