action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*" - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #73321

closed

all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"

Added by Xiaojing_liu over 4 years ago. Updated over 4 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2020-10-14

Due date:

% Done:

Estimated time:

Tags:

worker, minion, cache, sqlite

Description

Observation¶

jobs run on openqaworker8 incomplete, the reason is Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Downloads.pm line 34. at /usr/share/openqa/script/../lib/OpenQA/CacheSer…

Please see more details on:
https://10.160.0.207/tests/4823712
https://10.160.0.207/tests/4821829

Acceptance criteria¶

AC1: No jobs incomplete due to corrupted worker cache sqlite database files

Suggestions¶

As suggested in #73321#note-7 it makes sense if someone else besides kraih looks into this issue to double check his assertions
After that look into mitigation features (like automatic cache service reset when this error is encountered)

Files

Screenshot_2020-10-14_09-30-37.png (120 KB) Screenshot_2020-10-14_09-30-37.png

"All" jobs failed due to this issue

dzedro, 2020-10-14 07:55

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by dzedro over 4 years ago

File Screenshot_2020-10-14_09-30-37.png Screenshot_2020-10-14_09-30-37.png added

Happened only on openqaworker8

Actions

Copy link

Updated by okurz over 4 years ago

Copied to action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added

Actions

Copy link

Updated by okurz over 4 years ago

Tags set to cache, worker, minion, sqlite
Category set to Regressions/Crashes
Priority changed from High to Normal
Target version set to Ready

I will keep this ticket to improve the handling within openQA itself and handle the worker specifically in #73342

Actions

Copy link

Updated by okurz over 4 years ago

The db file is available in https://w3.nue.suse.com/~okurz/cache.sqlite_poo73321

Actions

Copy link

Updated by okurz over 4 years ago

Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added

Actions

Copy link

Updated by kraih over 4 years ago

Analysis of the corrupted SQLite file is interesting, but inconclusive. Two entirely unrelated tables are affected and no hints at what might cause the issue.

*** in database main ***
Page 26 is never used
Page 27 is never used
Page 1048 is never used
Page 1367 is never used
Page 1411 is never used
Page 1445 is never used
Page 1649 is never used
Page 1706 is never used
Page 1783 is never used
Page 1803 is never used
Page 1816 is never used
Page 1846 is never used
Page 1851 is never used
Page 1941 is never used
Page 1970 is never used
Page 1976 is never used
Page 1993 is never used
Page 2032 is never used
Page 2082 is never used
Page 2129 is never used
Page 2140 is never used
Page 2242 is never used
Page 2264 is never used
Page 2294 is never used
Page 2345 is never used
Page 2349 is never used
Page 2405 is never used
Page 2420 is never used
Page 2475 is never used
Page 2513 is never used
Page 2531 is never used
Page 2532 is never used
Page 2576 is never used
Page 2588 is never used
Page 2632 is never used
Page 2677 is never used
Page 2695 is never used
Page 2903 is never used
Page 2916 is never used
Page 2940 is never used
Page 2997 is never used
Page 3117 is never used
Page 3119 is never used
Page 3142 is never used
wrong # of entries in index minion_jobs_state_priority_id
NULL value in downloads.created
row 1 missing from index downloads_created
row 1 missing from index downloads_lock
NULL value in downloads.created
row 2 missing from index downloads_created
row 2 missing from index downloads_lock
NULL value in downloads.created
row 3 missing from index downloads_created
row 3 missing from index downloads_lock
NULL value in downloads.created
row 4 missing from index downloads_created
row 4 missing from index downloads_lock
row 1704 missing from index downloads_created
wrong # of entries in index downloads_created
wrong # of entries in index downloads_lock

Actions

Copy link

Updated by kraih over 4 years ago

I've tried tracking down the root problem for a few months now without much luck. Adding explicit transactions around write queries seems to have reduced the frequency of the problem occuring (could be a coincidence though). But every now and then it still happens on a different worker, so hardware issues are also rather unlikely. We've never been able to replicate the problem on a dev machine, and the corrupted SQLite files we've been able to preserve and analyse so far have not shown anything conclusive. It would probably make sense if someone else from the team looks into this issue to double check my assertions before we look into mitigation features (like automatic cache service reset when this error is encountered).

Actions

Copy link