action #73321
closedall jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"
0%
Description
Observation¶
jobs run on openqaworker8 incomplete, the reason is Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Downloads.pm line 34. at /usr/share/openqa/script/../lib/OpenQA/CacheSer…
Please see more details on:
https://10.160.0.207/tests/4823712
https://10.160.0.207/tests/4821829
Acceptance criteria¶
- AC1: No jobs incomplete due to corrupted worker cache sqlite database files
Suggestions¶
- As suggested in #73321#note-7 it makes sense if someone else besides kraih looks into this issue to double check his assertions
- After that look into mitigation features (like automatic cache service reset when this error is encountered)
Files
Updated by dzedro about 4 years ago
Happened only on openqaworker8
Updated by okurz about 4 years ago
- Copied to action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added
Updated by okurz about 4 years ago
- Tags set to cache, worker, minion, sqlite
- Category set to Regressions/Crashes
- Priority changed from High to Normal
- Target version set to Ready
I will keep this ticket to improve the handling within openQA itself and handle the worker specifically in #73342
Updated by okurz about 4 years ago
The db file is available in https://w3.nue.suse.com/~okurz/cache.sqlite_poo73321
Updated by okurz about 4 years ago
- Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added
Updated by kraih about 4 years ago
Analysis of the corrupted SQLite file is interesting, but inconclusive. Two entirely unrelated tables are affected and no hints at what might cause the issue.
*** in database main ***
Page 26 is never used
Page 27 is never used
Page 1048 is never used
Page 1367 is never used
Page 1411 is never used
Page 1445 is never used
Page 1649 is never used
Page 1706 is never used
Page 1783 is never used
Page 1803 is never used
Page 1816 is never used
Page 1846 is never used
Page 1851 is never used
Page 1941 is never used
Page 1970 is never used
Page 1976 is never used
Page 1993 is never used
Page 2032 is never used
Page 2082 is never used
Page 2129 is never used
Page 2140 is never used
Page 2242 is never used
Page 2264 is never used
Page 2294 is never used
Page 2345 is never used
Page 2349 is never used
Page 2405 is never used
Page 2420 is never used
Page 2475 is never used
Page 2513 is never used
Page 2531 is never used
Page 2532 is never used
Page 2576 is never used
Page 2588 is never used
Page 2632 is never used
Page 2677 is never used
Page 2695 is never used
Page 2903 is never used
Page 2916 is never used
Page 2940 is never used
Page 2997 is never used
Page 3117 is never used
Page 3119 is never used
Page 3142 is never used
wrong # of entries in index minion_jobs_state_priority_id
NULL value in downloads.created
row 1 missing from index downloads_created
row 1 missing from index downloads_lock
NULL value in downloads.created
row 2 missing from index downloads_created
row 2 missing from index downloads_lock
NULL value in downloads.created
row 3 missing from index downloads_created
row 3 missing from index downloads_lock
NULL value in downloads.created
row 4 missing from index downloads_created
row 4 missing from index downloads_lock
row 1704 missing from index downloads_created
wrong # of entries in index downloads_created
wrong # of entries in index downloads_lock
Updated by kraih about 4 years ago
I've tried tracking down the root problem for a few months now without much luck. Adding explicit transactions around write queries seems to have reduced the frequency of the problem occuring (could be a coincidence though). But every now and then it still happens on a different worker, so hardware issues are also rather unlikely. We've never been able to replicate the problem on a dev machine, and the corrupted SQLite files we've been able to preserve and analyse so far have not shown anything conclusive. It would probably make sense if someone else from the team looks into this issue to double check my assertions before we look into mitigation features (like automatic cache service reset when this error is encountered).
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from New to Workable
agreed, included the according suggestions in the ticket description
Updated by okurz about 4 years ago
- Status changed from Workable to Rejected
- Assignee set to okurz
Actually I consider this a duplicate of #67000 where we already have similar or same suggestions
Updated by okurz about 4 years ago
- Related to action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8 added