Project

General

Profile

Actions

action #73321

closed

all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"

Added by Xiaojing_liu over 3 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-10-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

jobs run on openqaworker8 incomplete, the reason is Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Downloads.pm line 34. at /usr/share/openqa/script/../lib/OpenQA/CacheSer…

Please see more details on:
https://10.160.0.207/tests/4823712
https://10.160.0.207/tests/4821829

Acceptance criteria

  • AC1: No jobs incomplete due to corrupted worker cache sqlite database files

Suggestions

  • As suggested in #73321#note-7 it makes sense if someone else besides kraih looks into this issue to double check his assertions
  • After that look into mitigation features (like automatic cache service reset when this error is encountered)

Files

Screenshot_2020-10-14_09-30-37.png (120 KB) Screenshot_2020-10-14_09-30-37.png "All" jobs failed due to this issue dzedro, 2020-10-14 07:55

Related issues 3 (0 open3 closed)

Related to openQA Project - action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolvedmkittler2020-05-18

Actions
Related to openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8Resolvedokurz2020-11-162020-11-18

Actions
Copied to openQA Infrastructure - action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retryResolvedokurz2020-10-142020-10-16

Actions
Actions #1

Updated by dzedro over 3 years ago

Happened only on openqaworker8

Actions #2

Updated by okurz over 3 years ago

  • Copied to action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added
Actions #3

Updated by okurz over 3 years ago

  • Tags set to cache, worker, minion, sqlite
  • Category set to Regressions/Crashes
  • Priority changed from High to Normal
  • Target version set to Ready

I will keep this ticket to improve the handling within openQA itself and handle the worker specifically in #73342

Actions #4

Updated by okurz over 3 years ago

Actions #5

Updated by okurz over 3 years ago

  • Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added
Actions #6

Updated by kraih over 3 years ago

Analysis of the corrupted SQLite file is interesting, but inconclusive. Two entirely unrelated tables are affected and no hints at what might cause the issue.

*** in database main ***
Page 26 is never used
Page 27 is never used
Page 1048 is never used
Page 1367 is never used
Page 1411 is never used
Page 1445 is never used
Page 1649 is never used
Page 1706 is never used
Page 1783 is never used
Page 1803 is never used
Page 1816 is never used
Page 1846 is never used
Page 1851 is never used
Page 1941 is never used
Page 1970 is never used
Page 1976 is never used
Page 1993 is never used
Page 2032 is never used
Page 2082 is never used
Page 2129 is never used
Page 2140 is never used
Page 2242 is never used
Page 2264 is never used
Page 2294 is never used
Page 2345 is never used
Page 2349 is never used
Page 2405 is never used
Page 2420 is never used
Page 2475 is never used
Page 2513 is never used
Page 2531 is never used
Page 2532 is never used
Page 2576 is never used
Page 2588 is never used
Page 2632 is never used
Page 2677 is never used
Page 2695 is never used
Page 2903 is never used
Page 2916 is never used
Page 2940 is never used
Page 2997 is never used
Page 3117 is never used
Page 3119 is never used
Page 3142 is never used
wrong # of entries in index minion_jobs_state_priority_id
NULL value in downloads.created
row 1 missing from index downloads_created
row 1 missing from index downloads_lock
NULL value in downloads.created
row 2 missing from index downloads_created
row 2 missing from index downloads_lock
NULL value in downloads.created
row 3 missing from index downloads_created
row 3 missing from index downloads_lock
NULL value in downloads.created
row 4 missing from index downloads_created
row 4 missing from index downloads_lock
row 1704 missing from index downloads_created
wrong # of entries in index downloads_created
wrong # of entries in index downloads_lock
Actions #7

Updated by kraih over 3 years ago

I've tried tracking down the root problem for a few months now without much luck. Adding explicit transactions around write queries seems to have reduced the frequency of the problem occuring (could be a coincidence though). But every now and then it still happens on a different worker, so hardware issues are also rather unlikely. We've never been able to replicate the problem on a dev machine, and the corrupted SQLite files we've been able to preserve and analyse so far have not shown anything conclusive. It would probably make sense if someone else from the team looks into this issue to double check my assertions before we look into mitigation features (like automatic cache service reset when this error is encountered).

Actions #8

Updated by okurz over 3 years ago

  • Description updated (diff)
  • Status changed from New to Workable

agreed, included the according suggestions in the ticket description

Actions #9

Updated by okurz over 3 years ago

  • Status changed from Workable to Rejected
  • Assignee set to okurz

Actually I consider this a duplicate of #67000 where we already have similar or same suggestions

Actions #10

Updated by okurz over 3 years ago

  • Related to action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8 added
Actions

Also available in: Atom PDF