Project

General

Profile

Actions

action #73342

closed

all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-10-14
Due date:
2020-10-16
% Done:

0%

Estimated time:

Description

jobs run on openqaworker8 incomplete, the reason is Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Downloads.pm line 34. at /usr/share/openqa/script/../lib/OpenQA/CacheSer…

Please see more details on:
https://10.160.0.207/tests/4823712
https://10.160.0.207/tests/4821829


Related issues 4 (0 open4 closed)

Related to openQA Project - action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolvedmkittler2020-05-18

Actions
Related to openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8Resolvedokurz2020-11-162020-11-18

Actions
Copied from openQA Project - action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"Rejectedokurz2020-10-14

Actions
Copied to openQA Infrastructure - action #75220: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retryResolvedokurz

Actions
Actions #1

Updated by okurz over 3 years ago

  • Copied from action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*" added
Actions #2

Updated by okurz over 3 years ago

copied the sqlite database file so that one can investigate further, see #73321 for this. To stop any further problematic jobs produced by this worker I did:

systemctl stop openqa-worker* salt-minion telegraf

I see that the worker was not rebooted for longer but I expected that to have happened automatically. Will check package versions and reboot which should clean out the pool and cache automatically anyway and fix the reported issue.

EDIT: Found out that many workers have not automatically rebooted since 31 days. Manually triggering the "auto-update" service on openqaworker8 and taking a look into the corresponding journal shows that there is an "interactive" patch pending which seems to be a kernel upgrade. So trying with the flag "--non-interactive-include-reboot-patches" now: systemctl daemon-reload; systemctl start auto-update ; journalctl -f -u auto-update. This triggered the kernel upgrade and asked the rebootmgr to reboot during the next maintenance window. For the above reasons I triggered a reboot now.

Actions #3

Updated by okurz over 3 years ago

  • Due date set to 2020-10-16
  • Status changed from In Progress to Feedback

Machine is rebooted, has a clean cache db and is working on jobs fine for now. Will monitor over the next time.

Actions #4

Updated by okurz over 3 years ago

  • Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added
Actions #5

Updated by okurz over 3 years ago

  • Subject changed from all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*" to all jobs run on openqaworker8 incomplete: auto_review:"Cache service status error from API: Minion job .*failed: .*database disk image is malformed*":retry
Actions #6

Updated by okurz over 3 years ago

  • Subject changed from all jobs run on openqaworker8 incomplete: auto_review:"Cache service status error from API: Minion job .*failed: .*database disk image is malformed*":retry to all jobs run on openqaworker8 incomplete: auto_review:"Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry
Actions #7

Updated by okurz over 3 years ago

  • Priority changed from Urgent to Normal

I monitored all openQA worker instances since the reboot and when the cache database was re-initialized and so far I have seen a lot of passed jobs on all worker instances. So it seems the problem of corrupt cache database was resolved as well as the problem that the machine(s) did not apply "interactive" patches: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/380

EDIT: Seems like the alerts for "incompletes" are also related to this so I paused the alerts for https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&panelId=17&fullscreen&edit&tab=alert and https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&panelId=14&fullscreen&edit&tab=alert for now until the situation is repaired.

Actions #8

Updated by okurz over 3 years ago

  • Status changed from Feedback to Resolved

merged, alerts reactivated, done.

Actions #9

Updated by okurz over 3 years ago

  • Subject changed from all jobs run on openqaworker8 incomplete: auto_review:"Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry to all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry
Actions #10

Updated by okurz over 3 years ago

  • Copied to action #75220: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added
Actions #11

Updated by okurz over 3 years ago

  • Related to action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8 added
Actions

Also available in: Atom PDF