Project

General

Profile

action #78163

After OSD upgrade, many jobs incomplete with "Cache service status error 500: Internal Server Error"

Added by Xiaojing_liu 2 months ago. Updated 2 months ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
2020-11-18
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

An example:
https://openqa.suse.de/tests/5020263

The worker-log.txt only shows:

[2020-11-18T06:56:31.0997 CET] [debug] [pid:32153] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5019106/status
[2020-11-18T06:56:32.0037 CET] [error] [pid:32153] Unable to setup job 5019106: Cache service status error 500: Internal Server Error
[2020-11-18T06:56:32.0037 CET] [debug] [pid:32153] Stopping job 5019106 from openqa.suse.de: 05019106-sle-15-SP3-Online-x86_64-Build81.1-xfstests_btrfs-generic-001-100@64bit-smp - reason: setup failure

Impact

Many incompletes within OSD at least, see #78165


Related issues

Is duplicate of openQA Project - action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolved2020-05-18

Copied to openQA Infrastructure - action #78165: infrastructure task: After osd deployment 2020-11-18 many jobs incomplete with auto_review:"Cache service (status error from API|.*error 500: Internal Server Error)":retryResolved2020-11-18

History

#1 Updated by Xiaojing_liu 2 months ago

  • Related to action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloaded added

#2 Updated by Xiaojing_liu 2 months ago

There are some job incomplete with Cache service status error from API: Minion job #1290 failed: Job terminated unexpectedly (exit code: 11, signal: 0) or Cache service status error from API: Minion job #1287 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Downloads.pm line 34. at /usr/share/openqa/script/../lib/OpenQA/CacheServ….
Examples: https://openqa.suse.de/tests/5021465 https://openqa.suse.de/tests/5021526

#3 Updated by Xiaojing_liu 2 months ago

Tried to restart openqa-worker-cacheservice-minion.service on openqaworker6, but failed.
The log shows:

Nov 18 08:24:36 openqaworker6 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Nov 18 08:24:36 openqaworker6 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 18 08:24:36 openqaworker6 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 18 08:24:36 openqaworker6 worker[31139]: [debug] [pid:31139] Updating status so job 5015093 is not considered dead.
Nov 18 08:24:36 openqaworker6 worker[31139]: [debug] [pid:31139] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5015093/status
Nov 18 08:24:36 openqaworker6 openqa-worker-cacheservice-minion[768]: [768] [i] [5eGOhfGq] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
Nov 18 08:24:36 openqaworker6 openqa-worker-cacheservice-minion[768]: [768] [i] Resetting all leftover locks after restart
Nov 18 08:24:36 openqaworker6 openqa-worker-cacheservice-minion[768]: [768] [i] Worker 768 started
Nov 18 08:24:36 openqaworker6 openqa-worker-cacheservice-minion[768]: DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vend>
Nov 18 08:24:36 openqaworker6 openqa-worker-cacheservice-minion[768]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 301.
Nov 18 08:24:36 openqaworker6 openqa-worker-cacheservice-minion[768]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 26.
Nov 18 08:24:36 openqaworker6 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=255/n/a
Nov 18 08:24:36 openqaworker6 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 18 08:24:36 openqaworker6 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 18 08:24:37 openqaworker6 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Nov 18 08:24:37 openqaworker6 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 18 08:24:37 openqaworker6 systemd[1]: openqa-worker-cacheservice-minion.service: Start request repeated too quickly.
Nov 18 08:24:37 openqaworker6 systemd[1]: Failed to start OpenQA Worker Cache Service Minion.
Nov 18 08:24:37 openqaworker6 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 18 08:24:37 openqaworker6 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.

#4 Updated by okurz 2 months ago

  • Copied to action #78165: infrastructure task: After osd deployment 2020-11-18 many jobs incomplete with auto_review:"Cache service (status error from API|.*error 500: Internal Server Error)":retry added

#5 Updated by okurz 2 months ago

  • Category set to Concrete Bugs
  • Priority changed from Urgent to High
  • Target version set to Ready

I have created the task #78165 specific to the osd deployment and handling workarounds. This ticket should focus on the investigation of the "openQA-code-only" aspects, e.g. any problems that we have in the openQA project that we should fix there, in upstream, not only in osd deployment.

#6 Updated by okurz 2 months ago

  • Subject changed from After OSD upgrade, many jobs incomplete with auto_review:"Cache service status error 500: Internal Server Error" to After OSD upgrade, many jobs incomplete with "Cache service status error 500: Internal Server Error"
  • Description updated (diff)

I have removed the "auto_review" keyword as I would like to handle that in the infrastructure specific task in #78165 and leave this ticket here for the upstream investigation

#7 Updated by mkittler 2 months ago

  • Is duplicate of action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added

#8 Updated by mkittler 2 months ago

  • Status changed from New to Closed

Since #78165 seems to be the "infrastructure ticket" I would close this one as duplicate of #67000. It doesn't make sense to have the comments about this problem at two different places. This isn't really OSD specific, we've already seen the exact same issue (corrupted SQLite database) after updating the o3 workers before.

#9 Updated by mkittler 2 months ago

  • Related to deleted (action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloaded)

#10 Updated by mkittler 2 months ago

I also removed the relation to #71827. Let's not put everything which is related to the cache service into one pot.

Also available in: Atom PDF