Project

General

Profile

action #56819

worker cacheservice on *arm* does not seem to be reboot safe (race condition with nvme prepare?)

Added by okurz 10 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2019-09-11
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Observation

from openqaworker-arm-2 after reboot:

Sep 11 20:12:16 openqaworker-arm-2 openqa-workercache[8962]: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
Sep 11 20:12:16 openqaworker-arm-2 openqa-workercache[8962]: [INFO] Creating cache directory tree for /var/lib/openqa/cache
Sep 11 20:12:16 openqaworker-arm-2 openqa-workercache[8962]: mkdir /var/lib/openqa/cache/tmp: Permission denied at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 85.
Sep 11 20:12:16 openqaworker-arm-2 openqa-workercache[8962]: Compilation failed in require at /usr/share/openqa/script/openqa-workercache line 26.
Sep 11 20:12:16 openqaworker-arm-2 openqa-workercache[8962]:         (in cleanup) DBI connect('dbname=/var/lib/openqa/cache/cache.sqlite','',...) failed: unable to open database file at /usr/lib/perl5/vendor_perl/5.26.1/Mojo>
Sep 11 20:12:16 openqaworker-arm-2 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=13/n/a
Sep 11 20:12:16 openqaworker-arm-2 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Sep 11 20:12:16 openqaworker-arm-2 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Sep 11 20:12:16 openqaworker-arm-2 systemd[1]: openqa-worker-cacheservice.service: Service RestartSec=100ms expired, scheduling restart.
Sep 11 20:12:16 openqaworker-arm-2 systemd[1]: Stopped OpenQA Worker Cache Service.
Sep 11 20:12:16 openqaworker-arm-2 systemd[1]: openqa-worker-cacheservice.service: Start request repeated too quickly.
Sep 11 20:12:16 openqaworker-arm-2 systemd[1]: Failed to start OpenQA Worker Cache Service.
Sep 11 20:12:16 openqaworker-arm-2 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Sep 11 20:12:16 openqaworker-arm-2 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.

a simple `systemctl restart openqaworker-cache-service helped:

Sep 11 20:16:34 openqaworker-arm-2 systemd[1]: Started OpenQA Worker Cache Service.
Sep 11 20:16:35 openqaworker-arm-2 openqa-workercache[9940]: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
Sep 11 20:16:35 openqaworker-arm-2 openqa-workercache[9940]: [INFO] Creating cache directory tree for /var/lib/openqa/cache
Sep 11 20:16:35 openqaworker-arm-2 openqa-workercache[9940]: [DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
Sep 11 20:16:35 openqaworker-arm-2 openqa-workercache[9940]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 0
Sep 11 20:16:35 openqaworker-arm-2 openqa-workercache[9940]: [9940] [i] Listening at "http://127.0.0.1:7844"
Sep 11 20:16:35 openqaworker-arm-2 openqa-workercache[9940]: [9940] [i] Listening at "http://[::1]:7844"

Suggestion

We have a "openqa/nvme_store/openqa-worker@_override.conf" in salt states repo already adding a wait to the nvme prepare service but not the corresponding override for cache-service or cache-service-minion

History

#1 Updated by okurz 9 months ago

  • Status changed from New to Resolved
  • Assignee set to okurz
  • Target version set to Done

Also available in: Atom PDF