Project

General

Profile

action #65450

workers on o3 power did not restart after upgrade as NFS mount point was stale "Ignoring host 'http://openqa1-opensuse': Working directory does not exist"

Added by okurz about 1 year ago. Updated 12 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2020-04-08
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

After upgrade with zypper dup the worker instances on power8 refused to start with a confusing error message "Ignoring host 'http://openqa1-opensuse': Working directory does not exist".

Only after trying to start workers manually with strace I could find out what is wrong:

[info] [pid:69150] worker 1:
 - config file:           /etc/openqa/workers.ini
 - worker hostname:       power8
 - isotovideo version:    0
 - websocket API version: 1
 - web UI hosts:          http://openqa1-opensuse
 - class:                 qemu_ppc64le,qemu_ppc64,qemu_ppc,heavyload
 - no cleanup:            no
 - pool directory:        /var/lib/openqa/pool/1
stat("/var/lib/empty/.config/openqa/client.conf", 0x100174904f0) = -1 ENOENT (No such file or directory)
stat("/etc/openqa/client.conf", {st_mode=S_IFREG|0400, st_size=166, ...}) = 0
stat("/var/lib/openqa/pool/1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/var/lib/openqa/cache/openqa1-opensuse", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[info] [pid:69150] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
stat("/var/lib/openqa/cache/tmp", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/var/lib/openqa/share", 0x100174904f0) = -1 ESTALE (Stale file handle)
[debug] [pid:69150] Found possible working directory for http://openqa1-opensuse: /var/lib/openqa/share
[error] [pid:69150] Ignoring host 'http://openqa1-opensuse': Working directory does not exist.
+++ exited with 0 +++

After I changed the storage setup on o3, on power8 which is less often rebooted then the other machines, the old mount point which vanished was still exported as a stale NFS mount on power8.

I don't know yet what the cache code wants to do with /var/lib/openqa/share which is only provided for old compatibility with tests relying on that path and not using the cache properly. But it seems we are also checking it. After I unmounted the worker started up just fine so we do not really need it.

Suggestions

  • Check why we even read this path
  • Improve error messages to show without --verbose what is wrong
  • Potentially improve to not even rely on this path

History

#1 Updated by okurz about 1 year ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz

The least I can do is improve the error message: https://github.com/os-autoinst/openQA/pull/2922

#2 Updated by okurz 12 months ago

  • Category changed from Concrete Bugs to Feature requests
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

ok, I tried and I failed. I struggle to understand the logic right now so I closed my PR and hope someone else can cover this more easily. What I suggest to do: Every log message of a higher log message should have complete information and not rely on e.g. debug messages to provide context. Hence we need to output which working directory was not usable for a worker.

Also available in: Atom PDF