action #65450
openworkers on o3 power did not restart after upgrade as NFS mount point was stale "Ignoring host 'http://openqa1-opensuse': Working directory does not exist"
0%
Description
Observation¶
After upgrade with zypper dup
the worker instances on power8 refused to start with a confusing error message "Ignoring host 'http://openqa1-opensuse': Working directory does not exist".
Only after trying to start workers manually with strace I could find out what is wrong:
[info] [pid:69150] worker 1:
- config file: /etc/openqa/workers.ini
- worker hostname: power8
- isotovideo version: 0
- websocket API version: 1
- web UI hosts: http://openqa1-opensuse
- class: qemu_ppc64le,qemu_ppc64,qemu_ppc,heavyload
- no cleanup: no
- pool directory: /var/lib/openqa/pool/1
stat("/var/lib/empty/.config/openqa/client.conf", 0x100174904f0) = -1 ENOENT (No such file or directory)
stat("/etc/openqa/client.conf", {st_mode=S_IFREG|0400, st_size=166, ...}) = 0
stat("/var/lib/openqa/pool/1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/var/lib/openqa/cache/openqa1-opensuse", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[info] [pid:69150] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
stat("/var/lib/openqa/cache/tmp", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/var/lib/openqa/share", 0x100174904f0) = -1 ESTALE (Stale file handle)
[debug] [pid:69150] Found possible working directory for http://openqa1-opensuse: /var/lib/openqa/share
[error] [pid:69150] Ignoring host 'http://openqa1-opensuse': Working directory does not exist.
+++ exited with 0 +++
After I changed the storage setup on o3, on power8 which is less often rebooted then the other machines, the old mount point which vanished was still exported as a stale NFS mount on power8.
I don't know yet what the cache code wants to do with /var/lib/openqa/share which is only provided for old compatibility with tests relying on that path and not using the cache properly. But it seems we are also checking it. After I unmounted the worker started up just fine so we do not really need it.
Suggestions¶
- Check why we even read this path
- Improve error messages to show without
--verbose
what is wrong - Potentially improve to not even rely on this path
Updated by okurz over 4 years ago
- Status changed from Workable to Feedback
- Assignee set to okurz
The least I can do is improve the error message: https://github.com/os-autoinst/openQA/pull/2922
Updated by okurz over 4 years ago
- Category changed from Regressions/Crashes to Feature requests
- Status changed from Feedback to Workable
- Assignee deleted (
okurz)
ok, I tried and I failed. I struggle to understand the logic right now so I closed my PR and hope someone else can cover this more easily. What I suggest to do: Every log message of a higher log message should have complete information and not rely on e.g. debug messages to provide context. Hence we need to output which working directory was not usable for a worker.
Updated by okurz over 1 year ago
- Related to action #127754: osd nfs-server needed to be restarted but we got no alerts size:M added