action #131309
closed[alert] NFS mount can fail due to hostname resolution error size:M
0%
Description
Observations¶
martchus@worker11:~> sudo journalctl --since '2 days ago' -u var-lib-openqa-share.mount
Jun 18 03:30:00 worker11 systemd[1]: Unmounting /var/lib/openqa/share...
Jun 18 03:30:00 worker11 systemd[1]: var-lib-openqa-share.mount: Deactivated successfully.
Jun 18 03:30:00 worker11 systemd[1]: Unmounted /var/lib/openqa/share.
-- Boot e08a8d421816483eb387db9aac85fdc4 --
Jun 18 03:36:02 worker11 systemd[1]: Mounting /var/lib/openqa/share...
Jun 18 03:36:02 worker11 mount[3411]: mount.nfs: Failed to resolve server openqa.suse.de: Name or service not known
Jun 18 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Mount process exited, code=exited, status=32/n/a
Jun 18 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Failed with result 'exit-code'.
Jun 18 03:36:02 worker11 systemd[1]: Failed to mount /var/lib/openqa/share.
Jun 18 03:37:03 worker11 systemd[1]: Mounting /var/lib/openqa/share...
Jun 18 03:37:05 worker11 mount[3803]: Created symlink /run/systemd/system/remote-fs.target.wants/rpc-statd.service → /usr/lib/systemd/system/rpc-statd.service.
Jun 18 03:37:06 worker11 systemd[1]: Mounted /var/lib/openqa/share.
the time shows that the machine worker11 rebooted during the planned weekly reboot window and likely OSD was also rebooting during that time and not yet immediately available again
Suggestions¶
- worker11 is one of our two staging test machines. Crosscheck that it actually really has the latest salt state. For that power on both machines and bring them back as salt minions connected to OSD and ensure the salt high state is applied here
- Crosscheck our mount options that we have configured in salt and locally in /etc/fstab, they should have some custom tweaking like systemd mount timeout, retry, etc.
- Ensure that machines retry sufficiently often to mount from OSD to cover for periods when OSD itself is not up, e.g. as happened above in the usual weekly maintenance reboot window
- This could be tested by mounting a share from one machine and keeping the server off for some minutes while the client reboots, e.g. power on worker11+12, create a NFS server on w11, mount it from w12, power off w11, reboot w12, wait and monitor how w12 behaves, then power on w11 and ensure that w12 eventually mounts and reaches the final systemd target without any failed services
Updated by nicksinger over 1 year ago
- Status changed from Workable to In Progress
The defined options in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls#L86 are present on both workers. To not schedule tests I refrain for now to add them back into salt as it wouldn't bring any updates in that regard.
Going to check the last suggestion now to cross-mount between the workers and reboot them so see how it behaves.
Updated by nicksinger over 1 year ago
Keeping worker12 shut-off while rebooting worker11. It behaved as we expected and did not fail. After starting worker12 I forgot to enable the nfs-server and therefore it timed out on accessing the share but recovered gracefully once I enabled the nfs-server process on worker12.
"Name or service not known" gives me the impression this has to do with DNS and a race with the network connection on boot. Checking if there are additional options or dependencies we can introduce.
Updated by okurz over 1 year ago
Sounds very similar to the case we had when the openQA cache service failing to resolve DNS after bootup but worked fine when restarting. According to mkittler though we resolved that problem by using the IP address for localhost instead as we did not need a remote host connection here. Maybe something with glibc, getent, whatever, restarting nscd or simply retrying helps here?
Slightly related as we discussed: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/924 and https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/925
Updated by openqa_review over 1 year ago
- Due date set to 2023-08-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger over 1 year ago
While debugging in poo# I saw the following line from systemd: [104432.144929][ T1512] systemd-fstab-generator[1512]: x-systemd.device-timeout ignored for openqa.suse.de:/var/lib/openqa/share
. Maybe this is related?
Updated by okurz over 1 year ago
Good point. Maybe we never wrote the correct name for the setting or we use something that is not anymore supported or even not yet supported
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Feedback
Couldn't find any conclusive results on why this option isn't supported but it shouldn't matter because we retry several times anyway.
Discovered an answer in https://suse.slack.com/archives/C02D92APKNU/p1690458456152179 and came up with a workaround to dynamically populate /etc/hosts based on DNS via salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/931
Given the several options we already tried I think this is the best we can do ATM
Updated by nicksinger over 1 year ago
- Status changed from Feedback to Resolved
There was another fixup needed which was done in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/932
Multiple reboots showed no problems and we should be good again.
Updated by okurz about 1 year ago
- Copied to action #137114: openQA workers fail to register after bootup due to unable to resolve openqa.suse.de but manage to do so immediately when restarting worker services added
Updated by okurz 9 months ago
- Related to action #158041: grenache needs upgrade to 15.5 added
Updated by okurz 6 months ago
- Related to action #163097: Share mount not working on openqaworker-arm-1 and other workers size:M added