action #131309: [alert] NFS mount can fail due to hostname resolution error size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #131309

closed

[alert] NFS mount can fail due to hostname resolution error size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-06-19

Due date:

2023-08-11

% Done:

Estimated time:

Tags:

alert, nfs, infra

Description

Observations¶

martchus@worker11:~> sudo journalctl --since '2 days ago' -u var-lib-openqa-share.mount 
Jun 18 03:30:00 worker11 systemd[1]: Unmounting /var/lib/openqa/share...
Jun 18 03:30:00 worker11 systemd[1]: var-lib-openqa-share.mount: Deactivated successfully.
Jun 18 03:30:00 worker11 systemd[1]: Unmounted /var/lib/openqa/share.
-- Boot e08a8d421816483eb387db9aac85fdc4 --
Jun 18 03:36:02 worker11 systemd[1]: Mounting /var/lib/openqa/share...
Jun 18 03:36:02 worker11 mount[3411]: mount.nfs: Failed to resolve server openqa.suse.de: Name or service not known
Jun 18 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Mount process exited, code=exited, status=32/n/a
Jun 18 03:36:02 worker11 systemd[1]: var-lib-openqa-share.mount: Failed with result 'exit-code'.
Jun 18 03:36:02 worker11 systemd[1]: Failed to mount /var/lib/openqa/share.
Jun 18 03:37:03 worker11 systemd[1]: Mounting /var/lib/openqa/share...
Jun 18 03:37:05 worker11 mount[3803]: Created symlink /run/systemd/system/remote-fs.target.wants/rpc-statd.service → /usr/lib/systemd/system/rpc-statd.service.
Jun 18 03:37:06 worker11 systemd[1]: Mounted /var/lib/openqa/share.

the time shows that the machine worker11 rebooted during the planned weekly reboot window and likely OSD was also rebooting during that time and not yet immediately available again

Suggestions¶

worker11 is one of our two staging test machines. Crosscheck that it actually really has the latest salt state. For that power on both machines and bring them back as salt minions connected to OSD and ensure the salt high state is applied here
Crosscheck our mount options that we have configured in salt and locally in /etc/fstab, they should have some custom tweaking like systemd mount timeout, retry, etc.
Ensure that machines retry sufficiently often to mount from OSD to cover for periods when OSD itself is not up, e.g. as happened above in the usual weekly maintenance reboot window
This could be tested by mounting a share from one machine and keeping the server off for some minutes while the client reboots, e.g. power on worker11+12, create a NFS server on w11, mount it from w12, power off w11, reboot w12, wait and monitor how w12 behaves, then power on w11 and ensure that w12 eventually mounts and reaches the final systemd target without any failed services

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #131309

[alert] NFS mount can fail due to hostname resolution error size:M

Observations¶

Suggestions¶

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz 9 months ago

Updated by okurz 6 months ago