action #55064
closednscd.service failed on openqaworker-arm-2 (and other arm machines as well)
Added by okurz over 5 years ago. Updated about 5 years ago.
0%
Description
Updated by okurz over 5 years ago
- Copied from action #55061: openqa-metrics.service failed on openqaworker-arm-2 since "Jul 24 17:07:08" added
Updated by okurz over 5 years ago
Reproducing manually with
/usr/sbin/nscd -d
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/passwd for database passwd
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/passwd` (1)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/group for database group
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/group` (3)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/hosts for database hosts
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/hosts` (4)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/resolv.conf for database hosts
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/resolv.conf` (5)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/services for database services
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/services` (6)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/netgroup for database netgroup
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/netgroup` (7)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/passwd; no persistent database used
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/group; no persistent database used
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/hosts; no sharing possible
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/services; no persistent database used
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/netgroup; no persistent database used
Fri Aug 2 15:39:59 2019 - 43127: /var/run/nscd/socket: No such file or directory
I can see that the problem seems to be about /var/run/nscd/
not existing. A quick check on openqaworker-arm-3 reveals that there the directory exists (with the socket file within). So manually created the directory and restarted the service which then was fine. I also reset the failed jobs with systemctl --reset-failed
Updated by okurz over 5 years ago
- Status changed from New to Feedback
- Assignee changed from okurz to nicksinger
@nicksinger so I "fixed" the problem by creating the directory. I wonder if you have seen this in before on other workers?
Checking failed services with sudo salt -C 'G@roles:worker' cmd.run 'systemctl --failed'
I could not find another instance – but quite some other services failing on all workers.
Updated by nicksinger over 5 years ago
- Assignee changed from nicksinger to okurz
okurz wrote:
@nicksinger so I "fixed" the problem by creating the directory. I wonder if you have seen this in before on other workers?
Checking failed services with
sudo salt -C 'G@roles:worker' cmd.run 'systemctl --failed'
I could not find another instance – but quite some other services failing on all workers.
Nope, not really. But IIRC the workers have no common ground besides our current configuration inside salt so there are slight setup differences (as example: I remember a worker with pool and pool2 directory and some weird symlinks between them).
Updated by okurz over 5 years ago
- Due date set to 2019-09-12
fine, let's try another iteration. I ran sudo salt -C 'G@roles:worker' cmd.run 'systemctl reset-failed'
also because what failed on all instances was "openqa-metrics" which we have removed meanwhile. Some other services have failed but only on singular machines. Let's see what pops up again after some time.
Updated by okurz over 5 years ago
still visible e.g. on arm-* machines.
I reported
https://bugzilla.opensuse.org/show_bug.cgi?id=1149603
and can add a workaround in salt recipes: https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/162
Let's see how this behaves over the next days. I should probably trigger some reboots as well in between.
Updated by okurz over 5 years ago
- Due date changed from 2019-09-12 to 2019-09-29
Suggested to remove the workaround again in https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/170 as I did not see the problem appear in the last days. After merge I will just see that the service does not fail and then close the ticket.
Updated by okurz about 5 years ago
- Subject changed from nscd.service failed on openqaworker-arm-2 to nscd.service failed on openqaworker-arm-2 (and other arm machines as well)
- Due date deleted (
2019-09-29) - Status changed from Feedback to Blocked
I checked if the bug was still detected:
sudo salt '*arm*' cmd.run 'journalctl -u nscd | grep 1149603'
openqaworker-arm-1.suse.de:
Sep 28 14:23:40 linux-rz6p mkdir_nscd[1143]: boo#1149603: /run/nscd does not exist
openqaworker-arm-2.suse.de:
Sep 19 18:38:39 linux-79cz mkdir_nscd[1576]: boo#1149603: /run/nscd does not exist
openqaworker-arm-3.suse.de:
Sep 17 20:20:37 openqaworker-arm-3 mkdir_nscd[1662]: boo#1149603: /run/nscd does not exist
Sep 26 12:07:28 openqaworker-arm-3 mkdir_nscd[1546]: boo#1149603: /run/nscd does not exist
so it appeared on all three arm machines now. I brought back the workaround with https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862 but it was never removed from the ARM machines.
https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862
Updated https://bugzilla.opensuse.org/show_bug.cgi?id=1149603#c5
Updated by okurz about 5 years ago
- Due date set to 2019-10-23
- Status changed from Blocked to Feedback
After https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/192 merged the root problem should be fixed. Let's give it two weeks monitoring if nscd can still fail, I assume not, in which case we can again remove the workaround.
Updated by okurz about 5 years ago
- Status changed from Feedback to Resolved