action #55064

nscd.service failed on openqaworker-arm-2 (and other arm machines as well)

Added by okurz 7 months ago. Updated 4 months ago.

Status:ResolvedStart date:30/07/2019
Priority:NormalDue date:23/10/2019
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration: 62

Description

Observation

On openqaworker-arm-2 I could see in systemctl --failed that nscd failed persistenly


Related issues

Copied from openQA Infrastructure - action #55061: openqa-metrics.service failed on openqaworker-arm-2 since... Resolved 30/07/2019

History

#1 Updated by okurz 7 months ago

  • Copied from action #55061: openqa-metrics.service failed on openqaworker-arm-2 since "Jul 24 17:07:08" added

#2 Updated by okurz 7 months ago

Reproducing manually with

/usr/sbin/nscd -d
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/passwd for database passwd
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/passwd` (1)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/group for database group
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/group` (3)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/hosts for database hosts
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/hosts` (4)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/resolv.conf for database hosts
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/resolv.conf` (5)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/services for database services
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/services` (6)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/netgroup for database netgroup
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/netgroup` (7)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/passwd; no persistent database used
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/group; no persistent database used
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/hosts; no sharing possible
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/services; no persistent database used
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/netgroup; no persistent database used
Fri Aug  2 15:39:59 2019 - 43127: /var/run/nscd/socket: No such file or directory

I can see that the problem seems to be about /var/run/nscd/ not existing. A quick check on openqaworker-arm-3 reveals that there the directory exists (with the socket file within). So manually created the directory and restarted the service which then was fine. I also reset the failed jobs with systemctl --reset-failed

#3 Updated by okurz 7 months ago

  • Status changed from New to Feedback
  • Assignee changed from okurz to nicksinger

@nicksinger so I "fixed" the problem by creating the directory. I wonder if you have seen this in before on other workers?

Checking failed services with sudo salt -C 'G@roles:worker' cmd.run 'systemctl --failed' I could not find another instance – but quite some other services failing on all workers.

#4 Updated by nicksinger 7 months ago

  • Assignee changed from nicksinger to okurz

okurz wrote:

@nicksinger so I "fixed" the problem by creating the directory. I wonder if you have seen this in before on other workers?


Checking failed services with sudo salt -C 'G@roles:worker' cmd.run 'systemctl --failed' I could not find another instance – but quite some other services failing on all workers.

Nope, not really. But IIRC the workers have no common ground besides our current configuration inside salt so there are slight setup differences (as example: I remember a worker with pool and pool2 directory and some weird symlinks between them).

#5 Updated by okurz 6 months ago

  • Due date set to 12/09/2019

fine, let's try another iteration. I ran sudo salt -C 'G@roles:worker' cmd.run 'systemctl reset-failed' also because what failed on all instances was "openqa-metrics" which we have removed meanwhile. Some other services have failed but only on singular machines. Let's see what pops up again after some time.

#6 Updated by okurz 6 months ago

still visible e.g. on arm-* machines.

I reported

https://bugzilla.opensuse.org/show_bug.cgi?id=1149603

and can add a workaround in salt recipes: https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/162

Let's see how this behaves over the next days. I should probably trigger some reboots as well in between.

#7 Updated by okurz 5 months ago

  • Due date changed from 12/09/2019 to 29/09/2019

Suggested to remove the workaround again in https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/170 as I did not see the problem appear in the last days. After merge I will just see that the service does not fail and then close the ticket.

#8 Updated by okurz 5 months ago

  • Subject changed from nscd.service failed on openqaworker-arm-2 to nscd.service failed on openqaworker-arm-2 (and other arm machines as well)
  • Due date deleted (29/09/2019)
  • Status changed from Feedback to Blocked

I checked if the bug was still detected:

sudo salt '*arm*' cmd.run 'journalctl -u nscd | grep 1149603'
openqaworker-arm-1.suse.de:
    Sep 28 14:23:40 linux-rz6p mkdir_nscd[1143]: boo#1149603: /run/nscd does not exist
openqaworker-arm-2.suse.de:
    Sep 19 18:38:39 linux-79cz mkdir_nscd[1576]: boo#1149603: /run/nscd does not exist
openqaworker-arm-3.suse.de:
    Sep 17 20:20:37 openqaworker-arm-3 mkdir_nscd[1662]: boo#1149603: /run/nscd does not exist
    Sep 26 12:07:28 openqaworker-arm-3 mkdir_nscd[1546]: boo#1149603: /run/nscd does not exist

so it appeared on all three arm machines now. I brought back the workaround with https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862 but it was never removed from the ARM machines.
https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862

Updated https://bugzilla.opensuse.org/show_bug.cgi?id=1149603#c5

#9 Updated by okurz 5 months ago

  • Due date set to 23/10/2019
  • Status changed from Blocked to Feedback

After https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/192 merged the root problem should be fixed. Let's give it two weeks monitoring if nscd can still fail, I assume not, in which case we can again remove the workaround.

#10 Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF