Project

General

Profile

Actions

action #55064

closed

nscd.service failed on openqaworker-arm-2 (and other arm machines as well)

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2019-07-30
Due date:
2019-10-23
% Done:

0%

Estimated time:

Description

Observation

On openqaworker-arm-2 I could see in systemctl --failed that nscd failed persistenly


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #55061: openqa-metrics.service failed on openqaworker-arm-2 since "Jul 24 17:07:08"Resolvednicksinger2019-07-30

Actions
Actions #1

Updated by okurz over 4 years ago

  • Copied from action #55061: openqa-metrics.service failed on openqaworker-arm-2 since "Jul 24 17:07:08" added
Actions #2

Updated by okurz over 4 years ago

Reproducing manually with

/usr/sbin/nscd -d
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/passwd for database passwd
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/passwd` (1)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/group for database group
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/group` (3)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/hosts for database hosts
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/hosts` (4)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/resolv.conf for database hosts
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/resolv.conf` (5)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/services for database services
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/services` (6)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: monitoring file /etc/netgroup for database netgroup
Fri Aug  2 15:39:59 2019 - 43127: monitoring file `/etc/netgroup` (7)
Fri Aug  2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/passwd; no persistent database used
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/group; no persistent database used
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/hosts; no sharing possible
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/services; no persistent database used
Fri Aug  2 15:39:59 2019 - 43127: cannot create /var/run/nscd/netgroup; no persistent database used
Fri Aug  2 15:39:59 2019 - 43127: /var/run/nscd/socket: No such file or directory

I can see that the problem seems to be about /var/run/nscd/ not existing. A quick check on openqaworker-arm-3 reveals that there the directory exists (with the socket file within). So manually created the directory and restarted the service which then was fine. I also reset the failed jobs with systemctl --reset-failed

Actions #3

Updated by okurz over 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from okurz to nicksinger

@nicksinger so I "fixed" the problem by creating the directory. I wonder if you have seen this in before on other workers?

Checking failed services with sudo salt -C 'G@roles:worker' cmd.run 'systemctl --failed' I could not find another instance – but quite some other services failing on all workers.

Actions #4

Updated by nicksinger over 4 years ago

  • Assignee changed from nicksinger to okurz

okurz wrote:

@nicksinger so I "fixed" the problem by creating the directory. I wonder if you have seen this in before on other workers?

Checking failed services with sudo salt -C 'G@roles:worker' cmd.run 'systemctl --failed' I could not find another instance – but quite some other services failing on all workers.

Nope, not really. But IIRC the workers have no common ground besides our current configuration inside salt so there are slight setup differences (as example: I remember a worker with pool and pool2 directory and some weird symlinks between them).

Actions #5

Updated by okurz over 4 years ago

  • Due date set to 2019-09-12

fine, let's try another iteration. I ran sudo salt -C 'G@roles:worker' cmd.run 'systemctl reset-failed' also because what failed on all instances was "openqa-metrics" which we have removed meanwhile. Some other services have failed but only on singular machines. Let's see what pops up again after some time.

Actions #6

Updated by okurz over 4 years ago

still visible e.g. on arm-* machines.

I reported

https://bugzilla.opensuse.org/show_bug.cgi?id=1149603

and can add a workaround in salt recipes: https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/162

Let's see how this behaves over the next days. I should probably trigger some reboots as well in between.

Actions #7

Updated by okurz over 4 years ago

  • Due date changed from 2019-09-12 to 2019-09-29

Suggested to remove the workaround again in https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/170 as I did not see the problem appear in the last days. After merge I will just see that the service does not fail and then close the ticket.

Actions #8

Updated by okurz over 4 years ago

  • Subject changed from nscd.service failed on openqaworker-arm-2 to nscd.service failed on openqaworker-arm-2 (and other arm machines as well)
  • Due date deleted (2019-09-29)
  • Status changed from Feedback to Blocked

I checked if the bug was still detected:

sudo salt '*arm*' cmd.run 'journalctl -u nscd | grep 1149603'
openqaworker-arm-1.suse.de:
    Sep 28 14:23:40 linux-rz6p mkdir_nscd[1143]: boo#1149603: /run/nscd does not exist
openqaworker-arm-2.suse.de:
    Sep 19 18:38:39 linux-79cz mkdir_nscd[1576]: boo#1149603: /run/nscd does not exist
openqaworker-arm-3.suse.de:
    Sep 17 20:20:37 openqaworker-arm-3 mkdir_nscd[1662]: boo#1149603: /run/nscd does not exist
    Sep 26 12:07:28 openqaworker-arm-3 mkdir_nscd[1546]: boo#1149603: /run/nscd does not exist

so it appeared on all three arm machines now. I brought back the workaround with https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862 but it was never removed from the ARM machines.
https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862

Updated https://bugzilla.opensuse.org/show_bug.cgi?id=1149603#c5

Actions #9

Updated by okurz over 4 years ago

  • Due date set to 2019-10-23
  • Status changed from Blocked to Feedback

After https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/192 merged the root problem should be fixed. Let's give it two weeks monitoring if nscd can still fail, I assume not, in which case we can again remove the workaround.

Actions #10

Updated by okurz over 4 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF