https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842019-08-02T15:43:53ZopenSUSE Project Management ToolopenQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2323132019-08-02T15:43:53Zokurzokurz@suse.com
<ul><li><strong>Copied from</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/55061">action #55061</a>: openqa-metrics.service failed on openqaworker-arm-2 since "Jul 24 17:07:08"</i> added</li></ul> openQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2323162019-08-02T15:48:24Zokurzokurz@suse.com
<ul></ul><p>Reproducing manually with</p>
<pre><code>/usr/sbin/nscd -d
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/passwd for database passwd
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/passwd` (1)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/group for database group
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/group` (3)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/hosts for database hosts
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/hosts` (4)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/resolv.conf for database hosts
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/resolv.conf` (5)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/services for database services
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/services` (6)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: monitoring file /etc/netgroup for database netgroup
Fri Aug 2 15:39:59 2019 - 43127: monitoring file `/etc/netgroup` (7)
Fri Aug 2 15:39:59 2019 - 43127: monitoring directory `/etc` (2)
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/passwd; no persistent database used
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/group; no persistent database used
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/hosts; no sharing possible
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/services; no persistent database used
Fri Aug 2 15:39:59 2019 - 43127: cannot create /var/run/nscd/netgroup; no persistent database used
Fri Aug 2 15:39:59 2019 - 43127: /var/run/nscd/socket: No such file or directory
</code></pre>
<p>I can see that the problem seems to be about <code>/var/run/nscd/</code> not existing. A quick check on openqaworker-arm-3 reveals that there the directory exists (with the socket file within). So manually created the directory and restarted the service which then was fine. I also reset the failed jobs with <code>systemctl --reset-failed</code></p>
openQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2323192019-08-02T15:50:28Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Feedback</i></li><li><strong>Assignee</strong> changed from <i>okurz</i> to <i>nicksinger</i></li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> so I "fixed" the problem by creating the directory. I wonder if you have seen this in before on other workers?</p>
<p>Checking failed services with <code>sudo salt -C 'G@roles:worker' cmd.run 'systemctl --failed'</code> I could not find another instance – but quite some other services failing on all workers.</p>
openQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2327992019-08-06T13:00:39Znicksingernsinger@suse.com
<ul><li><strong>Assignee</strong> changed from <i>nicksinger</i> to <i>okurz</i></li></ul><p>okurz wrote:</p>
<blockquote>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> so I "fixed" the problem by creating the directory. I wonder if you have seen this in before on other workers?</p>
<p>Checking failed services with <code>sudo salt -C 'G@roles:worker' cmd.run 'systemctl --failed'</code> I could not find another instance – but quite some other services failing on all workers.</p>
</blockquote>
<p>Nope, not really. But IIRC the workers have no common ground besides our current configuration inside salt so there are slight setup differences (as example: I remember a worker with pool and pool2 directory and some weird symlinks between them).</p>
openQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2395042019-08-29T13:18:35Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> set to <i>2019-09-12</i></li></ul><p>fine, let's try another iteration. I ran <code>sudo salt -C 'G@roles:worker' cmd.run 'systemctl reset-failed'</code> also because what failed on all instances was "openqa-metrics" which we have removed meanwhile. Some other services have failed but only on singular machines. Let's see what pops up again after some time.</p>
openQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2415592019-09-05T13:20:17Zokurzokurz@suse.com
<ul></ul><p>still visible e.g. on arm-* machines. </p>
<p>I reported</p>
<p><a href="https://bugzilla.opensuse.org/show_bug.cgi?id=1149603" class="external">https://bugzilla.opensuse.org/show_bug.cgi?id=1149603</a></p>
<p>and can add a workaround in salt recipes: <a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/162" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/162</a></p>
<p>Let's see how this behaves over the next days. I should probably trigger some reboots as well in between.</p>
openQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2434312019-09-15T20:21:12Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> changed from <i>2019-09-12</i> to <i>2019-09-29</i></li></ul><p>Suggested to remove the workaround again in <a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/170" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/170</a> as I did not see the problem appear in the last days. After merge I will just see that the service does not fail and then close the ticket.</p>
openQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2469562019-09-30T08:47:32Zokurzokurz@suse.com
<ul><li><strong>Subject</strong> changed from <i>nscd.service failed on openqaworker-arm-2</i> to <i>nscd.service failed on openqaworker-arm-2 (and other arm machines as well)</i></li><li><strong>Due date</strong> deleted (<del><i>2019-09-29</i></del>)</li><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Blocked</i></li></ul><p>I checked if the bug was still detected:</p>
<pre><code>sudo salt '*arm*' cmd.run 'journalctl -u nscd | grep 1149603'
openqaworker-arm-1.suse.de:
Sep 28 14:23:40 linux-rz6p mkdir_nscd[1143]: boo#1149603: /run/nscd does not exist
openqaworker-arm-2.suse.de:
Sep 19 18:38:39 linux-79cz mkdir_nscd[1576]: boo#1149603: /run/nscd does not exist
openqaworker-arm-3.suse.de:
Sep 17 20:20:37 openqaworker-arm-3 mkdir_nscd[1662]: boo#1149603: /run/nscd does not exist
Sep 26 12:07:28 openqaworker-arm-3 mkdir_nscd[1546]: boo#1149603: /run/nscd does not exist
</code></pre>
<p>so it appeared on all three arm machines now. I brought back the workaround with <a href="https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862</a> but it was never removed from the ARM machines.<br>
<a href="https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/commit/fad660eebbec153ca91155a9094d5b4db3819862</a></p>
<p>Updated <a href="https://bugzilla.opensuse.org/show_bug.cgi?id=1149603#c5" class="external">https://bugzilla.opensuse.org/show_bug.cgi?id=1149603#c5</a></p>
openQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2486662019-10-09T18:10:09Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> set to <i>2019-10-23</i></li><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Feedback</i></li></ul><p>After <a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/192" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/192</a> merged the root problem should be fixed. Let's give it two weeks monitoring if nscd can still fail, I assume not, in which case we can again remove the workaround.</p>
openQA Infrastructure - action #55064: nscd.service failed on openqaworker-arm-2 (and other arm machines as well)https://progress.opensuse.org/issues/55064?journal_id=2501812019-10-15T10:28:56Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/201" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/201</a></p>