https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842020-10-28T06:42:20ZopenSUSE Project Management ToolopenQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3442662020-10-28T06:42:20Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> set to <i>2020-10-30</i></li></ul><p>I did the manual steps mentioned again. Will see if the problematic dashboards reappear.</p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3446532020-10-29T12:05:34Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> deleted (<del><i>2020-10-30</i></del>)</li><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Workable</i></li><li><strong>Assignee</strong> deleted (<del><i>okurz</i></del>)</li></ul><p>They reappeared at least twice by now.</p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3446742020-10-29T13:08:15Znicksingernsinger@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>nicksinger</i></li></ul><p>Our dashboards get generated based on the realtime data present in salt. Sometimes it happens that a host accidentally registers against OSD which can show symptoms like this. However, not this time:</p>
<pre><code>openqa:~ # salt-key -L
Accepted Keys:
QA-Power8-4-kvm.qa.suse.de
QA-Power8-5-kvm.qa.suse.de
grenache-1.qa.suse.de
malbec.arch.suse.de
openqa-monitor.qa.suse.de
openqa.suse.de
openqaworker-arm-1.suse.de
openqaworker-arm-2.suse.de
openqaworker-arm-3.suse.de
openqaworker10.suse.de
openqaworker13.suse.de
openqaworker2.suse.de
openqaworker3.suse.de
openqaworker5.suse.de
openqaworker6.suse.de
openqaworker8.suse.de
openqaworker9.suse.de
Denied Keys:
Unaccepted Keys:
powerqaworker-qam-1
Rejected Keys:
</code></pre>
<p>All of these machines are expected. Nothing unusual. Going one step deeper into the mine (baha) where this data is generated: <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3">https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3</a><br>
It took me way too long to transform this single line of python into a bash command:</p>
<pre><code>openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
----------
QA-Power8-4-kvm.qa.suse.de:
QA-Power8-4-kvm
QA-Power8-5-kvm.qa.suse.de:
localhost
grenache-1.qa.suse.de:
grenache-1
malbec.arch.suse.de:
malbec
openqaworker-arm-1.suse.de:
openqaworker-arm-1
openqaworker-arm-2.suse.de:
openqaworker-arm-2
openqaworker-arm-3.suse.de:
openqaworker-arm-3
openqaworker10.suse.de:
openqaworker10
openqaworker13.suse.de:
openqaworker13
openqaworker2.suse.de:
openqaworker2
openqaworker3.suse.de:
openqaworker3
openqaworker5.suse.de:
openqaworker5
openqaworker6.suse.de:
openqaworker6
openqaworker8.suse.de:
linux-fwcx
openqaworker9.suse.de:
openqaworker9
</code></pre>
<p>So <code>openqaworker8.suse.de</code> and <code>QA-Power8-5-kvm.qa.suse.de</code> are the misbehaving hosts. Let's see what I can do about this</p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3446772020-10-29T13:10:49Zokurzokurz@suse.com
<ul></ul><p>OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of <a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" title="action: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panel... (Resolved)" href="https://progress.opensuse.org/issues/73633">#73633</a> to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.</p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3446862020-10-29T13:41:53Znicksingernsinger@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of <a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" title="action: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panel... (Resolved)" href="https://progress.opensuse.org/issues/73633">#73633</a> to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.</p>
</blockquote>
<p>I'm not sure where this is coming from. Looking at worker8 I saw that the <em>static</em> hostname was missing:</p>
<pre><code>nsinger@openqaworker8:~> hostnamectl
Static hostname: linux-fwcx.suse
Transient hostname: openqaworker8
Icon name: computer-server
Chassis: server
Machine ID: 7900bf3c706198423a0678e05913115f
Boot ID: 119abb6122e94753b4d46a405c525048
Operating System: openSUSE Leap 15.1
CPE OS Name: cpe:/o:opensuse:leap:15.1
Kernel: Linux 4.12.14-lp151.28.75-default
Architecture: x86-64
</code></pre>
<p>After setting the right one and restarting <code>salt-minion</code>:</p>
<pre><code>openqaworker8:~ # hostnamectl --static set-hostname openqaworker8
openqaworker8:~ # sudo systemctl restart salt-minion
</code></pre>
<p>The machine reported the right "nodename":</p>
<pre><code>openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
----------
[…]
openqaworker8.suse.de:
openqaworker8
[…]
</code></pre> openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3447072020-10-29T14:00:11Znicksingernsinger@suse.com
<ul></ul><p>QA-Power8-5-kvm gave me a bit of a hard time bringing it back. Everything looks good now:</p>
<pre><code>openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
----------
QA-Power8-4-kvm.qa.suse.de:
QA-Power8-4-kvm
QA-Power8-5-kvm.qa.suse.de:
QA-Power8-5-kvm
grenache-1.qa.suse.de:
grenache-1
malbec.arch.suse.de:
malbec
openqaworker-arm-1.suse.de:
openqaworker-arm-1
openqaworker-arm-2.suse.de:
openqaworker-arm-2
openqaworker-arm-3.suse.de:
openqaworker-arm-3
openqaworker10.suse.de:
openqaworker10
openqaworker13.suse.de:
openqaworker13
openqaworker2.suse.de:
openqaworker2
openqaworker3.suse.de:
openqaworker3
openqaworker5.suse.de:
openqaworker5
openqaworker6.suse.de:
openqaworker6
openqaworker8.suse.de:
openqaworker8
openqaworker9.suse.de:
openqaworker9
</code></pre> openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3447132020-10-29T14:08:28Znicksingernsinger@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>I'd say the immediate problem this ticket describes is away for now. However, we might need to follow up with <a href="https://progress.opensuse.org/issues/76783" class="external">https://progress.opensuse.org/issues/76783</a> if this persists :(</p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3447162020-10-29T14:08:53Zokurzokurz@suse.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" href="/issues/76786">action #76786</a>: Configure static hostnames with salt for all salt nodes</i> added</li></ul> openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3447222020-10-29T14:13:14Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>In Progress</i></li><li><strong>Assignee</strong> changed from <i>nicksinger</i> to <i>okurz</i></li></ul><p>I hope you agree that it makes sense that we ensure good static hostnames already in salt so I recorded <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Configure static hostnames with salt for all salt nodes (Resolved)" href="https://progress.opensuse.org/issues/76786">#76786</a> for this. I still see in <a href="https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok" class="external">https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok</a> the host names "linux-fwcx" and "localhost", maybe you need to call a high state once more? If the unexpected dashboards are gone you can resolve the ticket.</p>
<p>I am trying</p>
<pre><code>sudo salt '*monitor*' state.apply
</code></pre>
<p>right now and will check.</p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3447252020-10-29T14:18:19Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>Assignee</strong> changed from <i>okurz</i> to <i>nicksinger</i></li></ul><p>This wasn't sufficient. The deployed dashboard template files on the monitor host were fine but the "unknown dashboards" were still there. I manually deleted them in the grafana service instance. This might suffice now :) Setting back to nicksinger as original assignee.</p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3447312020-10-29T14:21:59Znicksingernsinger@suse.com
<ul></ul><p>oopsie, didn't check the full chain for the fix. Thanks for taking over!</p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3448812020-10-29T17:48:06Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/76783">action #76783</a>: research how hostnames with systemd work and make them static for all OSD related machines</i> added</li></ul> openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3487422020-11-09T19:03:50Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>Feedback</i></li></ul><p>We are back with this problem:</p>
<pre><code>sudo salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
----------
QA-Power8-4-kvm.qa.suse.de:
QA-Power8-4-kvm
QA-Power8-5-kvm.qa.suse.de:
QA-Power8-5-kvm
grenache-1.qa.suse.de:
grenache-1
malbec.arch.suse.de:
malbec
openqaworker-arm-1.suse.de:
openqaworker-arm-1
openqaworker-arm-2.suse.de:
openqaworker-arm-2
openqaworker-arm-3.suse.de:
openqaworker-arm-3
openqaworker10.suse.de:
openqaworker10
openqaworker13.suse.de:
localhost
openqaworker2.suse.de:
linux-1nn1
openqaworker3.suse.de:
openqaworker3
openqaworker5.suse.de:
openqaworker5
openqaworker6.suse.de:
openqaworker6
openqaworker8.suse.de:
openqaworker8
openqaworker9.suse.de:
linux-q6bp
powerqaworker-qam-1:
powerqaworker-qam-1
</code></pre>
<p>I assume something must have caused this problem to appear more often lately. Maybe related to <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: [osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service (and... (Resolved)" href="https://progress.opensuse.org/issues/75016">#75016</a> and slow link-up time? What do you think?</p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3522882020-11-19T20:12:58Zokurzokurz@suse.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li></ul><p>raising prio due to <a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" title="action: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panel... (Resolved)" href="https://progress.opensuse.org/issues/73633#note-37">#73633#note-37</a></p>
openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3539682020-11-25T12:08:20Zokurzokurz@suse.com
<ul><li><strong>Estimated time</strong> set to <i>80142.00 h</i></li></ul> openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3540082020-11-25T12:10:19Zokurzokurz@suse.com
<ul><li><strong>Estimated time</strong> deleted (<del><i>80142.00 h</i></del>)</li></ul> openQA Infrastructure - action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qahttps://progress.opensuse.org/issues/75445?journal_id=3541702020-11-25T22:03:44Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li><li><strong>Assignee</strong> changed from <i>nicksinger</i> to <i>okurz</i></li></ul><p>finished <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Configure static hostnames with salt for all salt nodes (Resolved)" href="https://progress.opensuse.org/issues/76786">#76786</a> , crosschecked that all hosts have the correct name. Have removed the wrongly generated dashboard files manually and on osd did</p>
<pre><code>salt --hide-timeout \* saltutil.sync_grains,saltutil.refresh_grains,saltutil.refresh_pillar,mine.update ,,,
salt -l error -C 'G@roles:monitor' state.apply
</code></pre>
<p>but that still did <code>find -type f ! -name worker-openqaworker-arm-1.json ! -name worker-malbec.json ! -name worker-grenache-1.json ! -name worker-linux-1nn1.json ! -name worker-openqaworker8.json ! -name worker-openqaworker6.json ! -name worker-QA-Power8-5-kvm.json ! -name worker-openqaworker-arm-3.json ! -name worker-QA-Power8-4-kvm.json ! -name worker-powerqaworker-qam-1.json ! -name worker-localhost.json ! -name worker-openqaworker10.json ! -name worker-linux-q6bp.json ! -name worker-openqaworker-arm-2.json ! -name worker-openqaworker3.json ! -name worker-openqaworker5.json ! -name webui.dashboard.json ! -name webui.services.json ! -name failed_systemd_services.json ! -name automatic_actions.json ! -name job_age.json ! -name openqa_jobs.json ! -name status_overview.json -exec rm {} \;</code>. See the wrong names like "worker-localhost" included.</p>
<p>After a <code>systemctl restart</code> on the affected machines the above worked. I still had to delete the dashboards in the grafana webUI.</p>
<p>That should be enough. As I had already tested that the hostname settings are static I don't expect this issue to reappear – well, not soon at least ;)</p>