Project

General

Profile

action #75445

unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qa

Added by okurz 12 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-10-28
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok
shows many paused alerts for "linux-fwcx" and "localhost", e.g.

linux-fwcx: Memory usage alert
UNKNOWN for 4 days
linux-fwcx: Minion Jobs alert
UNKNOWN for 4 days
linux-fwcx: NTP offset alert
UNKNOWN for 4 days
linux-fwcx: OpenQA Ping time alert
UNKNOWN for 4 days
linux-fwcx: partitions usage (%) alert
UNKNOWN for 4 days
localhost: Disk I/O time alert
UNKNOWN for 5 days
localhost: Memory usage alert
UNKNOWN for 5 days
localhost: Minion Jobs alert
UNKNOWN for 5 days
localhost: NTP offset alert
UNKNOWN for 5 days
localhost: OpenQA Ping time alert
UNKNOWN for 5 days
localhost: partitions usage (%) alert
UNKNOWN for 5 days

I already tried to manually delete them but they seem to reappear. What I did on monitor.qa:

sudo su
cd /var/lib/grafana/dashboards
rm worker-linux-fwcx.json worker-localhost.json
systemctl restart grafana-server

Acceptance criteria

Suggestions

  • Find out who did that, which machines these are, maybe experiments on "staging" or on the staging worker machines?
  • Prevent that the same monitoring instance is reconfigured from elsewhere

Related issues

Related to openQA Infrastructure - action #76783: research how hostnames with systemd work and make them static for all OSD related machinesResolved2020-10-29

Copied to openQA Infrastructure - action #76786: Configure static hostnames with salt for all salt nodesResolved

History

#1 Updated by okurz 12 months ago

  • Due date set to 2020-10-30

I did the manual steps mentioned again. Will see if the problematic dashboards reappear.

#2 Updated by okurz 12 months ago

  • Due date deleted (2020-10-30)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

They reappeared at least twice by now.

#3 Updated by nicksinger 12 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger

Our dashboards get generated based on the realtime data present in salt. Sometimes it happens that a host accidentally registers against OSD which can show symptoms like this. However, not this time:

openqa:~ # salt-key -L
Accepted Keys:
QA-Power8-4-kvm.qa.suse.de
QA-Power8-5-kvm.qa.suse.de
grenache-1.qa.suse.de
malbec.arch.suse.de
openqa-monitor.qa.suse.de
openqa.suse.de
openqaworker-arm-1.suse.de
openqaworker-arm-2.suse.de
openqaworker-arm-3.suse.de
openqaworker10.suse.de
openqaworker13.suse.de
openqaworker2.suse.de
openqaworker3.suse.de
openqaworker5.suse.de
openqaworker6.suse.de
openqaworker8.suse.de
openqaworker9.suse.de
Denied Keys:
Unaccepted Keys:
powerqaworker-qam-1
Rejected Keys:

All of these machines are expected. Nothing unusual. Going one step deeper into the mine (baha) where this data is generated: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3
It took me way too long to transform this single line of python into a bash command:

openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    QA-Power8-4-kvm.qa.suse.de:
        QA-Power8-4-kvm
    QA-Power8-5-kvm.qa.suse.de:
        localhost
    grenache-1.qa.suse.de:
        grenache-1
    malbec.arch.suse.de:
        malbec
    openqaworker-arm-1.suse.de:
        openqaworker-arm-1
    openqaworker-arm-2.suse.de:
        openqaworker-arm-2
    openqaworker-arm-3.suse.de:
        openqaworker-arm-3
    openqaworker10.suse.de:
        openqaworker10
    openqaworker13.suse.de:
        openqaworker13
    openqaworker2.suse.de:
        openqaworker2
    openqaworker3.suse.de:
        openqaworker3
    openqaworker5.suse.de:
        openqaworker5
    openqaworker6.suse.de:
        openqaworker6
    openqaworker8.suse.de:
        linux-fwcx
    openqaworker9.suse.de:
        openqaworker9

So openqaworker8.suse.de and QA-Power8-5-kvm.qa.suse.de are the misbehaving hosts. Let's see what I can do about this

#4 Updated by okurz 12 months ago

OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of #73633 to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.

#5 Updated by nicksinger 12 months ago

okurz wrote:

OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of #73633 to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.

I'm not sure where this is coming from. Looking at worker8 I saw that the static hostname was missing:

nsinger@openqaworker8:~> hostnamectl
   Static hostname: linux-fwcx.suse
Transient hostname: openqaworker8
         Icon name: computer-server
           Chassis: server
        Machine ID: 7900bf3c706198423a0678e05913115f
           Boot ID: 119abb6122e94753b4d46a405c525048
  Operating System: openSUSE Leap 15.1
       CPE OS Name: cpe:/o:opensuse:leap:15.1
            Kernel: Linux 4.12.14-lp151.28.75-default
      Architecture: x86-64

After setting the right one and restarting salt-minion:

openqaworker8:~ # hostnamectl --static set-hostname openqaworker8
openqaworker8:~ # sudo systemctl restart salt-minion

The machine reported the right "nodename":

openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    […]
    openqaworker8.suse.de:
        openqaworker8
    […]

#6 Updated by nicksinger 12 months ago

QA-Power8-5-kvm gave me a bit of a hard time bringing it back. Everything looks good now:

openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    QA-Power8-4-kvm.qa.suse.de:
        QA-Power8-4-kvm
    QA-Power8-5-kvm.qa.suse.de:
        QA-Power8-5-kvm
    grenache-1.qa.suse.de:
        grenache-1
    malbec.arch.suse.de:
        malbec
    openqaworker-arm-1.suse.de:
        openqaworker-arm-1
    openqaworker-arm-2.suse.de:
        openqaworker-arm-2
    openqaworker-arm-3.suse.de:
        openqaworker-arm-3
    openqaworker10.suse.de:
        openqaworker10
    openqaworker13.suse.de:
        openqaworker13
    openqaworker2.suse.de:
        openqaworker2
    openqaworker3.suse.de:
        openqaworker3
    openqaworker5.suse.de:
        openqaworker5
    openqaworker6.suse.de:
        openqaworker6
    openqaworker8.suse.de:
        openqaworker8
    openqaworker9.suse.de:
        openqaworker9

#7 Updated by nicksinger 12 months ago

  • Status changed from In Progress to Resolved

I'd say the immediate problem this ticket describes is away for now. However, we might need to follow up with https://progress.opensuse.org/issues/76783 if this persists :(

#8 Updated by okurz 12 months ago

  • Copied to action #76786: Configure static hostnames with salt for all salt nodes added

#9 Updated by okurz 12 months ago

  • Status changed from Resolved to In Progress
  • Assignee changed from nicksinger to okurz

I hope you agree that it makes sense that we ensure good static hostnames already in salt so I recorded #76786 for this. I still see in https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok the host names "linux-fwcx" and "localhost", maybe you need to call a high state once more? If the unexpected dashboards are gone you can resolve the ticket.

I am trying

sudo salt '*monitor*' state.apply

right now and will check.

#10 Updated by okurz 12 months ago

  • Status changed from In Progress to Resolved
  • Assignee changed from okurz to nicksinger

This wasn't sufficient. The deployed dashboard template files on the monitor host were fine but the "unknown dashboards" were still there. I manually deleted them in the grafana service instance. This might suffice now :) Setting back to nicksinger as original assignee.

#11 Updated by nicksinger 12 months ago

oopsie, didn't check the full chain for the fix. Thanks for taking over!

#12 Updated by okurz 12 months ago

  • Related to action #76783: research how hostnames with systemd work and make them static for all OSD related machines added

#13 Updated by okurz 11 months ago

  • Status changed from Resolved to Feedback

We are back with this problem:

sudo salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    QA-Power8-4-kvm.qa.suse.de:
        QA-Power8-4-kvm
    QA-Power8-5-kvm.qa.suse.de:
        QA-Power8-5-kvm
    grenache-1.qa.suse.de:
        grenache-1
    malbec.arch.suse.de:
        malbec
    openqaworker-arm-1.suse.de:
        openqaworker-arm-1
    openqaworker-arm-2.suse.de:
        openqaworker-arm-2
    openqaworker-arm-3.suse.de:
        openqaworker-arm-3
    openqaworker10.suse.de:
        openqaworker10
    openqaworker13.suse.de:
        localhost
    openqaworker2.suse.de:
        linux-1nn1
    openqaworker3.suse.de:
        openqaworker3
    openqaworker5.suse.de:
        openqaworker5
    openqaworker6.suse.de:
        openqaworker6
    openqaworker8.suse.de:
        openqaworker8
    openqaworker9.suse.de:
        linux-q6bp
    powerqaworker-qam-1:
        powerqaworker-qam-1

I assume something must have caused this problem to appear more often lately. Maybe related to #75016 and slow link-up time? What do you think?

#14 Updated by okurz 11 months ago

  • Priority changed from Normal to High

raising prio due to #73633#note-37

#15 Updated by okurz 11 months ago

  • Estimated time set to 80142.00 h

#16 Updated by okurz 11 months ago

  • Estimated time deleted (80142.00 h)

#17 Updated by okurz 11 months ago

  • Status changed from Feedback to Resolved
  • Assignee changed from nicksinger to okurz

finished #76786 , crosschecked that all hosts have the correct name. Have removed the wrongly generated dashboard files manually and on osd did

salt --hide-timeout \* saltutil.sync_grains,saltutil.refresh_grains,saltutil.refresh_pillar,mine.update ,,,
salt -l error -C 'G@roles:monitor' state.apply

but that still did find -type f ! -name worker-openqaworker-arm-1.json ! -name worker-malbec.json ! -name worker-grenache-1.json ! -name worker-linux-1nn1.json ! -name worker-openqaworker8.json ! -name worker-openqaworker6.json ! -name worker-QA-Power8-5-kvm.json ! -name worker-openqaworker-arm-3.json ! -name worker-QA-Power8-4-kvm.json ! -name worker-powerqaworker-qam-1.json ! -name worker-localhost.json ! -name worker-openqaworker10.json ! -name worker-linux-q6bp.json ! -name worker-openqaworker-arm-2.json ! -name worker-openqaworker3.json ! -name worker-openqaworker5.json ! -name webui.dashboard.json ! -name webui.services.json ! -name failed_systemd_services.json ! -name automatic_actions.json ! -name job_age.json ! -name openqa_jobs.json ! -name status_overview.json -exec rm {} \;. See the wrong names like "worker-localhost" included.

After a systemctl restart on the affected machines the above worked. I still had to delete the dashboards in the grafana webUI.

That should be enough. As I had already tested that the hostname settings are static I don't expect this issue to reappear – well, not soon at least ;)

Also available in: Atom PDF