Project

General

Profile

Actions

action #75445

closed

unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qa

Added by okurz about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2020-10-28
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok
shows many paused alerts for "linux-fwcx" and "localhost", e.g.

linux-fwcx: Memory usage alert
UNKNOWN for 4 days
linux-fwcx: Minion Jobs alert
UNKNOWN for 4 days
linux-fwcx: NTP offset alert
UNKNOWN for 4 days
linux-fwcx: OpenQA Ping time alert
UNKNOWN for 4 days
linux-fwcx: partitions usage (%) alert
UNKNOWN for 4 days
localhost: Disk I/O time alert
UNKNOWN for 5 days
localhost: Memory usage alert
UNKNOWN for 5 days
localhost: Minion Jobs alert
UNKNOWN for 5 days
localhost: NTP offset alert
UNKNOWN for 5 days
localhost: OpenQA Ping time alert
UNKNOWN for 5 days
localhost: partitions usage (%) alert
UNKNOWN for 5 days

I already tried to manually delete them but they seem to reappear. What I did on monitor.qa:

sudo su
cd /var/lib/grafana/dashboards
rm worker-linux-fwcx.json worker-localhost.json
systemctl restart grafana-server

Acceptance criteria

Suggestions

  • Find out who did that, which machines these are, maybe experiments on "staging" or on the staging worker machines?
  • Prevent that the same monitoring instance is reconfigured from elsewhere

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #76783: research how hostnames with systemd work and make them static for all OSD related machinesResolvedokurz2020-10-29

Actions
Copied to openQA Infrastructure (public) - action #76786: Configure static hostnames with salt for all salt nodesResolvedokurz

Actions
Actions #1

Updated by okurz about 4 years ago

  • Due date set to 2020-10-30

I did the manual steps mentioned again. Will see if the problematic dashboards reappear.

Actions #2

Updated by okurz about 4 years ago

  • Due date deleted (2020-10-30)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

They reappeared at least twice by now.

Actions #3

Updated by nicksinger about 4 years ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger

Our dashboards get generated based on the realtime data present in salt. Sometimes it happens that a host accidentally registers against OSD which can show symptoms like this. However, not this time:

openqa:~ # salt-key -L
Accepted Keys:
QA-Power8-4-kvm.qa.suse.de
QA-Power8-5-kvm.qa.suse.de
grenache-1.qa.suse.de
malbec.arch.suse.de
openqa-monitor.qa.suse.de
openqa.suse.de
openqaworker-arm-1.suse.de
openqaworker-arm-2.suse.de
openqaworker-arm-3.suse.de
openqaworker10.suse.de
openqaworker13.suse.de
openqaworker2.suse.de
openqaworker3.suse.de
openqaworker5.suse.de
openqaworker6.suse.de
openqaworker8.suse.de
openqaworker9.suse.de
Denied Keys:
Unaccepted Keys:
powerqaworker-qam-1
Rejected Keys:

All of these machines are expected. Nothing unusual. Going one step deeper into the mine (baha) where this data is generated: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3
It took me way too long to transform this single line of python into a bash command:

openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    QA-Power8-4-kvm.qa.suse.de:
        QA-Power8-4-kvm
    QA-Power8-5-kvm.qa.suse.de:
        localhost
    grenache-1.qa.suse.de:
        grenache-1
    malbec.arch.suse.de:
        malbec
    openqaworker-arm-1.suse.de:
        openqaworker-arm-1
    openqaworker-arm-2.suse.de:
        openqaworker-arm-2
    openqaworker-arm-3.suse.de:
        openqaworker-arm-3
    openqaworker10.suse.de:
        openqaworker10
    openqaworker13.suse.de:
        openqaworker13
    openqaworker2.suse.de:
        openqaworker2
    openqaworker3.suse.de:
        openqaworker3
    openqaworker5.suse.de:
        openqaworker5
    openqaworker6.suse.de:
        openqaworker6
    openqaworker8.suse.de:
        linux-fwcx
    openqaworker9.suse.de:
        openqaworker9

So openqaworker8.suse.de and QA-Power8-5-kvm.qa.suse.de are the misbehaving hosts. Let's see what I can do about this

Actions #4

Updated by okurz about 4 years ago

OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of #73633 to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.

Actions #5

Updated by nicksinger about 4 years ago

okurz wrote:

OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of #73633 to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.

I'm not sure where this is coming from. Looking at worker8 I saw that the static hostname was missing:

nsinger@openqaworker8:~> hostnamectl
   Static hostname: linux-fwcx.suse
Transient hostname: openqaworker8
         Icon name: computer-server
           Chassis: server
        Machine ID: 7900bf3c706198423a0678e05913115f
           Boot ID: 119abb6122e94753b4d46a405c525048
  Operating System: openSUSE Leap 15.1
       CPE OS Name: cpe:/o:opensuse:leap:15.1
            Kernel: Linux 4.12.14-lp151.28.75-default
      Architecture: x86-64

After setting the right one and restarting salt-minion:

openqaworker8:~ # hostnamectl --static set-hostname openqaworker8
openqaworker8:~ # sudo systemctl restart salt-minion

The machine reported the right "nodename":

openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    […]
    openqaworker8.suse.de:
        openqaworker8
    […]
Actions #6

Updated by nicksinger about 4 years ago

QA-Power8-5-kvm gave me a bit of a hard time bringing it back. Everything looks good now:

openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    QA-Power8-4-kvm.qa.suse.de:
        QA-Power8-4-kvm
    QA-Power8-5-kvm.qa.suse.de:
        QA-Power8-5-kvm
    grenache-1.qa.suse.de:
        grenache-1
    malbec.arch.suse.de:
        malbec
    openqaworker-arm-1.suse.de:
        openqaworker-arm-1
    openqaworker-arm-2.suse.de:
        openqaworker-arm-2
    openqaworker-arm-3.suse.de:
        openqaworker-arm-3
    openqaworker10.suse.de:
        openqaworker10
    openqaworker13.suse.de:
        openqaworker13
    openqaworker2.suse.de:
        openqaworker2
    openqaworker3.suse.de:
        openqaworker3
    openqaworker5.suse.de:
        openqaworker5
    openqaworker6.suse.de:
        openqaworker6
    openqaworker8.suse.de:
        openqaworker8
    openqaworker9.suse.de:
        openqaworker9
Actions #7

Updated by nicksinger about 4 years ago

  • Status changed from In Progress to Resolved

I'd say the immediate problem this ticket describes is away for now. However, we might need to follow up with https://progress.opensuse.org/issues/76783 if this persists :(

Actions #8

Updated by okurz about 4 years ago

  • Copied to action #76786: Configure static hostnames with salt for all salt nodes added
Actions #9

Updated by okurz about 4 years ago

  • Status changed from Resolved to In Progress
  • Assignee changed from nicksinger to okurz

I hope you agree that it makes sense that we ensure good static hostnames already in salt so I recorded #76786 for this. I still see in https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok the host names "linux-fwcx" and "localhost", maybe you need to call a high state once more? If the unexpected dashboards are gone you can resolve the ticket.

I am trying

sudo salt '*monitor*' state.apply

right now and will check.

Actions #10

Updated by okurz about 4 years ago

  • Status changed from In Progress to Resolved
  • Assignee changed from okurz to nicksinger

This wasn't sufficient. The deployed dashboard template files on the monitor host were fine but the "unknown dashboards" were still there. I manually deleted them in the grafana service instance. This might suffice now :) Setting back to nicksinger as original assignee.

Actions #11

Updated by nicksinger about 4 years ago

oopsie, didn't check the full chain for the fix. Thanks for taking over!

Actions #12

Updated by okurz about 4 years ago

  • Related to action #76783: research how hostnames with systemd work and make them static for all OSD related machines added
Actions #13

Updated by okurz about 4 years ago

  • Status changed from Resolved to Feedback

We are back with this problem:

sudo salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    QA-Power8-4-kvm.qa.suse.de:
        QA-Power8-4-kvm
    QA-Power8-5-kvm.qa.suse.de:
        QA-Power8-5-kvm
    grenache-1.qa.suse.de:
        grenache-1
    malbec.arch.suse.de:
        malbec
    openqaworker-arm-1.suse.de:
        openqaworker-arm-1
    openqaworker-arm-2.suse.de:
        openqaworker-arm-2
    openqaworker-arm-3.suse.de:
        openqaworker-arm-3
    openqaworker10.suse.de:
        openqaworker10
    openqaworker13.suse.de:
        localhost
    openqaworker2.suse.de:
        linux-1nn1
    openqaworker3.suse.de:
        openqaworker3
    openqaworker5.suse.de:
        openqaworker5
    openqaworker6.suse.de:
        openqaworker6
    openqaworker8.suse.de:
        openqaworker8
    openqaworker9.suse.de:
        linux-q6bp
    powerqaworker-qam-1:
        powerqaworker-qam-1

I assume something must have caused this problem to appear more often lately. Maybe related to #75016 and slow link-up time? What do you think?

Actions #14

Updated by okurz about 4 years ago

  • Priority changed from Normal to High

raising prio due to #73633#note-37

Actions #15

Updated by okurz about 4 years ago

  • Estimated time set to 80142.00 h
Actions #16

Updated by okurz about 4 years ago

  • Estimated time deleted (80142.00 h)
Actions #17

Updated by okurz about 4 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from nicksinger to okurz

finished #76786 , crosschecked that all hosts have the correct name. Have removed the wrongly generated dashboard files manually and on osd did

salt --hide-timeout \* saltutil.sync_grains,saltutil.refresh_grains,saltutil.refresh_pillar,mine.update ,,,
salt -l error -C 'G@roles:monitor' state.apply

but that still did find -type f ! -name worker-openqaworker-arm-1.json ! -name worker-malbec.json ! -name worker-grenache-1.json ! -name worker-linux-1nn1.json ! -name worker-openqaworker8.json ! -name worker-openqaworker6.json ! -name worker-QA-Power8-5-kvm.json ! -name worker-openqaworker-arm-3.json ! -name worker-QA-Power8-4-kvm.json ! -name worker-powerqaworker-qam-1.json ! -name worker-localhost.json ! -name worker-openqaworker10.json ! -name worker-linux-q6bp.json ! -name worker-openqaworker-arm-2.json ! -name worker-openqaworker3.json ! -name worker-openqaworker5.json ! -name webui.dashboard.json ! -name webui.services.json ! -name failed_systemd_services.json ! -name automatic_actions.json ! -name job_age.json ! -name openqa_jobs.json ! -name status_overview.json -exec rm {} \;. See the wrong names like "worker-localhost" included.

After a systemctl restart on the affected machines the above worked. I still had to delete the dashboards in the grafana webUI.

That should be enough. As I had already tested that the hostname settings are static I don't expect this issue to reappear – well, not soon at least ;)

Actions

Also available in: Atom PDF