action #75445: unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qa - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #75445

closed

unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qa

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-10-28

Due date:

% Done:

Estimated time:

Description

Observation¶

https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok
shows many paused alerts for "linux-fwcx" and "localhost", e.g.

linux-fwcx: Memory usage alert
UNKNOWN for 4 days
linux-fwcx: Minion Jobs alert
UNKNOWN for 4 days
linux-fwcx: NTP offset alert
UNKNOWN for 4 days
linux-fwcx: OpenQA Ping time alert
UNKNOWN for 4 days
linux-fwcx: partitions usage (%) alert
UNKNOWN for 4 days
localhost: Disk I/O time alert
UNKNOWN for 5 days
localhost: Memory usage alert
UNKNOWN for 5 days
localhost: Minion Jobs alert
UNKNOWN for 5 days
localhost: NTP offset alert
UNKNOWN for 5 days
localhost: OpenQA Ping time alert
UNKNOWN for 5 days
localhost: partitions usage (%) alert
UNKNOWN for 5 days

I already tried to manually delete them but they seem to reappear. What I did on monitor.qa:

sudo su
cd /var/lib/grafana/dashboards
rm worker-linux-fwcx.json worker-localhost.json
systemctl restart grafana-server

Acceptance criteria¶

AC1: Only osd production machines as maintained by https://gitlab.suse.de/openqa/salt-states-openqa and mentioned in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls are included

Suggestions¶

Find out who did that, which machines these are, maybe experiments on "staging" or on the staging worker machines?
Prevent that the same monitoring instance is reconfigured from elsewhere

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Due date set to 2020-10-30

I did the manual steps mentioned again. Will see if the problematic dashboards reappear.

Actions

Copy link

Updated by okurz over 4 years ago

Due date deleted (~~2020-10-30~~)
Status changed from Feedback to Workable
Assignee deleted (~~okurz~~)

They reappeared at least twice by now.

Actions

Copy link

Updated by nicksinger over 4 years ago

Status changed from Workable to In Progress
Assignee set to nicksinger

Our dashboards get generated based on the realtime data present in salt. Sometimes it happens that a host accidentally registers against OSD which can show symptoms like this. However, not this time:

openqa:~ # salt-key -L
Accepted Keys:
QA-Power8-4-kvm.qa.suse.de
QA-Power8-5-kvm.qa.suse.de
grenache-1.qa.suse.de
malbec.arch.suse.de
openqa-monitor.qa.suse.de
openqa.suse.de
openqaworker-arm-1.suse.de
openqaworker-arm-2.suse.de
openqaworker-arm-3.suse.de
openqaworker10.suse.de
openqaworker13.suse.de
openqaworker2.suse.de
openqaworker3.suse.de
openqaworker5.suse.de
openqaworker6.suse.de
openqaworker8.suse.de
openqaworker9.suse.de
Denied Keys:
Unaccepted Keys:
powerqaworker-qam-1
Rejected Keys:

All of these machines are expected. Nothing unusual. Going one step deeper into the mine (baha) where this data is generated: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3
It took me way too long to transform this single line of python into a bash command:

openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    QA-Power8-4-kvm.qa.suse.de:
        QA-Power8-4-kvm
    QA-Power8-5-kvm.qa.suse.de:
        localhost
    grenache-1.qa.suse.de:
        grenache-1
    malbec.arch.suse.de:
        malbec
    openqaworker-arm-1.suse.de:
        openqaworker-arm-1
    openqaworker-arm-2.suse.de:
        openqaworker-arm-2
    openqaworker-arm-3.suse.de:
        openqaworker-arm-3
    openqaworker10.suse.de:
        openqaworker10
    openqaworker13.suse.de:
        openqaworker13
    openqaworker2.suse.de:
        openqaworker2
    openqaworker3.suse.de:
        openqaworker3
    openqaworker5.suse.de:
        openqaworker5
    openqaworker6.suse.de:
        openqaworker6
    openqaworker8.suse.de:
        linux-fwcx
    openqaworker9.suse.de:
        openqaworker9

So openqaworker8.suse.de and QA-Power8-5-kvm.qa.suse.de are the misbehaving hosts. Let's see what I can do about this

Actions

Copy link

Updated by okurz over 4 years ago

OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of #73633 to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.

Actions

Copy link

Updated by nicksinger over 4 years ago

okurz wrote:

OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of #73633 to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.

I'm not sure where this is coming from. Looking at worker8 I saw that the static hostname was missing:

nsinger@openqaworker8:~> hostnamectl
   Static hostname: linux-fwcx.suse
Transient hostname: openqaworker8
         Icon name: computer-server
           Chassis: server
        Machine ID: 7900bf3c706198423a0678e05913115f
           Boot ID: 119abb6122e94753b4d46a405c525048
  Operating System: openSUSE Leap 15.1
       CPE OS Name: cpe:/o:opensuse:leap:15.1
            Kernel: Linux 4.12.14-lp151.28.75-default
      Architecture: x86-64

After setting the right one and restarting salt-minion:

openqaworker8:~ # hostnamectl --static set-hostname openqaworker8
openqaworker8:~ # sudo systemctl restart salt-minion

The machine reported the right "nodename":

openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    […]
    openqaworker8.suse.de:
        openqaworker8
    […]

Actions

Copy link

Updated by nicksinger over 4 years ago

QA-Power8-5-kvm gave me a bit of a hard time bringing it back. Everything looks good now:

openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    QA-Power8-4-kvm.qa.suse.de:
        QA-Power8-4-kvm
    QA-Power8-5-kvm.qa.suse.de:
        QA-Power8-5-kvm
    grenache-1.qa.suse.de:
        grenache-1
    malbec.arch.suse.de:
        malbec
    openqaworker-arm-1.suse.de:
        openqaworker-arm-1
    openqaworker-arm-2.suse.de:
        openqaworker-arm-2
    openqaworker-arm-3.suse.de:
        openqaworker-arm-3
    openqaworker10.suse.de:
        openqaworker10
    openqaworker13.suse.de:
        openqaworker13
    openqaworker2.suse.de:
        openqaworker2
    openqaworker3.suse.de:
        openqaworker3
    openqaworker5.suse.de:
        openqaworker5
    openqaworker6.suse.de:
        openqaworker6
    openqaworker8.suse.de:
        openqaworker8
    openqaworker9.suse.de:
        openqaworker9

Actions

Copy link

Updated by nicksinger over 4 years ago

Status changed from In Progress to Resolved

I'd say the immediate problem this ticket describes is away for now. However, we might need to follow up with https://progress.opensuse.org/issues/76783 if this persists :(

Actions

Copy link

Updated by okurz over 4 years ago

Copied to action #76786: Configure static hostnames with salt for all salt nodes added

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Resolved to In Progress
Assignee changed from nicksinger to okurz

I hope you agree that it makes sense that we ensure good static hostnames already in salt so I recorded #76786 for this. I still see in https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok the host names "linux-fwcx" and "localhost", maybe you need to call a high state once more? If the unexpected dashboards are gone you can resolve the ticket.

I am trying

sudo salt '*monitor*' state.apply

right now and will check.

Actions

Copy link

#10

Updated by okurz over 4 years ago

Status changed from In Progress to Resolved
Assignee changed from okurz to nicksinger

This wasn't sufficient. The deployed dashboard template files on the monitor host were fine but the "unknown dashboards" were still there. I manually deleted them in the grafana service instance. This might suffice now :) Setting back to nicksinger as original assignee.

Actions

Copy link

#11

Updated by nicksinger over 4 years ago

oopsie, didn't check the full chain for the fix. Thanks for taking over!

Actions

Copy link

#12

Updated by okurz over 4 years ago

Related to action #76783: research how hostnames with systemd work and make them static for all OSD related machines added

Actions

Copy link

#13

Updated by okurz over 4 years ago

Status changed from Resolved to Feedback

We are back with this problem:

sudo salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
    ----------
    QA-Power8-4-kvm.qa.suse.de:
        QA-Power8-4-kvm
    QA-Power8-5-kvm.qa.suse.de:
        QA-Power8-5-kvm
    grenache-1.qa.suse.de:
        grenache-1
    malbec.arch.suse.de:
        malbec
    openqaworker-arm-1.suse.de:
        openqaworker-arm-1
    openqaworker-arm-2.suse.de:
        openqaworker-arm-2
    openqaworker-arm-3.suse.de:
        openqaworker-arm-3
    openqaworker10.suse.de:
        openqaworker10
    openqaworker13.suse.de:
        localhost
    openqaworker2.suse.de:
        linux-1nn1
    openqaworker3.suse.de:
        openqaworker3
    openqaworker5.suse.de:
        openqaworker5
    openqaworker6.suse.de:
        openqaworker6
    openqaworker8.suse.de:
        openqaworker8
    openqaworker9.suse.de:
        linux-q6bp
    powerqaworker-qam-1:
        powerqaworker-qam-1

I assume something must have caused this problem to appear more often lately. Maybe related to #75016 and slow link-up time? What do you think?

Actions

Copy link

#14

Updated by okurz over 4 years ago

Priority changed from Normal to High

raising prio due to #73633#note-37

Actions

Copy link

#15

Updated by okurz over 4 years ago

Estimated time set to 80142.00 h

Actions

Copy link

#16

Updated by okurz over 4 years ago

Estimated time deleted (~~80142.00 h~~)

Actions

Copy link

#17

Updated by okurz over 4 years ago

Status changed from Feedback to Resolved
Assignee changed from nicksinger to okurz

finished #76786 , crosschecked that all hosts have the correct name. Have removed the wrongly generated dashboard files manually and on osd did

salt --hide-timeout \* saltutil.sync_grains,saltutil.refresh_grains,saltutil.refresh_pillar,mine.update ,,,
salt -l error -C 'G@roles:monitor' state.apply

but that still did find -type f ! -name worker-openqaworker-arm-1.json ! -name worker-malbec.json ! -name worker-grenache-1.json ! -name worker-linux-1nn1.json ! -name worker-openqaworker8.json ! -name worker-openqaworker6.json ! -name worker-QA-Power8-5-kvm.json ! -name worker-openqaworker-arm-3.json ! -name worker-QA-Power8-4-kvm.json ! -name worker-powerqaworker-qam-1.json ! -name worker-localhost.json ! -name worker-openqaworker10.json ! -name worker-linux-q6bp.json ! -name worker-openqaworker-arm-2.json ! -name worker-openqaworker3.json ! -name worker-openqaworker5.json ! -name webui.dashboard.json ! -name webui.services.json ! -name failed_systemd_services.json ! -name automatic_actions.json ! -name job_age.json ! -name openqa_jobs.json ! -name status_overview.json -exec rm {} \;. See the wrong names like "worker-localhost" included.

After a systemctl restart on the affected machines the above worked. I still had to delete the dashboards in the grafana webUI.

That should be enough. As I had already tested that the hostname settings are static I don't expect this issue to reappear – well, not soon at least ;)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #75445

unknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qa

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by nicksinger over 4 years ago

Updated by okurz over 4 years ago

Updated by nicksinger over 4 years ago

Updated by nicksinger over 4 years ago

Updated by nicksinger over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by nicksinger over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago