action #75445
closedunknown dashboards for "linux-fwcx" and "localhost" reappearing on monitor.qa
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok
shows many paused alerts for "linux-fwcx" and "localhost", e.g.
linux-fwcx: Memory usage alert
UNKNOWN for 4 days
linux-fwcx: Minion Jobs alert
UNKNOWN for 4 days
linux-fwcx: NTP offset alert
UNKNOWN for 4 days
linux-fwcx: OpenQA Ping time alert
UNKNOWN for 4 days
linux-fwcx: partitions usage (%) alert
UNKNOWN for 4 days
localhost: Disk I/O time alert
UNKNOWN for 5 days
localhost: Memory usage alert
UNKNOWN for 5 days
localhost: Minion Jobs alert
UNKNOWN for 5 days
localhost: NTP offset alert
UNKNOWN for 5 days
localhost: OpenQA Ping time alert
UNKNOWN for 5 days
localhost: partitions usage (%) alert
UNKNOWN for 5 days
I already tried to manually delete them but they seem to reappear. What I did on monitor.qa:
sudo su
cd /var/lib/grafana/dashboards
rm worker-linux-fwcx.json worker-localhost.json
systemctl restart grafana-server
Acceptance criteria¶
- AC1: Only osd production machines as maintained by https://gitlab.suse.de/openqa/salt-states-openqa and mentioned in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls are included
Suggestions¶
- Find out who did that, which machines these are, maybe experiments on "staging" or on the staging worker machines?
- Prevent that the same monitoring instance is reconfigured from elsewhere
Updated by okurz about 4 years ago
- Due date set to 2020-10-30
I did the manual steps mentioned again. Will see if the problematic dashboards reappear.
Updated by okurz about 4 years ago
- Due date deleted (
2020-10-30) - Status changed from Feedback to Workable
- Assignee deleted (
okurz)
They reappeared at least twice by now.
Updated by nicksinger about 4 years ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Our dashboards get generated based on the realtime data present in salt. Sometimes it happens that a host accidentally registers against OSD which can show symptoms like this. However, not this time:
openqa:~ # salt-key -L
Accepted Keys:
QA-Power8-4-kvm.qa.suse.de
QA-Power8-5-kvm.qa.suse.de
grenache-1.qa.suse.de
malbec.arch.suse.de
openqa-monitor.qa.suse.de
openqa.suse.de
openqaworker-arm-1.suse.de
openqaworker-arm-2.suse.de
openqaworker-arm-3.suse.de
openqaworker10.suse.de
openqaworker13.suse.de
openqaworker2.suse.de
openqaworker3.suse.de
openqaworker5.suse.de
openqaworker6.suse.de
openqaworker8.suse.de
openqaworker9.suse.de
Denied Keys:
Unaccepted Keys:
powerqaworker-qam-1
Rejected Keys:
All of these machines are expected. Nothing unusual. Going one step deeper into the mine (baha) where this data is generated: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana.sls#L3
It took me way too long to transform this single line of python into a bash command:
openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
----------
QA-Power8-4-kvm.qa.suse.de:
QA-Power8-4-kvm
QA-Power8-5-kvm.qa.suse.de:
localhost
grenache-1.qa.suse.de:
grenache-1
malbec.arch.suse.de:
malbec
openqaworker-arm-1.suse.de:
openqaworker-arm-1
openqaworker-arm-2.suse.de:
openqaworker-arm-2
openqaworker-arm-3.suse.de:
openqaworker-arm-3
openqaworker10.suse.de:
openqaworker10
openqaworker13.suse.de:
openqaworker13
openqaworker2.suse.de:
openqaworker2
openqaworker3.suse.de:
openqaworker3
openqaworker5.suse.de:
openqaworker5
openqaworker6.suse.de:
openqaworker6
openqaworker8.suse.de:
linux-fwcx
openqaworker9.suse.de:
openqaworker9
So openqaworker8.suse.de
and QA-Power8-5-kvm.qa.suse.de
are the misbehaving hosts. Let's see what I can do about this
Updated by okurz about 4 years ago
OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of #73633 to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.
Updated by nicksinger about 4 years ago
okurz wrote:
OMG, this is so funny because we – at least me – were also missing QA-Power8-5-kvm.qa.suse.de in the past days :D This looks like yet another symptom of #73633 to me: I think the hostname should be updated from DHCP but this likely fails to be done in time due to either the linkup being very slow or DHCP response to be very slow.
I'm not sure where this is coming from. Looking at worker8 I saw that the static hostname was missing:
nsinger@openqaworker8:~> hostnamectl
Static hostname: linux-fwcx.suse
Transient hostname: openqaworker8
Icon name: computer-server
Chassis: server
Machine ID: 7900bf3c706198423a0678e05913115f
Boot ID: 119abb6122e94753b4d46a405c525048
Operating System: openSUSE Leap 15.1
CPE OS Name: cpe:/o:opensuse:leap:15.1
Kernel: Linux 4.12.14-lp151.28.75-default
Architecture: x86-64
After setting the right one and restarting salt-minion
:
openqaworker8:~ # hostnamectl --static set-hostname openqaworker8
openqaworker8:~ # sudo systemctl restart salt-minion
The machine reported the right "nodename":
openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
----------
[…]
openqaworker8.suse.de:
openqaworker8
[…]
Updated by nicksinger about 4 years ago
QA-Power8-5-kvm gave me a bit of a hard time bringing it back. Everything looks good now:
openqa:~ # salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
----------
QA-Power8-4-kvm.qa.suse.de:
QA-Power8-4-kvm
QA-Power8-5-kvm.qa.suse.de:
QA-Power8-5-kvm
grenache-1.qa.suse.de:
grenache-1
malbec.arch.suse.de:
malbec
openqaworker-arm-1.suse.de:
openqaworker-arm-1
openqaworker-arm-2.suse.de:
openqaworker-arm-2
openqaworker-arm-3.suse.de:
openqaworker-arm-3
openqaworker10.suse.de:
openqaworker10
openqaworker13.suse.de:
openqaworker13
openqaworker2.suse.de:
openqaworker2
openqaworker3.suse.de:
openqaworker3
openqaworker5.suse.de:
openqaworker5
openqaworker6.suse.de:
openqaworker6
openqaworker8.suse.de:
openqaworker8
openqaworker9.suse.de:
openqaworker9
Updated by nicksinger about 4 years ago
- Status changed from In Progress to Resolved
I'd say the immediate problem this ticket describes is away for now. However, we might need to follow up with https://progress.opensuse.org/issues/76783 if this persists :(
Updated by okurz about 4 years ago
- Copied to action #76786: Configure static hostnames with salt for all salt nodes added
Updated by okurz about 4 years ago
- Status changed from Resolved to In Progress
- Assignee changed from nicksinger to okurz
I hope you agree that it makes sense that we ensure good static hostnames already in salt so I recorded #76786 for this. I still see in https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok the host names "linux-fwcx" and "localhost", maybe you need to call a high state once more? If the unexpected dashboards are gone you can resolve the ticket.
I am trying
sudo salt '*monitor*' state.apply
right now and will check.
Updated by okurz about 4 years ago
- Status changed from In Progress to Resolved
- Assignee changed from okurz to nicksinger
This wasn't sufficient. The deployed dashboard template files on the monitor host were fine but the "unknown dashboards" were still there. I manually deleted them in the grafana service instance. This might suffice now :) Setting back to nicksinger as original assignee.
Updated by nicksinger about 4 years ago
oopsie, didn't check the full chain for the fix. Thanks for taking over!
Updated by okurz about 4 years ago
- Related to action #76783: research how hostnames with systemd work and make them static for all OSD related machines added
Updated by okurz about 4 years ago
- Status changed from Resolved to Feedback
We are back with this problem:
sudo salt -l error --no-color -C 'openqa.suse.de' mine.get 'roles:worker' 'nodename' 'grain'
openqa.suse.de:
----------
QA-Power8-4-kvm.qa.suse.de:
QA-Power8-4-kvm
QA-Power8-5-kvm.qa.suse.de:
QA-Power8-5-kvm
grenache-1.qa.suse.de:
grenache-1
malbec.arch.suse.de:
malbec
openqaworker-arm-1.suse.de:
openqaworker-arm-1
openqaworker-arm-2.suse.de:
openqaworker-arm-2
openqaworker-arm-3.suse.de:
openqaworker-arm-3
openqaworker10.suse.de:
openqaworker10
openqaworker13.suse.de:
localhost
openqaworker2.suse.de:
linux-1nn1
openqaworker3.suse.de:
openqaworker3
openqaworker5.suse.de:
openqaworker5
openqaworker6.suse.de:
openqaworker6
openqaworker8.suse.de:
openqaworker8
openqaworker9.suse.de:
linux-q6bp
powerqaworker-qam-1:
powerqaworker-qam-1
I assume something must have caused this problem to appear more often lately. Maybe related to #75016 and slow link-up time? What do you think?
Updated by okurz about 4 years ago
- Priority changed from Normal to High
raising prio due to #73633#note-37
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
- Assignee changed from nicksinger to okurz
finished #76786 , crosschecked that all hosts have the correct name. Have removed the wrongly generated dashboard files manually and on osd did
salt --hide-timeout \* saltutil.sync_grains,saltutil.refresh_grains,saltutil.refresh_pillar,mine.update ,,,
salt -l error -C 'G@roles:monitor' state.apply
but that still did find -type f ! -name worker-openqaworker-arm-1.json ! -name worker-malbec.json ! -name worker-grenache-1.json ! -name worker-linux-1nn1.json ! -name worker-openqaworker8.json ! -name worker-openqaworker6.json ! -name worker-QA-Power8-5-kvm.json ! -name worker-openqaworker-arm-3.json ! -name worker-QA-Power8-4-kvm.json ! -name worker-powerqaworker-qam-1.json ! -name worker-localhost.json ! -name worker-openqaworker10.json ! -name worker-linux-q6bp.json ! -name worker-openqaworker-arm-2.json ! -name worker-openqaworker3.json ! -name worker-openqaworker5.json ! -name webui.dashboard.json ! -name webui.services.json ! -name failed_systemd_services.json ! -name automatic_actions.json ! -name job_age.json ! -name openqa_jobs.json ! -name status_overview.json -exec rm {} \;
. See the wrong names like "worker-localhost" included.
After a systemctl restart
on the affected machines the above worked. I still had to delete the dashboards in the grafana webUI.
That should be enough. As I had already tested that the hostname settings are static I don't expect this issue to reappear – well, not soon at least ;)