action #174652
closedEnsure uniqueness of nodenames for generating configs on monitor size:M
Added by dheidler 5 months ago. Updated 2 months ago.
0%
Description
Motivation¶
We need to prevent issues like #174610 (caused by work on #168811) in the future so let's use the full fqdn for each host in grafana/influxdb or as alternative use unique hostnames
Acceptance criteria¶
- AC1: All salt minion IDs and/or hostnames are unique
- AC2: All grafana monitored OSD hosts show current data
- AC3: all salt minion ids are again the FQDN https://gitlab.suse.de/openqa/salt-states-openqa#how-to-use
Suggestions¶
- Follow up with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1329
- Consider https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1329#note_702658
- Or maybe use different hostnames
- Ensure that monitoring data on https://monitor.qa.suse.de/ continues (not that new dashboards are generated for every host)
- Remove workaround from #168811-36
- Ensure that salt states can be cleanly applied even if two hostnames (not FQDN) are the same, e.g. "baremetal-support"
- Ensure that all salt minion IDs equal the FQDN
Rollback actions¶
- Remove silence from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana
alertname=baremetal-support-prg2: host up alert
Updated by jbaier_cz 5 months ago
- Related to action #174610: [alert] salt-states-openqa deploy pipeline failed: data failed to compile added
Updated by jbaier_cz 5 months ago
As can be seen https://monitor.qa.suse.de/d/GDbaremetal-support-prg2/dashboard-for-baremetal-support-prg2 we now have a dashboard without data because the hostname and minion nodename are different.
Updated by okurz 5 months ago
- Related to action #168811: baremetal-support in PRG2 size:M added
Updated by okurz 5 months ago
- Related to action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S added
Updated by livdywan 5 months ago
See also https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3443316#L430 which might be related as we discussed:
worker39.oqa.prg2.suse.org:
Data failed to compile:
----------
Detected conflicting IDs, SLS IDs need to be globally unique.
The conflicting ID is 'net.ipv6.conf.br0.accept_ra' and is found in SLS 'base:network.accept_ra' and SLS 'base:openqa.worker'
mania.qe.nue2.suse.org:
Data failed to compile:
----------
Detected conflicting IDs, SLS IDs need to be globally unique.
The conflicting ID is 'net.ipv6.conf.br0.accept_ra' and is found in SLS 'base:network.accept_ra' and SLS 'base:openqa.worker'
ada.qe.prg2.suse.org:
Updated by nicksinger 5 months ago
Ensure that monitoring data on https://monitor.qa.suse.de/ continues (not that new dashboards are generated for every host)
this is directly contradicting with the ticket itself. We use the nodename as dashboard identifier (UID getting filled by a for-loop using the nodename) and also instruct telegraf to collect data with only the hostname/nodename (https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-worker.conf#L14-15 -> […] if empty use os.Hostname().
from telegraf docs)
So the ticket currently suggests that we redefine "nodename" (or better: the hostname) as non-unique in our infrastructure. IMHO this task is rather an epic and should be planned/handled accordingly. Also remember that we do this purely because of https://progress.opensuse.org/issues/168811#note-36 - we can also just rename this single machine for now.
Updated by ybonatakis 4 months ago
I changed the hostname manually
iob@baremetal-support:~> sudo hostnamectl set-hostname baremetal-support-prg2
iob@baremetal-support:~> cat /etc/salt/grains
cat: /etc/salt/grains: Permission denied
iob@baremetal-support:~> sudo cat /etc/salt/grains
nodename: baremetal-support-prg2
iob@baremetal-support:~> sudo cat /etc/hostname
baremetal-support-prg2
Then I updated the openqa/workerconf.sls
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/960
I see reference in salt-states-openqa/monitoring/grafana/alerts_to_delete.yaml but I am not sure if I have to do something there(I will look at it possible on Monday)
Finally I dont see any entry in salt repo other than ./salt/profile/dns/files/prg2_suse_org/dns-qa.suse.de.zone:34:baremetal-support CNAME baremetal-support.qe.nue2.suse.org.
. I guess I will have to update this line too? or do I have to add it to the host.yaml in that repo?
Updated by ybonatakis 4 months ago
I think the alias has to change in the salt/profile/dns/files/prg2_suse_org/dns-qa.suse.de.zone as well
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5952
Can someone let me know if this is correct and the only change required in the dns files?
Updated by okurz 4 months ago
Please introduce a new A-record and add back a CNAME, e.g. in salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org
- baremetal-support A 10.144.110.162
+ baremetal-support-prg2 A 10.144.110.162
+ baremetal-support CNAME baremetal-support-prg2.qe.prg2.suse.org.
- salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support AAAA 2a07:de40:b211:24::162
+ salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support-prg2 AAAA 2a07:de40:b211:24::162
Updated by ybonatakis 4 months ago
okurz wrote in #note-15:
Please introduce a new A-record and add back a CNAME, e.g. in salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org
- baremetal-support A 10.144.110.162 + baremetal-support-prg2 A 10.144.110.162 + baremetal-support CNAME baremetal-support-prg2.qe.prg2.suse.org. - salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support AAAA 2a07:de40:b211:24::162 + salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support-prg2 AAAA 2a07:de40:b211:24::162
PR updated
okurz wrote in #note-15:
Please introduce a new A-record and add back a CNAME, e.g. in salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org
- baremetal-support A 10.144.110.162 + baremetal-support-prg2 A 10.144.110.162 + baremetal-support CNAME baremetal-support-prg2.qe.prg2.suse.org. - salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support AAAA 2a07:de40:b211:24::162 + salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support-prg2 AAAA 2a07:de40:b211:24::162
Updated by openqa_review 4 months ago
- Due date set to 2025-02-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 4 months ago
- Status changed from In Progress to Workable
pipeline passes for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5952. waiting for merge
Updated by okurz 4 months ago
- Status changed from Feedback to Workable
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5952 was merged & deployed.
Updated by ybonatakis 4 months ago
- Status changed from Workable to In Progress
I found that the hostname was baremetal-support again.
So I updated again with hostnamectl. but hostname --fqdn was failing to give me the long domain name.
As such, I editted the /etc/hosts (and restarted systemd-hostnamed). With the changes in /etc/hosts the fqdn looked fine.
But still dont see any updates in the dashboard.
Checking the logs (journalctl -fu grafana-server) but I dont know if they are relevant. So first I assume that the /etc/hosts should be updated somewhere in one of the salt repos. What else should I check out to make sure that the dashboard is properly configured?
Updated by okurz 4 months ago
ybonatakis wrote in #note-22:
I found that the hostname was baremetal-support again.
So I updated again with hostnamectl. but hostname --fqdn was failing to give me the long domain name.
As such, I editted the /etc/hosts (and restarted systemd-hostnamed). With the changes in /etc/hosts the fqdn looked fine.
But still dont see any updates in the dashboard.
Please revert your changes to /etc/hosts. hostnamectl
shows the problem. Only the transient hostname is set to baremetal-support-prg2, the static hostname is still baremetal-support. I suggest to rename the VM in OpenPlatform itself to be baremetal-support-prg2. Maybe the static hostname is set from the VM name. Also please ensure this is safe over reboots.
Checking the logs (journalctl -fu grafana-server) but I dont know if they are relevant.
Not relevant
So first I assume that the /etc/hosts should be updated somewhere in one of the salt repos. What else should I check out to make sure that the dashboard is properly configured?
Please read
https://gitlab.suse.de/openqa/salt-states-openqa/#setup-production-machine
/etc/salt/minion_id should be the hostname, right now it's the old FQDN.
Updated by okurz 4 months ago
- Copied to action #175998: Multiple unaccepted salt keys on OSD added
Updated by ybonatakis 4 months ago
- Status changed from In Progress to Feedback
So after the changes in the https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5952 and after setting the hostname as described above, the dashboard still couldnt match with the correct value. We perform some steps manually.
-
ssh into baremetal-support-prg2.qe.prg2.suse.org
-
verify that hostnamectl provides correct static name and that it remains after reboot.
-
hostname --fqdn
still wasnt giving the correct full domain. For that we had to adjust /etc/hosts manually -
and then remove the
/etc/salt/grains
file. as the baremetal-support doesnt act as a worker, apparently this isnt needed. -
edit /etc/salt/minion_id. Updated to baremetal-support-prg2.qe.prg2.suse.org. This seems to provide the correct host value on the dashboards queries with the correct hostname in the VM.
- we notice some ERROR from
systemctl status salt-minion.service
:
The Salt Master has cached the public key for this node, this salt minion will wait for 10 seconds before attempting to re-authenticate
- we notice some ERROR from
Now on openqa.suse.de:
- list all the keys and remove the old key for baremetal-support-prg2
- add the new one (at that point I noticed that the dashboard was already shows some graphs)
- and finally (to make sure that monitor is provisioned properly we re-apply the state cmd: sudo salt "monitor" state.apply
To fullfill all the requirements of that ticket the last action was to unsilence
the silenced alert
Updated by ybonatakis 4 months ago
- Status changed from Feedback to Resolved
I updated https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=28360.
I think with this job is done here.
in brief:
- VM has unique fqdn
- workaround for unique nodename in salt removed
- dashboard shows data again and silence is reverted
- racktables is up to date