Project

General

Profile

Actions

action #174652

closed

Ensure uniqueness of nodenames for generating configs on monitor size:M

Added by dheidler 5 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-12-20
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

We need to prevent issues like #174610 (caused by work on #168811) in the future so let's use the full fqdn for each host in grafana/influxdb or as alternative use unique hostnames

Acceptance criteria

Suggestions

Rollback actions


Related issues 4 (0 open4 closed)

Related to openQA Infrastructure (public) - action #174610: [alert] salt-states-openqa deploy pipeline failed: data failed to compileResolveddheidler2024-12-19

Actions
Related to openQA Infrastructure (public) - action #168811: baremetal-support in PRG2 size:MResolveddheidler2024-02-15

Actions
Related to openQA Infrastructure (public) - action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:SRejectednicksinger2025-01-03

Actions
Copied to openQA Infrastructure (public) - action #175998: Multiple unaccepted salt keys on OSDResolvedokurz

Actions
Actions #1

Updated by okurz 5 months ago

  • Tags set to infra, salt
  • Description updated (diff)
  • Category set to Feature requests
  • Target version set to Ready
  • Parent task set to #159852
Actions #2

Updated by okurz 5 months ago

  • Parent task changed from #159852 to #166598
Actions #3

Updated by jbaier_cz 5 months ago

  • Related to action #174610: [alert] salt-states-openqa deploy pipeline failed: data failed to compile added
Actions #4

Updated by jbaier_cz 5 months ago

As can be seen https://monitor.qa.suse.de/d/GDbaremetal-support-prg2/dashboard-for-baremetal-support-prg2 we now have a dashboard without data because the hostname and minion nodename are different.

Actions #5

Updated by okurz 5 months ago

  • Description updated (diff)
Actions #6

Updated by okurz 5 months ago

Actions #7

Updated by okurz 5 months ago

  • Related to action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S added
Actions #8

Updated by livdywan 5 months ago

See also https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3443316#L430 which might be related as we discussed:

worker39.oqa.prg2.suse.org:
    Data failed to compile:
----------
    Detected conflicting IDs, SLS IDs need to be globally unique.
    The conflicting ID is 'net.ipv6.conf.br0.accept_ra' and is found in SLS 'base:network.accept_ra' and SLS 'base:openqa.worker'
mania.qe.nue2.suse.org:
    Data failed to compile:
----------
    Detected conflicting IDs, SLS IDs need to be globally unique.
    The conflicting ID is 'net.ipv6.conf.br0.accept_ra' and is found in SLS 'base:network.accept_ra' and SLS 'base:openqa.worker'
ada.qe.prg2.suse.org:
Actions #9

Updated by livdywan 5 months ago

  • Subject changed from Use fqdn instead of nodename for generating configs on monitor to Ensure uniqueness of nodenames for generating configs on monitor size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #10

Updated by nicksinger 5 months ago

Ensure that monitoring data on https://monitor.qa.suse.de/ continues (not that new dashboards are generated for every host)

this is directly contradicting with the ticket itself. We use the nodename as dashboard identifier (UID getting filled by a for-loop using the nodename) and also instruct telegraf to collect data with only the hostname/nodename (https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-worker.conf#L14-15 -> […] if empty use os.Hostname(). from telegraf docs)

So the ticket currently suggests that we redefine "nodename" (or better: the hostname) as non-unique in our infrastructure. IMHO this task is rather an epic and should be planned/handled accordingly. Also remember that we do this purely because of https://progress.opensuse.org/issues/168811#note-36 - we can also just rename this single machine for now.

Actions #11

Updated by okurz 5 months ago

nicksinger wrote in #note-10:

[...] Also remember that we do this purely because of #168811-36 - we can also just rename this single machine for now.

well, that's what I am recurringly suggesting. That fulfills all criteria

Actions #12

Updated by ybonatakis 4 months ago

  • Assignee set to ybonatakis
Actions #13

Updated by ybonatakis 4 months ago

I changed the hostname manually

iob@baremetal-support:~> sudo hostnamectl set-hostname baremetal-support-prg2
iob@baremetal-support:~> cat /etc/salt/grains
cat: /etc/salt/grains: Permission denied
iob@baremetal-support:~> sudo cat /etc/salt/grains
nodename: baremetal-support-prg2
iob@baremetal-support:~> sudo cat /etc/hostname 
baremetal-support-prg2

Then I updated the openqa/workerconf.sls
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/960

I see reference in salt-states-openqa/monitoring/grafana/alerts_to_delete.yaml but I am not sure if I have to do something there(I will look at it possible on Monday)
Finally I dont see any entry in salt repo other than ./salt/profile/dns/files/prg2_suse_org/dns-qa.suse.de.zone:34:baremetal-support CNAME baremetal-support.qe.nue2.suse.org.. I guess I will have to update this line too? or do I have to add it to the host.yaml in that repo?

Actions #14

Updated by ybonatakis 4 months ago

I think the alias has to change in the salt/profile/dns/files/prg2_suse_org/dns-qa.suse.de.zone as well

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5952

Can someone let me know if this is correct and the only change required in the dns files?

Actions #15

Updated by okurz 4 months ago

Please introduce a new A-record and add back a CNAME, e.g. in salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org

- baremetal-support            A     10.144.110.162
+ baremetal-support-prg2       A     10.144.110.162
+ baremetal-support            CNAME baremetal-support-prg2.qe.prg2.suse.org.
- salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support            AAAA  2a07:de40:b211:24::162
+ salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support-prg2       AAAA  2a07:de40:b211:24::162
Actions #16

Updated by ybonatakis 4 months ago

okurz wrote in #note-15:

Please introduce a new A-record and add back a CNAME, e.g. in salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org

- baremetal-support            A     10.144.110.162
+ baremetal-support-prg2       A     10.144.110.162
+ baremetal-support            CNAME baremetal-support-prg2.qe.prg2.suse.org.
- salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support            AAAA  2a07:de40:b211:24::162
+ salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support-prg2       AAAA  2a07:de40:b211:24::162

PR updated

okurz wrote in #note-15:

Please introduce a new A-record and add back a CNAME, e.g. in salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org

- baremetal-support            A     10.144.110.162
+ baremetal-support-prg2       A     10.144.110.162
+ baremetal-support            CNAME baremetal-support-prg2.qe.prg2.suse.org.
- salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support            AAAA  2a07:de40:b211:24::162
+ salt/profile/dns/files/prg2_suse_org/dns-qe.prg2.suse.org:baremetal-support-prg2       AAAA  2a07:de40:b211:24::162
Actions #17

Updated by ybonatakis 4 months ago

  • Status changed from Workable to In Progress
Actions #18

Updated by openqa_review 4 months ago

  • Due date set to 2025-02-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #19

Updated by ybonatakis 4 months ago

  • Status changed from In Progress to Workable
Actions #20

Updated by ybonatakis 4 months ago

  • Status changed from Workable to Feedback
Actions #21

Updated by okurz 4 months ago

  • Status changed from Feedback to Workable
Actions #22

Updated by ybonatakis 4 months ago

  • Status changed from Workable to In Progress

I found that the hostname was baremetal-support again.
So I updated again with hostnamectl. but hostname --fqdn was failing to give me the long domain name.
As such, I editted the /etc/hosts (and restarted systemd-hostnamed). With the changes in /etc/hosts the fqdn looked fine.
But still dont see any updates in the dashboard.

Checking the logs (journalctl -fu grafana-server) but I dont know if they are relevant. So first I assume that the /etc/hosts should be updated somewhere in one of the salt repos. What else should I check out to make sure that the dashboard is properly configured?

Actions #23

Updated by okurz 4 months ago

ybonatakis wrote in #note-22:

I found that the hostname was baremetal-support again.
So I updated again with hostnamectl. but hostname --fqdn was failing to give me the long domain name.
As such, I editted the /etc/hosts (and restarted systemd-hostnamed). With the changes in /etc/hosts the fqdn looked fine.
But still dont see any updates in the dashboard.

Please revert your changes to /etc/hosts. hostnamectl shows the problem. Only the transient hostname is set to baremetal-support-prg2, the static hostname is still baremetal-support. I suggest to rename the VM in OpenPlatform itself to be baremetal-support-prg2. Maybe the static hostname is set from the VM name. Also please ensure this is safe over reboots.

Checking the logs (journalctl -fu grafana-server) but I dont know if they are relevant.

Not relevant

So first I assume that the /etc/hosts should be updated somewhere in one of the salt repos. What else should I check out to make sure that the dashboard is properly configured?

Please read
https://gitlab.suse.de/openqa/salt-states-openqa/#setup-production-machine

/etc/salt/minion_id should be the hostname, right now it's the old FQDN.

Actions #25

Updated by okurz 4 months ago

  • Copied to action #175998: Multiple unaccepted salt keys on OSD added
Actions #26

Updated by ybonatakis 4 months ago

  • Status changed from In Progress to Feedback

So after the changes in the https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5952 and after setting the hostname as described above, the dashboard still couldnt match with the correct value. We perform some steps manually.

  • ssh into baremetal-support-prg2.qe.prg2.suse.org

  • verify that hostnamectl provides correct static name and that it remains after reboot.

  • hostname --fqdn still wasnt giving the correct full domain. For that we had to adjust /etc/hosts manually

  • and then remove the /etc/salt/grains file. as the baremetal-support doesnt act as a worker, apparently this isnt needed.

  • edit /etc/salt/minion_id. Updated to baremetal-support-prg2.qe.prg2.suse.org. This seems to provide the correct host value on the dashboards queries with the correct hostname in the VM.

    • we notice some ERROR from systemctl status salt-minion.service:
      The Salt Master has cached the public key for this node, this salt minion will wait for 10 seconds before attempting to re-authenticate

Now on openqa.suse.de:

  • list all the keys and remove the old key for baremetal-support-prg2
  • add the new one (at that point I noticed that the dashboard was already shows some graphs)
  • and finally (to make sure that monitor is provisioned properly we re-apply the state cmd: sudo salt "monitor" state.apply

To fullfill all the requirements of that ticket the last action was to unsilence the silenced alert

Actions #27

Updated by ybonatakis 4 months ago

  • Status changed from Feedback to Resolved

I updated https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=28360.
I think with this job is done here.

in brief:

  • VM has unique fqdn
  • workaround for unique nodename in salt removed
  • dashboard shows data again and silence is reverted
  • racktables is up to date
Actions #28

Updated by okurz 2 months ago

  • Due date deleted (2025-02-04)
Actions

Also available in: Atom PDF