action #120267
closedcoordination #121720: [saga][epic] QE setup in PRG2+NUE3
coordination #116623: [epic] Migration of SUSE openQA+QA+QAM systems to new security zones
Conduct the migration of openqa-ses aka. "storage.qa.suse.de" size:M
0%
Description
Motivation¶
See parent #116623
Acceptance criteria¶
- AC1: openqa-ses is either migrated to the new network zone or decommissioned
Suggestions¶
- Coordinate the move among SUSE-IT and machine owners in Slack #discuss-qe-new-security-zones
- Ensure racktables is up-to-date
Updated by okurz about 1 year ago
- Copied from action #120264: Conduct the migration of SUSE QA systems (non-tools-team maintained) from Nbg SRV1 to new security zones size:M added
Updated by okurz about 1 year ago
- Status changed from New to Blocked
Regarding "openqa-ses" chat was in https://suse.slack.com/archives/C02CANHLANP/p1667993899105129:
(Oliver Kurz) ok. I wrote to qa-team@suse.de. If I receive no response then I think this is a topic for QE mgmt, shouldn't happen that we have machines which nobody wants to know about :slightly_smiling_face:
(Matthias Griessmeier) I agree. Ses is storage, I don't know/remember why it has osd-admins as contact. Probably because openqa is in the name. I'd say if the machine is not pingable, nor reachable over ipmi and no one responses, let's unmount it and move it to cold storage. This seems to be historical leftover. […] Gerhard is aware and next time he is in srv1. He will disconnect it. I cannot even find it in https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/suse_de.sls
that's probably the reason why dns does not resolve - so I think it is safe to remove it. to make it official, created https://sd.suse.com/servicedesk/customer/portal/1/SD-103880
Also I asked Lazaros Haleplidis in https://suse.slack.com/archives/C0488BZNA5S/p1668081298115499:
(Lazaros Haleplidis) also I have a question about machines like openqa-ses.suse.de https://racktables.nue.suse.com/index.php?page=object&object_id=13558 . nobody seems to know the machine and it's not reachable. But racktable lists switch port connections so by powering off/on machines or disconnecting/connecting the switch ports you and others from EngInfra could identify the MAC addresses and just go ahead with the migration and nobody from our side can contribute more.
Updated by okurz about 1 year ago
- Status changed from Blocked to Feedback
No response in Slack. I asked mgriessmeier to share access to https://sd.suse.com/servicedesk/customer/portal/1/SD-103880 . The machine is still in racktables untouched.
Updated by okurz about 1 year ago
- Status changed from Feedback to Blocked
OSD-Admins is now participant on https://sd.suse.com/servicedesk/customer/portal/1/SD-103880, we can track this
Updated by okurz 12 months ago
- Related to action #121282: Recover storage.qa.suse.de size:S added
Updated by okurz 12 months ago
- Subject changed from Conduct the migration/decommissioning of openqa-ses to Conduct the migration of openqa-ses aka. "storage.qa.suse.de"
- Category set to Infrastructure
- Status changed from Feedback to Blocked
blocked by #121282. After that we need to ensure that we actually have the system migrated to the new network security zone(s)
Updated by mkittler 11 months ago
- Related to action #120270: Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:M added
Updated by okurz 11 months ago
- Status changed from Feedback to Blocked
We can declare that as blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-109299
Updated by okurz 10 months ago
I commented in https://sd.suse.com/servicedesk/customer/portal/1/SD-109299
Currently we use storage.qa.suse.de solely within the scope of openQA so please move it into .oqa.suse.de., not .qe.suse.de.
Updated by mkittler 9 months ago
- Status changed from Feedback to Blocked
The VPN is online again. However, it doesn't look like the host has actually been migrated. So I've left a comment on https://sd.suse.com/servicedesk/customer/portal/1/SD-109299. (It is not possible to re-open the ticket. I've mentioned it in the chat to get some attention.)
Updated by okurz 9 months ago
- Status changed from Blocked to In Progress
mkittler to check latest changes after mcaj mentioned
the DNS change for the device storage.qa.suse.de - https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=13558
is here: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3232/
already merged. Maybe we need to reboot or check the host's status.
Updated by mkittler 9 months ago
- Status changed from In Progress to Blocked
Still blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-109299.
Updated by mkittler 9 months ago
I deleted the old host storage.qa.suse.de and added storage.oqa.suse.de instead. This fixed the alert and the host seems to be almost properly in salt again. There are no failing systemd services. The only problem I found so far is telegraf:
Feb 28 16:07:37 storage telegraf[14403]: 2023-02-28T15:07:37Z W! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: database "telegraf" creation failed: Post "http://openqa-monitor.qa.suse.de:8086/query": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 28 16:07:52 storage telegraf[14403]: 2023-02-28T15:07:52Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 28 16:07:52 storage telegraf[14403]: 2023-02-28T15:07:52Z E! [agent] Error writing to outputs.influxdb: could not write any address
Note that the host is generally reachable and e.g. curl --verbose -X POST http://openqa-monitor.qa.suse.de:8086/write?db=telegraf
returns fast (with a "204 No Content" response).
I couldn't find anything useful in the InfluxDB logs.
Updated by okurz 9 months ago
deployed https://gitlab.suse.de/qa-sle/qanet-configs/-/merge_requests/51 for the updated DNS entry
Updated by openqa_review 9 months ago
- Due date set to 2023-03-15
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 9 months ago
The InfluxDB issue is gone (error not logged anymore and data is visible on https://stats.openqa-monitor.qa.suse.de/d/GDstorage/dashboard-for-storage). I don't know what has changed but I suppose it is good enough that it works now.
Looks like I still need to replace some references of the old domain:
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/801
- https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/497
Annoyingly, the host-up alert (and possible other alerts for storage) still used the old domain, even after applying salt states. For now I have changed the domain of the host-up alert by editing the alert manually. I could just save the changes on this page without being prompted to C&P JSON.
All of this means that our alerting is currently not covered by the JSON files in salt and generic alert-related code (e.g. "value": "{{ host_interface }}"
in generic.json.template
is not effective¹).
¹ Despite the old alert config being actually still visibly stored - e.g. if I save the JSON of that dashboard I'm getting "value": "storage.oqa.suse.de"
. However, it has no effect on the migrated alert.
Updated by livdywan 9 months ago
- Status changed from Resolved to Feedback
I'm seeing alerts because salt-states-openqa is failing like so, hence re-opening:
ID: /root/.ssh/id_ed25519.backup_osd
Function: file.managed
Result: False
Comment: Pillar id_ed25519.backup_osd does not exist
Started: 13:25:13.881664
Duration: 2.935 ms
Changes:
And this looks to be the fix for it: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/498