Project

General

Profile

Actions

action #120267

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones

Conduct the migration of openqa-ses aka. "storage.qa.suse.de" size:M

Added by okurz over 1 year ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2022-09-15
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

See parent #116623

Acceptance criteria

  • AC1: openqa-ses is either migrated to the new network zone or decommissioned

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #121282: Recover storage.qa.suse.de size:SResolvednicksinger2022-12-01

Actions
Related to openQA Infrastructure - action #120270: Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:MResolvedmkittler

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #120264: Conduct the migration of SUSE QA systems (non-tools-team maintained) from Nbg SRV1 to new security zones size:M added
Actions #2

Updated by okurz over 1 year ago

  • Status changed from New to Blocked

Regarding "openqa-ses" chat was in https://suse.slack.com/archives/C02CANHLANP/p1667993899105129:

(Oliver Kurz) ok. I wrote to qa-team@suse.de. If I receive no response then I think this is a topic for QE mgmt, shouldn't happen that we have machines which nobody wants to know about :slightly_smiling_face:
(Matthias Griessmeier) I agree. Ses is storage, I don't know/remember why it has osd-admins as contact. Probably because openqa is in the name. I'd say if the machine is not pingable, nor reachable over ipmi and no one responses, let's unmount it and move it to cold storage. This seems to be historical leftover. […] Gerhard is aware and next time he is in srv1. He will disconnect it. I cannot even find it in https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/suse_de.sls
that's probably the reason why dns does not resolve - so I think it is safe to remove it. to make it official, created https://sd.suse.com/servicedesk/customer/portal/1/SD-103880

Also I asked Lazaros Haleplidis in https://suse.slack.com/archives/C0488BZNA5S/p1668081298115499:

(Lazaros Haleplidis) also I have a question about machines like openqa-ses.suse.de https://racktables.nue.suse.com/index.php?page=object&object_id=13558 . nobody seems to know the machine and it's not reachable. But racktable lists switch port connections so by powering off/on machines or disconnecting/connecting the switch ports you and others from EngInfra could identify the MAC addresses and just go ahead with the migration and nobody from our side can contribute more.

Actions #3

Updated by okurz over 1 year ago

  • Parent task changed from #116623 to #120264
Actions #4

Updated by okurz over 1 year ago

  • Status changed from Blocked to Feedback

No response in Slack. I asked mgriessmeier to share access to https://sd.suse.com/servicedesk/customer/portal/1/SD-103880 . The machine is still in racktables untouched.

Actions #5

Updated by okurz over 1 year ago

  • Status changed from Feedback to Blocked

OSD-Admins is now participant on https://sd.suse.com/servicedesk/customer/portal/1/SD-103880, we can track this

Actions #6

Updated by okurz over 1 year ago

  • Status changed from Blocked to Feedback

SD ticket was resolved with: "Has been unmounted and placed in the cold storage." racktables wasn't updated yet. Need to check with mgriessmeier if he created a new cold storage entry at FC location

Actions #7

Updated by okurz over 1 year ago

Actions #8

Updated by okurz over 1 year ago

  • Subject changed from Conduct the migration/decommissioning of openqa-ses to Conduct the migration of openqa-ses aka. "storage.qa.suse.de"
  • Category set to Infrastructure
  • Status changed from Feedback to Blocked

blocked by #121282. After that we need to ensure that we actually have the system migrated to the new network security zone(s)

Actions #9

Updated by okurz over 1 year ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

With #121282 resolved we can now unblock and continue here.

Actions #10

Updated by okurz over 1 year ago

  • Parent task changed from #120264 to #116623
Actions #11

Updated by livdywan over 1 year ago

  • Subject changed from Conduct the migration of openqa-ses aka. "storage.qa.suse.de" to Conduct the migration of openqa-ses aka. "storage.qa.suse.de" size:M
  • Status changed from New to Workable
Actions #12

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #13

Updated by mkittler over 1 year ago

  • Related to action #120270: Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:M added
Actions #15

Updated by mkittler over 1 year ago

  • Status changed from Workable to Feedback
Actions #16

Updated by okurz over 1 year ago

  • Status changed from Feedback to Blocked
Actions #17

Updated by mkittler about 1 year ago

The IPMI interface has been moved (which is hopefully actually wanted, this ticket makes no mention of it but supposedly the IPMI interface is part of #120270). Now we only need to wait for the migration of "storage.qa.suse.de" itself.

Actions #18

Updated by okurz about 1 year ago

I commented in https://sd.suse.com/servicedesk/customer/portal/1/SD-109299

Currently we use storage.qa.suse.de solely within the scope of openQA so please move it into .oqa.suse.de., not .qe.suse.de.

Actions #19

Updated by mkittler about 1 year ago

  • Status changed from Blocked to Feedback

The migration is supposedly concluded. I cannot login on qe-jumpy.suse.de to verify it at the moment, though. hostname --fqdn only returns hostname: Name or service not known on storage.qa.suse.de.

Actions #20

Updated by mkittler about 1 year ago

I'd checked this again today but VPN is offline.

Actions #21

Updated by mkittler about 1 year ago

  • Status changed from Feedback to Blocked

The VPN is online again. However, it doesn't look like the host has actually been migrated. So I've left a comment on https://sd.suse.com/servicedesk/customer/portal/1/SD-109299. (It is not possible to re-open the ticket. I've mentioned it in the chat to get some attention.)

Actions #22

Updated by okurz about 1 year ago

  • Status changed from Blocked to In Progress

mkittler to check latest changes after mcaj mentioned

the DNS change for the device storage.qa.suse.de - https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=13558
is here: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3232/

already merged. Maybe we need to reboot or check the host's status.

Actions #23

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Blocked
Actions #24

Updated by mkittler about 1 year ago

  • Status changed from Blocked to In Progress

The host has been migrated and I've updated racktables. Now I only need to check our salt/alerting (as currently the host-up alert for the old domain is firing).

Actions #25

Updated by mkittler about 1 year ago

I deleted the old host storage.qa.suse.de and added storage.oqa.suse.de instead. This fixed the alert and the host seems to be almost properly in salt again. There are no failing systemd services. The only problem I found so far is telegraf:

Feb 28 16:07:37 storage telegraf[14403]: 2023-02-28T15:07:37Z W! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: database "telegraf" creation failed: Post "http://openqa-monitor.qa.suse.de:8086/query": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 28 16:07:52 storage telegraf[14403]: 2023-02-28T15:07:52Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 28 16:07:52 storage telegraf[14403]: 2023-02-28T15:07:52Z E! [agent] Error writing to outputs.influxdb: could not write any address

Note that the host is generally reachable and e.g. curl --verbose -X POST http://openqa-monitor.qa.suse.de:8086/write?db=telegraf returns fast (with a "204 No Content" response).

I couldn't find anything useful in the InfluxDB logs.

Actions #26

Updated by okurz about 1 year ago

Actions #27

Updated by openqa_review about 1 year ago

  • Due date set to 2023-03-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions #28

Updated by mkittler about 1 year ago

The InfluxDB issue is gone (error not logged anymore and data is visible on https://stats.openqa-monitor.qa.suse.de/d/GDstorage/dashboard-for-storage). I don't know what has changed but I suppose it is good enough that it works now.


Looks like I still need to replace some references of the old domain:


Annoyingly, the host-up alert (and possible other alerts for storage) still used the old domain, even after applying salt states. For now I have changed the domain of the host-up alert by editing the alert manually. I could just save the changes on this page without being prompted to C&P JSON.

All of this means that our alerting is currently not covered by the JSON files in salt and generic alert-related code (e.g. "value": "{{ host_interface }}" in generic.json.template is not effective¹).

¹ Despite the old alert config being actually still visibly stored - e.g. if I save the JSON of that dashboard I'm getting "value": "storage.oqa.suse.de". However, it has no effect on the migrated alert.

Actions #30

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Resolved

I've also edited the other alerts manually (actually just the ping alert used the full domain name and had to be changed). So this should cover everything.

Actions #31

Updated by livdywan about 1 year ago

  • Status changed from Resolved to Feedback

I'm seeing alerts because salt-states-openqa is failing like so, hence re-opening:

         ID: /root/.ssh/id_ed25519.backup_osd
   Function: file.managed
     Result: False
    Comment: Pillar id_ed25519.backup_osd does not exist
    Started: 13:25:13.881664
   Duration: 2.935 ms
    Changes:  

And this looks to be the fix for it: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/498

Actions #32

Updated by mkittler about 1 year ago

  • Status changed from Feedback to Resolved

Yes, and the pipeline has already passed.

Actions #33

Updated by okurz about 1 year ago

  • Due date deleted (2023-03-15)
Actions

Also available in: Atom PDF