Project

General

Profile

action #120270

Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:M

Added by okurz 3 months ago. Updated 17 days ago.

Status:
Blocked
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

See parent #116623

Acceptance criteria

  • AC1: All IPMI interfaces of openQA machines in Nbg SRV1 are in new security zones
  • AC2: All IPMI interfaces of openQA machines in Nbg SRV1 are fully usable in production
  • AC3: All documentation referencing O3+OSD ipmi interfaces are up-to-date
  • AC4: Our automated tools using O3+OSD ipmi interfaces are up-to-date e.g. GitLab pipelines and salt states

Suggestions

Open points

  1. Where is the documentation by SUSE-IT
  2. Where is the git repo handling ssh keys
  3. Fix the multi-second login time over ssh (workaround: use ssh -4)

Related issues

Blocked by openQA Infrastructure - action #120651: [openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:MNew2022-11-17

Copied to openQA Tests - action #120288: [tools] cloud based tests fail due to traffic to cloud blocked auto_review:"2022-11-0.*Test died: (Waiting for Godot.*ssh|Cannot find image after upload)":retryResolved2022-11-10

History

#2 Updated by okurz 3 months ago

  • Description updated (diff)

Preliminary instructions in https://suse.slack.com/archives/C0488BZNA5S/p1668011380114319

(Martin Caj) you are not need to be root there to do ssh / ipmi. try this: ssh jumpy@qe-jumpy.suse.de
(Nick Singer) I'm able to log onto the machine. Let me try ipmi access to one of the migrated hosts
(Oliver Kurz) can you tell me the git repo where you manage the keys so that I can check myself and add an ed25519 key
(Oliver Kurz) it takes rather long to login: time ssh jumpy@qe-jumpy.suse.de true takes 3.8s!
(Nick Singer) yes the long connection time I also realized. might get annoying in the future and might even break automated pipelines or require special, dirty hacks to increase timeouts

#3 Updated by okurz 3 months ago

  • Description updated (diff)

#4 Updated by okurz 3 months ago

So far the conversion rule seems to be:

sed 's/ipmitool/ssh -4 jumpy@qe-jumpy.suse.de -- &/;s/-ipmi\.suse\.de/-ipmi.qe-ipmi-ur/' openqa/workerconf.sls 

only covering hostnames ending in "-ipmi". For others we will have to find out what mcaj will think of :)

#5 Updated by okurz 3 months ago

  • Copied to action #120288: [tools] cloud based tests fail due to traffic to cloud blocked auto_review:"2022-11-0.*Test died: (Waiting for Godot.*ssh|Cannot find image after upload)":retry added

#6 Updated by cdywan 3 months ago

  • Subject changed from Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones to Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:M
  • Description updated (diff)
  • Status changed from New to Workable

#8 Updated by mkittler 2 months ago

  • Blocked by action #120651: [openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:M added

#9 Updated by mkittler 2 months ago

  • Status changed from Workable to Blocked

We should not move grenache-1 to the new security zone as it is the last machine that can conduct tests being broken on worker2 in the new security zone (see #120270).

#10 Updated by okurz about 2 months ago

  • Project changed from SUSE QA to openQA Infrastructure
  • Assignee set to mkittler

mkittler wrote:

We should not move grenache-1 to the new security zone as it is the last machine that can conduct tests being broken on worker2 in the new security zone (see #120270).

I guess you meant to block on #120261, right? Also please only use "Blocked" with assignee.

#11 Updated by cdywan 26 days ago

  • Status changed from Blocked to Workable

okurz wrote:

I guess you meant to block on #120261, right? Also please only use "Blocked" with assignee.

Looks like this is unblocked now

#12 Updated by mkittler 24 days ago

  • Status changed from Workable to In Progress

True. Tests using the IPMI backend are impaired (https://progress.opensuse.org/issues/120651) but that's not due to the IPMI interface itself.

I was told that AC1 and AC2 are now implemented so I'm checking what's left for AC3 and AC4.

#13 Updated by mkittler 24 days ago

I ran test-ipmi-access to verify AC3 (with updated regex, see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/477). Some ipmi commands using jumpy@qe-jumpy still fail. I logged in on the jump host and ran the failing commands for relevant hosts (those in SRV1) manually:

jumpy@qe-jumpy:~> ipmitool -I lanplus -C 3 -H openqa-aarch64-ipmi.qe-ipmi-ur -U … -P …
Address lookup for openqa-aarch64-ipmi.qe-ipmi-ur failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
jumpy@qe-jumpy:~> ipmitool -I lanplus -C 3 -H openqaworker-power8-ipmi.qe-ipmi-ur -U … -P …
Address lookup for openqaworker-power8-ipmi.qe-ipmi-ur failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
jumpy@qe-jumpy:~> ipmitool -I lanplus -C 3 -H rebel-ipmi.qe-ipmi-ur -U … -P …
Address lookup for rebel-ipmi.qe-ipmi-ur failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session

Note that not all such hosts are broken (e.g. jumpy@qe-jumpy:~> ipmitool -I lanplus -C 3 -H openqaworker4-ipmi.qe-ipmi-ur … works fine).

Looks like IPs for the hosts are hardcoded in /etc/hosts on jumpy@qe-jumpy. The failing hosts are missing in that list so supposedly someone needs to update the list. I cannot do it because I don't know the IPs so I've been asking in the chat.

#14 Updated by mkittler 24 days ago

Looks like https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/.gitlab-ci.yml is only about workers not in SRV1.

https://gitlab.suse.de/openqa/monitor-o3/-/blob/master/.gitlab-ci.yml needs to be adjusted but first the issues mentioned in my previous comment need to be resolved. For being able to use the jumphost the script also needs to be changed a little.

#15 Updated by openqa_review 24 days ago

  • Due date set to 2023-01-26

Setting due date based on mean cycle time of SUSE QE Tools

#16 Updated by mkittler 23 days ago

#20 Updated by mkittler 22 days ago

  • Private changed from Yes to No

#21 Updated by mkittler 22 days ago

  • Status changed from In Progress to Feedback

#23 Updated by okurz 18 days ago

  • Due date deleted (2023-01-26)
  • Status changed from Feedback to Blocked

#24 Updated by okurz 18 days ago

  • Status changed from Blocked to Feedback

But the gitlab CI pipelines seem to have problems to reach the jumpy host, see e.g. https://gitlab.suse.de/openqa/monitor-o3/-/jobs/1346163#L37

#26 Updated by mkittler 17 days ago

  • Status changed from Feedback to Blocked

The SR has been merged and it works. So this leaves me waiting for https://sd.suse.com/servicedesk/customer/portal/1/SD-109299.

Also available in: Atom PDF