Project

General

Profile

Actions

action #120270

closed

Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:M

Added by okurz about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

See parent #116623

Acceptance criteria

  • AC1: All IPMI interfaces of openQA machines in Nbg SRV1 are in new security zones
  • AC2: All IPMI interfaces of openQA machines in Nbg SRV1 are fully usable in production
  • AC3: All documentation referencing O3+OSD ipmi interfaces are up-to-date
  • AC4: Our automated tools using O3+OSD ipmi interfaces are up-to-date e.g. GitLab pipelines and salt states

Suggestions

Open points

  1. Where is the documentation by SUSE-IT
  2. Where is the git repo handling ssh keys
  3. Fix the multi-second login time over ssh (workaround: use ssh -4)

Related issues 3 (1 open2 closed)

Related to openQA Infrastructure - action #120651: [openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:MNew2022-11-17

Actions
Copied to openQA Tests - action #120288: [tools] cloud based tests fail due to traffic to cloud blocked auto_review:"2022-11-0.*Test died: (Waiting for Godot.*ssh|Cannot find image after upload)":retryResolvedokurz2022-11-10

Actions
Copied to openQA Infrastructure - action #124119: Conduct the migration of remaining SUSE openQA systems IPMI to new security zonesResolvedokurz2023-02-08

Actions
Actions #2

Updated by okurz about 2 years ago

  • Description updated (diff)

Preliminary instructions in https://suse.slack.com/archives/C0488BZNA5S/p1668011380114319

(Martin Caj) you are not need to be root there to do ssh / ipmi. try this: ssh jumpy@qe-jumpy.suse.de
(Nick Singer) I'm able to log onto the machine. Let me try ipmi access to one of the migrated hosts
(Oliver Kurz) can you tell me the git repo where you manage the keys so that I can check myself and add an ed25519 key
(Oliver Kurz) it takes rather long to login: time ssh jumpy@qe-jumpy.suse.de true takes 3.8s!
(Nick Singer) yes the long connection time I also realized. might get annoying in the future and might even break automated pipelines or require special, dirty hacks to increase timeouts

Actions #3

Updated by okurz about 2 years ago

  • Description updated (diff)
Actions #4

Updated by okurz about 2 years ago

So far the conversion rule seems to be:

sed 's/ipmitool/ssh -4 jumpy@qe-jumpy.suse.de -- &/;s/-ipmi\.suse\.de/-ipmi.qe-ipmi-ur/' openqa/workerconf.sls 

only covering hostnames ending in "-ipmi". For others we will have to find out what mcaj will think of :)

Actions #5

Updated by okurz about 2 years ago

  • Copied to action #120288: [tools] cloud based tests fail due to traffic to cloud blocked auto_review:"2022-11-0.*Test died: (Waiting for Godot.*ssh|Cannot find image after upload)":retry added
Actions #6

Updated by livdywan about 2 years ago

  • Subject changed from Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones to Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by mkittler about 2 years ago

  • Blocked by action #120651: [openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:M added
Actions #9

Updated by mkittler about 2 years ago

  • Status changed from Workable to Blocked

We should not move grenache-1 to the new security zone as it is the last machine that can conduct tests being broken on worker2 in the new security zone (see #120270).

Actions #10

Updated by okurz almost 2 years ago

  • Project changed from 46 to openQA Infrastructure
  • Assignee set to mkittler

mkittler wrote:

We should not move grenache-1 to the new security zone as it is the last machine that can conduct tests being broken on worker2 in the new security zone (see #120270).

I guess you meant to block on #120261, right? Also please only use "Blocked" with assignee.

Actions #11

Updated by livdywan almost 2 years ago

  • Status changed from Blocked to Workable

okurz wrote:

I guess you meant to block on #120261, right? Also please only use "Blocked" with assignee.

Looks like this is unblocked now

Actions #12

Updated by mkittler almost 2 years ago

  • Status changed from Workable to In Progress

True. Tests using the IPMI backend are impaired (https://progress.opensuse.org/issues/120651) but that's not due to the IPMI interface itself.

I was told that AC1 and AC2 are now implemented so I'm checking what's left for AC3 and AC4.

Actions #13

Updated by mkittler almost 2 years ago

I ran test-ipmi-access to verify AC3 (with updated regex, see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/477). Some ipmi commands using jumpy@qe-jumpy still fail. I logged in on the jump host and ran the failing commands for relevant hosts (those in SRV1) manually:

jumpy@qe-jumpy:~> ipmitool -I lanplus -C 3 -H openqa-aarch64-ipmi.qe-ipmi-ur -U … -P …
Address lookup for openqa-aarch64-ipmi.qe-ipmi-ur failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
jumpy@qe-jumpy:~> ipmitool -I lanplus -C 3 -H openqaworker-power8-ipmi.qe-ipmi-ur -U … -P …
Address lookup for openqaworker-power8-ipmi.qe-ipmi-ur failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
jumpy@qe-jumpy:~> ipmitool -I lanplus -C 3 -H rebel-ipmi.qe-ipmi-ur -U … -P …
Address lookup for rebel-ipmi.qe-ipmi-ur failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session

Note that not all such hosts are broken (e.g. jumpy@qe-jumpy:~> ipmitool -I lanplus -C 3 -H openqaworker4-ipmi.qe-ipmi-ur … works fine).

Looks like IPs for the hosts are hardcoded in /etc/hosts on jumpy@qe-jumpy. The failing hosts are missing in that list so supposedly someone needs to update the list. I cannot do it because I don't know the IPs so I've been asking in the chat.

Actions #14

Updated by mkittler almost 2 years ago

Looks like https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/.gitlab-ci.yml is only about workers not in SRV1.

https://gitlab.suse.de/openqa/monitor-o3/-/blob/master/.gitlab-ci.yml needs to be adjusted but first the issues mentioned in my previous comment need to be resolved. For being able to use the jumphost the script also needs to be changed a little.

Actions #15

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-01-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #16

Updated by mkittler almost 2 years ago

Actions #20

Updated by mkittler almost 2 years ago

  • Private changed from Yes to No
Actions #21

Updated by mkittler almost 2 years ago

  • Status changed from In Progress to Feedback
Actions #23

Updated by okurz almost 2 years ago

  • Due date deleted (2023-01-26)
  • Status changed from Feedback to Blocked
Actions #24

Updated by okurz almost 2 years ago

  • Status changed from Blocked to Feedback

But the gitlab CI pipelines seem to have problems to reach the jumpy host, see e.g. https://gitlab.suse.de/openqa/monitor-o3/-/jobs/1346163#L37

Actions #26

Updated by mkittler almost 2 years ago

  • Status changed from Feedback to Blocked

The SR has been merged and it works. So this leaves me waiting for https://sd.suse.com/servicedesk/customer/portal/1/SD-109299.

Actions #27

Updated by mkittler almost 2 years ago

  • Status changed from Blocked to Feedback

The remaining IPMI interfaces have been migrated. Now only the documentation/automation changes are pending:


I've also gone through the list of IPMI commands in the pillars and only the following hosts have not been migrated:

  1. not used anymore anyways or broken: openqaworker1, imagetester, power8
  2. special: sp.openqaw5-xen.qa.suse.de
  3. Prague located: openqaworker14-ipmi.qa.suse.cz, openqaworker15-ipmi.qa.suse.cz, openqaworker16-ipmi.qa.suse.cz, openqaworker17-ipmi.qa.suse.cz, openqaworker18-ipmi.qa.suse.cz
  4. power: qa-power8-4.qa.suse.de, qa-power8-5.qa.suse.de, fsp1-powerqaworker-qam.qa.suse.de, malbec
  5. arm: openqaworker-arm-1-ipmi.suse.de, openqaworker-arm-2-ipmi.suse.de, openqaworker-arm-4-ipmi.suse.de, openqaworker-arm-4-ipmi.suse.de, openqaworker-arm-5-ipmi.suse.de

I suppose we can ignore category 1. I'm not sure about the others. Should I request migrating them, too? When I remember correctly, we talk about this ticket in Jitsi and concluded that only the hosts I requested in https://sd.suse.com/servicedesk/customer/portal/1/SD-109299 would be missing.

EDIT: Created #124119 as follow-up for remaining hosts.

Actions #28

Updated by okurz almost 2 years ago

  • Status changed from Feedback to In Progress

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/489 merged. As this ticket should be about SRV1 in particular then please make sure the list you provided is covered in a follow-up ticket, like check the parent and other subtasks, create a new ticket as necessary.

Actions #30

Updated by mkittler almost 2 years ago

  • Copied to action #124119: Conduct the migration of remaining SUSE openQA systems IPMI to new security zones added
Actions #31

Updated by okurz almost 2 years ago

  • Blocked by deleted (action #120651: [openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:M)
Actions #32

Updated by okurz almost 2 years ago

  • Related to action #120651: [openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:M added
Actions #33

Updated by livdywan almost 2 years ago

  • Status changed from In Progress to Resolved

okurz wrote:

https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/10 merged.

Merged. We're good here.

Actions

Also available in: Atom PDF