Project

General

Profile

Actions

action #157753

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters

Bring back automatic recovery for openqaworker-arm-1 size:M

Added by okurz 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #132614 openqaworker-arm-1 was moved to FC Basement so that we have one hot-redundant aarch64 OSD machine outside of PRG2. For that to be setup we need to also accomodate the automatic recovery feature.

Acceptance criteria

  • AC1: The automatic recovery of openqaworker-arm-1 on crashes works
  • AC2: openqaworker-arm-1 runs OSD production jobs in a stable way

Suggestions

Rollback actions


Files


Related issues 4 (0 open4 closed)

Related to QA - action #155179: Participate in alpha-testing of new version of velociraptor-clientResolvedokurz2024-02-08

Actions
Related to openQA Infrastructure - action #159303: [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:SResolvednicksinger2024-06-25

Actions
Related to openQA Infrastructure - action #159270: openqaworker-arm-1 is Unreachable size:SResolvedybonatakis2024-04-19

Actions
Copied from QA - action #133748: Move of openqaworker-arm-1 to FC Basement size:MResolvedybonatakis

Actions
Actions #1

Updated by okurz 8 months ago

  • Copied from action #133748: Move of openqaworker-arm-1 to FC Basement size:M added
Actions #2

Updated by okurz 8 months ago

  • Subject changed from Bring back automatic recovery for openqaworker-arm-1 to Bring back automatic recovery for openqaworker-arm-1 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by nicksinger 8 months ago

Some more information on how to control the eaton PDUs:

  • Available protocols to communicate with the PDU: SSH/Telnet (only one at the same time seems to be possible), HTTP(S), SNMP, "EnergyWise"
  • SNMPv3 is finicky to setup but provides authentication. An according MIB file can be downloaded from the PDU itself. Haven't explored it further.
  • SSH server already does not work to modern ssh-clients but can be used with additional options like so to get the status of an outlet on the PDU (A20 in that example):
ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@epdu-b2.qe.nue2.suse.org
+============================================================================+
|                      EATON ePDU Configuration Utility                      |
+============================================================================+
admin@epdu-b2.qe.nue2.suse.org's password:
pdu#0>get PDU.OutletSystem.Outlet[20].PresentStatus.SwitchOnOff
1

Other "paths" can be explored by using info PDU.*. Not sure how to control the outlet yet. Maybe with PDU.OutletSystem.Outlet[x].Schedule[1].Command? But what to write there?

Actions #4

Updated by okurz 8 months ago

  • Related to action #155179: Participate in alpha-testing of new version of velociraptor-client added
Actions #5

Updated by okurz 8 months ago

  • Description updated (diff)

As expected openqaworker-arm-1 already crashed as visible on https://monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1&from=1711486404798&to=1711516350686 . I removed the salt key for openqaworker-arm-1 with ssh osd 'sudo salt-key -y -d openqaworker-arm-1.qe.nue2.suse.org' and will restart the failed osd deployment. I will also lend the machine openqaworker-arm-1 temporarily to another team as part of #155179.

Actions #6

Updated by okurz 8 months ago

  • Priority changed from Low to Normal
Actions #7

Updated by ybonatakis 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #8

Updated by openqa_review 8 months ago

  • Due date set to 2024-04-13

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by ybonatakis 8 months ago

  • Due date deleted (2024-04-13)

https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/31

I run manually the script in a container and it produced the following output

69fecd9cd196:/ # ./ipmi-recover-worker
Attempting to reboot openqaworker-arm-1
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : previous
Last Power Event     : 
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Sleep Button Disable : not allowed
Diag Button Disable  : allowed
Reset Button Disable : allowed
Power Button Disable : allowed
Sleep Button Disabled: false
Diag Button Disabled : false
Reset Button Disabled: false
Power Button Disabled: false
Set Boot Device to disk
IPMI reports power status on ==> performing power cycle
Chassis Power Control: Cycle
PING openqaworker-arm-1.qe.nue2.suse.org (10.168.192.213) 56(84) bytes of data.

--- openqaworker-arm-1.qe.nue2.suse.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 11.207/11.207/11.207/0.000 ms
Connection to openqaworker-arm-1.qe.nue2.suse.org 22 port [tcp/ssh] succeeded!
IPMI succeeded ==> openqaworker-arm-1 is online again
69fecd9cd196:/ # echo $?
0

After a while i chould ssh into openqaworker-arm-1.qe.nue2.suse.org but on OSD the machine is shown offline

Actions #10

Updated by openqa_review 8 months ago

  • Due date set to 2024-04-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by okurz 8 months ago

ybonatakis wrote in #note-9:

https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/31

I run manually the script in a container and it produced the following output
[…]
After a while i chould ssh into openqaworker-arm-1.qe.nue2.suse.org but on OSD the machine is shown offline

yes, because of #157753-5. What needs to be extended in ipmi-recover-worker is support for the FC Basement ePDUs

Actions #12

Updated by ybonatakis 8 months ago

So openqaworker-arm-1.qe.nue2.suse.org is connected to Outlet 1
The cli command to get the state is get PDU.OutletSystem.Outlet[1].PresentStatus.SwitchOnOff but it is RO.

Reboot can be achieved with set PDU.OutletSystem.Outlet[1].ToggleControl 1 which has 120s delay to trigger the restart+ some time to boot and actually have access again.

I am going to update to perform the commands there

Actions #13

Updated by okurz 8 months ago

Great. That sounds quite promising

Actions #14

Updated by ybonatakis 8 months ago · Edited

PR updated. I updated the script to use ssh in order to reboot the VMs.
I noticed that there is a retry logic https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/6d70304f191411a7796a8aec350a58e433df522b/ipmi-recover-worker#L64
and i wonder how valid is it and whether it could moved into the expect script. just food for thought. it is a bit weird to me because there is settings in place which sets a timeout>10s which is the $pdu_retry.
Notice that I put a sleep in the expect sleep to catch status after restart(see below output)

Another small thingy is that the registry.opensuse.org/home/okurz/container/containers/tumbleweed:ipmitool-ping-nc-msmtp-expect doesnt seems to contain ssh. i had to install it to perform tests.
Once i installed openssh i run the script as following
expect -f control-switched-rack-pdu.exp "admin" "admin" "epdu-b3.qe.nue2.suse.org" "1"
The output (when run second time)

7072e66e7929:/ # expect -f control-switched-rack-pdu.exp "admin" "admin" "epdu-b3.qe.nue2.suse.org" "1"
spawn ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@epdu-b3.qe.nue2.suse.org
+============================================================================+
|                      EATON ePDU Configuration Utility                      |
+============================================================================+
admin@epdu-b3.qe.nue2.suse.org's password: 
pdu#0>set PDU.OutletSystem.Outlet[1].ToggleControl 1
1
pdu#0>get PDU.OutletSystem.Outlet[1].PresentStatus.SwitchOnOff
1
pdu#0>get PDU.OutletSystem.Outlet[1].PresentStatus.SwitchOnOff
0
pdu#0>
Actions #16

Updated by ybonatakis 8 months ago · Edited

  • Status changed from In Progress to Feedback
Actions #17

Updated by ybonatakis 8 months ago

  • Status changed from Feedback to In Progress
Actions #18

Updated by okurz 8 months ago

I created a new container https://build.opensuse.org/package/show/home:okurz:container/grafana-webhook-actions instead of https://build.opensuse.org/request/show/1165434 with fixed package dependency name and updated .gitlab-ci.yml to use the new container image with https://gitlab.suse.de/openqa/grafana-webhook-actions/-/commit/a72304f3a0c9adfc62e051a31893d102c3c3dabb so you should be able to use ssh now in the CI triggered scripts.

Actions #19

Updated by ybonatakis 8 months ago

  • Status changed from In Progress to Feedback
Actions #20

Updated by ybonatakis 8 months ago

okurz wrote in #note-18:

I created a new container https://build.opensuse.org/package/show/home:okurz:container/grafana-webhook-actions instead of https://build.opensuse.org/request/show/1165434 with fixed package dependency name and updated .gitlab-ci.yml to use the new container image with https://gitlab.suse.de/openqa/grafana-webhook-actions/-/commit/a72304f3a0c9adfc62e051a31893d102c3c3dabb so you should be able to use ssh now in the CI triggered scripts.

I came across this a few minutes late. I have updated the old container. I need to check again the recent changes i guess

Actions #21

Updated by ybonatakis 8 months ago

  • Status changed from Feedback to In Progress
Actions #22

Updated by ybonatakis 8 months ago

  • Status changed from In Progress to Feedback

ybonatakis wrote in #note-20:

okurz wrote in #note-18:

I created a new container https://build.opensuse.org/package/show/home:okurz:container/grafana-webhook-actions instead of https://build.opensuse.org/request/show/1165434 with fixed package dependency name and updated .gitlab-ci.yml to use the new container image with https://gitlab.suse.de/openqa/grafana-webhook-actions/-/commit/a72304f3a0c9adfc62e051a31893d102c3c3dabb so you should be able to use ssh now in the CI triggered scripts.

I came across this a few minutes late. I have updated the old container. I need to check again the recent changes i guess

So i dont think i need to change anything on https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/31 in this case and we can ignore https://build.opensuse.org/request/show/1166102. The only question is whether the new container works with the old expect script.

Actions #24

Updated by okurz 8 months ago

openqaworker-arm-1 is not needed (anymore) by the security sensor development team. You can continue bringing the machine back into production and removing according silences.

Actions #25

Updated by ybonatakis 7 months ago

I spend some time looking into https://gitlab.suse.de/openqa/salt-states-openqa but when i login in to grafana i could just delete the entries from the UI. https://monitor.qa.suse.de/alerting/silences and https://monitor.qa.suse.de/alerting/routes have been updated. I am not sure if i have to actually do anything with salt-states-openqa

Actions #26

Updated by ybonatakis 7 months ago

  • Status changed from Feedback to Resolved

openqaworker-arm-1 back in production

Actions #27

Updated by okurz 7 months ago

  • Due date deleted (2024-04-17)
Actions #28

Updated by okurz 7 months ago

  • Related to action #159303: [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S added
Actions #29

Updated by okurz 7 months ago

  • Status changed from Resolved to Feedback

Please ensure that the automatic recovery actually works from grafana+gitlab. Also https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1

Actions #30

Updated by okurz 7 months ago

  • Related to action #159270: openqaworker-arm-1 is Unreachable size:S added
Actions #32

Updated by ybonatakis 7 months ago

PDU_HOSTNAME variable is updated from qaps06nue.qa.suse.de to epdu-b3.qe.nue2.suse.org on https://gitlab.suse.de/openqa/grafana-webhook-actions/-/settings/ci_cd

Actions #33

Updated by nicksinger 7 months ago

  • Status changed from Feedback to Workable

Good, the gitlab/CI part seems to work: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2513430 (we have to wait until the PDU fallback is needed to check if the new flow for FC PDUs works).

What is missing is the Grafana alert which triggers these pipeline runs.

Actions #34

Updated by ybonatakis 7 months ago


i think this is solved now

Actions #35

Updated by ybonatakis 7 months ago

  • Status changed from Feedback to Workable

putting in this back as Workable as my last comment change the status

Actions #36

Updated by ybonatakis 7 months ago

  • Status changed from Workable to In Progress
Actions #37

Updated by ybonatakis 7 months ago

  • Status changed from In Progress to Feedback

nicksinger wrote in #note-33:

Good, the gitlab/CI part seems to work: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2513430 (we have to wait until the PDU fallback is needed to check if the new flow for FC PDUs works).

What is missing is the Grafana alert which triggers these pipeline runs.

I think the grafana alert is already there and seems to work properly.
I went to https://monitor.qa.suse.de/alerting/notifications/receivers/Trigger%20reboot%20of%20openqaworker-arm-1/edit and trigger a test notification.
The https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2523869 shows the results of the CI and the grafana https://monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1 shows the restart.
I think we can resolve the ticket. @nicksinger is anything else to be done?

Actions #38

Updated by okurz 7 months ago

LGTM

Actions #39

Updated by nicksinger 7 months ago

The "Contact point" still shows "Unused" which indicates to me that no alert actually uses it.

Actions #40

Updated by ybonatakis 7 months ago

Not sure if i did it right.
https://monitor.qa.suse.de/alerting/notifications?alertmanager=grafana&search=Trigger%20reboot%20of%20openqaworker-arm-1 doesnt show "Unused" notification.
But I wonder if i did something as i dont remember if autogen-contact-point-default there before which now says "unused"

Actions #41

Updated by ybonatakis 7 months ago

Today morning I found the machine down. There were alerts but the recovery process didnt run automatically.
host is running again now.

Actions #42

Updated by ybonatakis 7 months ago

  • Status changed from Feedback to Resolved

@nicksinger fixed the delivery issue and we tested it. I think we done with it

Actions #43

Updated by okurz 7 months ago

  • Status changed from Resolved to Feedback

Can you please provide verifications for the ACs?

Actions #44

Updated by ybonatakis 7 months ago

okurz wrote in #note-43:

Can you please provide verifications for the ACs?

you mean https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/1098371?

Actions #45

Updated by okurz 7 months ago

For AC1 maybe, what about AC2?

Actions #46

Updated by ybonatakis 7 months ago

I can see jobs running and many of them successfully within the last few days https://openqa.suse.de/admin/workers/3532

Actions #47

Updated by okurz 7 months ago

  • Status changed from Feedback to Resolved

I see some green dots on https://openqa.suse.de/admin/workers/3532 so AC2 fulfilled as well

Actions

Also available in: Atom PDF