action #157753
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters
Bring back automatic recovery for openqaworker-arm-1 size:M
0%
Description
Motivation¶
In #132614 openqaworker-arm-1 was moved to FC Basement so that we have one hot-redundant aarch64 OSD machine outside of PRG2. For that to be setup we need to also accomodate the automatic recovery feature.
Acceptance criteria¶
- AC1: The automatic recovery of openqaworker-arm-1 on crashes works
- AC2: openqaworker-arm-1 runs OSD production jobs in a stable way
Suggestions¶
- Read #133748 about notes regarding PDU auto-control
- Find on https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs how the new PDU can be used
- Integrate the new PDU in https://gitlab.suse.de/openqa/grafana-webhook-actions
- After openqaworker-arm-1 is fully back including recovery remove silences in https://monitor.qa.suse.de/alerting/silences
- Remove the "Mute All times" in https://monitor.qa.suse.de/alerting/routes for
__contacts__ =~ .*"Trigger reboot of openqaworker-arm-1".*
Rollback actions¶
- Bring back openqaworker-arm-1 into production https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production
Files
Updated by okurz 8 months ago
- Copied from action #133748: Move of openqaworker-arm-1 to FC Basement size:M added
Updated by nicksinger 8 months ago
Some more information on how to control the eaton PDUs:
- Available protocols to communicate with the PDU: SSH/Telnet (only one at the same time seems to be possible), HTTP(S), SNMP, "EnergyWise"
- SNMPv3 is finicky to setup but provides authentication. An according MIB file can be downloaded from the PDU itself. Haven't explored it further.
- SSH server already does not work to modern ssh-clients but can be used with additional options like so to get the status of an outlet on the PDU (A20 in that example):
ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@epdu-b2.qe.nue2.suse.org
+============================================================================+
| EATON ePDU Configuration Utility |
+============================================================================+
admin@epdu-b2.qe.nue2.suse.org's password:
pdu#0>get PDU.OutletSystem.Outlet[20].PresentStatus.SwitchOnOff
1
Other "paths" can be explored by using info PDU.*
. Not sure how to control the outlet yet. Maybe with PDU.OutletSystem.Outlet[x].Schedule[1].Command
? But what to write there?
Updated by okurz 8 months ago
- Related to action #155179: Participate in alpha-testing of new version of velociraptor-client added
Updated by okurz 8 months ago
- Description updated (diff)
As expected openqaworker-arm-1 already crashed as visible on https://monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1&from=1711486404798&to=1711516350686 . I removed the salt key for openqaworker-arm-1 with ssh osd 'sudo salt-key -y -d openqaworker-arm-1.qe.nue2.suse.org'
and will restart the failed osd deployment. I will also lend the machine openqaworker-arm-1 temporarily to another team as part of #155179.
Updated by ybonatakis 8 months ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by openqa_review 8 months ago
- Due date set to 2024-04-13
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 8 months ago
- Due date deleted (
2024-04-13)
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/31
I run manually the script in a container and it produced the following output
69fecd9cd196:/ # ./ipmi-recover-worker
Attempting to reboot openqaworker-arm-1
System Power : on
Power Overload : false
Power Interlock : inactive
Main Power Fault : false
Power Control Fault : false
Power Restore Policy : previous
Last Power Event :
Chassis Intrusion : inactive
Front-Panel Lockout : inactive
Drive Fault : false
Cooling/Fan Fault : false
Sleep Button Disable : not allowed
Diag Button Disable : allowed
Reset Button Disable : allowed
Power Button Disable : allowed
Sleep Button Disabled: false
Diag Button Disabled : false
Reset Button Disabled: false
Power Button Disabled: false
Set Boot Device to disk
IPMI reports power status on ==> performing power cycle
Chassis Power Control: Cycle
PING openqaworker-arm-1.qe.nue2.suse.org (10.168.192.213) 56(84) bytes of data.
--- openqaworker-arm-1.qe.nue2.suse.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 11.207/11.207/11.207/0.000 ms
Connection to openqaworker-arm-1.qe.nue2.suse.org 22 port [tcp/ssh] succeeded!
IPMI succeeded ==> openqaworker-arm-1 is online again
69fecd9cd196:/ # echo $?
0
After a while i chould ssh into openqaworker-arm-1.qe.nue2.suse.org but on OSD the machine is shown offline
Updated by openqa_review 8 months ago
- Due date set to 2024-04-17
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 8 months ago
ybonatakis wrote in #note-9:
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/31
I run manually the script in a container and it produced the following output
[…]
After a while i chould ssh into openqaworker-arm-1.qe.nue2.suse.org but on OSD the machine is shownoffline
yes, because of #157753-5. What needs to be extended in ipmi-recover-worker is support for the FC Basement ePDUs
Updated by ybonatakis 8 months ago
So openqaworker-arm-1.qe.nue2.suse.org is connected to Outlet 1
The cli command to get the state is get PDU.OutletSystem.Outlet[1].PresentStatus.SwitchOnOff
but it is RO.
Reboot can be achieved with set PDU.OutletSystem.Outlet[1].ToggleControl 1
which has 120s delay to trigger the restart+ some time to boot and actually have access again.
I am going to update to perform the commands there
Updated by ybonatakis 8 months ago · Edited
PR updated. I updated the script to use ssh in order to reboot the VMs.
I noticed that there is a retry logic https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/6d70304f191411a7796a8aec350a58e433df522b/ipmi-recover-worker#L64
and i wonder how valid is it and whether it could moved into the expect script. just food for thought. it is a bit weird to me because there is settings in place which sets a timeout>10s which is the $pdu_retry.
Notice that I put a sleep in the expect sleep to catch status after restart(see below output)
Another small thingy is that the registry.opensuse.org/home/okurz/container/containers/tumbleweed:ipmitool-ping-nc-msmtp-expect doesnt seems to contain ssh. i had to install it to perform tests.
Once i installed openssh i run the script as following
expect -f control-switched-rack-pdu.exp "admin" "admin" "epdu-b3.qe.nue2.suse.org" "1"
The output (when run second time)
7072e66e7929:/ # expect -f control-switched-rack-pdu.exp "admin" "admin" "epdu-b3.qe.nue2.suse.org" "1"
spawn ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 admin@epdu-b3.qe.nue2.suse.org
+============================================================================+
| EATON ePDU Configuration Utility |
+============================================================================+
admin@epdu-b3.qe.nue2.suse.org's password:
pdu#0>set PDU.OutletSystem.Outlet[1].ToggleControl 1
1
pdu#0>get PDU.OutletSystem.Outlet[1].PresentStatus.SwitchOnOff
1
pdu#0>get PDU.OutletSystem.Outlet[1].PresentStatus.SwitchOnOff
0
pdu#0>
Updated by ybonatakis 8 months ago
Updated by ybonatakis 8 months ago · Edited
- Status changed from In Progress to Feedback
@okurz https://build.opensuse.org/request/show/1165434 adds ssh in the container updated: https://build.opensuse.org/request/show/1166102
Updated by okurz 8 months ago
I created a new container https://build.opensuse.org/package/show/home:okurz:container/grafana-webhook-actions instead of https://build.opensuse.org/request/show/1165434 with fixed package dependency name and updated .gitlab-ci.yml to use the new container image with https://gitlab.suse.de/openqa/grafana-webhook-actions/-/commit/a72304f3a0c9adfc62e051a31893d102c3c3dabb so you should be able to use ssh now in the CI triggered scripts.
Updated by ybonatakis 8 months ago
okurz wrote in #note-18:
I created a new container https://build.opensuse.org/package/show/home:okurz:container/grafana-webhook-actions instead of https://build.opensuse.org/request/show/1165434 with fixed package dependency name and updated .gitlab-ci.yml to use the new container image with https://gitlab.suse.de/openqa/grafana-webhook-actions/-/commit/a72304f3a0c9adfc62e051a31893d102c3c3dabb so you should be able to use ssh now in the CI triggered scripts.
I came across this a few minutes late. I have updated the old container. I need to check again the recent changes i guess
Updated by ybonatakis 8 months ago
- Status changed from In Progress to Feedback
ybonatakis wrote in #note-20:
okurz wrote in #note-18:
I created a new container https://build.opensuse.org/package/show/home:okurz:container/grafana-webhook-actions instead of https://build.opensuse.org/request/show/1165434 with fixed package dependency name and updated .gitlab-ci.yml to use the new container image with https://gitlab.suse.de/openqa/grafana-webhook-actions/-/commit/a72304f3a0c9adfc62e051a31893d102c3c3dabb so you should be able to use ssh now in the CI triggered scripts.
I came across this a few minutes late. I have updated the old container. I need to check again the recent changes i guess
So i dont think i need to change anything on https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/31 in this case and we can ignore https://build.opensuse.org/request/show/1166102. The only question is whether the new container works with the old expect script.
Updated by ybonatakis 7 months ago
I spend some time looking into https://gitlab.suse.de/openqa/salt-states-openqa but when i login in to grafana i could just delete the entries from the UI. https://monitor.qa.suse.de/alerting/silences and https://monitor.qa.suse.de/alerting/routes have been updated. I am not sure if i have to actually do anything with salt-states-openqa
Updated by ybonatakis 7 months ago
- Status changed from Feedback to Resolved
openqaworker-arm-1 back in production
Updated by okurz 7 months ago
- Related to action #159303: [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S added
Updated by okurz 7 months ago
- Status changed from Resolved to Feedback
Please ensure that the automatic recovery actually works from grafana+gitlab. Also https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1
Updated by okurz 7 months ago
- Related to action #159270: openqaworker-arm-1 is Unreachable size:S added
Updated by ybonatakis 7 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1159 missing stuff. thanks @okurz
Updated by ybonatakis 7 months ago
PDU_HOSTNAME variable is updated from qaps06nue.qa.suse.de to epdu-b3.qe.nue2.suse.org on https://gitlab.suse.de/openqa/grafana-webhook-actions/-/settings/ci_cd
Updated by nicksinger 7 months ago
- Status changed from Feedback to Workable
Good, the gitlab/CI part seems to work: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2513430 (we have to wait until the PDU fallback is needed to check if the new flow for FC PDUs works).
What is missing is the Grafana alert which triggers these pipeline runs.
Updated by ybonatakis 7 months ago
- File clipboard-202404200912-8h8yb.png clipboard-202404200912-8h8yb.png added
- Status changed from Workable to Feedback
i think this is solved now
Updated by ybonatakis 7 months ago
- Status changed from Feedback to Workable
putting in this back as Workable as my last comment change the status
Updated by ybonatakis 7 months ago
- Status changed from In Progress to Feedback
nicksinger wrote in #note-33:
Good, the gitlab/CI part seems to work: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2513430 (we have to wait until the PDU fallback is needed to check if the new flow for FC PDUs works).
What is missing is the Grafana alert which triggers these pipeline runs.
I think the grafana alert is already there and seems to work properly.
I went to https://monitor.qa.suse.de/alerting/notifications/receivers/Trigger%20reboot%20of%20openqaworker-arm-1/edit and trigger a test notification.
The https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2523869 shows the results of the CI and the grafana https://monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1 shows the restart.
I think we can resolve the ticket. @nicksinger is anything else to be done?
Updated by nicksinger 7 months ago
The "Contact point" still shows "Unused" which indicates to me that no alert actually uses it.
Updated by ybonatakis 7 months ago
Not sure if i did it right.
https://monitor.qa.suse.de/alerting/notifications?alertmanager=grafana&search=Trigger%20reboot%20of%20openqaworker-arm-1 doesnt show "Unused" notification.
But I wonder if i did something as i dont remember if autogen-contact-point-default there before which now says "unused"
Updated by ybonatakis 7 months ago
Today morning I found the machine down. There were alerts but the recovery process didnt run automatically.
host is running again now.
Updated by ybonatakis 7 months ago
- Status changed from Feedback to Resolved
@nicksinger fixed the delivery issue and we tested it. I think we done with it
Updated by ybonatakis 7 months ago
okurz wrote in #note-43:
Can you please provide verifications for the ACs?
you mean https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/1098371?
Updated by ybonatakis 7 months ago
I can see jobs running and many of them successfully within the last few days https://openqa.suse.de/admin/workers/3532
Updated by okurz 7 months ago
- Status changed from Feedback to Resolved
I see some green dots on https://openqa.suse.de/admin/workers/3532 so AC2 fulfilled as well