Project

General

Profile

Actions

action #106933

closed

Use PSU capabilites to power cycle openqaworker-arm-[1-3] instead of infra tickets size:M

Added by nicksinger about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2021-11-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

Today we installed a controllable PSU (qaps06nue.qa.suse.de) into the rack for openqaworker-arm-[1-3]. We should make use of them in our automatic recovery pipeline to power cycle the BMC if it is down

Here is the mapping for each machine:

ARM1: Plug 1
ARM2: Plug 2+3
ARM3: Plug 4+5
ARM4: Plug 6
ARM6: Plug 7

Suggestions

Research how we can automate the power-cycle on the PSU side. The PSUs have a webinterface which can be scripted/scraped (no real API AFAIK) and several access options like e.g. ssh, telnet, ftp.
Ask nsinger for the password if you want to browse through the web ui. Keep in mind to choose a security sensible option (e.g. an encrypted channel).
Integrate this automation into our pipeline at https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker. It should replace the create_ticket() function (https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L26-28).

Acceptance criteria

  • AC1: Infra tickets are no longer created
  • AC2: grafana-webhook-actions uses some API of the PSU to automate the power cycle

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #102575: Prevent false-positive ticket reporting for openqaworker-arm-3Resolvedmkittler2021-11-172021-12-02

Actions
Actions #1

Updated by nicksinger about 2 years ago

  • Copied from action #102575: Prevent false-positive ticket reporting for openqaworker-arm-3 added
Actions #2

Updated by nicksinger about 2 years ago

  • Copied from deleted (action #102575: Prevent false-positive ticket reporting for openqaworker-arm-3)
Actions #3

Updated by nicksinger about 2 years ago

  • Related to action #102575: Prevent false-positive ticket reporting for openqaworker-arm-3 added
Actions #4

Updated by nicksinger about 2 years ago

I think a "High" priority is reasonable because currently we flood infra with mails/tickets and they are already overloaded. I asked them to ignore these tickets for now as we have full access to the system our self.

Actions #5

Updated by okurz about 2 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I am trying with an expect script or something

Actions #6

Updated by nicksinger about 2 years ago

  • Assignee deleted (okurz)
  • Target version deleted (Ready)

I just recently scripted closing the FTP port. You could reuse at least the login part:

    s = requests.Session()

    login_params = {
      "login_username": "admin",
      "login_password": "<INSERT_PASSWORD_HERE>",
      "submit": "Log On"
    }

    disable_ftp_params = {
      "ftpPort": "21",
      "submit": "Apply"
    }

    headers = {'Content-Type': 'application/x-www-form-urlencoded'}

    print("Opening admin page…", end="")
    req0 = s.get("http://" + host)
    print("Done.")

    print("Login…", end="")
    req1 = s.post("http://" + host + "/Forms/login1", data=login_params, headers=headers)
    print("Done.")

    print("Disabling FTP…", end="")
    req2 = s.post("http://" + host + "/Forms/ftpserv1", data=disable_ftp_params, headers=headers)
    print("Done.")

    print("Logout…", end="")
    req3 = s.get("http://" + host + "/logout.htm")
    print("Done.")
Actions #7

Updated by nicksinger about 2 years ago

  • Assignee set to okurz
  • Target version set to Ready
Actions #8

Updated by okurz about 2 years ago

  • Due date set to 2022-03-02
  • Status changed from In Progress to Feedback

thanks. Already have it covered with https://github.com/okurz/scripts/blob/master/control-switched-rack-pdu.exp :) What I have not figured out yet how to properly control one or two sockets depending on which machines we want to control. Maybe something to hack together :)

The integration into our webhook triggered recovery scripts is included with https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/19

I already added PDU_HOSTNAME and PDU_PASSWORD as variables in gitlab CI

Actions #9

Updated by livdywan about 2 years ago

  • Subject changed from Use PSU capabilites to power cycle openqaworker-arm-[1-3] instead of infra tickets to Use PSU capabilites to power cycle openqaworker-arm-[1-3] instead of infra tickets size:M
Actions #10

Updated by okurz about 2 years ago

  • Due date deleted (2022-03-02)
  • Status changed from Feedback to Resolved

We have seen multiple successful runs of the pipeline since the merge in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines . Likely as expected no PSU based recoveries yet as that would happen more seldomly.

Actions

Also available in: Atom PDF