Project

General

Profile

Actions

action #158907

closed

openQA Project - coordination #155485: [saga][epic] Efficient openQA worker pool resource handling in datacenters

coordination #158374: [epic] Prevention of inefficient hardware resource use

Automated check for machines marked as "unused" in racktables but still pingable (as they should not be powered on at all) size:M

Added by okurz 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: A regular automated check of all "unused" machines for pingability is conducted

Suggestions

  • See what was done in #158383-7
  • Request a non-personal IDP account with mailing list usable for CI, e.g. "qe-admins-bot"? Then use a mailing list like "osd-admins+bot"
  • Ensure that the bot has access to the data we need
  • Find an upstream place
  • Make that a periodic check, e.g. scheduled in gitlab CI pipeline schedule every day, e.g. in https://gitlab.suse.de/openqa/scripts-ci/

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #158383: Crosscheck which machines marked as "unused" in racktables are still pingable (as they should not be powered on at all) size:MResolvednicksinger2024-04-01

Actions
Actions #1

Updated by okurz 2 months ago

  • Copied from action #158383: Crosscheck which machines marked as "unused" in racktables are still pingable (as they should not be powered on at all) size:M added
Actions #2

Updated by nicksinger 2 months ago ยท Edited

  • Status changed from Workable to Feedback

Waiting on the account creation e-mail

Actions #3

Updated by nicksinger 2 months ago

  • Status changed from Feedback to Workable

account mail apparently was stuck in the spam filter for some time but eventually made it. I was able to use the provided credentials in our password repository to log into racktables with it. So next I will try to implement it in a CI-run.

Actions #4

Updated by nicksinger about 2 months ago

  • Status changed from Workable to In Progress

I already added the script itself to our scripts-repo with https://github.com/os-autoinst/scripts/pull/314
I also added a new schedule to our scripts-ci project on gitlab and created https://gitlab.suse.de/openqa/scripts-ci/-/merge_requests/5 to fetch additional files. A first test in a personal project shows that we now miss some dependencies in the CI run: https://gitlab.suse.de/nicksinger/scripts-ci/-/jobs/2501549
A PR to add these is https://github.com/os-autoinst/scripts/pull/315

Actions #5

Updated by nicksinger about 2 months ago

  • Status changed from In Progress to Blocked

All MRs and PRs merged, @okurz created a new CA container with me together after the infra daily to supply the needed internal certificate. A current run can be seen here: https://gitlab.suse.de/nicksinger/scripts-ci/-/jobs/2502360#L96

Waiting for a firewall change now so that our job can access racktables.suse.de: https://sd.suse.com/servicedesk/customer/portal/1/SD-154389

Actions #6

Updated by nicksinger about 2 months ago

  • Status changed from Blocked to Workable

SD-ticket resolved and validated with https://gitlab.suse.de/nicksinger/scripts-ci/-/jobs/2513431#L116 - the script now shows an error while parsing racktables HTTP, have to check that next

Actions #7

Updated by nicksinger about 2 months ago

  • Status changed from Workable to In Progress
Actions #8

Updated by nicksinger about 2 months ago

  • Status changed from In Progress to Feedback

Turns out that our password needs escaping which I figured out by now. Also created some more MR/PR to handle various smaller issues I encountered while debugging this:

Actions #9

Updated by nicksinger about 2 months ago

  • Status changed from Feedback to Resolved

After several additional adjustments I was finally able to enable a weekly schedule (at 04:00 on Monday) in our scripts-ci pipeline: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2535545
Feel free to find a "unused" machine, power it on and see if it fails after the weekend.

Actions

Also available in: Atom PDF