Project

General

Profile

Actions

action #99117

closed

malbec 🍷️ is not reachable via ssh or ipmi

Added by livdywan over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-09-23
Due date:
2021-10-21
% Done:

0%

Estimated time:

Description

Observation

  • alert [Alerting] malbec: host up alert triggered at 13:41 CEST today
  • ssh malbec.arch.suse.de isn't responsive.
  • ipmi on fsp1-malbec.arch.suse.de says Unable to establish IPMI v2 / RMCP+ session
  • chassis reboot fails with Unable to establish IPMI v2 / RMCP+ session

Workaround

ipmitool -4 -I lanplus -C 3 -H fsp1-malbec.arch.suse.de -P $password

Actions #1

Updated by livdywan over 2 years ago

  • Project changed from QA to openQA Infrastructure
Actions #2

Updated by okurz over 2 years ago

I can confirm that IPMI does not work. ipmi-fsp1-malbec.arch yields Error: Unable to establish IPMI v2 / RMCP+ session. I suggest to report an EngInfra ticket

Actions #3

Updated by livdywan over 2 years ago

  • Status changed from New to Feedback

Somehow malbec came back 3:38 CEST yesterday, and it looks fine

Actions #4

Updated by okurz over 2 years ago

  • Due date set to 2021-10-07
  • Assignee set to okurz
  • Priority changed from Urgent to High

please only use "Feedback" with assignee. Otherwise the tickets tends to stay around for ages. mgriessmeier mentioned that gschlotter from EngInfra mentioned that they had "undefined network issues" in the past days. Let's assume that was it.

I could login over ssh with ssh malbec.arch and verify that openQA tests are running. Also https://monitor.qa.suse.de/d/WDmalbec/worker-dashboard-malbec?orgId=1&refresh=1m&from=now-7d&to=now says it's fine, no malbec related alerts on monitor.qa.suse.de, https://openqa.suse.de/admin/workers says that malbec is up and working on jobs. The history of https://openqa.suse.de/admin/workers/914 , https://openqa.suse.de/admin/workers/887 , https://openqa.suse.de/admin/workers/898 , https://openqa.suse.de/admin/workers/885 looks very much ok with exception of some incomplete jobs like https://openqa.suse.de/tests/7201511 stating an oddfully specific reason "Reason: api failure: Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> 503 Service Unavailable Service Unavailable The server is temporarily unable to service your request due to maintenance downtime … ".

I called

host=openqa.suse.de WORKER=malbec failed_since=2021-09-23 openqa-advanced-retrigger-jobs

to handle the incompletes on this host. This retriggered some tests and we should be good.

I wonder, can we better handle restarting jobs with such reason?

EDIT: Also IPMI does not work. Created EngInfra ticket for that: https://sd.suse.com/servicedesk/customer/portal/1/SD-61278

Actions #5

Updated by okurz over 2 years ago

  • Status changed from Feedback to Blocked
Actions #6

Updated by livdywan over 2 years ago

okurz wrote:

-> https://sd.suse.com/servicedesk/customer/portal/1/SD-61278

I can't access that. I guess the automatic addition of team members still isn't working?

Actions #7

Updated by okurz over 2 years ago

yes, not working. And that's not even expected because the solution was only discussed for the automatically created tickets. I now added osd-admins@suse.de (which is now a user) over the "Share" button. And I have seen the email confirmation. I also created https://sd.suse.com/servicedesk/customer/portal/1/SD-61392 to ask about the general procedure about efficient workflows.

Actions #8

Updated by okurz over 2 years ago

  • Description updated (diff)
  • Due date changed from 2021-10-07 to 2021-10-21
  • Priority changed from High to Normal

gschlotter responded. Seems like IPv6 DNS resolution fails. Workaround that works:
ipmitool -4 -I lanplus -C 3 -H fsp1-malbec.arch.suse.de -P $password

Documented workaround in ticket description, lowering prio.

Actions #9

Updated by okurz over 2 years ago

  • Status changed from Blocked to Resolved

https://sd.suse.com/servicedesk/customer/portal/1/SD-61278 was resolved. trenninger fixed the IPv4/IPv6 DNS entries in the arch network so now IPMI access to malbec works (again) over both ways. I confirmed. No further changes needed in salt pillars.

Actions

Also available in: Atom PDF