Project

General

Profile

action #99117

malbec 🍷️ is not reachable via ssh or ipmi

Added by cdywan 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-09-23
Due date:
2021-10-21
% Done:

0%

Estimated time:

Description

Observation

  • alert [Alerting] malbec: host up alert triggered at 13:41 CEST today
  • ssh malbec.arch.suse.de isn't responsive.
  • ipmi on fsp1-malbec.arch.suse.de says Unable to establish IPMI v2 / RMCP+ session
  • chassis reboot fails with Unable to establish IPMI v2 / RMCP+ session

Workaround

ipmitool -4 -I lanplus -C 3 -H fsp1-malbec.arch.suse.de -P $password

History

#1 Updated by cdywan 4 months ago

  • Project changed from QA to openQA Infrastructure

#2 Updated by okurz 4 months ago

I can confirm that IPMI does not work. ipmi-fsp1-malbec.arch yields Error: Unable to establish IPMI v2 / RMCP+ session. I suggest to report an EngInfra ticket

#3 Updated by cdywan 4 months ago

  • Status changed from New to Feedback

Somehow malbec came back 3:38 CEST yesterday, and it looks fine

#4 Updated by okurz 4 months ago

  • Due date set to 2021-10-07
  • Assignee set to okurz
  • Priority changed from Urgent to High

please only use "Feedback" with assignee. Otherwise the tickets tends to stay around for ages. mgriessmeier mentioned that gschlotter from EngInfra mentioned that they had "undefined network issues" in the past days. Let's assume that was it.

I could login over ssh with ssh malbec.arch and verify that openQA tests are running. Also https://monitor.qa.suse.de/d/WDmalbec/worker-dashboard-malbec?orgId=1&refresh=1m&from=now-7d&to=now says it's fine, no malbec related alerts on monitor.qa.suse.de, https://openqa.suse.de/admin/workers says that malbec is up and working on jobs. The history of https://openqa.suse.de/admin/workers/914 , https://openqa.suse.de/admin/workers/887 , https://openqa.suse.de/admin/workers/898 , https://openqa.suse.de/admin/workers/885 looks very much ok with exception of some incomplete jobs like https://openqa.suse.de/tests/7201511 stating an oddfully specific reason "Reason: api failure: Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> 503 Service Unavailable Service Unavailable The server is temporarily unable to service your request due to maintenance downtime … ".

I called

host=openqa.suse.de WORKER=malbec failed_since=2021-09-23 openqa-advanced-retrigger-jobs

to handle the incompletes on this host. This retriggered some tests and we should be good.

I wonder, can we better handle restarting jobs with such reason?

EDIT: Also IPMI does not work. Created EngInfra ticket for that: https://sd.suse.com/servicedesk/customer/portal/1/SD-61278

#5 Updated by okurz 4 months ago

  • Status changed from Feedback to Blocked

#6 Updated by cdywan 4 months ago

okurz wrote:

-> https://sd.suse.com/servicedesk/customer/portal/1/SD-61278

I can't access that. I guess the automatic addition of team members still isn't working?

#7 Updated by okurz 4 months ago

yes, not working. And that's not even expected because the solution was only discussed for the automatically created tickets. I now added osd-admins@suse.de (which is now a user) over the "Share" button. And I have seen the email confirmation. I also created https://sd.suse.com/servicedesk/customer/portal/1/SD-61392 to ask about the general procedure about efficient workflows.

#8 Updated by okurz 4 months ago

  • Description updated (diff)
  • Due date changed from 2021-10-07 to 2021-10-21
  • Priority changed from High to Normal

gschlotter responded. Seems like IPv6 DNS resolution fails. Workaround that works:
ipmitool -4 -I lanplus -C 3 -H fsp1-malbec.arch.suse.de -P $password

Documented workaround in ticket description, lowering prio.

#9 Updated by okurz 4 months ago

  • Status changed from Blocked to Resolved

https://sd.suse.com/servicedesk/customer/portal/1/SD-61278 was resolved. trenninger fixed the IPv4/IPv6 DNS entries in the arch network so now IPMI access to malbec works (again) over both ways. I confirmed. No further changes needed in salt pillars.

Also available in: Atom PDF