action #99117
closedmalbec 🍷️ is not reachable via ssh or ipmi
0%
Description
Observation¶
- alert
[Alerting] malbec: host up alert
triggered at 13:41 CEST today - ssh malbec.arch.suse.de isn't responsive.
- ipmi on
fsp1-malbec.arch.suse.de
saysUnable to establish IPMI v2 / RMCP+ session
chassis reboot
fails withUnable to establish IPMI v2 / RMCP+ session
Workaround¶
ipmitool -4 -I lanplus -C 3 -H fsp1-malbec.arch.suse.de -P $password
Updated by livdywan about 3 years ago
- Project changed from QA (public) to openQA Infrastructure (public)
Updated by okurz about 3 years ago
I can confirm that IPMI does not work. ipmi-fsp1-malbec.arch
yields Error: Unable to establish IPMI v2 / RMCP+ session
. I suggest to report an EngInfra ticket
Updated by livdywan about 3 years ago
- Status changed from New to Feedback
Somehow malbec came back 3:38 CEST yesterday, and it looks fine
Updated by okurz about 3 years ago
- Due date set to 2021-10-07
- Assignee set to okurz
- Priority changed from Urgent to High
please only use "Feedback" with assignee. Otherwise the tickets tends to stay around for ages. mgriessmeier mentioned that gschlotter from EngInfra mentioned that they had "undefined network issues" in the past days. Let's assume that was it.
I could login over ssh with ssh malbec.arch
and verify that openQA tests are running. Also https://monitor.qa.suse.de/d/WDmalbec/worker-dashboard-malbec?orgId=1&refresh=1m&from=now-7d&to=now says it's fine, no malbec related alerts on monitor.qa.suse.de, https://openqa.suse.de/admin/workers says that malbec is up and working on jobs. The history of https://openqa.suse.de/admin/workers/914 , https://openqa.suse.de/admin/workers/887 , https://openqa.suse.de/admin/workers/898 , https://openqa.suse.de/admin/workers/885 looks very much ok with exception of some incomplete jobs like https://openqa.suse.de/tests/7201511 stating an oddfully specific reason "Reason: api failure: Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> 503 Service Unavailable Service Unavailable The server is temporarily unable to service your request due to maintenance downtime … ".
I called
host=openqa.suse.de WORKER=malbec failed_since=2021-09-23 openqa-advanced-retrigger-jobs
to handle the incompletes on this host. This retriggered some tests and we should be good.
I wonder, can we better handle restarting jobs with such reason?
EDIT: Also IPMI does not work. Created EngInfra ticket for that: https://sd.suse.com/servicedesk/customer/portal/1/SD-61278
Updated by okurz about 3 years ago
- Status changed from Feedback to Blocked
Updated by livdywan about 3 years ago
okurz wrote:
-> https://sd.suse.com/servicedesk/customer/portal/1/SD-61278
I can't access that. I guess the automatic addition of team members still isn't working?
Updated by okurz about 3 years ago
yes, not working. And that's not even expected because the solution was only discussed for the automatically created tickets. I now added osd-admins@suse.de (which is now a user) over the "Share" button. And I have seen the email confirmation. I also created https://sd.suse.com/servicedesk/customer/portal/1/SD-61392 to ask about the general procedure about efficient workflows.
Updated by okurz about 3 years ago
- Description updated (diff)
- Due date changed from 2021-10-07 to 2021-10-21
- Priority changed from High to Normal
gschlotter responded. Seems like IPv6 DNS resolution fails. Workaround that works:
ipmitool -4 -I lanplus -C 3 -H fsp1-malbec.arch.suse.de -P $password
Documented workaround in ticket description, lowering prio.
Updated by okurz about 3 years ago
- Status changed from Blocked to Resolved
https://sd.suse.com/servicedesk/customer/portal/1/SD-61278 was resolved. trenninger fixed the IPv4/IPv6 DNS entries in the arch network so now IPMI access to malbec works (again) over both ways. I confirmed. No further changes needed in salt pillars.