action #159270
closedQA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA - coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters
openqaworker-arm-1 is Unreachable size:S
0%
Description
Observation¶
❯ ping openqaworker-arm-1.qe.nue2.suse.org
PING openqaworker-arm-1.qe.nue2.suse.org (10.168.192.213) 56(84) bytes of data.
From 81.95.8.245 icmp_seq=1 Destination Host Unreachable
From 81.95.8.245 icmp_seq=2 Destination Host Unreachable
From 81.95.8.245 icmp_seq=3 Destination Host Unreachable
graph shows that it went down at 2024-04-18 15:32:00
I think the most relevant graph is https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1&from=now-12h&to=now&viewPanel=65113
QA network infrastructure packet loss shows walter1.qe.nue2.suse.org 100 at 2024-04-18 15:19:00
Suggestions¶
- Just recover the machine and ensure it's up again as alert mitigation
Out of scope¶
Updated by ybonatakis 6 months ago
And this also breaks https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2510604
Updated by okurz 6 months ago
- Related to action #159303: [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S added
Updated by okurz 6 months ago
- Related to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Updated by ybonatakis 6 months ago · Edited
- Status changed from Workable to Feedback
Updated by ybonatakis 6 months ago
- Status changed from Feedback to Resolved
Also possible to ssh into it.
Updated by okurz 5 months ago
- Status changed from Resolved to In Progress
@ybonatakis you leaked the IPMI passwords in #159270-6. I deleted that comment. Now please update all IPMI passwords as documented in https://gitlab.suse.de/openqa/salt-pillars-openqa/#ipmi-passwords . Please use a pronouncable password. I suggest to think of a good password based on https://github.com/okurz/scripts/blob/master/xkcdpass-two-word
Updated by okurz 5 months ago
- Related to action #159318: openqa-piworker host up alert added
Updated by ybonatakis 5 months ago
- Status changed from In Progress to Feedback
Updated by livdywan 5 months ago
- Status changed from Feedback to In Progress
@ybonatakis Please rememember you need to address the urgency of the ticket or resolve it immediately.
Updated by ybonatakis 5 months ago
but https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/789 is still open
Updated by openqa_review 5 months ago
- Due date set to 2024-05-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 5 months ago · Edited
I found the following entries with inconsistencies:
- malbec, you did not change the password, please revert
- imagetester, should be changed
- storage, should be changed
- kerosene, should be changed
- openqaworker{20..28}, should be changed
- openqaworker-arm{21..22}, should be changed
Updated by okurz 5 months ago
- Related to action #159555: IPMI access over IPv6 doesn't work on imagetester - try to update BIOS with physical access size:S added
Updated by ybonatakis 5 months ago
- Status changed from In Progress to Blocked
Updated by ybonatakis 5 months ago
- Due date deleted (
2024-05-08)
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/794
Still waiting to get ssh access to update
openqaworker{20..28}, should be changed
openqaworker-arm{21..22}, should be changed
Updated by nicksinger 5 months ago
I addressed your question in https://suse.slack.com/archives/C02AJ1E568M/p1713963938332169?thread_ts=1713940157.475059&cid=C02AJ1E568M and deleted the stale alerts with:
sqlite3 /var/lib/grafana/grafana.db "$(for RULEID in DzAhcifVk dzA25mfVk Fk0h5iBVk Sk02ciBVk VzA2cif4zz; do echo -n "delete from alert_rule where uid = '$RULEID'; delete from alert_rule_version where rule_uid = '$RULEID'; delete from provenance_type where record_key = '$RULEID'; delete from annotation where text like '%$RULEID%';"; done)"
Updated by openqa_review 5 months ago
- Due date set to 2024-05-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 5 months ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
as the main reported issue has been resolved i lower the prority.
What remains to be done is to update the passwords on:
openqaworker{20..28} and openqaworker-arm{21..22}
My ssh keys are not yet in oqa-jumpy.dmz-prg2.suse.org so i will have to wait.
Updated by ybonatakis 5 months ago
Updated by nicksinger 5 months ago
@ybonatakis could you please share the SD ticket with our "OSD Admins" group so we can see the progress? Also, can you maybe ask someone from the team to make the required changes? I don't think it is useful if you have to wait several days for a response. Especially for a "Workable" ticket with "High" Priority…
Updated by ybonatakis 5 months ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/800
the only one machine which password havent update is kerosene
Updated by ybonatakis 5 months ago
- Status changed from Workable to Resolved
kerosine was updated as well.
https://suse.slack.com/archives/C02AJ1E568M/p1714466396069729
Updated by livdywan 5 months ago
- Status changed from Resolved to Feedback
openqaworker-arm-1.qe.nue2.suse.org:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20240510052716652055
Is this machine broken again?
See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906
Updated by ybonatakis 5 months ago
I just checked the CI jobs and there are not problems since the reopening.
It is 4 days since. The machine was restarted 3 days before
12:59:02 up 3 days 13:54, 1 user, load average: 1.39, 1.49, 1.31
most likely from https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2596329
but I cant tell if Oliver triggered this by himself or if it was automation.
therefore I assume that the recovery works.
livdywan wrote in #note-30:
openqaworker-arm-1.qe.nue2.suse.org: Minion did not return. [No response] The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command: salt-run jobs.lookup_jid 20240510052716652055
Is this machine broken again?
See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906
Updated by okurz 5 months ago · Edited
ybonatakis wrote in #note-31:
I just checked the CI jobs and there are not problems since the reopening.
It is 4 days since. The machine was restarted 3 days before12:59:02 up 3 days 13:54, 1 user, load average: 1.39, 1.49, 1.31
most likely from https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2596329
but I cant tell if Oliver triggered this by himself or if it was automation.
therefore I assume that the recovery works.
I have not triggered a recovery manually recently so I concur, the automatic recovery works in principle. https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906 failed as openqaworker-arm-1 did not respond to salt commands in time. That was also what we saw in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589864#L326 in the pre-deploy check which isn't fatal but just informative. I am not sure how we handled the temporary outages of openqaworker-arm-[123] in the past during osd-deployment. We might have not really seen the problem in before and were possibly just lucky/unlucky? I will look into some older tickets if something comes up.
EDIT: Nothing came up from older tickets so except for "moar retries!!!1" covering the recovery periods of openqaworker-arm-1 I don't have further good ideas.
EDIT2: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906 was running at 0727. There are already retries by gitlab with retry: 2
but those retries are quickly one after another. The previous jobs were https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589904 0725 and https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589872 at 0723 so only 4 minutes, not enough to allow openqaworker-arm-1 to be up again. We could try retries with customized retry periods. Will propose a solution.
Updated by okurz 5 months ago
- Related to action #41882: all arm worker die after some time added
Updated by okurz 5 months ago
- Related to action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically added
Updated by okurz 5 months ago
- Related to action #95482: openqaworker-arm-3 offline and not automatically recovered due to gitlab CI failures added
Updated by okurz 5 months ago
- Related to action #107074: error on openqaworker-arm-2 failing osd-deployment size:M added
Updated by okurz 5 months ago
- Related to action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M added
Updated by okurz 5 months ago
https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/57 merged. I suggest to await at least one succesful normal osd-deployment regardless if a machine like openqaworker-arm-1 is not currently available during the deployment.