Project

General

Profile

Actions

action #159270

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters

openqaworker-arm-1 is Unreachable size:S

Added by ybonatakis 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-04-19
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

❯ ping openqaworker-arm-1.qe.nue2.suse.org
PING openqaworker-arm-1.qe.nue2.suse.org (10.168.192.213) 56(84) bytes of data.
From 81.95.8.245 icmp_seq=1 Destination Host Unreachable
From 81.95.8.245 icmp_seq=2 Destination Host Unreachable
From 81.95.8.245 icmp_seq=3 Destination Host Unreachable

graph shows that it went down at 2024-04-18 15:32:00
I think the most relevant graph is https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1&from=now-12h&to=now&viewPanel=65113
QA network infrastructure packet loss shows walter1.qe.nue2.suse.org 100 at 2024-04-18 15:19:00

Suggestions

  • Just recover the machine and ensure it's up again as alert mitigation

Out of scope


Related issues 9 (0 open9 closed)

Related to openQA Infrastructure - action #159303: [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:SResolvednicksinger2024-06-25

Actions
Related to QA - action #157753: Bring back automatic recovery for openqaworker-arm-1 size:MResolvedybonatakis

Actions
Related to openQA Infrastructure - action #159318: openqa-piworker host up alertResolvednicksinger2023-08-09

Actions
Related to openQA Infrastructure - action #159555: IPMI access over IPv6 doesn't work on imagetester - try to update BIOS with physical access size:SResolvedokurz2024-04-24

Actions
Related to openQA Infrastructure - action #41882: all arm worker die after some timeResolvedokurz2018-10-02

Actions
Related to openQA Infrastructure - action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automaticallyResolvedmkittler2021-03-102021-04-22

Actions
Related to openQA Infrastructure - action #95482: openqaworker-arm-3 offline and not automatically recovered due to gitlab CI failuresResolvedokurz2021-07-14

Actions
Related to openQA Infrastructure - action #107074: error on openqaworker-arm-2 failing osd-deployment size:MResolvedmkittler2022-02-18

Actions
Related to openQA Infrastructure - action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:MRejectedokurz2023-11-28

Actions
Actions #2

Updated by tinita 2 months ago

  • Target version changed from Tools - Next to Ready
Actions #3

Updated by okurz 2 months ago

  • Related to action #159303: [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S added
Actions #4

Updated by okurz 2 months ago

  • Related to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Actions #5

Updated by okurz 2 months ago

  • Subject changed from openqaworker-arm-1 is Unreachable to openqaworker-arm-1 is Unreachable size:S
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to ybonatakis
Actions #6

Updated by ybonatakis 2 months ago · Edited

  • Status changed from Workable to Feedback
Actions #7

Updated by ybonatakis 2 months ago

  • Status changed from Feedback to Resolved

Also possible to ssh into it.

Actions #8

Updated by okurz 2 months ago

  • Status changed from Resolved to In Progress

@ybonatakis you leaked the IPMI passwords in #159270-6. I deleted that comment. Now please update all IPMI passwords as documented in https://gitlab.suse.de/openqa/salt-pillars-openqa/#ipmi-passwords . Please use a pronouncable password. I suggest to think of a good password based on https://github.com/okurz/scripts/blob/master/xkcdpass-two-word

Actions #9

Updated by okurz 2 months ago

Actions #11

Updated by livdywan 2 months ago

  • Status changed from Feedback to In Progress

@ybonatakis Please rememember you need to address the urgency of the ticket or resolve it immediately.

Actions #13

Updated by livdywan 2 months ago

ybonatakis wrote in #note-12:

but https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/789 is still open

Did you see the unanswered questions there?

Actions #14

Updated by openqa_review about 2 months ago

  • Due date set to 2024-05-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by okurz about 2 months ago · Edited

I found the following entries with inconsistencies:

  1. malbec, you did not change the password, please revert
  2. imagetester, should be changed
  3. storage, should be changed
  4. kerosene, should be changed
  5. openqaworker{20..28}, should be changed
  6. openqaworker-arm{21..22}, should be changed
Actions #16

Updated by okurz about 2 months ago

  • Related to action #159555: IPMI access over IPv6 doesn't work on imagetester - try to update BIOS with physical access size:S added
Actions #17

Updated by okurz about 2 months ago

  • Parent task set to #129280
Actions #18

Updated by ybonatakis about 2 months ago

  • Status changed from In Progress to Blocked
Actions #19

Updated by ybonatakis about 2 months ago

  • Status changed from Blocked to In Progress

merged

Actions #20

Updated by ybonatakis about 2 months ago

  • Due date deleted (2024-05-08)

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/794

Still waiting to get ssh access to update
openqaworker{20..28}, should be changed
openqaworker-arm{21..22}, should be changed

Actions #21

Updated by nicksinger about 2 months ago

I addressed your question in https://suse.slack.com/archives/C02AJ1E568M/p1713963938332169?thread_ts=1713940157.475059&cid=C02AJ1E568M and deleted the stale alerts with:

sqlite3 /var/lib/grafana/grafana.db "$(for RULEID in DzAhcifVk dzA25mfVk Fk0h5iBVk Sk02ciBVk VzA2cif4zz; do echo -n "delete from alert_rule where uid = '$RULEID'; delete from alert_rule_version where rule_uid = '$RULEID'; delete from provenance_type where record_key = '$RULEID'; delete from annotation where text like '%$RULEID%';"; done)"
Actions #22

Updated by openqa_review about 2 months ago

  • Due date set to 2024-05-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #23

Updated by ybonatakis about 2 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

as the main reported issue has been resolved i lower the prority.
What remains to be done is to update the passwords on:
openqaworker{20..28} and openqaworker-arm{21..22}
My ssh keys are not yet in oqa-jumpy.dmz-prg2.suse.org so i will have to wait.

Actions #24

Updated by okurz about 2 months ago

  • Status changed from Feedback to Workable

please create IT ticket about getting your ssh key deployed on oqa-jumpy

Actions #26

Updated by nicksinger about 2 months ago

@ybonatakis could you please share the SD ticket with our "OSD Admins" group so we can see the progress? Also, can you maybe ask someone from the team to make the required changes? I don't think it is useful if you have to wait several days for a response. Especially for a "Workable" ticket with "High" Priority…

Actions #27

Updated by ybonatakis about 2 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/800
the only one machine which password havent update is kerosene

Actions #28

Updated by ybonatakis about 2 months ago

  • Status changed from Workable to Resolved
Actions #29

Updated by okurz about 2 months ago

  • Due date deleted (2024-05-09)
Actions #30

Updated by livdywan about 1 month ago

  • Status changed from Resolved to Feedback
openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240510052716652055

Is this machine broken again?

See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906

Actions #31

Updated by ybonatakis about 1 month ago

I just checked the CI jobs and there are not problems since the reopening.
It is 4 days since. The machine was restarted 3 days before

 12:59:02  up 3 days 13:54,  1 user,  load average: 1.39, 1.49, 1.31

most likely from https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2596329
but I cant tell if Oliver triggered this by himself or if it was automation.
therefore I assume that the recovery works.

livdywan wrote in #note-30:

openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240510052716652055

Is this machine broken again?

See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906

Actions #32

Updated by okurz about 1 month ago · Edited

ybonatakis wrote in #note-31:

I just checked the CI jobs and there are not problems since the reopening.
It is 4 days since. The machine was restarted 3 days before

 12:59:02  up 3 days 13:54,  1 user,  load average: 1.39, 1.49, 1.31

most likely from https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2596329
but I cant tell if Oliver triggered this by himself or if it was automation.
therefore I assume that the recovery works.

I have not triggered a recovery manually recently so I concur, the automatic recovery works in principle. https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906 failed as openqaworker-arm-1 did not respond to salt commands in time. That was also what we saw in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589864#L326 in the pre-deploy check which isn't fatal but just informative. I am not sure how we handled the temporary outages of openqaworker-arm-[123] in the past during osd-deployment. We might have not really seen the problem in before and were possibly just lucky/unlucky? I will look into some older tickets if something comes up.

EDIT: Nothing came up from older tickets so except for "moar retries!!!1" covering the recovery periods of openqaworker-arm-1 I don't have further good ideas.

EDIT2: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906 was running at 0727. There are already retries by gitlab with retry: 2 but those retries are quickly one after another. The previous jobs were https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589904 0725 and https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589872 at 0723 so only 4 minutes, not enough to allow openqaworker-arm-1 to be up again. We could try retries with customized retry periods. Will propose a solution.

Actions #33

Updated by okurz about 1 month ago

  • Related to action #41882: all arm worker die after some time added
Actions #34

Updated by okurz about 1 month ago

  • Related to action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically added
Actions #35

Updated by okurz about 1 month ago

  • Related to action #95482: openqaworker-arm-3 offline and not automatically recovered due to gitlab CI failures added
Actions #36

Updated by okurz about 1 month ago

  • Related to action #107074: error on openqaworker-arm-2 failing osd-deployment size:M added
Actions #37

Updated by okurz about 1 month ago

  • Related to action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M added
Actions #39

Updated by okurz about 1 month ago

https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/57 merged. I suggest to await at least one succesful normal osd-deployment regardless if a machine like openqaworker-arm-1 is not currently available during the deployment.

Actions #40

Updated by okurz about 1 month ago

  • Due date deleted (2024-05-28)
  • Status changed from Feedback to Resolved

At least one OSD deployment passed now so we should be good again. We don't know if any outage of openqaworker-arm-1 can be covered with this but it least the normal case didn't break

Actions

Also available in: Atom PDF