Project

General

Profile

Actions

action #94919

closed

All arm workers down 2021-06-30 , NUE SRV2 Rack A8 was switched off by EngInfra size:S

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Start date:
2021-06-30
Due date:
% Done:

0%

Estimated time:

Description

Observation

2021-06-29 The rack NUE SRV2 A8 was switched off by SUSE EngInfra, see https://chat.suse.de/channel/suse-it-ama?msg=WA9YfB7CumBiPfupu and following. The rack includes openqaworker-arm-2 and openqaworker-arm-3. But now also openqaworker-arm-1 is off.

Acceptance criteria

  • AC1: All three ARM workers are up again

Suggestions

  • Monitor e.g. https://chat.suse.de/channel/suse-it-ama regarding an update for the rack NUE SRV2 A8 reg. openqaworker-arm-2 and openqaworker-arm-3
  • For openqaworker-arm-1 check if machine is reachable over IPMI or contact EngInfra if it is not

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #94940: multiple network related problems, gitlab CI pipelines not working, workers not reachable, proxySCC not reachableResolvedokurz2021-06-302021-07-07

Actions
Related to openQA Infrastructure (public) - action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automountResolvedokurz2021-06-30

Actions
Actions #1

Updated by okurz over 3 years ago

  • Subject changed from All arm workers down 2021-06-30 , NUE SRV2 Rack A8 was switched off by EngInfra to All arm workers down 2021-06-30 , NUE SRV2 Rack A8 was switched off by EngInfra size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by okurz over 3 years ago

  • Related to action #94940: multiple network related problems, gitlab CI pipelines not working, workers not reachable, proxySCC not reachable added
Actions #3

Updated by livdywan over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #4

Updated by dheidler over 3 years ago

  • Status changed from In Progress to Blocked

We tried accessing the IPMI console but for all three workers we can't even ping the IPMI host.

Created a ticket with infra: https://infra.nue.suse.com/SelfService/Display.html?id=191475

Actions #5

Updated by okurz over 3 years ago

  • Related to action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automount added
Actions #6

Updated by dzedro over 3 years ago

Arm workers are still down, there is infra ticket but I don't see any activity also there.
Can somebody update (if there are any news) or push to get this fixed ?

Actions #7

Updated by okurz over 3 years ago

Yeah I already asked yesterday but there was no update. I have written in the EngInfra ticket now with text

"Hi, unfortunately this ticket did not receive an update yesterday. You mentioned that you would investigate yesterday. Can you please provide an update and also an estimation when we will have access to the machines openqaworker-arm-1/2/3 again as these machines are part of our critical test infrastructure."

Also I wrote in https://chat.suse.de/channel/suse-it-ama?msg=5f92J6mF2bWt76FSe now

Actions #8

Updated by okurz over 3 years ago

Update in https://chat.suse.de/channel/suse-it-ama?msg=KhcQFuiuiZSN4Zu5S

Evzenie Sujskaja @esujskaja all Please be kindly informed, that next week EngInfra team will be working in a low headcount due to bank holidays in CZ and vacations. Some delays are possible, especially for SRV-related tasks. Main focus will operational stable functioning. Thank you for your understanding..
Oliver Kurz @okurz Evzenie Sujskaja good for letting us know. Thank you. Can you help us get an estimate on when machines from the racks NUE-SRV2-A7+8 can be accessible for us again? See https://infra.nue.suse.com/SelfService/Display.html?id=191475
Evzenie Sujskaja @esujskaja Oliver Kurz it's in progress now, fatma ghariani is opening port by port and following the traffic. We should be very careful - we don't know what provoked the storm. We set up storm control on the port - but it can't give 100% safety; and our network is quite vulnerable for such issues. We will do our best to make it back available today or beg next week.
Oliver Kurz @okurz today or next week, thank you

Actions #9

Updated by okurz over 3 years ago

Provided an update in https://infra.nue.suse.com/SelfService/Display.html?id=191475#txn-2929933 . all three machines reachable over IPMI but no network link.

Actions #10

Updated by dheidler over 3 years ago

Update on the INFRA ticket from Fatma Ghariani:

All openQAworkers on that rack are back to UP.
Actions #11

Updated by dheidler over 3 years ago

  • Status changed from Blocked to Resolved

I was able to login to all three workers via SSH and there are jobs running on all of them.

Actions

Also available in: Atom PDF