Project

General

Profile

Actions

action #96938

closed

openqaworker10+13 are offline, reason unknown, let's fix other problems first size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-08-16
Due date:
% Done:

0%

Estimated time:


Related issues 1 (0 open1 closed)

Related to openQA Project - action #96260: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:MResolveddheidler2021-07-29

Actions
Actions #1

Updated by okurz over 2 years ago

  • Assignee set to okurz
  • Target version set to Ready

pausing alerts

Actions #2

Updated by okurz over 2 years ago

  • Assignee deleted (okurz)

paused alerts, removed keys from salt.

Actions #3

Updated by okurz over 2 years ago

  • Related to action #96260: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:M added
Actions #4

Updated by okurz over 2 years ago

As dheidler found out the MC for w10 is not reachable. Please create an infra ticket, CC osd-admins@suse.de and take #96938 and set it to blocked on the infra ticket. I suggest to first keep openqaworker10 powered down. I prefer to have less but consistent jobs to reduce the variables.

Actions #5

Updated by livdywan over 2 years ago

  • Subject changed from openqaworker10+13 are offline, reason unknown, let's fix other problems first to openqaworker10+13 are offline, reason unknown, let's fix other problems first size:M
  • Status changed from New to Workable

okurz wrote:

As dheidler found out the MC for w10 is not reachable. Please create an infra ticket, CC osd-admins@suse.de and take #96938 and set it to blocked on the infra ticket. I suggest to first keep openqaworker10 powered down. I prefer to have less but consistent jobs to reduce the variables.

https://infra.nue.suse.com/SelfService/Display.html?id=195815

Alerts were enabled again, machines seem to work?

Actions #6

Updated by mkittler over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #7

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback

I've powered on openqaworker10 (because I wanted to look into #96507#note-9). This means now both workers are back online. I've been updating the workers so at least the workaround for #96710 is deployed. Besides some initial jobs which still ran with the old version it looks good so far.

I've also unpaused the alerts.


The reason for the outage was stated in the infra ticket:

There was an issues with one of the USV units around the time you pointed out. Should be restored now.

I assume this also applies to openqaworker13 but I've asked in the ticket to be sure.

Actions #8

Updated by mkittler over 2 years ago

  • Status changed from Feedback to Resolved

The openqaworker10 host up alert has just fired again but IPMI and SSH still works. I assume someone of the team took out the worker intentionally (possible to work on #96260). It would have been nice to pause the alert before doing that, though.

While both workers have been actually running job they seemed to work fine (not producing more incompletes than other workers in the same time frame). In the infra ticket I also got a response and it is indeed very likely that the same applies to openqaworker13. So I'm resolving the ticket now.

Actions #9

Updated by okurz over 2 years ago

  • Status changed from Resolved to Feedback

please read all comments and the ticket descriptions. The salt keys are not currently part of the "accepted" group

Actions #10

Updated by mkittler over 2 years ago

They were and both workers were working fine. @dheidler only removed openqaworker10 again later for #96260. I guess we don't need to re-open this ticket every time someone takes out the worker again in the future.

Actions #11

Updated by okurz over 2 years ago

yes. But sudo salt-key -L says

Unaccepted Keys:
openqaworker10.suse.de
openqaworker12.suse.de
openqaworker13.suse.de

so openqaworker13 is also still missing from salt keys?

Actions #12

Updated by mkittler over 2 years ago

Not sure when openqaworker13 was taken out and for what reason. I definitely did not take it out due to the host being offline; it was likely taken out before and by somebody else. I can nevertheless take it back. The worker slots/services are already active anyways.

Actions #13

Updated by mkittler over 2 years ago

  • Status changed from Feedback to Resolved

The workers are back online and seem to work.

Actions

Also available in: Atom PDF