action #96938
closedopenqaworker10+13 are offline, reason unknown, let's fix other problems first size:M
0%
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/531244 failed due to openqaworker10 and openqaworker13 being offline. Alerts also showed this in https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker10/worker-dashboard-openqaworker10?orgId=1&editPanel=65105&tab=alert&refresh=1m and https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker13/worker-dashboard-openqaworker13?orgId=1&editPanel=65105&tab=alert&refresh=1m
Tasks¶
- Bring back machines into salt controlled infrastructure, see https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-production
- Unpause alerts
- Verify green state
Updated by okurz over 3 years ago
- Related to action #96260: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:M added
Updated by okurz over 3 years ago
As dheidler found out the MC for w10 is not reachable. Please create an infra ticket, CC osd-admins@suse.de and take #96938 and set it to blocked on the infra ticket. I suggest to first keep openqaworker10 powered down. I prefer to have less but consistent jobs to reduce the variables.
Updated by livdywan over 3 years ago
- Subject changed from openqaworker10+13 are offline, reason unknown, let's fix other problems first to openqaworker10+13 are offline, reason unknown, let's fix other problems first size:M
- Status changed from New to Workable
okurz wrote:
As dheidler found out the MC for w10 is not reachable. Please create an infra ticket, CC osd-admins@suse.de and take #96938 and set it to blocked on the infra ticket. I suggest to first keep openqaworker10 powered down. I prefer to have less but consistent jobs to reduce the variables.
https://infra.nue.suse.com/SelfService/Display.html?id=195815
Alerts were enabled again, machines seem to work?
Updated by mkittler over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler over 3 years ago
- Status changed from In Progress to Feedback
I've powered on openqaworker10 (because I wanted to look into #96507#note-9). This means now both workers are back online. I've been updating the workers so at least the workaround for #96710 is deployed. Besides some initial jobs which still ran with the old version it looks good so far.
I've also unpaused the alerts.
The reason for the outage was stated in the infra ticket:
There was an issues with one of the USV units around the time you pointed out. Should be restored now.
I assume this also applies to openqaworker13 but I've asked in the ticket to be sure.
Updated by mkittler over 3 years ago
- Status changed from Feedback to Resolved
The openqaworker10 host up alert has just fired again but IPMI and SSH still works. I assume someone of the team took out the worker intentionally (possible to work on #96260). It would have been nice to pause the alert before doing that, though.
While both workers have been actually running job they seemed to work fine (not producing more incompletes than other workers in the same time frame). In the infra ticket I also got a response and it is indeed very likely that the same applies to openqaworker13. So I'm resolving the ticket now.
Updated by okurz over 3 years ago
- Status changed from Resolved to Feedback
please read all comments and the ticket descriptions. The salt keys are not currently part of the "accepted" group
Updated by okurz over 3 years ago
yes. But sudo salt-key -L
says
Unaccepted Keys:
openqaworker10.suse.de
openqaworker12.suse.de
openqaworker13.suse.de
so openqaworker13 is also still missing from salt keys?
Updated by mkittler over 3 years ago
Not sure when openqaworker13 was taken out and for what reason. I definitely did not take it out due to the host being offline; it was likely taken out before and by somebody else. I can nevertheless take it back. The worker slots/services are already active anyways.
Updated by mkittler over 3 years ago
- Status changed from Feedback to Resolved
The workers are back online and seem to work.