action #96938
closed
- Assignee set to okurz
- Target version set to Ready
paused alerts, removed keys from salt.
- Related to action #96260: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:M added
As dheidler found out the MC for w10 is not reachable. Please create an infra ticket, CC osd-admins@suse.de and take #96938 and set it to blocked on the infra ticket. I suggest to first keep openqaworker10 powered down. I prefer to have less but consistent jobs to reduce the variables.
- Subject changed from openqaworker10+13 are offline, reason unknown, let's fix other problems first to openqaworker10+13 are offline, reason unknown, let's fix other problems first size:M
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to mkittler
- Status changed from In Progress to Feedback
I've powered on openqaworker10 (because I wanted to look into #96507#note-9). This means now both workers are back online. I've been updating the workers so at least the workaround for #96710 is deployed. Besides some initial jobs which still ran with the old version it looks good so far.
I've also unpaused the alerts.
The reason for the outage was stated in the infra ticket:
There was an issues with one of the USV units around the time you pointed out. Should be restored now.
I assume this also applies to openqaworker13 but I've asked in the ticket to be sure.
- Status changed from Feedback to Resolved
The openqaworker10 host up alert has just fired again but IPMI and SSH still works. I assume someone of the team took out the worker intentionally (possible to work on #96260). It would have been nice to pause the alert before doing that, though.
While both workers have been actually running job they seemed to work fine (not producing more incompletes than other workers in the same time frame). In the infra ticket I also got a response and it is indeed very likely that the same applies to openqaworker13. So I'm resolving the ticket now.
- Status changed from Resolved to Feedback
please read all comments and the ticket descriptions. The salt keys are not currently part of the "accepted" group
They were and both workers were working fine. @dheidler only removed openqaworker10 again later for #96260. I guess we don't need to re-open this ticket every time someone takes out the worker again in the future.
yes. But sudo salt-key -L
says
Unaccepted Keys:
openqaworker10.suse.de
openqaworker12.suse.de
openqaworker13.suse.de
so openqaworker13 is also still missing from salt keys?
Not sure when openqaworker13 was taken out and for what reason. I definitely did not take it out due to the host being offline; it was likely taken out before and by somebody else. I can nevertheless take it back. The worker slots/services are already active anyways.
- Status changed from Feedback to Resolved
The workers are back online and seem to work.
Also available in: Atom
PDF