Project

General

Profile

action #103128

Updated by okurz over 2 years ago

## Observation 

 E.g. recently the recovery of arm-1 didn't work due to the usual network problems during the maintenance windows: 

 ``` 
 fatal: unable to access 'https://gitlab.suse.de/openqa/grafana-webhook-actions.git/': Could not resolve host: gitlab.suse.de 
 ``` 
 (https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/711927) 

 This left the worker unrecovered until the problem became apparent on the OSD deployment which happened the next day. However, the notification mail about the failed pipeline was visible and one just had to restart the pipeline after the maintenance window (which was simply forgotten). That there was no mail for the firing long-term alert didn't help either. 

 Note that *blindly* retriggering the pipeline later would not be ideal because when the worker has already been recovered anyways it would needlessly trigger a power cycle. 

 ## Acceptance criteria 
 * **AC1:** The gitlab CI pipeline does not fail while gitlab.suse.de is not accessible (e.g. not triggered at all during this time or retried or worked around) 

 ## Suggestions 
 * Research if we can prevent triggering the pipeline from grafana during the SUSE IT maintenance window 
 * Research if we can retry until the git repo can be reached again 
 * Currently the long-term alert https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?tab=alert&editPanel=5&orgId=1 does not have a notification target configured. Research in our git history why we did not enable this. Maybe we can just enable that and send email to osd-admins@suse.de after we also try to create tickets automatically in the CI pipeline 

 ## Further details 
 The EngInfra ticket regarding decommission of CAASP cluster is tracked in https://jira.suse.com/browse/ENGINFRA-705

Back