action #107671: No aggregate maintenance runs scheduled today on osd size:M - QA (public) - openSUSE Project Management Tool

action #107671

## Observation 

 Seems a different issue than #106179 since the dashboard is accessible this time. 

 Link to list aggregate runs of the day: 

 https://openqa.suse.de/tests/overview?arch=&flavor=&machine=&test=&modules=&module_re=&groupid=366&groupid=308&groupid=232&groupid=165&groupid=280&groupid=218&groupid=108&groupid=54&groupid=405&groupid=412&groupid=411&groupid=369&groupid=352&groupid=353&groupid=357&groupid=355&groupid=354&groupid=358&groupid=370&groupid=348&groupid=349&groupid=351&groupid=356&groupid=375&groupid=376&groupid=397&groupid=414&build=20220228-1# 
 (This was showing an empty list at that point) 

 Impact: update approval blocked 

 ## Suggestions 
 * caused by downtime of http://download.suse.de 
 * read suggestions from #105603 
 * Some gitlab CI steps are failing but we allow them to fail to let other steps continue, e.g. in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/886067 "sync smelt" fails but we allow it to fail so that "sync incidents" can continue but we also don't receive an alert about it and there is not sufficient retrying. We could split the steps into separate pipelines, make each step fatal and add configurable number of retries and interval between retries customized for each step in https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml, e.g. for sync smelt long enough , retrying to cover the weekly SUSE IT maintenance window, less for other critical steps 
 * For retrying we do not even need to change qem-bot, we could use just a wrapper in the gitlab CI job itself, e.g. https://github.com/okurz/leaky_bucket_error_count 
 * Also look into gitlab CI options to either abort a previous pipeline if a new one is triggered or not start new ones as long as old ones are still running

Back

Project

General

Profile

QA (public)

action #107671