action #107671
closedNo aggregate maintenance runs scheduled today on osd size:M
0%
Description
Observation¶
Seems a different issue than #106179 since the dashboard is accessible this time.
Link to list aggregate runs of the day:
Impact: update approval blocked
Suggestions¶
- caused by downtime of http://download.suse.de
- read suggestions from #105603
- Some gitlab CI steps are failing but we allow them to fail to let other steps continue, e.g. in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/886067 "sync smelt" fails but we allow it to fail so that "sync incidents" can continue but we also don't receive an alert about it and there is not sufficient retrying. We could split the steps into separate pipelines, make each step fatal and add configurable number of retries and interval between retries customized for each step in https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml, e.g. for sync smelt long enough , retrying to cover the weekly SUSE IT maintenance window, less for other critical steps
- For retrying we do not even need to change qem-bot, we could use just a wrapper in the gitlab CI job itself, e.g. https://github.com/okurz/leaky_bucket_error_count
- Also look into gitlab CI options to either abort a previous pipeline if a new one is triggered or not start new ones as long as old ones are still running
Updated by okurz over 2 years ago
- Related to action #106179: No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S added
Updated by osukup over 2 years ago
- Status changed from New to Resolved
caused by downtime of http://download.suse.de
Updated by okurz over 2 years ago
- Status changed from Resolved to Feedback
- Assignee set to okurz
- Priority changed from Immediate to High
- Target version set to Ready
osukup thank you for handling it this urgently. Please keep our process best practices in mind regarding finding at least an additional improvement
Updated by okurz over 2 years ago
- Status changed from Feedback to New
- Assignee deleted (
okurz)
ok, maybe it wasn't clear. I asked a question and was waiting for the feedback, hence assigning to me. Because it was never assigned to anyone else I wanted to avoid assigning to individual members in the team. You know, don't shoot the messenger :) But I think I can make the ticket more clear by putting back to "New". Another question is if all members of our team know about our best practice to find at least an additional improvement.
Updated by livdywan over 2 years ago
- Subject changed from No aggregate maintenance runs scheduled today on osd to No aggregate maintenance runs scheduled today on osd size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 2 years ago
- Related to action #105603: openQABot pipeline failed: "ERROR:root:Something bad happended during reading MR data from SMELT/IBS: Expecting value: line 4 column 1 (char 3)" size:M added
Updated by jbaier_cz over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to jbaier_cz
Updated by jbaier_cz over 2 years ago
- Status changed from In Progress to Feedback
The improvement as discussed today in the meeting is implemented inside https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50; please see the more detailed explanation in the MR / commit messages.
Updated by jbaier_cz over 2 years ago
The new pipeline system is working, I will be watching it for a few days. We will see how it deals with server errors.
Updated by okurz over 2 years ago
- Related to action #108824: Some of the daily aggregate tests are cancelled without a reason size:M added
Updated by okurz over 2 years ago
Could #108824 be a regression due to this? Could it be that due to splitting calls we call isos post
on the openQA API with the OBSOLETE parameter on matching products so that now jobs are cancelled unexpectedly and never rescheduled?
Updated by okurz over 2 years ago
So it turned out that #108824 was actually a regression due to the work on this ticket. I suggest to monitor this further and at best resolve next week, still long before the due-date, please :)
Updated by jbaier_cz over 2 years ago
okurz wrote:
So it turned out that #108824 was actually a regression due to the work on this ticket. I suggest to monitor this further and at best resolve next week, still long before the due-date, please :)
No, not this one. #108869 was a regression due to this (the other bot-ng ticket :) ). I agree, we should monitor further.
Updated by okurz over 2 years ago
- Due date deleted (
2022-03-31) - Status changed from Feedback to Resolved
No further problems observed, can be seen as resolved. We have also five why's analysis planned for a related topic so we will think of improvements in this area anyway.