coordination #94258
closed[epic] deployment pipeline failed, alerts not handled
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/pipelines failed already last Wednesday and today and there was no reaction on the failed pipelines, see our alert handling processes as documented on https://progress.opensuse.org/projects/qa/wiki#Alert-handling
Acceptance criteria¶
- AC1: Team has been made aware about our alert handling process
- AC2: Automatic deployment continued
Suggestions¶
- Check specific problem in failed pipeline and fix it
- Consider coming up with improvements within the pipeline to fix this or similar problems in the future automatically
- Make sure the team is aware about our alert handling process
- Find out why pipeline failure was not seen or reacted upon for the past days
Updated by mkittler over 3 years ago
Find out why pipeline failure was not seen or reacted upon for the past days
I haven't received a mail despite having "Failed pipeline" checked for the project https://gitlab.suse.de/openqa/osd-deployment. I have also checked my Outlook mail account directly. Maybe other teams members have the same problem. (The same counts for https://gitlab.suse.de/openqa/salt-states-openqa btw.)
Updated by okurz over 3 years ago
mkittler wrote:
Find out why pipeline failure was not seen or reacted upon for the past days
I haven't received a mail despite having "Failed pipeline" checked for the project https://gitlab.suse.de/openqa/osd-deployment. I have also checked my Outlook mail account directly. Maybe other teams members have the same problem. (The same counts for https://gitlab.suse.de/openqa/salt-states-openqa btw.)
Right. That would address the last point "Find out why pipeline failure was not seen". I can confirm that I have received an email for both the failure on Wednesday as well as today morning.
Updated by mkittler over 3 years ago
- AC1: I guess the "Failed pipeline" was just pre-checked in the menu for customizing notifications and I was previously watching any activity. However, when using the customized setting with everything checked it also doesn't work for me.
- AC2: Executing
rpm --rebuilddb
on arm-1 worked. Considering the machine sometimes randomly crashes it isn't really a surprise to find a broken rpm database.
Updated by mkittler over 3 years ago
- Assignee deleted (
mkittler)
Not sure what to do about it. The notifications just don't seem to work for me.
Updated by okurz over 3 years ago
mkittler wrote:
Not sure what to do about it. The notifications just don't seem to work for me.
I have the following ideas:
- Ask from multiple team members if it works for them (works for me)
- Simulate failures in a fork and check if notifications work from there
- Crosscheck if you receive emails from other repos. Do you get emails if CI checks in your own MRs fail?
- Ask EngInfra admins if they can check if you should receive an email
Further ideas of the ticket:
- Detect the RPM failure and apply the database rebuild automatically
Updated by okurz over 3 years ago
ARM workers are handled, deployment continued, AC2 covered. Leaves AC1
Updated by mkittler over 3 years ago
For me it doesn't work for other repositories as well. The last mail I've received was "salt-states-openqa | Pipeline #82985 has been fixed for firewall | 44894364 in !378" from 15.10.20 17:27.
Updated by livdywan over 3 years ago
- Recommended settings as suggested on the call: Settings > Notifications > Global notification level > Watch
- A failed pipeline looks something like
Failed pipeline for break-not-master | osd-deployment | ff035118
. I produced this by opening an MR with broken YAML (https://gitlab.suse.de/cdywan/osd-deployment/-/pipelines/159292)
Updated by livdywan over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Updated by livdywan over 3 years ago
- Status changed from In Progress to Feedback
So Marius seems to be getting notifications but they're just filtered out into a folder usually.
The idea to shutdown arm3 was not very fruitful as it recovered perfectly w/ no alerts or failed pipelines. So I'm still not sure how to really test this 🤔️
Updated by okurz over 3 years ago
cdywan wrote:
So Marius seems to be getting notifications but they're just filtered out into a folder usually.
The idea to shutdown arm3 was not very fruitful as it recovered perfectly w/ no alerts or failed pipelines. So I'm still not sure how to really test this 🤔️
well, you need to shut down a salt controlled host and trigger a gitlab CI pipeline within osd-deployment
Updated by okurz over 3 years ago
- Status changed from Feedback to Blocked
I created a specific subtask #95105 for the non-technical part with the right due date
Updated by okurz over 3 years ago
- Tracker changed from action to coordination
- Subject changed from https://gitlab.suse.de/openqa/osd-deployment/-/pipelines fails and alerts not handled to [epic] deployment pipeline failed, alerts not handled
Updated by livdywan over 3 years ago
- Status changed from Blocked to Resolved
Kinda forgot about the epic :-D