coordination #94258
closed
[epic] deployment pipeline failed, alerts not handled
Added by okurz over 3 years ago.
Updated over 3 years ago.
Estimated time:
(Total: 0.00 h)
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/pipelines failed already last Wednesday and today and there was no reaction on the failed pipelines, see our alert handling processes as documented on https://progress.opensuse.org/projects/qa/wiki#Alert-handling
Acceptance criteria¶
- AC1: Team has been made aware about our alert handling process
- AC2: Automatic deployment continued
Suggestions¶
- Check specific problem in failed pipeline and fix it
- Consider coming up with improvements within the pipeline to fix this or similar problems in the future automatically
- Make sure the team is aware about our alert handling process
- Find out why pipeline failure was not seen or reacted upon for the past days
mkittler wrote:
Find out why pipeline failure was not seen or reacted upon for the past days
I haven't received a mail despite having "Failed pipeline" checked for the project https://gitlab.suse.de/openqa/osd-deployment. I have also checked my Outlook mail account directly. Maybe other teams members have the same problem. (The same counts for https://gitlab.suse.de/openqa/salt-states-openqa btw.)
Right. That would address the last point "Find out why pipeline failure was not seen". I can confirm that I have received an email for both the failure on Wednesday as well as today morning.
- AC1: I guess the "Failed pipeline" was just pre-checked in the menu for customizing notifications and I was previously watching any activity. However, when using the customized setting with everything checked it also doesn't work for me.
- AC2: Executing
rpm --rebuilddb
on arm-1 worked. Considering the machine sometimes randomly crashes it isn't really a surprise to find a broken rpm database.
- Assignee deleted (
mkittler)
Not sure what to do about it. The notifications just don't seem to work for me.
mkittler wrote:
Not sure what to do about it. The notifications just don't seem to work for me.
I have the following ideas:
- Ask from multiple team members if it works for them (works for me)
- Simulate failures in a fork and check if notifications work from there
- Crosscheck if you receive emails from other repos. Do you get emails if CI checks in your own MRs fail?
- Ask EngInfra admins if they can check if you should receive an email
Further ideas of the ticket:
- Detect the RPM failure and apply the database rebuild automatically
ARM workers are handled, deployment continued, AC2 covered. Leaves AC1
For me it doesn't work for other repositories as well. The last mail I've received was "salt-states-openqa | Pipeline #82985 has been fixed for firewall | 44894364 in !378" from 15.10.20 17:27.
- Recommended settings as suggested on the call: Settings > Notifications > Global notification level > Watch
- A failed pipeline looks something like
Failed pipeline for break-not-master | osd-deployment | ff035118
. I produced this by opening an MR with broken YAML (https://gitlab.suse.de/cdywan/osd-deployment/-/pipelines/159292)
- Status changed from Workable to In Progress
- Assignee set to livdywan
- Status changed from In Progress to Feedback
So Marius seems to be getting notifications but they're just filtered out into a folder usually.
The idea to shutdown arm3 was not very fruitful as it recovered perfectly w/ no alerts or failed pipelines. So I'm still not sure how to really test this 🤔️
cdywan wrote:
So Marius seems to be getting notifications but they're just filtered out into a folder usually.
The idea to shutdown arm3 was not very fruitful as it recovered perfectly w/ no alerts or failed pipelines. So I'm still not sure how to really test this 🤔️
well, you need to shut down a salt controlled host and trigger a gitlab CI pipeline within osd-deployment
- Status changed from Feedback to Blocked
I created a specific subtask #95105 for the non-technical part with the right due date
- Tracker changed from action to coordination
- Subject changed from https://gitlab.suse.de/openqa/osd-deployment/-/pipelines fails and alerts not handled to [epic] deployment pipeline failed, alerts not handled
- Status changed from Blocked to Resolved
Kinda forgot about the epic :-D
Also available in: Atom
PDF