Project

General

Profile

Actions

coordination #94258

closed

[epic] deployment pipeline failed, alerts not handled

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
2021-06-25
Due date:
2021-07-08
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/pipelines failed already last Wednesday and today and there was no reaction on the failed pipelines, see our alert handling processes as documented on https://progress.opensuse.org/projects/qa/wiki#Alert-handling

Acceptance criteria

  • AC1: Team has been made aware about our alert handling process
  • AC2: Automatic deployment continued

Suggestions

  • Check specific problem in failed pipeline and fix it
  • Consider coming up with improvements within the pipeline to fix this or similar problems in the future automatically
  • Make sure the team is aware about our alert handling process
  • Find out why pipeline failure was not seen or reacted upon for the past days

Subtasks 2 (0 open2 closed)

openQA Infrastructure - action #94747: broken RPM database on openqaworker-arm-* during osd deploymentResolvedokurz2021-06-25

Actions
action #95105: osd-deployment pipelines fail and alerts are not handled size:MResolvedlivdywan2021-07-062021-07-08

Actions
Actions #1

Updated by mkittler over 3 years ago

Find out why pipeline failure was not seen or reacted upon for the past days

I haven't received a mail despite having "Failed pipeline" checked for the project https://gitlab.suse.de/openqa/osd-deployment. I have also checked my Outlook mail account directly. Maybe other teams members have the same problem. (The same counts for https://gitlab.suse.de/openqa/salt-states-openqa btw.)

Actions #2

Updated by mkittler over 3 years ago

  • Assignee set to mkittler
Actions #3

Updated by okurz over 3 years ago

mkittler wrote:

Find out why pipeline failure was not seen or reacted upon for the past days

I haven't received a mail despite having "Failed pipeline" checked for the project https://gitlab.suse.de/openqa/osd-deployment. I have also checked my Outlook mail account directly. Maybe other teams members have the same problem. (The same counts for https://gitlab.suse.de/openqa/salt-states-openqa btw.)

Right. That would address the last point "Find out why pipeline failure was not seen". I can confirm that I have received an email for both the failure on Wednesday as well as today morning.

Actions #4

Updated by mkittler over 3 years ago

  • AC1: I guess the "Failed pipeline" was just pre-checked in the menu for customizing notifications and I was previously watching any activity. However, when using the customized setting with everything checked it also doesn't work for me.
  • AC2: Executing rpm --rebuilddb on arm-1 worked. Considering the machine sometimes randomly crashes it isn't really a surprise to find a broken rpm database.
Actions #5

Updated by mkittler over 3 years ago

  • Assignee deleted (mkittler)

Not sure what to do about it. The notifications just don't seem to work for me.

Actions #6

Updated by okurz over 3 years ago

mkittler wrote:

Not sure what to do about it. The notifications just don't seem to work for me.

I have the following ideas:

  • Ask from multiple team members if it works for them (works for me)
  • Simulate failures in a fork and check if notifications work from there
  • Crosscheck if you receive emails from other repos. Do you get emails if CI checks in your own MRs fail?
  • Ask EngInfra admins if they can check if you should receive an email

Further ideas of the ticket:

  • Detect the RPM failure and apply the database rebuild automatically
Actions #7

Updated by okurz over 3 years ago

ARM workers are handled, deployment continued, AC2 covered. Leaves AC1

Actions #8

Updated by mkittler over 3 years ago

For me it doesn't work for other repositories as well. The last mail I've received was "salt-states-openqa | Pipeline #82985 has been fixed for firewall | 44894364 in !378" from 15.10.20 17:27.

Actions #9

Updated by livdywan over 3 years ago

  • Recommended settings as suggested on the call: Settings > Notifications > Global notification level > Watch
  • A failed pipeline looks something like Failed pipeline for break-not-master | osd-deployment | ff035118. I produced this by opening an MR with broken YAML (https://gitlab.suse.de/cdywan/osd-deployment/-/pipelines/159292)
Actions #10

Updated by livdywan over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan
Actions #11

Updated by livdywan over 3 years ago

  • Status changed from In Progress to Feedback

So Marius seems to be getting notifications but they're just filtered out into a folder usually.
The idea to shutdown arm3 was not very fruitful as it recovered perfectly w/ no alerts or failed pipelines. So I'm still not sure how to really test this 🤔️

Actions #12

Updated by okurz over 3 years ago

cdywan wrote:

So Marius seems to be getting notifications but they're just filtered out into a folder usually.
The idea to shutdown arm3 was not very fruitful as it recovered perfectly w/ no alerts or failed pipelines. So I'm still not sure how to really test this 🤔️

well, you need to shut down a salt controlled host and trigger a gitlab CI pipeline within osd-deployment

Actions #13

Updated by okurz over 3 years ago

  • Status changed from Feedback to Blocked

I created a specific subtask #95105 for the non-technical part with the right due date

Actions #14

Updated by okurz over 3 years ago

  • Tracker changed from action to coordination
  • Subject changed from https://gitlab.suse.de/openqa/osd-deployment/-/pipelines fails and alerts not handled to [epic] deployment pipeline failed, alerts not handled
Actions #15

Updated by livdywan over 3 years ago

  • Status changed from Blocked to Resolved

Kinda forgot about the epic :-D

Actions

Also available in: Atom PDF