Project

General

Profile

Actions

action #98979

closed

monitor-post-deployment failed while arm3 was being rebooted by our automatic recovery

Added by livdywan about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2021-09-21
Due date:
2021-10-04
% Done:

0%

Estimated time:

Description

Observation

[{'dashboardId': 35,
  'dashboardSlug': 'automatic-actions',
  'dashboardUid': '1bNU0StZz',
  'evalData': {'evalMatches': [{'metric': 'ping.mean',
                                'tags': None,
                                'value': 0}]},
  'evalDate': '0001-01-01T00:00:00Z',
  'executionError': '',
  'id': 206,
  'name': '[openqa] openqaworker-arm-3 online (long-time) alert',
  'newStateDate': '2021-09-20T07:31:37+02:00',
  'panelId': 7,
  'state': 'pending',
  'url': '/d/1bNU0StZz/automatic-actions'},
 {'dashboardId': 35,
  'dashboardSlug': 'automatic-actions',
  'dashboardUid': '1bNU0StZz',
  'evalData': {'evalMatches': [{'metric': 'ping.mean',
                                'tags': None,
                                'value': 0}]},
  'evalDate': '0001-01-01T00:00:00Z',
  'executionError': '',
  'id': 185,
  'name': 'openqaworker-arm-3 offline',
  'newStateDate': '2021-09-20T07:33:11+02:00',
  'panelId': 4,
  'state': 'alerting',
  'url': '/d/1bNU0StZz/automatic-actions'}]

Acceptance criteria

  • AC1: Broken workers only trigger alerts in one place

Suggestions

  • make monitor-post-deploy retry several times
  • drop this step since it duplicates other alerts / don't check workers "online" in this step via ping since there's no clear benefit and it wouldn't catch regressions this way
  • connect pipelines for broken worker healthcheck and deployment
Actions #1

Updated by okurz about 3 years ago

  • Priority changed from Normal to High

cdywan wrote:

yeah but only because I was the last to commit to "osd-deployment", which did not cause the problem

Suggestions

  • make monitor-post-deploy retry several times

better retry within the step with sleep time between to cover for such cases as stated above

  • drop this step since it duplicates other alerts / don't check workers "online" in this step via ping since there's no clear benefit and it wouldn't catch regressions this way

We should keep this as it links potential alerts to actual deployments. In the past there had been regressions from deployments but then people were not able to make a connection easily between "there is an alert" and "maybe deployment caused it".
We are simply checking that there are no alerts. I would really like us to not need to add any special exclusions in this step as well

  • connect pipelines for broken worker healthcheck and deployment

I don't know what you mean by that

Actions #2

Updated by livdywan about 3 years ago

okurz wrote:

cdywan wrote:

  • connect pipelines for broken worker healthcheck and deployment

I don't know what you mean by that

Sync up the two cases that were triggered here, i.e. "we detected that the worker no longer respons, try ipmi, failing that reboot" and "we just deployed, check that the worker can be pinged". Since we're checking if the worker is online in two different places it's racy.

Actions #3

Updated by okurz about 3 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #4

Updated by okurz about 3 years ago

cdywan wrote:

Sync up the two cases that were triggered here, i.e. "we detected that the worker no longer respons, try ipmi, failing that reboot" and "we just deployed, check that the worker can be pinged". Since we're checking if the worker is online in two different places it's racy.

Not quite. We actually do not check if openqaworker-arm-[123] are online anywhere else because these alerts are always paused to prevent unactionable alert messages. Nick and me tried in vain some months ago to generate the worker dashboards so that we would simply have no "host up" alert for those three hosts. But now Nick and me collaborated to improve in a different way, by excluding pending alerts as well as excluding any alert from the "automatic-actions" dashboard. So we would not fail any monitor step if openqaworker-arm-[123] would be down, also not if they are down due to deployment but we consider it unlikely that any deployment failure would only affect these three hosts. There would still be follow-up by "automatic-actions" by trying to recover as well as reporting EngInfra tickets automatically as last resort.

Actions #5

Updated by okurz about 3 years ago

  • Status changed from In Progress to Resolved

MR merged and active for next deployment. AC fulfilled.

Actions #6

Updated by livdywan about 3 years ago

okurz wrote:

MR merged and active for next deployment. AC fulfilled.

What MR is that? Please mention it here so others can learn from it

Edit: It's https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/34

Unfortunately it looks like it's not working: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/599916

Actions #7

Updated by livdywan about 3 years ago

  • Status changed from Resolved to Feedback
Actions #8

Updated by okurz about 3 years ago

  • Due date set to 2021-10-04

I understood now that I picked the wrong image URL for verification. Used the old URL in https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/34#note_346765

Fixed in https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/35, will care to retrigger and monitor the deployment

Actions #9

Updated by okurz about 3 years ago

  • Status changed from Feedback to Resolved

all good now, received email that deployment pipeline was fixed :)

Actions

Also available in: Atom PDF