Project

General

Profile

Actions

action #98979

closed

monitor-post-deployment failed while arm3 was being rebooted by our automatic recovery

Added by livdywan about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2021-09-21
Due date:
2021-10-04
% Done:

0%

Estimated time:

Description

Observation

[{'dashboardId': 35,
  'dashboardSlug': 'automatic-actions',
  'dashboardUid': '1bNU0StZz',
  'evalData': {'evalMatches': [{'metric': 'ping.mean',
                                'tags': None,
                                'value': 0}]},
  'evalDate': '0001-01-01T00:00:00Z',
  'executionError': '',
  'id': 206,
  'name': '[openqa] openqaworker-arm-3 online (long-time) alert',
  'newStateDate': '2021-09-20T07:31:37+02:00',
  'panelId': 7,
  'state': 'pending',
  'url': '/d/1bNU0StZz/automatic-actions'},
 {'dashboardId': 35,
  'dashboardSlug': 'automatic-actions',
  'dashboardUid': '1bNU0StZz',
  'evalData': {'evalMatches': [{'metric': 'ping.mean',
                                'tags': None,
                                'value': 0}]},
  'evalDate': '0001-01-01T00:00:00Z',
  'executionError': '',
  'id': 185,
  'name': 'openqaworker-arm-3 offline',
  'newStateDate': '2021-09-20T07:33:11+02:00',
  'panelId': 4,
  'state': 'alerting',
  'url': '/d/1bNU0StZz/automatic-actions'}]

Acceptance criteria

  • AC1: Broken workers only trigger alerts in one place

Suggestions

  • make monitor-post-deploy retry several times
  • drop this step since it duplicates other alerts / don't check workers "online" in this step via ping since there's no clear benefit and it wouldn't catch regressions this way
  • connect pipelines for broken worker healthcheck and deployment
Actions

Also available in: Atom PDF