Actions
action #98979
closedmonitor-post-deployment failed while arm3 was being rebooted by our automatic recovery
Start date:
2021-09-21
Due date:
2021-10-04
% Done:
0%
Estimated time:
Description
Observation¶
[{'dashboardId': 35,
'dashboardSlug': 'automatic-actions',
'dashboardUid': '1bNU0StZz',
'evalData': {'evalMatches': [{'metric': 'ping.mean',
'tags': None,
'value': 0}]},
'evalDate': '0001-01-01T00:00:00Z',
'executionError': '',
'id': 206,
'name': '[openqa] openqaworker-arm-3 online (long-time) alert',
'newStateDate': '2021-09-20T07:31:37+02:00',
'panelId': 7,
'state': 'pending',
'url': '/d/1bNU0StZz/automatic-actions'},
{'dashboardId': 35,
'dashboardSlug': 'automatic-actions',
'dashboardUid': '1bNU0StZz',
'evalData': {'evalMatches': [{'metric': 'ping.mean',
'tags': None,
'value': 0}]},
'evalDate': '0001-01-01T00:00:00Z',
'executionError': '',
'id': 185,
'name': 'openqaworker-arm-3 offline',
'newStateDate': '2021-09-20T07:33:11+02:00',
'panelId': 4,
'state': 'alerting',
'url': '/d/1bNU0StZz/automatic-actions'}]
- monitor-post-deploy failed, commit by @okurz mentioned
- arm3 rebooted twice, at 7.41 and 4.00 as could be seen on automatic actions
Acceptance criteria¶
- AC1: Broken workers only trigger alerts in one place
Suggestions¶
- make monitor-post-deploy retry several times
- drop this step since it duplicates other alerts / don't check workers "online" in this step via ping since there's no clear benefit and it wouldn't catch regressions this way
- connect pipelines for broken worker healthcheck and deployment
Actions