action #98979
closedmonitor-post-deployment failed while arm3 was being rebooted by our automatic recovery
0%
Description
Observation¶
[{'dashboardId': 35,
'dashboardSlug': 'automatic-actions',
'dashboardUid': '1bNU0StZz',
'evalData': {'evalMatches': [{'metric': 'ping.mean',
'tags': None,
'value': 0}]},
'evalDate': '0001-01-01T00:00:00Z',
'executionError': '',
'id': 206,
'name': '[openqa] openqaworker-arm-3 online (long-time) alert',
'newStateDate': '2021-09-20T07:31:37+02:00',
'panelId': 7,
'state': 'pending',
'url': '/d/1bNU0StZz/automatic-actions'},
{'dashboardId': 35,
'dashboardSlug': 'automatic-actions',
'dashboardUid': '1bNU0StZz',
'evalData': {'evalMatches': [{'metric': 'ping.mean',
'tags': None,
'value': 0}]},
'evalDate': '0001-01-01T00:00:00Z',
'executionError': '',
'id': 185,
'name': 'openqaworker-arm-3 offline',
'newStateDate': '2021-09-20T07:33:11+02:00',
'panelId': 4,
'state': 'alerting',
'url': '/d/1bNU0StZz/automatic-actions'}]
- monitor-post-deploy failed, commit by @okurz mentioned
- arm3 rebooted twice, at 7.41 and 4.00 as could be seen on automatic actions
Acceptance criteria¶
- AC1: Broken workers only trigger alerts in one place
Suggestions¶
- make monitor-post-deploy retry several times
- drop this step since it duplicates other alerts / don't check workers "online" in this step via ping since there's no clear benefit and it wouldn't catch regressions this way
- connect pipelines for broken worker healthcheck and deployment
Updated by okurz about 3 years ago
- Priority changed from Normal to High
cdywan wrote:
- monitor-post-deploy failed, commit by @okurz mentioned
yeah but only because I was the last to commit to "osd-deployment", which did not cause the problem
Suggestions¶
- make monitor-post-deploy retry several times
better retry within the step with sleep time between to cover for such cases as stated above
- drop this step since it duplicates other alerts / don't check workers "online" in this step via ping since there's no clear benefit and it wouldn't catch regressions this way
We should keep this as it links potential alerts to actual deployments. In the past there had been regressions from deployments but then people were not able to make a connection easily between "there is an alert" and "maybe deployment caused it".
We are simply checking that there are no alerts. I would really like us to not need to add any special exclusions in this step as well
- connect pipelines for broken worker healthcheck and deployment
I don't know what you mean by that
Updated by livdywan about 3 years ago
okurz wrote:
cdywan wrote:
- connect pipelines for broken worker healthcheck and deployment
I don't know what you mean by that
Sync up the two cases that were triggered here, i.e. "we detected that the worker no longer respons, try ipmi, failing that reboot" and "we just deployed, check that the worker can be pinged". Since we're checking if the worker is online in two different places it's racy.
Updated by okurz about 3 years ago
- Status changed from New to In Progress
- Assignee set to okurz
Updated by okurz about 3 years ago
cdywan wrote:
Sync up the two cases that were triggered here, i.e. "we detected that the worker no longer respons, try ipmi, failing that reboot" and "we just deployed, check that the worker can be pinged". Since we're checking if the worker is online in two different places it's racy.
Not quite. We actually do not check if openqaworker-arm-[123] are online anywhere else because these alerts are always paused to prevent unactionable alert messages. Nick and me tried in vain some months ago to generate the worker dashboards so that we would simply have no "host up" alert for those three hosts. But now Nick and me collaborated to improve in a different way, by excluding pending alerts as well as excluding any alert from the "automatic-actions" dashboard. So we would not fail any monitor step if openqaworker-arm-[123] would be down, also not if they are down due to deployment but we consider it unlikely that any deployment failure would only affect these three hosts. There would still be follow-up by "automatic-actions" by trying to recover as well as reporting EngInfra tickets automatically as last resort.
Updated by okurz about 3 years ago
- Status changed from In Progress to Resolved
MR merged and active for next deployment. AC fulfilled.
Updated by livdywan about 3 years ago
okurz wrote:
MR merged and active for next deployment. AC fulfilled.
What MR is that? Please mention it here so others can learn from it
Edit: It's https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/34
Unfortunately it looks like it's not working: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/599916
Updated by okurz about 3 years ago
- Due date set to 2021-10-04
I understood now that I picked the wrong image URL for verification. Used the old URL in https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/34#note_346765
Fixed in https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/35, will care to retrigger and monitor the deployment
Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved
all good now, received email that deployment pipeline was fixed :)