action #98979: monitor-post-deployment failed while arm3 was being rebooted by our automatic recovery - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #98979

closed

monitor-post-deployment failed while arm3 was being rebooted by our automatic recovery

Added by livdywan over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-09-21

Due date:

2021-10-04

% Done:

Estimated time:

Description

Observation¶

[{'dashboardId': 35,
  'dashboardSlug': 'automatic-actions',
  'dashboardUid': '1bNU0StZz',
  'evalData': {'evalMatches': [{'metric': 'ping.mean',
                                'tags': None,
                                'value': 0}]},
  'evalDate': '0001-01-01T00:00:00Z',
  'executionError': '',
  'id': 206,
  'name': '[openqa] openqaworker-arm-3 online (long-time) alert',
  'newStateDate': '2021-09-20T07:31:37+02:00',
  'panelId': 7,
  'state': 'pending',
  'url': '/d/1bNU0StZz/automatic-actions'},
 {'dashboardId': 35,
  'dashboardSlug': 'automatic-actions',
  'dashboardUid': '1bNU0StZz',
  'evalData': {'evalMatches': [{'metric': 'ping.mean',
                                'tags': None,
                                'value': 0}]},
  'evalDate': '0001-01-01T00:00:00Z',
  'executionError': '',
  'id': 185,
  'name': 'openqaworker-arm-3 offline',
  'newStateDate': '2021-09-20T07:33:11+02:00',
  'panelId': 4,
  'state': 'alerting',
  'url': '/d/1bNU0StZz/automatic-actions'}]

monitor-post-deploy failed, commit by @okurz mentioned
arm3 rebooted twice, at 7.41 and 4.00 as could be seen on automatic actions

Acceptance criteria¶

AC1: Broken workers only trigger alerts in one place

Suggestions¶

make monitor-post-deploy retry several times
drop this step since it duplicates other alerts / don't check workers "online" in this step via ping since there's no clear benefit and it wouldn't catch regressions this way
connect pipelines for broken worker healthcheck and deployment

Actions

Copy link

Updated by okurz over 3 years ago

Priority changed from Normal to High

cdywan wrote:

monitor-post-deploy failed, commit by @okurz mentioned

yeah but only because I was the last to commit to "osd-deployment", which did not cause the problem

Suggestions¶

make monitor-post-deploy retry several times

better retry within the step with sleep time between to cover for such cases as stated above

drop this step since it duplicates other alerts / don't check workers "online" in this step via ping since there's no clear benefit and it wouldn't catch regressions this way

We should keep this as it links potential alerts to actual deployments. In the past there had been regressions from deployments but then people were not able to make a connection easily between "there is an alert" and "maybe deployment caused it".
We are simply checking that there are no alerts. I would really like us to not need to add any special exclusions in this step as well

connect pipelines for broken worker healthcheck and deployment

I don't know what you mean by that

Actions

Copy link

Updated by livdywan over 3 years ago

okurz wrote:

cdywan wrote:

connect pipelines for broken worker healthcheck and deployment

I don't know what you mean by that

Sync up the two cases that were triggered here, i.e. "we detected that the worker no longer respons, try ipmi, failing that reboot" and "we just deployed, check that the worker can be pinged". Since we're checking if the worker is online in two different places it's racy.

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from New to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz over 3 years ago

cdywan wrote:

Sync up the two cases that were triggered here, i.e. "we detected that the worker no longer respons, try ipmi, failing that reboot" and "we just deployed, check that the worker can be pinged". Since we're checking if the worker is online in two different places it's racy.

Not quite. We actually do not check if openqaworker-arm-[123] are online anywhere else because these alerts are always paused to prevent unactionable alert messages. Nick and me tried in vain some months ago to generate the worker dashboards so that we would simply have no "host up" alert for those three hosts. But now Nick and me collaborated to improve in a different way, by excluding pending alerts as well as excluding any alert from the "automatic-actions" dashboard. So we would not fail any monitor step if openqaworker-arm-[123] would be down, also not if they are down due to deployment but we consider it unlikely that any deployment failure would only affect these three hosts. There would still be follow-up by "automatic-actions" by trying to recover as well as reporting EngInfra tickets automatically as last resort.

Actions

Copy link