Project

General

Profile

Actions

action #91494

open

[epic] work on #90152 caused deployment problem and no monitoring alert

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2021-04-21
Due date:
% Done:

0%

Estimated time:

Description

Observation

The work on #90152 caused deployment problems on both o3 and osd as we were informed about by users, e.g. in #91461 and in chat messages by dzedro, coolo, dimstar. In particular the PR https://github.com/os-autoinst/openQA/pull/3849 which showed no problems in CI seems to have caused the problem that jobs show no details and fail despite all modules passing. Two problems: We deployed a bug and no tests revealing such problems before merge and there was no monitoring alert

Acceptance criteria

  • AC1: Failed jobs caused by openQA deployments trigger alerts
  • AC2: We have better CI tests to prevent regressions in the accounting for failed openQA jobs

Related issues 1 (0 open1 closed)

Copied from openQA Project - action #91461: Test is missing webui results and fail despite all tests passedResolvedmkittler2021-04-212021-05-06

Actions
Actions #1

Updated by okurz over 3 years ago

  • Copied from action #91461: Test is missing webui results and fail despite all tests passed added
Actions #2

Updated by mkittler over 3 years ago

Note that this is not about a problem of the deployment itself but about problems caused by the newly deployed code. I turned failures like the ones from #90152 into incompletes (see https://github.com/os-autoinst/openQA/pull/3869) so the exact same issue from #90152 should be resolved now.

Not sure whether AC1 makes sense. Do we really generally want an alert for that? We would likely need a very high threshold because failed tests are part of the normal operation.

I assume AC2 means the openQA upstream testsuite with "CI tests". I'm not sure what you want to change here. What do you mean with "accounting" specifically? We actually do have tests for setting the test result and the views which show accumulated figures (and e.g. in https://github.com/os-autoinst/openQA/pull/3869 I had to adapt tests).

Actions #3

Updated by okurz over 3 years ago

mkittler wrote:

Note that this is not about a problem of the deployment itself but about problems caused by the newly deployed code. I turned failures like the ones from #90152 into incompletes (see https://github.com/os-autoinst/openQA/pull/3869) so the exact same issue from #90152 should be resolved now.

Not sure whether AC1 makes sense. Do we really generally want an alert for that? We would likely need a very high threshold because failed tests are part of the normal operation.

well, possibly not an alert as easy as "any failed jobs" but really only "Failed jobs caused by openQA deployments" so we need to find out which they are. E.g. think about openqa-investigate. Let's assume after deployment a job fails, previous one was good, we schedule openqa-investigate and all openqa-investigate jobs fail so we don't identify a test regression nor product regression as failure reason. Then it becomes much more likely that openQA itself can be the reason of the failure and we should learn about these cases as soon as possible.

I assume AC2 means the openQA upstream testsuite with "CI tests". I'm not sure what you want to change here. What do you mean with "accounting" specifically? We actually do have tests for setting the test result and the views which show accumulated figures (and e.g. in https://github.com/os-autoinst/openQA/pull/3869 I had to adapt tests).

Maybe we can answer this question better after we understand how the recent regression could slip through our CI tests. Could it be that we mostly rely on job fixtures hence do not test timing dependant behavior of handling test module result updates in our tests well enough?

Actions #4

Updated by okurz over 3 years ago

  • Status changed from Workable to New

moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size

Actions #5

Updated by okurz over 3 years ago

  • Target version changed from Ready to future
Actions

Also available in: Atom PDF