action #91494: [epic] work on #90152 caused deployment problem and no monitoring alert - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #91494

open

[epic] work on #90152 caused deployment problem and no monitoring alert

Added by okurz almost 4 years ago. Updated over 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Feature requests

Target version:

QA (public) - future

Start date:

2021-04-21

Due date:

% Done:

Estimated time:

Description

Observation¶

The work on #90152 caused deployment problems on both o3 and osd as we were informed about by users, e.g. in #91461 and in chat messages by dzedro, coolo, dimstar. In particular the PR https://github.com/os-autoinst/openQA/pull/3849 which showed no problems in CI seems to have caused the problem that jobs show no details and fail despite all modules passing. Two problems: We deployed a bug and no tests revealing such problems before merge and there was no monitoring alert

Acceptance criteria¶

AC1: Failed jobs caused by openQA deployments trigger alerts
AC2: We have better CI tests to prevent regressions in the accounting for failed openQA jobs

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz almost 4 years ago

Copied from action #91461: Test is missing webui results and fail despite all tests passed added

Actions

Copy link

Updated by mkittler almost 4 years ago

Note that this is not about a problem of the deployment itself but about problems caused by the newly deployed code. I turned failures like the ones from #90152 into incompletes (see https://github.com/os-autoinst/openQA/pull/3869) so the exact same issue from #90152 should be resolved now.

Not sure whether AC1 makes sense. Do we really generally want an alert for that? We would likely need a very high threshold because failed tests are part of the normal operation.

I assume AC2 means the openQA upstream testsuite with "CI tests". I'm not sure what you want to change here. What do you mean with "accounting" specifically? We actually do have tests for setting the test result and the views which show accumulated figures (and e.g. in https://github.com/os-autoinst/openQA/pull/3869 I had to adapt tests).

Actions

Copy link

Updated by okurz almost 4 years ago

mkittler wrote:

Note that this is not about a problem of the deployment itself but about problems caused by the newly deployed code. I turned failures like the ones from #90152 into incompletes (see https://github.com/os-autoinst/openQA/pull/3869) so the exact same issue from #90152 should be resolved now.

Not sure whether AC1 makes sense. Do we really generally want an alert for that? We would likely need a very high threshold because failed tests are part of the normal operation.

well, possibly not an alert as easy as "any failed jobs" but really only "Failed jobs caused by openQA deployments" so we need to find out which they are. E.g. think about openqa-investigate. Let's assume after deployment a job fails, previous one was good, we schedule openqa-investigate and all openqa-investigate jobs fail so we don't identify a test regression nor product regression as failure reason. Then it becomes much more likely that openQA itself can be the reason of the failure and we should learn about these cases as soon as possible.

I assume AC2 means the openQA upstream testsuite with "CI tests". I'm not sure what you want to change here. What do you mean with "accounting" specifically? We actually do have tests for setting the test result and the views which show accumulated figures (and e.g. in https://github.com/os-autoinst/openQA/pull/3869 I had to adapt tests).

Maybe we can answer this question better after we understand how the recent regression could slip through our CI tests. Could it be that we mostly rely on job fixtures hence do not test timing dependant behavior of handling test module result updates in our tests well enough?

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from Workable to New

moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size

Actions

Copy link

Updated by okurz over 3 years ago

Target version changed from Ready to future

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #91494

[epic] work on #90152 caused deployment problem and no monitoring alert

Observation¶

Acceptance criteria¶

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago