action #91494: [epic] work on #90152 caused deployment problem and no monitoring alert - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #91494

open

[epic] work on #90152 caused deployment problem and no monitoring alert

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Feature requests

Target version:

QA (public) - future

Start date:

2021-04-21

Due date:

% Done:

Estimated time:

Description

Observation¶

The work on #90152 caused deployment problems on both o3 and osd as we were informed about by users, e.g. in #91461 and in chat messages by dzedro, coolo, dimstar. In particular the PR https://github.com/os-autoinst/openQA/pull/3849 which showed no problems in CI seems to have caused the problem that jobs show no details and fail despite all modules passing. Two problems: We deployed a bug and no tests revealing such problems before merge and there was no monitoring alert

Acceptance criteria¶

AC1: Failed jobs caused by openQA deployments trigger alerts
AC2: We have better CI tests to prevent regressions in the accounting for failed openQA jobs

Related issues 1 (0 open — 1 closed)

Copied from openQA Project (public) - action #91461: Test is missing webui results and fail despite all tests passed

Resolved

mkittler

2021-04-21

2021-05-06

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz over 3 years ago

Copied from action #91461: Test is missing webui results and fail despite all tests passed added

Actions

Copy link

Updated by mkittler over 3 years ago

Note that this is not about a problem of the deployment itself but about problems caused by the newly deployed code. I turned failures like the ones from #90152 into incompletes (see https://github.com/os-autoinst/openQA/pull/3869) so the exact same issue from #90152 should be resolved now.

Not sure whether AC1 makes sense. Do we really generally want an alert for that? We would likely need a very high threshold because failed tests are part of the normal operation.

I assume AC2 means the openQA upstream testsuite with "CI tests". I'm not sure what you want to change here. What do you mean with "accounting" specifically? We actually do have tests for setting the test result and the views which show accumulated figures (and e.g. in https://github.com/os-autoinst/openQA/pull/3869 I had to adapt tests).

Actions

Copy link

Updated by okurz over 3 years ago

mkittler wrote:

Note that this is not about a problem of the deployment itself but about problems caused by the newly deployed code. I turned failures like the ones from #90152 into incompletes (see https://github.com/os-autoinst/openQA/pull/3869) so the exact same issue from #90152 should be resolved now.

Not sure whether AC1 makes sense. Do we really generally want an alert for that? We would likely need a very high threshold because failed tests are part of the normal operation.

well, possibly not an alert as easy as "any failed jobs" but really only "Failed jobs caused by openQA deployments" so we need to find out which they are. E.g. think about openqa-investigate. Let's assume after deployment a job fails, previous one was good, we schedule openqa-investigate and all openqa-investigate jobs fail so we don't identify a test regression nor product regression as failure reason. Then it becomes much more likely that openQA itself can be the reason of the failure and we should learn about these cases as soon as possible.

I assume AC2 means the openQA upstream testsuite with "CI tests". I'm not sure what you want to change here. What do you mean with "accounting" specifically? We actually do have tests for setting the test result and the views which show accumulated figures (and e.g. in https://github.com/os-autoinst/openQA/pull/3869 I had to adapt tests).

Maybe we can answer this question better after we understand how the recent regression could slip through our CI tests. Could it be that we mostly rely on job fixtures hence do not test timing dependant behavior of handling test module result updates in our tests well enough?

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from Workable to New

moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size

Actions

Copy link

Updated by okurz over 3 years ago

Target version changed from Ready to future

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #91494

[epic] work on #90152 caused deployment problem and no monitoring alert

Observation¶

Acceptance criteria¶

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago