action #91494
open
[epic] work on #90152 caused deployment problem and no monitoring alert
Added by okurz over 3 years ago.
Updated over 3 years ago.
Category:
Feature requests
Description
Observation¶
The work on #90152 caused deployment problems on both o3 and osd as we were informed about by users, e.g. in #91461 and in chat messages by dzedro, coolo, dimstar. In particular the PR https://github.com/os-autoinst/openQA/pull/3849 which showed no problems in CI seems to have caused the problem that jobs show no details and fail despite all modules passing. Two problems: We deployed a bug and no tests revealing such problems before merge and there was no monitoring alert
Acceptance criteria¶
- AC1: Failed jobs caused by openQA deployments trigger alerts
- AC2: We have better CI tests to prevent regressions in the accounting for failed openQA jobs
- Copied from action #91461: Test is missing webui results and fail despite all tests passed added
Note that this is not about a problem of the deployment itself but about problems caused by the newly deployed code. I turned failures like the ones from #90152 into incompletes (see https://github.com/os-autoinst/openQA/pull/3869) so the exact same issue from #90152 should be resolved now.
Not sure whether AC1 makes sense. Do we really generally want an alert for that? We would likely need a very high threshold because failed tests are part of the normal operation.
I assume AC2 means the openQA upstream testsuite with "CI tests". I'm not sure what you want to change here. What do you mean with "accounting" specifically? We actually do have tests for setting the test result and the views which show accumulated figures (and e.g. in https://github.com/os-autoinst/openQA/pull/3869 I had to adapt tests).
mkittler wrote:
Note that this is not about a problem of the deployment itself but about problems caused by the newly deployed code. I turned failures like the ones from #90152 into incompletes (see https://github.com/os-autoinst/openQA/pull/3869) so the exact same issue from #90152 should be resolved now.
Not sure whether AC1 makes sense. Do we really generally want an alert for that? We would likely need a very high threshold because failed tests are part of the normal operation.
well, possibly not an alert as easy as "any failed jobs" but really only "Failed jobs caused by openQA deployments" so we need to find out which they are. E.g. think about openqa-investigate. Let's assume after deployment a job fails, previous one was good, we schedule openqa-investigate and all openqa-investigate jobs fail so we don't identify a test regression nor product regression as failure reason. Then it becomes much more likely that openQA itself can be the reason of the failure and we should learn about these cases as soon as possible.
I assume AC2 means the openQA upstream testsuite with "CI tests". I'm not sure what you want to change here. What do you mean with "accounting" specifically? We actually do have tests for setting the test result and the views which show accumulated figures (and e.g. in https://github.com/os-autoinst/openQA/pull/3869 I had to adapt tests).
Maybe we can answer this question better after we understand how the recent regression could slip through our CI tests. Could it be that we mostly rely on job fixtures hence do not test timing dependant behavior of handling test module result updates in our tests well enough?
- Status changed from Workable to New
moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size
- Target version changed from Ready to future
Also available in: Atom
PDF