Project

General

Profile

Actions

action #80274

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #55364: [epic] Let's make codecov reports reliable

Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm size:M

Added by okurz over 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-11-24
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #55364 : codecov reports often report about coverage changes which are obviously not related to the actual changes of a PR, e.g. when documentation is changed. We can already trust our coverage analysis more but should have only coverage changes reported for actual changes we introduced in a pull request.

Acceptance criteria

  • AC1: t/lib/OpenQA/Test/Utils.pm does not appear anymore as changing code coverage in unrelated changes

Acceptance tests

  • AT1-1: codecov reports in PRs do not reference t/lib/OpenQA/Test/Utils.pm in about 20 PR CI runs (except where t/lib/OpenQA/Test/Utils.pm is actively changed)

Suggestions


Related issues 4 (0 open4 closed)

Related to openQA Project - action #95662: Codecov: Green/red markers are off by one or more lines sometimes size:SResolvedtinita2021-07-192021-08-03

Actions
Copied from openQA Project - action #80268: Fix flaky coverage - t/05-scheduler-full.tResolvedokurz2020-11-24

Actions
Copied to openQA Project - action #80298: Fix flaky coverage - lib/OpenQA/WebSockets.pmResolvedokurz2020-11-24

Actions
Copied to openQA Project - action #89899: Fix flaky coverage - t/ui/27-plugin_obs_rsync_status_details.tResolvedkraih

Actions
Actions #1

Updated by okurz over 3 years ago

  • Copied from action #80268: Fix flaky coverage - t/05-scheduler-full.t added
Actions #2

Updated by okurz over 3 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

reports like https://app.codecov.io/gh/os-autoinst/openQA/compare/3567/changes look like we are missing coverage reports from some processes. Check if we need to add something like Devel::Cover::report() if Devel::Cover->can('report'); in some more cases.

Actions #3

Updated by okurz over 3 years ago

  • Copied to action #80298: Fix flaky coverage - lib/OpenQA/WebSockets.pm added
Actions #4

Updated by okurz over 3 years ago

https://codecov.io/gh/os-autoinst/openQA/pull/3606/changes shows more unreliable coverage changes:

In t/lib/OpenQA/Test/Utils.pm the method sub setup_worker shows completely missing coverage now and sub unstable_worker partial coverage:

sub setup_worker {
    my ($worker, $host) = @_;
    $worker->settings->webui_hosts([]);
    $worker->settings->webui_host_specific_settings({});
    push(@{$worker->settings->webui_hosts}, $host);
    $worker->settings->webui_host_specific_settings->{$host} = {};
    $worker->log_setup_info;

sub unstable_worker {

        _setup_sub_process 'openqa-worker-unstable';
        my $worker = OpenQA::Worker->new(
        setup_worker($worker, $host);
        $worker->init();
        if ($ticks < 0) {
            Mojo::IOLoop->singleton->one_tick for (0 .. $ticks);
        Devel::Cover::report() if Devel::Cover->can('report');

so maybe the method was called but we could not successfully finish the last line which could explain the missing report.

Actions #5

Updated by okurz over 3 years ago

  • Description updated (diff)
Actions #6

Updated by Xiaojing_liu about 3 years ago

I could reproduce this in my local. But I am not sure how to fix it. Could we simply use "uncoverable" to the flaky part, such as sub setup_worker and unstable_worker? or we should find out why sometimes these codes are not covered? I did not find out why this happened.

Actions #7

Updated by livdywan about 3 years ago

Commands used:

make coverage SCHEDULER_FULLSTACK=1 TESTS=t/05-scheduler-full.t | grep '^t/05-scheduler-full'
DIE_ON_FAIL=1 make test-with-database SCHEDULER_FULLSTACK=1 TESTS=t/05-scheduler-full.t

So I think the problem here is that the sigchild handler never completes. A simple note I added wasn't even showing. There's existing code checking if the worker's available via the Workers table but that's not useful for this purpose. The function at best makes it to the initial IOLoop invocation, before it gets to the explicit Devel::Cover::report call.
Having written down all of that I realize this is with $ticks=-1 which means the IOLoop blocks forever...🤦 Always forcing n ticks doesn't seem to help much, but moving up the report call at least gets coverage up to 55.4%.

https://github.com/os-autoinst/openQA/pull/3655 has some of my attempts at improving this, and although I'm not making a lot of headway here maybe someone else will have more ideas

Actions #8

Updated by livdywan about 3 years ago

  • Description updated (diff)
Actions #9

Updated by livdywan about 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

I should probably take this ticket

Actions #10

Updated by openqa_review about 3 years ago

  • Due date set to 2021-01-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by livdywan about 3 years ago

We discussed this in the weekly and I took some notes on how to tackle this better, since the PR review had raised some interesting questions.

  • What does unstable_worker do?
  • Can we use an event on the worker, like the current job?
  • Where does it usually stop?
  • Why do we want the infinite worker?
  • An "unstable" worker can't process jobs?
  • The original idea of the test was conceived to behave differently between runs - a counter example would be the scalability test which overlaps with this one.
  • Maybe delete unstable worker tests, see what the coverage looks like afterwards and write new tests for what's missing with defined use cases for issues that we know to expect. We have more knowledge since the test was introduced.

That last part is what I'll look into, rather than trying to fixup a test don't really consider a good approach.

Actions #13

Updated by livdywan about 3 years ago

  • Due date changed from 2021-01-23 to 2021-01-29
Actions #14

Updated by livdywan about 3 years ago

cdywan wrote:

  • Maybe delete unstable worker tests, see what the coverage looks like afterwards and write new tests for what's missing with defined use cases for issues that we know to expect. We have more knowledge since the test was introduced.

https://github.com/os-autoinst/openQA/pull/3695

Actions #15

Updated by livdywan about 3 years ago

  • Due date changed from 2021-01-29 to 2021-02-12
Actions #16

Updated by livdywan about 3 years ago

  • Due date changed from 2021-02-12 to 2021-02-19

Unfortunately I didn't actually make much progress here, maybe I should've put this on Feedback. I put up a basic/radical PR first of all to see what Codecov might report back. I still need to add new tests as discussed.

Actions #17

Updated by okurz about 3 years ago

  • Due date deleted (2021-02-19)
  • Status changed from In Progress to Blocked

As this is strongly relying on a reliable coverage report I consider this blocked by #88915 which you also picked up yourself so let's mark this ticket accordingly

Actions #18

Updated by okurz about 3 years ago

  • Status changed from Blocked to Workable

unblocked

Actions #19

Updated by openqa_review about 3 years ago

  • Due date set to 2021-03-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #20

Updated by okurz about 3 years ago

  • Copied to action #89899: Fix flaky coverage - t/ui/27-plugin_obs_rsync_status_details.t added
Actions #21

Updated by livdywan about 3 years ago

  • Due date deleted (2021-03-24)

Resetting the due date due to hackweek. And since I'm not actively working on the proposed approach I think it makes more sense to release it for now and take it back later. Or in case somebody feels like taking it I won't be upset.

Actions #22

Updated by livdywan about 3 years ago

  • Assignee deleted (livdywan)

Trying to unassign again 😛️ Mainly to stop the bot from adding a due date before it's in progress

Actions #23

Updated by okurz almost 3 years ago

  • Description updated (diff)
Actions #24

Updated by okurz over 2 years ago

  • Subject changed from Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm to Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm size:M
  • Description updated (diff)
Actions #25

Updated by kraih over 2 years ago

  • Assignee set to kraih

I'll take a look, since i've fixed similar issues before.

Actions #26

Updated by tinita over 2 years ago

  • Related to action #95662: Codecov: Green/red markers are off by one or more lines sometimes size:S added
Actions #27

Updated by okurz over 2 years ago

With #95662 resolved – thx tinita – codecov reports should be more consistent and it should be easier to fix the coverage.

Actions #28

Updated by kraih over 2 years ago

okurz wrote:

With #95662 resolved – thx tinita – codecov reports should be more consistent and it should be easier to fix the coverage.

Excellent, that was really fast.

Actions #29

Updated by kraih over 2 years ago

  • Status changed from Workable to In Progress
Actions #30

Updated by openqa_review over 2 years ago

  • Due date set to 2021-08-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #31

Updated by kraih over 2 years ago

  • Status changed from In Progress to Feedback

Maybe resolved with the merged PR.

Actions #32

Updated by livdywan over 2 years ago

  • Due date deleted (2021-08-10)
  • Status changed from Feedback to Resolved

I still see gaps but it looks to be stable

Actions

Also available in: Atom PDF