Project

General

Profile

action #80274

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #55364: [epic] Let's make codecov reports reliable

Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm size:M

Added by okurz 8 months ago. Updated 2 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-11-24
Due date:
2021-08-10
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

See #55364 : codecov reports often report about coverage changes which are obviously not related to the actual changes of a PR, e.g. when documentation is changed. We can already trust our coverage analysis more but should have only coverage changes reported for actual changes we introduced in a pull request.

Acceptance criteria

  • AC1: t/lib/OpenQA/Test/Utils.pm does not appear anymore as changing code coverage in unrelated changes

Acceptance tests

  • AT1-1: codecov reports in PRs do not reference t/lib/OpenQA/Test/Utils.pm in about 20 PR CI runs (except where t/lib/OpenQA/Test/Utils.pm is actively changed)

Suggestions


Related issues

Related to openQA Project - action #95662: Codecov: Green/red markers are off by one or more lines sometimes size:SResolved2021-07-192021-08-03

Copied from openQA Project - action #80268: Fix flaky coverage - t/05-scheduler-full.tResolved2020-11-24

Copied to openQA Project - action #80298: Fix flaky coverage - lib/OpenQA/WebSockets.pmResolved2020-11-24

Copied to openQA Project - action #89899: Fix flaky coverage - t/ui/27-plugin_obs_rsync_status_details.tResolved

History

#1 Updated by okurz 8 months ago

  • Copied from action #80268: Fix flaky coverage - t/05-scheduler-full.t added

#2 Updated by okurz 8 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

reports like https://app.codecov.io/gh/os-autoinst/openQA/compare/3567/changes look like we are missing coverage reports from some processes. Check if we need to add something like Devel::Cover::report() if Devel::Cover->can('report'); in some more cases.

#3 Updated by okurz 8 months ago

  • Copied to action #80298: Fix flaky coverage - lib/OpenQA/WebSockets.pm added

#4 Updated by okurz 8 months ago

https://codecov.io/gh/os-autoinst/openQA/pull/3606/changes shows more unreliable coverage changes:

In t/lib/OpenQA/Test/Utils.pm the method sub setup_worker shows completely missing coverage now and sub unstable_worker partial coverage:

sub setup_worker {
    my ($worker, $host) = @_;
    $worker->settings->webui_hosts([]);
    $worker->settings->webui_host_specific_settings({});
    push(@{$worker->settings->webui_hosts}, $host);
    $worker->settings->webui_host_specific_settings->{$host} = {};
    $worker->log_setup_info;
…
sub unstable_worker {
…
        _setup_sub_process 'openqa-worker-unstable';
        my $worker = OpenQA::Worker->new(
        setup_worker($worker, $host);
        $worker->init();
        if ($ticks < 0) {
            Mojo::IOLoop->singleton->one_tick for (0 .. $ticks);
        Devel::Cover::report() if Devel::Cover->can('report');

so maybe the method was called but we could not successfully finish the last line which could explain the missing report.

#5 Updated by okurz 8 months ago

  • Description updated (diff)

#6 Updated by Xiaojing_liu 7 months ago

I could reproduce this in my local. But I am not sure how to fix it. Could we simply use "uncoverable" to the flaky part, such as sub setup_worker and unstable_worker? or we should find out why sometimes these codes are not covered? I did not find out why this happened.

#7 Updated by cdywan 7 months ago

Commands used:

make coverage SCHEDULER_FULLSTACK=1 TESTS=t/05-scheduler-full.t | grep '^t/05-scheduler-full'
DIE_ON_FAIL=1 make test-with-database SCHEDULER_FULLSTACK=1 TESTS=t/05-scheduler-full.t

So I think the problem here is that the sigchild handler never completes. A simple note I added wasn't even showing. There's existing code checking if the worker's available via the Workers table but that's not useful for this purpose. The function at best makes it to the initial IOLoop invocation, before it gets to the explicit Devel::Cover::report call.
Having written down all of that I realize this is with $ticks=-1 which means the IOLoop blocks forever...🤦 Always forcing n ticks doesn't seem to help much, but moving up the report call at least gets coverage up to 55.4%.

https://github.com/os-autoinst/openQA/pull/3655 has some of my attempts at improving this, and although I'm not making a lot of headway here maybe someone else will have more ideas

#8 Updated by cdywan 7 months ago

  • Description updated (diff)

#9 Updated by cdywan 7 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

I should probably take this ticket

#10 Updated by openqa_review 7 months ago

  • Due date set to 2021-01-23

Setting due date based on mean cycle time of SUSE QE Tools

#12 Updated by cdywan 6 months ago

We discussed this in the weekly and I took some notes on how to tackle this better, since the PR review had raised some interesting questions.

  • What does unstable_worker do?
  • Can we use an event on the worker, like the current job?
  • Where does it usually stop?
  • Why do we want the infinite worker?
  • An "unstable" worker can't process jobs?
  • The original idea of the test was conceived to behave differently between runs - a counter example would be the scalability test which overlaps with this one.
  • Maybe delete unstable worker tests, see what the coverage looks like afterwards and write new tests for what's missing with defined use cases for issues that we know to expect. We have more knowledge since the test was introduced.

That last part is what I'll look into, rather than trying to fixup a test don't really consider a good approach.

#13 Updated by cdywan 6 months ago

  • Due date changed from 2021-01-23 to 2021-01-29

#14 Updated by cdywan 6 months ago

cdywan wrote:

  • Maybe delete unstable worker tests, see what the coverage looks like afterwards and write new tests for what's missing with defined use cases for issues that we know to expect. We have more knowledge since the test was introduced.

https://github.com/os-autoinst/openQA/pull/3695

#15 Updated by cdywan 6 months ago

  • Due date changed from 2021-01-29 to 2021-02-12

#16 Updated by cdywan 5 months ago

  • Due date changed from 2021-02-12 to 2021-02-19

Unfortunately I didn't actually make much progress here, maybe I should've put this on Feedback. I put up a basic/radical PR first of all to see what Codecov might report back. I still need to add new tests as discussed.

#17 Updated by okurz 5 months ago

  • Due date deleted (2021-02-19)
  • Status changed from In Progress to Blocked

As this is strongly relying on a reliable coverage report I consider this blocked by #88915 which you also picked up yourself so let's mark this ticket accordingly

#18 Updated by okurz 5 months ago

  • Status changed from Blocked to Workable

unblocked

#19 Updated by openqa_review 5 months ago

  • Due date set to 2021-03-24

Setting due date based on mean cycle time of SUSE QE Tools

#20 Updated by okurz 5 months ago

  • Copied to action #89899: Fix flaky coverage - t/ui/27-plugin_obs_rsync_status_details.t added

#21 Updated by cdywan 4 months ago

  • Due date deleted (2021-03-24)

Resetting the due date due to hackweek. And since I'm not actively working on the proposed approach I think it makes more sense to release it for now and take it back later. Or in case somebody feels like taking it I won't be upset.

#22 Updated by cdywan 4 months ago

  • Assignee deleted (cdywan)

Trying to unassign again 😛️ Mainly to stop the bot from adding a due date before it's in progress

#23 Updated by okurz 3 months ago

  • Description updated (diff)

#24 Updated by okurz 29 days ago

  • Subject changed from Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm to Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm size:M
  • Description updated (diff)

#25 Updated by kraih 17 days ago

  • Assignee set to kraih

I'll take a look, since i've fixed similar issues before.

#26 Updated by tinita 9 days ago

  • Related to action #95662: Codecov: Green/red markers are off by one or more lines sometimes size:S added

#27 Updated by okurz 7 days ago

With #95662 resolved – thx tinita – codecov reports should be more consistent and it should be easier to fix the coverage.

#28 Updated by kraih 7 days ago

okurz wrote:

With #95662 resolved – thx tinita – codecov reports should be more consistent and it should be easier to fix the coverage.

Excellent, that was really fast.

#29 Updated by kraih 3 days ago

  • Status changed from Workable to In Progress

#30 Updated by openqa_review 2 days ago

  • Due date set to 2021-08-10

Setting due date based on mean cycle time of SUSE QE Tools

#31 Updated by kraih 2 days ago

  • Status changed from In Progress to Feedback

Maybe resolved with the merged PR.

Also available in: Atom PDF