action #93086: unstable test in openQA master t/10-jobs.t exceeding runtime of 280s - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #93086

closed

unstable test in openQA master t/10-jobs.t exceeding runtime of 280s

Added by okurz about 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

kraih

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-05-25

Due date:

2021-06-23

% Done:

Estimated time:

Description

Observation¶

https://app.circleci.com/pipelines/github/os-autoinst/openQA/6526/workflows/3caccede-e153-4757-b795-d97046d58228/jobs/61350

shows

RETRY=0 timeout -s SIGINT -k 5 -v $((25 * (0 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer t/10-jobs.t t/14-grutasks.t t/32-openqa_client.t t/44-scripts.t 
[11:40:36] t/10-jobs.t ........... 22/? Bailout called.  Further testing stopped:  test exceeds runtime limit of '280' seconds

Acceptance criteria¶

AC1: t/10-jobs.t takes significantly lower runtime of 280s in circleCI (with coverage analysis enabled)

Suggestions¶

Run t/10-jobs.t locally with and without coverage analysis and check runtime for regression and optimization hotspots

Related issues 1 (0 open — 1 closed)

Copied to openQA Project (public) - coordination #93883: [epic] Speedup openQA coverage tests with running minion jobs synchronously using new upstream "perform_jobs_in_foreground" mojo function

Resolved

mkittler

2021-07-07

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by kraih about 4 years ago

Assignee set to kraih

I'll run a profiler against the test to see what's going on.

Actions

Copy link

Updated by kraih about 4 years ago

There might actually be a bug somewhere, because the test is not heavy at all. Locally it takes 4 seconds to finish.

All tests successful.
Files=1, Tests=41,  4 wallclock secs ( 0.06 usr  0.01 sys +  3.40 cusr  0.45 csys =  3.92 CPU)
Result: PASS

Actions

Copy link

Updated by mkittler about 4 years ago

Did you test with coverage enabled?

Actions

Copy link

Updated by kraih about 4 years ago

Ok, i think i can now replicate locally what is happening on Circle CI.

$ time TEST_PG='DBI:Pg:dbname=openqa_test;host=/dev/shm/tpg' HEAVY=1 DEVEL_COVER_DB_FORMAT=JSON perl -Ilib -MDevel::Cover=-select_re,'^/lib',+ignore_re,lib/perlcritic/Perl/Critic/Policy,-coverage,statement,-db,cover_db t/10-jobs.t
...
real    2m12.795s
user    2m5.298s
sys     0m5.640s

I will do a full profiling with NYTProf, to make sure, but i'm afraid so far it does look like the overhead is simply Devel::Cover processing its data and writing to the database in all the spawned Minion jobs. And if i disable the Devel::Cover::report() call for the Minion jobs the number drops sharply.

real    0m13.368s
user    0m11.653s
sys     0m0.894s

Perhaps there's ways to speed up Devel::Cover.

Actions

Copy link

Updated by kraih about 4 years ago

So, there's 10 forked processes, most look like this:

Stmts   Exclusive Time Reports Source File
8813074	10.6s	line	B/Deparse.pm
3206105	3.54s	line	Devel/Cover.pm (including 3 string evals)
4570491	3.25s	line	B.pm
302536	530ms	line	Data/Dumper.pm
178862	269ms	line	Devel/Cover/DB/Structure.pm
209533	178ms	line	Devel/Cover/DB.pm
318697	129ms	line	Devel/Cover/Criterion.pm
90654	81.5ms	line	Devel/Cover/Statement.pm
10190	41.5ms	line	Devel/Cover/DB/IO/Base.pm
7197	39.6ms	line	Devel/Cover/DB/IO/JSON.pm
35925	33.9ms	line	warnings.pm
3689	22.4ms	line	Devel/Cover/DB/IO.pm (including 2 string evals)
29650	18.3ms	line	Devel/Cover/DB/File.pm
5830	9.64ms	line	DBIx/Class/Storage/DBIHacks.pm
619	7.74ms	line	DBIx/Class/Storage/DBI.pm
2635	6.41ms	line	JSON/MaybeXS.pm
189	5.32ms	line	Mojo/Pg/Database.pm
104	4.93ms	line	DBD/Pg.pm
2606	4.67ms	line	DBIx/Class/ResultSet.pm
2478	4.28ms	line	SQL/Abstract/Classic.pm
1002	2.67ms	line	DBIx/Class/ResultSource.pm
1245	1.88ms	line	DBIx/Class/SQLMaker/ClassicExtensions.pm
363	1.55ms	line	DBI.pm (including 2 string evals)

And some like this (i assume they have a no_cover setting active):

Stmts   Exclusive Time Reports Source File
155	989µs	line	IPC/Run.pm (including 1 string eval)
84	462µs	line	Devel/Cover.pm (including 3 string evals)
96	209µs	line	IPC/Run/Debug.pm (including 1 string eval)
0	0s	line	YAML/XS/LibYAML.pm
0	0s	line	YAML/XS.pm
0	0s	line	YAML/PP/Writer/File.pm

And then the main test process:

Stmts   Exclusive Time Reports Source File
94	130s	line	Minion/Job.pm
8811266	10.2s	line	B/Deparse.pm
4313728	4.77s	line	Devel/Cover.pm (including 3 string evals)
4745489	3.33s	line	B.pm
3079	882ms	line	IPC/Run.pm (including 1 string eval)
216646	551ms	line	DBIx/Class/Storage/DBI.pm
473804	510ms	line	Archive/Extract.pm
289556	499ms	line	SQL/SplitStatement.pm
302586	493ms	line	Data/Dumper.pm
335375	452ms	line	SQL/Abstract/Classic.pm
321298	404ms	line	DBIx/Class/ResultSet.pm
186789	295ms	line	Devel/Cover/DB/Structure.pm
172707	294ms	line	Class/Accessor/Grouped.pm (including 182 string evals)
188547	286ms	line	DBIx/Class/ResultSource.pm
233287	256ms	line	DBIx/Class/Storage/DBIHacks.pm
149411	202ms	line	DBIx/Class/Row.pm
164689	187ms	line	Try/Tiny.pm
209658	165ms	line	Devel/Cover/DB.pm
109961	155ms	line	DateTime.pm

And looking at the actual lines, there's also nothing too unexpected:

Calls 	P 	F 	Exclusive Time 	Inclusive Time 	Subroutine
6	1	1	130s	130s	     Minion::Job::CORE:waitpid (opcode)
1077415	29	2	1.61s	2.15s	               B::class
95217	92	2	1.39s	12.4s	    Devel::Cover::deparse (recurses: max depth 22, inclusive time 30.2s)
2410	2	1	1.24s	2.21s	      B::Deparse::populate_curcvlex
64472	3	1	849ms	1.41s	      B::Deparse::_pessimise_walk (recurses: max depth 31, inclusive time 7.30s)
12	1	1	840ms	840ms	           POSIX::read (xsub)

Actions

Copy link

Updated by kraih about 4 years ago

We are using $minion->perform_jobs to run all jobs currently in the queue sequentially. That could also be done in parallel with a custom worker. On my local dev machine that did cut runtime in half, but of course it depends very much on the available CPU, and might therefore also be very inconsistent on Circle CI.

real    1m0.666s
user    1m47.174s
sys     0m7.287s

Actions

Copy link

Updated by okurz about 4 years ago

ok, I think then I understand how that could be a regression because over time we moved more functionality to minion jobs. We could try to not collect coverage information from minion subprocesses at all and see if we actually loose significant coverage. Or circumvent the minions in separate processes completely and execute the according code segments directly within the main test process. Theoretically this might be slower but makes tests deterministic hence also easier to debug and does not trigger the coverage collection in other processes which causes the slowdown.

Actions

Copy link

Updated by kraih about 4 years ago

I did run a bunch of Circle CI tests with my hacked up support for parallel Minion job processing, and the results ended up fairly underwhelming unfortunately. The first result in each of these three runs is always the one with parallel jobs, and the second without.

[13:03:31] t/10-jobs.t ........... ok   153668 ms ( 0.08 usr  0.00 sys + 250.87 cusr  5.38 csys = 256.33 CPU)
[13:16:46] t/10-jobs.t ........... ok   173135 ms ( 0.07 usr  0.00 sys + 166.85 cusr  4.42 csys = 171.34 CPU)

[14:01:59] t/10-jobs.t ........... ok   151535 ms ( 0.08 usr  0.00 sys + 241.61 cusr  5.61 csys = 247.30 CPU)
[14:15:16] t/10-jobs.t ........... ok   162337 ms ( 0.08 usr  0.00 sys + 156.56 cusr  4.60 csys = 161.24 CPU)

[14:32:42] t/10-jobs.t ........... ok   143575 ms ( 0.06 usr  0.00 sys + 231.69 cusr  5.24 csys = 236.99 CPU)
[15:02:31] t/10-jobs.t ........... ok   165410 ms ( 0.07 usr  0.00 sys + 160.33 cusr  4.07 csys = 164.47 CPU)

For the three runs above the parallel job limit was 10, i also did another run with 4 parallel jobs, and the results improved a little bit again (so there's probbaly a sweet spot where we use Circle CI resources most efficiently).

[16:03:02] t/10-jobs.t ........... ok   120680 ms ( 0.07 usr  0.01 sys + 188.07 cusr  4.51 csys = 192.66 CPU)

My hack to make it work is not really production quality, but i can offer to make it a proper upstream feature in Minion ($t->app->minion->perform_jobs({jobs => 4}) or so). Not sure it's really worth it though.

Actions

Copy link

Updated by mkittler about 4 years ago

A cheap workaround is already present in t/42-df-based-cleanup.t but I know you don't like giving up the forking.

Not sure whether parallelization would gain us much considering that the test likely just spawns one job at a time and then calls perform_minion_jobs.

Perhaps there's ways to speed up Devel::Cover.

That would of course be very good because it would also help with many other tests slowed down by this (also within os-autoinst).

Actions

Copy link

#10

Updated by kraih almost 4 years ago

mkittler wrote:

A cheap workaround is already present in t/42-df-based-cleanup.t but I know you don't like giving up the forking.

Not sure why i didn't think of just doing that. :) That is a very good idea and i'll probably add a cleaner implementation of it to OpenQA::Test::Utils.

Actions

Copy link

#11

Updated by kraih almost 4 years ago

That does look like a success on Circle CI.

[15:39:48] t/10-jobs.t ........... ok    19263 ms ( 0.07 usr  0.01 sys + 18.16 cusr  0.66 csys = 18.90 CPU)

Actions

Copy link

#12

Updated by kraih almost 4 years ago

Opened a PR. https://github.com/os-autoinst/openQA/pull/3934

Actions

Copy link

#13

Updated by kraih almost 4 years ago

Status changed from Workable to In Progress

Actions

Copy link

#14

Updated by openqa_review almost 4 years ago

Due date set to 2021-06-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#15

Updated by kraih almost 4 years ago

Status changed from In Progress to Resolved

PR has been merged. The feature will also be available upstream from Minion soon. https://github.com/mojolicious/minion/compare/de1057fa4607...b8bb86e119c2

Actions

Copy link

#16

Updated by okurz almost 4 years ago

Copied to coordination #93883: [epic] Speedup openQA coverage tests with running minion jobs synchronously using new upstream "perform_jobs_in_foreground" mojo function added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #93086

unstable test in openQA master t/10-jobs.t exceeding runtime of 280s

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by kraih about 4 years ago

Updated by kraih about 4 years ago

Updated by mkittler about 4 years ago

Updated by kraih about 4 years ago

Updated by kraih about 4 years ago

Updated by kraih about 4 years ago

Updated by okurz about 4 years ago

Updated by kraih about 4 years ago

Updated by mkittler about 4 years ago

Updated by kraih almost 4 years ago

Updated by kraih almost 4 years ago

Updated by kraih almost 4 years ago

Updated by kraih almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by kraih almost 4 years ago

Updated by okurz almost 4 years ago