Project

General

Profile

Actions

action #93086

closed

unstable test in openQA master t/10-jobs.t exceeding runtime of 280s

Added by okurz about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-05-25
Due date:
2021-06-23
% Done:

0%

Estimated time:

Description

Observation

https://app.circleci.com/pipelines/github/os-autoinst/openQA/6526/workflows/3caccede-e153-4757-b795-d97046d58228/jobs/61350

shows

RETRY=0 timeout -s SIGINT -k 5 -v $((25 * (0 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer t/10-jobs.t t/14-grutasks.t t/32-openqa_client.t t/44-scripts.t 
[11:40:36] t/10-jobs.t ........... 22/? Bailout called.  Further testing stopped:  test exceeds runtime limit of '280' seconds

Acceptance criteria

  • AC1: t/10-jobs.t takes significantly lower runtime of 280s in circleCI (with coverage analysis enabled)

Suggestions

  • Run t/10-jobs.t locally with and without coverage analysis and check runtime for regression and optimization hotspots

Related issues 1 (0 open1 closed)

Copied to openQA Project (public) - coordination #93883: [epic] Speedup openQA coverage tests with running minion jobs synchronously using new upstream "perform_jobs_in_foreground" mojo functionResolvedmkittler2021-07-07

Actions
Actions #1

Updated by kraih about 4 years ago

  • Assignee set to kraih

I'll run a profiler against the test to see what's going on.

Actions #2

Updated by kraih about 4 years ago

There might actually be a bug somewhere, because the test is not heavy at all. Locally it takes 4 seconds to finish.

All tests successful.
Files=1, Tests=41,  4 wallclock secs ( 0.06 usr  0.01 sys +  3.40 cusr  0.45 csys =  3.92 CPU)
Result: PASS
Actions #3

Updated by mkittler about 4 years ago

Did you test with coverage enabled?

Actions #4

Updated by kraih about 4 years ago

Ok, i think i can now replicate locally what is happening on Circle CI.

$ time TEST_PG='DBI:Pg:dbname=openqa_test;host=/dev/shm/tpg' HEAVY=1 DEVEL_COVER_DB_FORMAT=JSON perl -Ilib -MDevel::Cover=-select_re,'^/lib',+ignore_re,lib/perlcritic/Perl/Critic/Policy,-coverage,statement,-db,cover_db t/10-jobs.t
...
real    2m12.795s
user    2m5.298s
sys     0m5.640s

I will do a full profiling with NYTProf, to make sure, but i'm afraid so far it does look like the overhead is simply Devel::Cover processing its data and writing to the database in all the spawned Minion jobs. And if i disable the Devel::Cover::report() call for the Minion jobs the number drops sharply.

real    0m13.368s
user    0m11.653s
sys     0m0.894s

Perhaps there's ways to speed up Devel::Cover.

Actions #5

Updated by kraih about 4 years ago

So, there's 10 forked processes, most look like this:

Stmts   Exclusive Time Reports Source File
8813074	10.6s	line	B/Deparse.pm
3206105	3.54s	line	Devel/Cover.pm (including 3 string evals)
4570491	3.25s	line	B.pm
302536	530ms	line	Data/Dumper.pm
178862	269ms	line	Devel/Cover/DB/Structure.pm
209533	178ms	line	Devel/Cover/DB.pm
318697	129ms	line	Devel/Cover/Criterion.pm
90654	81.5ms	line	Devel/Cover/Statement.pm
10190	41.5ms	line	Devel/Cover/DB/IO/Base.pm
7197	39.6ms	line	Devel/Cover/DB/IO/JSON.pm
35925	33.9ms	line	warnings.pm
3689	22.4ms	line	Devel/Cover/DB/IO.pm (including 2 string evals)
29650	18.3ms	line	Devel/Cover/DB/File.pm
5830	9.64ms	line	DBIx/Class/Storage/DBIHacks.pm
619	7.74ms	line	DBIx/Class/Storage/DBI.pm
2635	6.41ms	line	JSON/MaybeXS.pm
189	5.32ms	line	Mojo/Pg/Database.pm
104	4.93ms	line	DBD/Pg.pm
2606	4.67ms	line	DBIx/Class/ResultSet.pm
2478	4.28ms	line	SQL/Abstract/Classic.pm
1002	2.67ms	line	DBIx/Class/ResultSource.pm
1245	1.88ms	line	DBIx/Class/SQLMaker/ClassicExtensions.pm
363	1.55ms	line	DBI.pm (including 2 string evals)

And some like this (i assume they have a no_cover setting active):

Stmts   Exclusive Time Reports Source File
155	989µs	line	IPC/Run.pm (including 1 string eval)
84	462µs	line	Devel/Cover.pm (including 3 string evals)
96	209µs	line	IPC/Run/Debug.pm (including 1 string eval)
0	0s	line	YAML/XS/LibYAML.pm
0	0s	line	YAML/XS.pm
0	0s	line	YAML/PP/Writer/File.pm

And then the main test process:

Stmts   Exclusive Time Reports Source File
94	130s	line	Minion/Job.pm
8811266	10.2s	line	B/Deparse.pm
4313728	4.77s	line	Devel/Cover.pm (including 3 string evals)
4745489	3.33s	line	B.pm
3079	882ms	line	IPC/Run.pm (including 1 string eval)
216646	551ms	line	DBIx/Class/Storage/DBI.pm
473804	510ms	line	Archive/Extract.pm
289556	499ms	line	SQL/SplitStatement.pm
302586	493ms	line	Data/Dumper.pm
335375	452ms	line	SQL/Abstract/Classic.pm
321298	404ms	line	DBIx/Class/ResultSet.pm
186789	295ms	line	Devel/Cover/DB/Structure.pm
172707	294ms	line	Class/Accessor/Grouped.pm (including 182 string evals)
188547	286ms	line	DBIx/Class/ResultSource.pm
233287	256ms	line	DBIx/Class/Storage/DBIHacks.pm
149411	202ms	line	DBIx/Class/Row.pm
164689	187ms	line	Try/Tiny.pm
209658	165ms	line	Devel/Cover/DB.pm
109961	155ms	line	DateTime.pm

And looking at the actual lines, there's also nothing too unexpected:

Calls 	P 	F 	Exclusive Time 	Inclusive Time 	Subroutine
6	1	1	130s	130s	     Minion::Job::CORE:waitpid (opcode)
1077415	29	2	1.61s	2.15s	               B::class
95217	92	2	1.39s	12.4s	    Devel::Cover::deparse (recurses: max depth 22, inclusive time 30.2s)
2410	2	1	1.24s	2.21s	      B::Deparse::populate_curcvlex
64472	3	1	849ms	1.41s	      B::Deparse::_pessimise_walk (recurses: max depth 31, inclusive time 7.30s)
12	1	1	840ms	840ms	           POSIX::read (xsub)
Actions #6

Updated by kraih about 4 years ago

We are using $minion->perform_jobs to run all jobs currently in the queue sequentially. That could also be done in parallel with a custom worker. On my local dev machine that did cut runtime in half, but of course it depends very much on the available CPU, and might therefore also be very inconsistent on Circle CI.

real    1m0.666s
user    1m47.174s
sys     0m7.287s
Actions #7

Updated by okurz about 4 years ago

ok, I think then I understand how that could be a regression because over time we moved more functionality to minion jobs. We could try to not collect coverage information from minion subprocesses at all and see if we actually loose significant coverage. Or circumvent the minions in separate processes completely and execute the according code segments directly within the main test process. Theoretically this might be slower but makes tests deterministic hence also easier to debug and does not trigger the coverage collection in other processes which causes the slowdown.

Actions #8

Updated by kraih about 4 years ago

I did run a bunch of Circle CI tests with my hacked up support for parallel Minion job processing, and the results ended up fairly underwhelming unfortunately. The first result in each of these three runs is always the one with parallel jobs, and the second without.

[13:03:31] t/10-jobs.t ........... ok   153668 ms ( 0.08 usr  0.00 sys + 250.87 cusr  5.38 csys = 256.33 CPU)
[13:16:46] t/10-jobs.t ........... ok   173135 ms ( 0.07 usr  0.00 sys + 166.85 cusr  4.42 csys = 171.34 CPU)

[14:01:59] t/10-jobs.t ........... ok   151535 ms ( 0.08 usr  0.00 sys + 241.61 cusr  5.61 csys = 247.30 CPU)
[14:15:16] t/10-jobs.t ........... ok   162337 ms ( 0.08 usr  0.00 sys + 156.56 cusr  4.60 csys = 161.24 CPU)

[14:32:42] t/10-jobs.t ........... ok   143575 ms ( 0.06 usr  0.00 sys + 231.69 cusr  5.24 csys = 236.99 CPU)
[15:02:31] t/10-jobs.t ........... ok   165410 ms ( 0.07 usr  0.00 sys + 160.33 cusr  4.07 csys = 164.47 CPU)

For the three runs above the parallel job limit was 10, i also did another run with 4 parallel jobs, and the results improved a little bit again (so there's probbaly a sweet spot where we use Circle CI resources most efficiently).

[16:03:02] t/10-jobs.t ........... ok   120680 ms ( 0.07 usr  0.01 sys + 188.07 cusr  4.51 csys = 192.66 CPU)

My hack to make it work is not really production quality, but i can offer to make it a proper upstream feature in Minion ($t->app->minion->perform_jobs({jobs => 4}) or so). Not sure it's really worth it though.

Actions #9

Updated by mkittler about 4 years ago

A cheap workaround is already present in t/42-df-based-cleanup.t but I know you don't like giving up the forking.


Not sure whether parallelization would gain us much considering that the test likely just spawns one job at a time and then calls perform_minion_jobs.


Perhaps there's ways to speed up Devel::Cover.

That would of course be very good because it would also help with many other tests slowed down by this (also within os-autoinst).

Actions #10

Updated by kraih almost 4 years ago

mkittler wrote:

A cheap workaround is already present in t/42-df-based-cleanup.t but I know you don't like giving up the forking.

Not sure why i didn't think of just doing that. :) That is a very good idea and i'll probably add a cleaner implementation of it to OpenQA::Test::Utils.

Actions #11

Updated by kraih almost 4 years ago

That does look like a success on Circle CI.

[15:39:48] t/10-jobs.t ........... ok    19263 ms ( 0.07 usr  0.01 sys + 18.16 cusr  0.66 csys = 18.90 CPU)
Actions #13

Updated by kraih almost 4 years ago

  • Status changed from Workable to In Progress
Actions #14

Updated by openqa_review almost 4 years ago

  • Due date set to 2021-06-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by kraih almost 4 years ago

  • Status changed from In Progress to Resolved

PR has been merged. The feature will also be available upstream from Minion soon. https://github.com/mojolicious/minion/compare/de1057fa4607...b8bb86e119c2

Actions #16

Updated by okurz almost 4 years ago

  • Copied to coordination #93883: [epic] Speedup openQA coverage tests with running minion jobs synchronously using new upstream "perform_jobs_in_foreground" mojo function added
Actions

Also available in: Atom PDF