action #36727: job_grab does not cope with parallel cycles - openQA Project - openSUSE Project Management Tool

Actions

Copy link

action #36727

closed

job_grab does not cope with parallel cycles

Added by szarate about 6 years ago. Updated almost 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

szarate

Category:

Feature requests

Target version:

Done

Start date:

2018-06-04

Due date:

% Done:

Estimated time:

Description

Apparently the scheduler is deadlocking at some point when the ammount of jobs is too big (quick hipotesys)

px aux shows:

postgres 15941 90.5 1.2 282272 208904 ? Rs 11:07 29:55 postgres: geekotest openqa [local] BIND

pid     | 15941
age     | -00:00:05.616602
usename | geekotest
query   | SELECT COUNT( * ) FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.par

Files

scheduler-locking.txt (194 KB) scheduler-locking.txt

szarate, 2018-06-04 09:50

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by szarate about 6 years ago

File scheduler-locking.txt scheduler-locking.txt added

killing the PID worked: SELECT pg_terminate_backend(15941);

but dbi is barfing:

Jun 04 11:42:14 openqa openqa-scheduler[15912]: rollback ineffective with AutoCommit enabled at /usr/lib/perl5/vendor_perl/5.18.2/DBIx/Class/Storage/DBI.pm line 1651.

in the attached log file

d', 207='2', 208='scheduled', 209='2', 210='scheduled', 211='2', 212='scheduled', 213='2', 214='scheduled', 215='2', 216='1741516', 217='1741515', 218='scheduled', 219='WORKER_CLASS', 220='qemu_aarch64', 221='tap', 222='1', 223='done', 224='cancelled', 225='2', 226='1741516', 227='1741515', 228='scheduled', 229='scheduled'] at /usr/share/openqa/script/../lib/OpenQA/Scheduler/Scheduler.pm line 155

Note we're still running: 3f6bb61a

Actions

Copy link

Updated by szarate about 6 years ago

After restarting the scheduler:

Jun 04 12:09:15 openqa openqa-scheduler[9278]: Can't open file "/var/log/openqa_scheduler": Permission denied at
Jun 04 12:09:15 openqa openqa-scheduler[9278]: /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71 (#1)
Jun 04 12:09:15 openqa openqa-scheduler[9278]: (S inplace) The implicit opening of a file through use of the <>
Jun 04 12:09:15 openqa openqa-scheduler[9278]: filehandle, either implicitly under the -n or -p command-line
Jun 04 12:09:15 openqa openqa-scheduler[9278]: switches, or explicitly, failed for the indicated reason. Usually
Jun 04 12:09:15 openqa openqa-scheduler[9278]: this is because you don't have read permission for a file which
Jun 04 12:09:15 openqa openqa-scheduler[9278]: you named on the command line.
Jun 04 12:09:15 openqa openqa-scheduler[9278]:

Jun 04 12:09:15 openqa openqa-scheduler[9278]: (F) You tried to call perl with the -e switch, but /dev/null (or
Jun 04 12:09:15 openqa openqa-scheduler[9278]: your operating system's equivalent) could not be opened.
Jun 04 12:09:15 openqa openqa-scheduler[9278]:

Jun 04 12:09:15 openqa openqa-scheduler[9278]: Uncaught exception from user code:
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Can't open file "/var/log/openqa_scheduler": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71.
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Mojo::File::open('Mojo::File=SCALAR(0x4a1f9e8)', '>>') called at /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71
Jun 04 12:09:15 openqa openqa-scheduler[9278]: OpenQA::Setup::setup_log('OpenQA::Setup=HASH(0x13d9e80)') called at /usr/share/openqa/script/../lib/OpenQA/Scheduler.pm line 81
Jun 04 12:09:15 openqa openqa-scheduler[9278]: OpenQA::Scheduler::run() called at /usr/share/openqa/script/openqa-scheduler line 38
Jun 04 12:09:15 openqa systemd[1]: openqa-scheduler.service: Main process exited, code=exited, status=13/n/a

Actions

Copy link

Updated by szarate about 6 years ago

Subject changed from openQA scheduler hangs to job_grab seems to hang with over 1.2K jobs

We've had before 1.2, and didn't die... but looks like this time was impossible not to

Actions

Copy link

Updated by EDiGiacinto about 6 years ago

Idea by coolo:

12:45 <coolo> mudler: I want a blocked_by column in the jobs table that also shows the job differently in the scheduled table
12:45 <coolo> mudler: and then the scheduler can ignore those - this should limit this subselect query a lot :)

That can work, and likely can be tested experimentally with not much code changes - we need to check out if with that approach we still see a degradation of performance in the job_grab's generated query in > 1.2K scheduled jobs (and among those, for a good test we need also to schedule jobs with chained parents/parallel, that makes the orm query more heavy).

Actions

Copy link

Updated by szarate about 6 years ago

Priority changed from Immediate to Urgent
Target version set to Ready

So let's stop the workers then during the weekend, and start them after the database dump is created. (Or we can use the one from yesterday)

Actions

Copy link

Updated by szarate about 6 years ago

Assignee deleted (~~szarate~~)

Actions

Copy link

Updated by coolo about 6 years ago

Priority changed from Urgent to Immediate
Target version changed from Ready to Current Sprint

Damn, there is no way around fixing it - whatever caused it, we have this now whenever there is a lot of load :(

Actions

Copy link

Updated by coolo about 6 years ago

And I had to cancel the one job and restart the scheduler to have it rolling again.

Actions

Copy link

Updated by coolo about 6 years ago

https://openqa.suse.de/tests/1743900 in this case

Actions

Copy link

#10

Updated by coolo about 6 years ago

Next stop https://openqa.suse.de/tests/1743954 - looks like support servers of clusters are a problem :(

Actions

Copy link

#11

Updated by coolo about 6 years ago

Category changed from 168 to 122
Assignee set to coolo

It also can fail with simple Server-Client clusters - https://openqa.suse.de/tests/1744194#settings

Actions

Copy link

#12

Updated by coolo about 6 years ago

Subject changed from job_grab seems to hang with over 1.2K jobs to job_grab hangs on multimachine jobs

The query created is https://pastebin.com/r0zbhCrn - there is something majorly f****ed

Actions

Copy link

#13

Updated by szarate about 6 years ago

Related to action #32545: Provide warning when user creates/triggers misconfigured multi-machine clusters added

Actions

Copy link

#14

Updated by coolo about 6 years ago

Subject changed from job_grab hangs on multimachine jobs to job_grab does not cope with parallel cycles
Priority changed from Immediate to Normal

After a lot of debugging we found the underlying problem. two test suites both added PARALLEL_WITH with each other, which resulted in a cycle of dependencies. We were just not aware that this is possible, so we didn't look earlier into this and hunted in many other directions first :(

the test suites are fixed

Actions

Copy link

#15

Updated by EDiGiacinto about 6 years ago

Related to coordination #32851: [tools][EPIC] Scheduling redesign added

Actions

Copy link

#16

Updated by szarate about 6 years ago

Assignee changed from coolo to szarate

I will give it a look

Actions

Copy link

#17

Updated by szarate about 6 years ago

Started adding a new fixture, https://github.com/foursixnine/openQA/commits/recursioniscalled_ let's see how it goes.

Actions

Copy link

#18

Updated by szarate almost 6 years ago

Status changed from New to In Progress

So working with the test fixture is not a problem, now caching the real problem is...

I tried to catch them in the job_create_dependencies, but had no luck so far, as the jobs are already created, making the tests fail horribly. I switched to look at create_from_settings method, but having a key=value table doesn't make it easy.

Actions

Copy link

#19

Updated by coolo almost 6 years ago

https://github.com/os-autoinst/openQA/pull/1775

Actions

Copy link

#20

Updated by coolo almost 6 years ago

Status changed from In Progress to Resolved

merged

Actions

Copy link

#21

Updated by coolo almost 6 years ago

Target version changed from Current Sprint to Done

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #36727

job_grab does not cope with parallel cycles

Updated by szarate about 6 years ago

Updated by szarate about 6 years ago

Updated by szarate about 6 years ago

Updated by EDiGiacinto about 6 years ago

Updated by szarate about 6 years ago

Updated by szarate about 6 years ago

Updated by coolo about 6 years ago

Updated by coolo about 6 years ago

Updated by coolo about 6 years ago

Updated by coolo about 6 years ago

Updated by coolo about 6 years ago

Updated by coolo about 6 years ago

Updated by szarate about 6 years ago

Updated by coolo about 6 years ago

Updated by EDiGiacinto about 6 years ago

Updated by szarate about 6 years ago

Updated by szarate about 6 years ago

Updated by szarate almost 6 years ago

Updated by coolo almost 6 years ago

Updated by coolo almost 6 years ago

Updated by coolo almost 6 years ago