Project

General

Profile

Actions

action #36727

closed

job_grab does not cope with parallel cycles

Added by szarate over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-06-04
Due date:
% Done:

0%

Estimated time:

Description

Apparently the scheduler is deadlocking at some point when the ammount of jobs is too big (quick hipotesys)

px aux shows:

postgres 15941 90.5 1.2 282272 208904 ? Rs 11:07 29:55 postgres: geekotest openqa [local] BIND

pid     | 15941
age     | -00:00:05.616602
usename | geekotest
query   | SELECT COUNT( * ) FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.par

Files

scheduler-locking.txt (194 KB) scheduler-locking.txt szarate, 2018-06-04 09:50

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #32545: Provide warning when user creates/triggers misconfigured multi-machine clustersResolvedmkittler2018-02-28

Actions
Related to openQA Project (public) - coordination #32851: [tools][EPIC] Scheduling redesignResolvedokurz2018-05-05

Actions
Actions #1

Updated by szarate over 6 years ago

killing the PID worked: SELECT pg_terminate_backend(15941);

but dbi is barfing:

Jun 04 11:42:14 openqa openqa-scheduler[15912]: rollback ineffective with AutoCommit enabled at /usr/lib/perl5/vendor_perl/5.18.2/DBIx/Class/Storage/DBI.pm line 1651.

in the attached log file

d', 207='2', 208='scheduled', 209='2', 210='scheduled', 211='2', 212='scheduled', 213='2', 214='scheduled', 215='2', 216='1741516', 217='1741515', 218='scheduled', 219='WORKER_CLASS', 220='qemu_aarch64', 221='tap', 222='1', 223='done', 224='cancelled', 225='2', 226='1741516', 227='1741515', 228='scheduled', 229='scheduled'] at /usr/share/openqa/script/../lib/OpenQA/Scheduler/Scheduler.pm line 155

Note we're still running: 3f6bb61a

Actions #2

Updated by szarate over 6 years ago

After restarting the scheduler:

Jun 04 12:09:15 openqa openqa-scheduler[9278]: Can't open file "/var/log/openqa_scheduler": Permission denied at
Jun 04 12:09:15 openqa openqa-scheduler[9278]: /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71 (#1)
Jun 04 12:09:15 openqa openqa-scheduler[9278]: (S inplace) The implicit opening of a file through use of the <>
Jun 04 12:09:15 openqa openqa-scheduler[9278]: filehandle, either implicitly under the -n or -p command-line
Jun 04 12:09:15 openqa openqa-scheduler[9278]: switches, or explicitly, failed for the indicated reason. Usually
Jun 04 12:09:15 openqa openqa-scheduler[9278]: this is because you don't have read permission for a file which
Jun 04 12:09:15 openqa openqa-scheduler[9278]: you named on the command line.
Jun 04 12:09:15 openqa openqa-scheduler[9278]:

Jun 04 12:09:15 openqa openqa-scheduler[9278]: (F) You tried to call perl with the -e switch, but /dev/null (or
Jun 04 12:09:15 openqa openqa-scheduler[9278]: your operating system's equivalent) could not be opened.
Jun 04 12:09:15 openqa openqa-scheduler[9278]:

Jun 04 12:09:15 openqa openqa-scheduler[9278]: Uncaught exception from user code:
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Can't open file "/var/log/openqa_scheduler": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71.
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Mojo::File::open('Mojo::File=SCALAR(0x4a1f9e8)', '>>') called at /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71
Jun 04 12:09:15 openqa openqa-scheduler[9278]: OpenQA::Setup::setup_log('OpenQA::Setup=HASH(0x13d9e80)') called at /usr/share/openqa/script/../lib/OpenQA/Scheduler.pm line 81
Jun 04 12:09:15 openqa openqa-scheduler[9278]: OpenQA::Scheduler::run() called at /usr/share/openqa/script/openqa-scheduler line 38
Jun 04 12:09:15 openqa systemd[1]: openqa-scheduler.service: Main process exited, code=exited, status=13/n/a

Actions #3

Updated by szarate over 6 years ago

  • Subject changed from openQA scheduler hangs to job_grab seems to hang with over 1.2K jobs

We've had before 1.2, and didn't die... but looks like this time was impossible not to

Actions #4

Updated by EDiGiacinto over 6 years ago

Idea by coolo:

12:45 <coolo> mudler: I want a blocked_by column in the jobs table that also shows the job differently in the scheduled table
12:45 <coolo> mudler: and then the scheduler can ignore those - this should limit this subselect query a lot :)

That can work, and likely can be tested experimentally with not much code changes - we need to check out if with that approach we still see a degradation of performance in the job_grab's generated query in > 1.2K scheduled jobs (and among those, for a good test we need also to schedule jobs with chained parents/parallel, that makes the orm query more heavy).

Actions #5

Updated by szarate over 6 years ago

  • Priority changed from Immediate to Urgent
  • Target version set to Ready

So let's stop the workers then during the weekend, and start them after the database dump is created. (Or we can use the one from yesterday)

Actions #6

Updated by szarate over 6 years ago

  • Assignee deleted (szarate)
Actions #7

Updated by coolo over 6 years ago

  • Priority changed from Urgent to Immediate
  • Target version changed from Ready to Current Sprint

Damn, there is no way around fixing it - whatever caused it, we have this now whenever there is a lot of load :(

Actions #8

Updated by coolo over 6 years ago

And I had to cancel the one job and restart the scheduler to have it rolling again.

Actions #10

Updated by coolo over 6 years ago

Next stop https://openqa.suse.de/tests/1743954 - looks like support servers of clusters are a problem :(

Actions #11

Updated by coolo over 6 years ago

  • Category changed from 168 to 122
  • Assignee set to coolo

It also can fail with simple Server-Client clusters - https://openqa.suse.de/tests/1744194#settings

Actions #12

Updated by coolo over 6 years ago

  • Subject changed from job_grab seems to hang with over 1.2K jobs to job_grab hangs on multimachine jobs

The query created is https://pastebin.com/r0zbhCrn - there is something majorly f****ed

Actions #13

Updated by szarate over 6 years ago

  • Related to action #32545: Provide warning when user creates/triggers misconfigured multi-machine clusters added
Actions #14

Updated by coolo over 6 years ago

  • Subject changed from job_grab hangs on multimachine jobs to job_grab does not cope with parallel cycles
  • Priority changed from Immediate to Normal

After a lot of debugging we found the underlying problem. two test suites both added PARALLEL_WITH with each other, which resulted in a cycle of dependencies. We were just not aware that this is possible, so we didn't look earlier into this and hunted in many other directions first :(

the test suites are fixed

Actions #15

Updated by EDiGiacinto over 6 years ago

Actions #16

Updated by szarate over 6 years ago

  • Assignee changed from coolo to szarate

I will give it a look

Actions #17

Updated by szarate over 6 years ago

Started adding a new fixture, https://github.com/foursixnine/openQA/commits/recursioniscalled_ let's see how it goes.

Actions #18

Updated by szarate over 6 years ago

  • Status changed from New to In Progress

So working with the test fixture is not a problem, now caching the real problem is...

I tried to catch them in the job_create_dependencies, but had no luck so far, as the jobs are already created, making the tests fail horribly. I switched to look at create_from_settings method, but having a key=value table doesn't make it easy.

Actions #20

Updated by coolo over 6 years ago

  • Status changed from In Progress to Resolved

merged

Actions #21

Updated by coolo about 6 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF