action #36727

job_grab does not cope with parallel cycles

Added by szarate over 1 year ago. Updated over 1 year ago.

Status:ResolvedStart date:04/06/2018
Priority:NormalDue date:
Assignee:szarate% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

Apparently the scheduler is deadlocking at some point when the ammount of jobs is too big (quick hipotesys)

px aux shows:

postgres 15941 90.5 1.2 282272 208904 ? Rs 11:07 29:55 postgres: geekotest openqa [local] BIND

pid     | 15941
age     | -00:00:05.616602
usename | geekotest
query   | SELECT COUNT( * ) FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me  JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.par

scheduler-locking.txt Magnifier (194 KB) szarate, 04/06/2018 09:50 am


Related issues

Related to openQA Project - action #32545: Catch multi-machine clusters misconfigured New 28/02/2018
Related to openQA Project - action #32851: [tools][EPIC] Scheduling redesign New 05/05/2018

History

#1 Updated by szarate over 1 year ago

killing the PID worked: SELECT pg_terminate_backend(15941);

but dbi is barfing:

Jun 04 11:42:14 openqa openqa-scheduler[15912]: rollback ineffective with AutoCommit enabled at /usr/lib/perl5/vendor_perl/5.18.2/DBIx/Class/Storage/DBI.pm line 1651.

in the attached log file

d', 207='2', 208='scheduled', 209='2', 210='scheduled', 211='2', 212='scheduled', 213='2', 214='scheduled', 215='2', 216='1741516', 217='1741515', 218='scheduled', 219='WORKER_CLASS', 220='qemu_aarch64', 221='tap', 222='1', 223='done', 224='cancelled', 225='2', 226='1741516', 227='1741515', 228='scheduled', 229='scheduled'] at /usr/share/openqa/script/../lib/OpenQA/Scheduler/Scheduler.pm line 155

Note we're still running: 3f6bb61a

#2 Updated by szarate over 1 year ago

After restarting the scheduler:

Jun 04 12:09:15 openqa openqa-scheduler[9278]: Can't open file "/var/log/openqa_scheduler": Permission denied at
Jun 04 12:09:15 openqa openqa-scheduler[9278]: /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71 (#1)
Jun 04 12:09:15 openqa openqa-scheduler[9278]: (S inplace) The implicit opening of a file through use of the <>
Jun 04 12:09:15 openqa openqa-scheduler[9278]: filehandle, either implicitly under the -n or -p command-line
Jun 04 12:09:15 openqa openqa-scheduler[9278]: switches, or explicitly, failed for the indicated reason. Usually
Jun 04 12:09:15 openqa openqa-scheduler[9278]: this is because you don't have read permission for a file which
Jun 04 12:09:15 openqa openqa-scheduler[9278]: you named on the command line.
Jun 04 12:09:15 openqa openqa-scheduler[9278]:

Jun 04 12:09:15 openqa openqa-scheduler[9278]: (F) You tried to call perl with the -e switch, but /dev/null (or
Jun 04 12:09:15 openqa openqa-scheduler[9278]: your operating system's equivalent) could not be opened.
Jun 04 12:09:15 openqa openqa-scheduler[9278]:

Jun 04 12:09:15 openqa openqa-scheduler[9278]: Uncaught exception from user code:
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Can't open file "/var/log/openqa_scheduler": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71.
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Mojo::File::open('Mojo::File=SCALAR(0x4a1f9e8)', '>>') called at /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71
Jun 04 12:09:15 openqa openqa-scheduler[9278]: OpenQA::Setup::setup_log('OpenQA::Setup=HASH(0x13d9e80)') called at /usr/share/openqa/script/../lib/OpenQA/Scheduler.pm line 81
Jun 04 12:09:15 openqa openqa-scheduler[9278]: OpenQA::Scheduler::run() called at /usr/share/openqa/script/openqa-scheduler line 38
Jun 04 12:09:15 openqa systemd[1]: openqa-scheduler.service: Main process exited, code=exited, status=13/n/a

#3 Updated by szarate over 1 year ago

  • Subject changed from openQA scheduler hangs to job_grab seems to hang with over 1.2K jobs

We've had before 1.2, and didn't die... but looks like this time was impossible not to

#4 Updated by EDiGiacinto over 1 year ago

Idea by coolo:

12:45 <coolo> mudler: I want a blocked_by column in the jobs table that also shows the job differently in the scheduled table
12:45 <coolo> mudler: and then the scheduler can ignore those - this should limit this subselect query a lot :)

That can work, and likely can be tested experimentally with not much code changes - we need to check out if with that approach we still see a degradation of performance in the job_grab's generated query in > 1.2K scheduled jobs (and among those, for a good test we need also to schedule jobs with chained parents/parallel, that makes the orm query more heavy).

#5 Updated by szarate over 1 year ago

  • Priority changed from Immediate to Urgent
  • Target version set to Ready

So let's stop the workers then during the weekend, and start them after the database dump is created. (Or we can use the one from yesterday)

#6 Updated by szarate over 1 year ago

  • Assignee deleted (szarate)

#7 Updated by coolo over 1 year ago

  • Priority changed from Urgent to Immediate
  • Target version changed from Ready to Current Sprint

Damn, there is no way around fixing it - whatever caused it, we have this now whenever there is a lot of load :(

#8 Updated by coolo over 1 year ago

And I had to cancel the one job and restart the scheduler to have it rolling again.

#10 Updated by coolo over 1 year ago

Next stop https://openqa.suse.de/tests/1743954 - looks like support servers of clusters are a problem :(

#11 Updated by coolo over 1 year ago

  • Category changed from 168 to 122
  • Assignee set to coolo

It also can fail with simple Server-Client clusters - https://openqa.suse.de/tests/1744194#settings

#12 Updated by coolo over 1 year ago

  • Subject changed from job_grab seems to hang with over 1.2K jobs to job_grab hangs on multimachine jobs

The query created is https://pastebin.com/r0zbhCrn - there is something majorly f****ed

#13 Updated by szarate over 1 year ago

  • Related to action #32545: Catch multi-machine clusters misconfigured added

#14 Updated by coolo over 1 year ago

  • Subject changed from job_grab hangs on multimachine jobs to job_grab does not cope with parallel cycles
  • Priority changed from Immediate to Normal

After a lot of debugging we found the underlying problem. two test suites both added PARALLEL_WITH with each other, which resulted in a cycle of dependencies. We were just not aware that this is possible, so we didn't look earlier into this and hunted in many other directions first :(

the test suites are fixed

#15 Updated by EDiGiacinto over 1 year ago

  • Related to action #32851: [tools][EPIC] Scheduling redesign added

#16 Updated by szarate over 1 year ago

  • Assignee changed from coolo to szarate

I will give it a look

#17 Updated by szarate over 1 year ago

Started adding a new fixture, https://github.com/foursixnine/openQA/commits/recursioniscalled_ let's see how it goes.

#18 Updated by szarate over 1 year ago

  • Status changed from New to In Progress

So working with the test fixture is not a problem, now caching the real problem is...

I tried to catch them in the job_create_dependencies, but had no luck so far, as the jobs are already created, making the tests fail horribly. I switched to look at create_from_settings method, but having a key=value table doesn't make it easy.

#20 Updated by coolo over 1 year ago

  • Status changed from In Progress to Resolved

merged

#21 Updated by coolo over 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF