action #36727
closedjob_grab does not cope with parallel cycles
Description
Apparently the scheduler is deadlocking at some point when the ammount of jobs is too big (quick hipotesys)
px aux shows:
postgres 15941 90.5 1.2 282272 208904 ? Rs 11:07 29:55 postgres: geekotest openqa [local] BIND
pid | 15941
age | -00:00:05.616602
usename | geekotest
query | SELECT COUNT( * ) FROM job_dependencies me JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.parent_job_id FROM job_dependencies me JOIN jobs parent ON parent.id = me.parent_job_id WHERE ( ( child_job_id IN ( SELECT me.par
Files
Updated by szarate over 6 years ago
- File scheduler-locking.txt scheduler-locking.txt added
killing the PID worked: SELECT pg_terminate_backend(15941);
but dbi is barfing:
Jun 04 11:42:14 openqa openqa-scheduler[15912]: rollback ineffective with AutoCommit enabled at /usr/lib/perl5/vendor_perl/5.18.2/DBIx/Class/Storage/DBI.pm line 1651.
in the attached log file
d', 207='2', 208='scheduled', 209='2', 210='scheduled', 211='2', 212='scheduled', 213='2', 214='scheduled', 215='2', 216='1741516', 217='1741515', 218='scheduled', 219='WORKER_CLASS', 220='qemu_aarch64', 221='tap', 222='1', 223='done', 224='cancelled', 225='2', 226='1741516', 227='1741515', 228='scheduled', 229='scheduled'] at /usr/share/openqa/script/../lib/OpenQA/Scheduler/Scheduler.pm line 155
Note we're still running: 3f6bb61a
Updated by szarate over 6 years ago
After restarting the scheduler:
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Can't open file "/var/log/openqa_scheduler": Permission denied at
Jun 04 12:09:15 openqa openqa-scheduler[9278]: /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71 (#1)
Jun 04 12:09:15 openqa openqa-scheduler[9278]: (S inplace) The implicit opening of a file through use of the <>
Jun 04 12:09:15 openqa openqa-scheduler[9278]: filehandle, either implicitly under the -n or -p command-line
Jun 04 12:09:15 openqa openqa-scheduler[9278]: switches, or explicitly, failed for the indicated reason. Usually
Jun 04 12:09:15 openqa openqa-scheduler[9278]: this is because you don't have read permission for a file which
Jun 04 12:09:15 openqa openqa-scheduler[9278]: you named on the command line.
Jun 04 12:09:15 openqa openqa-scheduler[9278]:
Jun 04 12:09:15 openqa openqa-scheduler[9278]: (F) You tried to call perl with the -e switch, but /dev/null (or
Jun 04 12:09:15 openqa openqa-scheduler[9278]: your operating system's equivalent) could not be opened.
Jun 04 12:09:15 openqa openqa-scheduler[9278]:
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Uncaught exception from user code:
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Can't open file "/var/log/openqa_scheduler": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71.
Jun 04 12:09:15 openqa openqa-scheduler[9278]: Mojo::File::open('Mojo::File=SCALAR(0x4a1f9e8)', '>>') called at /usr/share/openqa/script/../lib/OpenQA/Setup.pm line 71
Jun 04 12:09:15 openqa openqa-scheduler[9278]: OpenQA::Setup::setup_log('OpenQA::Setup=HASH(0x13d9e80)') called at /usr/share/openqa/script/../lib/OpenQA/Scheduler.pm line 81
Jun 04 12:09:15 openqa openqa-scheduler[9278]: OpenQA::Scheduler::run() called at /usr/share/openqa/script/openqa-scheduler line 38
Jun 04 12:09:15 openqa systemd[1]: openqa-scheduler.service: Main process exited, code=exited, status=13/n/a
Updated by szarate over 6 years ago
- Subject changed from openQA scheduler hangs to job_grab seems to hang with over 1.2K jobs
We've had before 1.2, and didn't die... but looks like this time was impossible not to
Updated by EDiGiacinto over 6 years ago
Idea by coolo:
12:45 <coolo> mudler: I want a blocked_by column in the jobs table that also shows the job differently in the scheduled table
12:45 <coolo> mudler: and then the scheduler can ignore those - this should limit this subselect query a lot :)
That can work, and likely can be tested experimentally with not much code changes - we need to check out if with that approach we still see a degradation of performance in the job_grab's generated query in > 1.2K scheduled jobs (and among those, for a good test we need also to schedule jobs with chained parents/parallel, that makes the orm query more heavy).
Updated by szarate over 6 years ago
- Priority changed from Immediate to Urgent
- Target version set to Ready
So let's stop the workers then during the weekend, and start them after the database dump is created. (Or we can use the one from yesterday)
Updated by coolo over 6 years ago
- Priority changed from Urgent to Immediate
- Target version changed from Ready to Current Sprint
Damn, there is no way around fixing it - whatever caused it, we have this now whenever there is a lot of load :(
Updated by coolo over 6 years ago
And I had to cancel the one job and restart the scheduler to have it rolling again.
Updated by coolo over 6 years ago
https://openqa.suse.de/tests/1743900 in this case
Updated by coolo over 6 years ago
Next stop https://openqa.suse.de/tests/1743954 - looks like support servers of clusters are a problem :(
Updated by coolo over 6 years ago
- Category changed from 168 to 122
- Assignee set to coolo
It also can fail with simple Server-Client clusters - https://openqa.suse.de/tests/1744194#settings
Updated by coolo over 6 years ago
- Subject changed from job_grab seems to hang with over 1.2K jobs to job_grab hangs on multimachine jobs
The query created is https://pastebin.com/r0zbhCrn - there is something majorly f****ed
Updated by szarate over 6 years ago
- Related to action #32545: Provide warning when user creates/triggers misconfigured multi-machine clusters added
Updated by coolo over 6 years ago
- Subject changed from job_grab hangs on multimachine jobs to job_grab does not cope with parallel cycles
- Priority changed from Immediate to Normal
After a lot of debugging we found the underlying problem. two test suites both added PARALLEL_WITH with each other, which resulted in a cycle of dependencies. We were just not aware that this is possible, so we didn't look earlier into this and hunted in many other directions first :(
the test suites are fixed
Updated by EDiGiacinto over 6 years ago
- Related to coordination #32851: [tools][EPIC] Scheduling redesign added
Updated by szarate over 6 years ago
Started adding a new fixture, https://github.com/foursixnine/openQA/commits/recursioniscalled_ let's see how it goes.
Updated by szarate over 6 years ago
- Status changed from New to In Progress
So working with the test fixture is not a problem, now caching the real problem is...
I tried to catch them in the job_create_dependencies, but had no luck so far, as the jobs are already created, making the tests fail horribly. I switched to look at create_from_settings method, but having a key=value table doesn't make it easy.
Updated by coolo over 6 years ago
Updated by coolo about 6 years ago
- Target version changed from Current Sprint to Done