action #20002
closed[tools] openqa sometimes doesn't update job_dependencies table
0%
Description
For multi-machine jobs (caasp and slenkins) openQA time to time doesn't schedule all child jobs by triggering its parent CaaSP-controller or slenkins--control job.
It seems that not all child jobs are running because **there are missing entries for that jobs in job_dependencies SQL table*.
Example of broken job https://openqa.suse.de/tests/1016423 CaaSP-controller (In this case we miss admin node so then the whole test failed)
`# select count(child_job_id) from job_dependencies where parent_job_id=1016423;
count
-------
22
(1 row)`
If you try examine some successful CaaSP-controller job (eg. id=1015418) you should get count=25 (1x controller, 1x admin, 1x master, 22x workers).
I'm not able to reproduce the issue on request but the problem sometimes occurs in my local openqa instance using sqlite and also o.s.d using postgresql. The broken job dependency could be solved by posting iso again.
Maybe it has something to do with scheduler which just skips some db insert queries.
I'm sorry being so brief but I really don't know more.
Updated by coolo over 7 years ago
We also have the problem that sometimes jobs are scheduled on the wrong workerclass. This and your issue together make me believe that the jobs are grabed/scheduled before the final picture is there, i.e. job settings and job dependencies aren't inserted in a transaction but piece by piece? Can you check?
Updated by szarate over 7 years ago
- Related to action #18684: Jobs with worker class qemu_x86_64 are taken by machines without this class, causing incomplete jobs added
Updated by thehejik over 7 years ago
coolo wrote:
We also have the problem that sometimes jobs are scheduled on the wrong workerclass. This and your issue together make me believe that the jobs are grabed/scheduled before the final picture is there, i.e. job settings and job dependencies aren't inserted in a transaction but piece by piece? Can you check?
Sorry, I have no idea how to check that.
Updated by coolo over 7 years ago
Somehow I had the feeling that I was talking to Ettore :)
Updated by EDiGiacinto over 7 years ago
@coolo: seems possible, AFAICS from https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Scheduler/Scheduler.pm#L214 the search->first is not into a transaction
Updated by szarate over 7 years ago
Looks like poo#18684 is fixed, but this is still happening, https://openqa.suse.de/tests/1061075#settings
Updated by EDiGiacinto over 7 years ago
it might be fixed by https://github.com/os-autoinst/openQA/pull/1389 - unfortunately can't reproduce the issue locally in one machine. But i've been running openQA with those patches with no issues
Updated by coolo over 7 years ago
Fixed in master doesn't matter here - we have c2c7bcd2 deployed. Remember EDiGiacinto's first feature? :)
Updated by asmorodskyi over 7 years ago
evidence that issue is not fixed https://openqa.suse.de/tests/1065721
Updated by pcervinka over 7 years ago
- Related to action #20790: [qam] SLE12-SP3 test fails in 1__unknown_ - slenkins-tests-openvpn-control added
Updated by okurz over 7 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: slenkins-tests-openvpn-control
https://openqa.suse.de/tests/1130899
Updated by coolo about 7 years ago
- Status changed from New to Resolved
the same test worked flawless in all of october and november. So I assuming it's fixed