Project

General

Profile

action #20002

[tools] openqa sometimes doesn't update job_dependencies table

Added by thehejik about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2017-06-22
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

For multi-machine jobs (caasp and slenkins) openQA time to time doesn't schedule all child jobs by triggering its parent CaaSP-controller or slenkins--control job.
It seems that not all child jobs are running because **there are missing entries for that jobs in job_dependencies SQL table
*.

Example of broken job https://openqa.suse.de/tests/1016423 CaaSP-controller (In this case we miss admin node so then the whole test failed)

`# select count(child_job_id) from job_dependencies where parent_job_id=1016423;
count 
-------
22
(1 row)`

If you try examine some successful CaaSP-controller job (eg. id=1015418) you should get count=25 (1x controller, 1x admin, 1x master, 22x workers).

I'm not able to reproduce the issue on request but the problem sometimes occurs in my local openqa instance using sqlite and also o.s.d using postgresql. The broken job dependency could be solved by posting iso again.

Maybe it has something to do with scheduler which just skips some db insert queries.

I'm sorry being so brief but I really don't know more.


Related issues

Related to openQA Project - action #18684: Jobs with worker class qemu_x86_64 are taken by machines without this class, causing incomplete jobsResolved2017-04-20

Related to openQA Tests - action #20790: [qam] SLE12-SP3 test fails in 1__unknown_ - slenkins-tests-openvpn-controlRejected2017-07-26

History

#1 Updated by coolo about 4 years ago

We also have the problem that sometimes jobs are scheduled on the wrong workerclass. This and your issue together make me believe that the jobs are grabed/scheduled before the final picture is there, i.e. job settings and job dependencies aren't inserted in a transaction but piece by piece? Can you check?

#2 Updated by szarate about 4 years ago

  • Related to action #18684: Jobs with worker class qemu_x86_64 are taken by machines without this class, causing incomplete jobs added

#3 Updated by thehejik about 4 years ago

coolo wrote:

We also have the problem that sometimes jobs are scheduled on the wrong workerclass. This and your issue together make me believe that the jobs are grabed/scheduled before the final picture is there, i.e. job settings and job dependencies aren't inserted in a transaction but piece by piece? Can you check?

Sorry, I have no idea how to check that.

#4 Updated by coolo about 4 years ago

Somehow I had the feeling that I was talking to Ettore :)

#5 Updated by EDiGiacinto about 4 years ago

coolo: seems possible, AFAICS from https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Scheduler/Scheduler.pm#L214 the search->first is not into a transaction

#6 Updated by szarate about 4 years ago

Looks like poo#18684 is fixed, but this is still happening, https://openqa.suse.de/tests/1061075#settings

#7 Updated by EDiGiacinto about 4 years ago

it might be fixed by https://github.com/os-autoinst/openQA/pull/1389 - unfortunately can't reproduce the issue locally in one machine. But i've been running openQA with those patches with no issues

#8 Updated by coolo about 4 years ago

Fixed in master doesn't matter here - we have c2c7bcd2 deployed. Remember EDiGiacinto's first feature? :)

#9 Updated by asmorodskyi about 4 years ago

evidence that issue is not fixed https://openqa.suse.de/tests/1065721

#10 Updated by pcervinka about 4 years ago

  • Related to action #20790: [qam] SLE12-SP3 test fails in 1__unknown_ - slenkins-tests-openvpn-control added

#11 Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: slenkins-tests-openvpn-control
https://openqa.suse.de/tests/1130899

#12 Updated by coolo almost 4 years ago

  • Status changed from New to Resolved

the same test worked flawless in all of october and november. So I assuming it's fixed

Also available in: Atom PDF