action #39560

Tests for blocked_by and loops inside of it

Added by szarate over 1 year ago. Updated over 1 year ago.

Status:ResolvedStart date:10/08/2018
Priority:HighDue date:
Assignee:-% Done:


Target version:Done


Currently we have the following PR: that is hotpached already onto osd.

We need to pick it up where coolo left it, and add tests for the situation aswell, note that this might also require to revisit the changes done in

Related issues

Related to openQA Project - action #32725: [tools] Scheduler job_grab/filter_jobs refactoring Resolved 05/05/2018
Related to openQA Project - action #39629: openQA Scheduler refactor fallout Resolved 13/08/2018


#1 Updated by szarate over 1 year ago

  • Description updated (diff)

#2 Updated by EDiGiacinto over 1 year ago

Also is related to it.

From IRC:

<mudler>that PR is hotpatched in osd already right? -> not sure, but looks like it's not catching more-than-one chained parents still 
<mudler>as one of the parent is uploading now, but it went running and missed asset
<mudler>(they were both running in parallel, now it failed the parent, for other reasons)
<coolo>mudler: what's also possible is that the blocked_by wasn't even calculated - depending how the job was created
<mudler>it's also somehow missing the parents.. i mean they should be two, no?
<foursixnine>coolo: I think blocked_by is depending on using isos post/job duplicate right?
<mudler>but if i read correctly, is the one that was posted
<mudler>and misses parents as well
<coolo>mudler: if it's missing parents, that's a completely different part then
<mudler>my point :) but i guess there are two bugs then, because even if the parent was in the DB it didn't waited for it
<coolo>I don't think there is a parent - so no waiting

#3 Updated by EDiGiacinto over 1 year ago

  • Related to action #32725: [tools] Scheduler job_grab/filter_jobs refactoring added

#4 Updated by EDiGiacinto over 1 year ago

For sake of reference, even with that PR, all seems pretty broken still:

1) Child jobs are not waiting anymore for parents to go in certain cases - still to bisect, but i believe this is a showoff from two different bugs in two different code parts e.g.

2) Stale jobs in running state forever - In few days of having this in production, we are having back bugs that were actually addressed in the previous scheduler logics, that had to cope with production loads ( maybe we did simplified maybe too much here ? ) see for e.g. but we had plenty of them with ' State: running finished 3 days ago ( 00:03 minutes ) ' or similar.

3) Having a separated way to represent the cluster is a bit confusing now - we have settings page that show something that is not coherent what the scheduler is actually considering, and this makes things to debug even more messy, because you see something, scheduler does another, so from my point this is a big -1 as makes things a lot counterintuitive (just my opinion).

4) Tests are kinda bended to make them go successfully towards new scheduler logic, which makes me wonder if this is working as it is expected, as the core logic is not covered by unit tests.

For me, IMHO this is kinda a no-go at this point.

#5 Updated by szarate over 1 year ago

  • Related to action #39629: openQA Scheduler refactor fallout added

#6 Updated by coolo over 1 year ago

  • Status changed from New to Resolved

I added 2 more test cases for blocked_by - and it looks good in production, so let's resolve it

#7 Updated by coolo over 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF