Tests for blocked_by and loops inside of it
Currently we have the following PR: https://github.com/os-autoinst/openQA/pull/1743 that is hotpached already onto osd.
We need to pick it up where coolo left it, and add tests for the situation aswell, note that this might also require to revisit the changes done in https://github.com/os-autoinst/openQA/pull/1718
#2 Updated by EDiGiacinto over 1 year ago
Also https://github.com/os-autoinst/openQA/pull/1717 is related to it.
<mudler>that PR is hotpatched in osd already right? https://openqa.suse.de/tests/1917358#settings -> not sure, but looks like it's not catching more-than-one chained parents still <mudler>as one of the parent is uploading now, but it went running and missed asset <mudler>(they were both running in parallel, now it failed the parent, for other reasons) <coolo>mudler: what's also possible is that the blocked_by wasn't even calculated - depending how the job was created <mudler>it's also somehow missing the parents.. i mean they should be two, no? <foursixnine>coolo: I think blocked_by is depending on using isos post/job duplicate right? <mudler>but if i read correctly, https://openqa.suse.de/tests/1913372#settings is the one that was posted <mudler>and misses parents as well <coolo>mudler: if it's missing parents, that's a completely different part then <mudler>my point :) but i guess there are two bugs then, because even if the parent was in the DB it didn't waited for it <coolo>I don't think there is a parent - so no waiting
#4 Updated by EDiGiacinto over 1 year ago
For sake of reference, even with that PR, all seems pretty broken still:
1) Child jobs are not waiting anymore for parents to go in certain cases - still to bisect, but i believe this is a showoff from two different bugs in two different code parts e.g. https://openqa.suse.de/tests/1917358
2) Stale jobs in running state forever - In few days of having this in production, we are having back bugs that were actually addressed in the previous scheduler logics, that had to cope with production loads ( maybe we did simplified maybe too much here ? ) see for e.g. https://openqa.suse.de/tests/1925778 but we had plenty of them with ' State: running finished 3 days ago ( 00:03 minutes ) ' or similar.
3) Having a separated way to represent the cluster is a bit confusing now - we have settings page that show something that is not coherent what the scheduler is actually considering, and this makes things to debug even more messy, because you see something, scheduler does another, so from my point this is a big -1 as makes things a lot counterintuitive (just my opinion).
4) Tests are kinda bended to make them go successfully towards new scheduler logic, which makes me wonder if this is working as it is expected, as the core logic is not covered by unit tests.
For me, IMHO this is kinda a no-go at this point.