action #39560: Tests for blocked_by and loops inside of it - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #39560

closed

Tests for blocked_by and loops inside of it

Added by szarate over 6 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Target version:

Done

Start date:

2018-08-10

Due date:

% Done:

Estimated time:

Description

Currently we have the following PR: https://github.com/os-autoinst/openQA/pull/1743 that is hotpached already onto osd.

We need to pick it up where coolo left it, and add tests for the situation aswell, note that this might also require to revisit the changes done in https://github.com/os-autoinst/openQA/pull/1718

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by szarate over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

Also https://github.com/os-autoinst/openQA/pull/1717 is related to it.

From IRC:

<mudler>that PR is hotpatched in osd already right? https://openqa.suse.de/tests/1917358#settings -> not sure, but looks like it's not catching more-than-one chained parents still 
<mudler>as one of the parent is uploading now, but it went running and missed asset
<mudler>(they were both running in parallel, now it failed the parent, for other reasons)
<coolo>mudler: what's also possible is that the blocked_by wasn't even calculated - depending how the job was created
<mudler>it's also somehow missing the parents.. i mean they should be two, no?
<foursixnine>coolo: I think blocked_by is depending on using isos post/job duplicate right?
<mudler>but if i read correctly, https://openqa.suse.de/tests/1913372#settings is the one that was posted
<mudler>and misses parents as well
<coolo>mudler: if it's missing parents, that's a completely different part then
<mudler>my point :) but i guess there are two bugs then, because even if the parent was in the DB it didn't waited for it
<coolo>I don't think there is a parent - so no waiting

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

Related to action #32725: [tools] Scheduler job_grab/filter_jobs refactoring added

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

For sake of reference, even with that PR, all seems pretty broken still:

1) Child jobs are not waiting anymore for parents to go in certain cases - still to bisect, but i believe this is a showoff from two different bugs in two different code parts e.g. https://openqa.suse.de/tests/1917358

2) Stale jobs in running state forever - In few days of having this in production, we are having back bugs that were actually addressed in the previous scheduler logics, that had to cope with production loads ( maybe we did simplified maybe too much here ? ) see for e.g. https://openqa.suse.de/tests/1925778 but we had plenty of them with ' State: running finished 3 days ago ( 00:03 minutes ) ' or similar.

3) Having a separated way to represent the cluster is a bit confusing now - we have settings page that show something that is not coherent what the scheduler is actually considering, and this makes things to debug even more messy, because you see something, scheduler does another, so from my point this is a big -1 as makes things a lot counterintuitive (just my opinion).

4) Tests are kinda bended to make them go successfully towards new scheduler logic, which makes me wonder if this is working as it is expected, as the core logic is not covered by unit tests.

For me, IMHO this is kinda a no-go at this point.

Actions

Copy link