action #181895
opencoordination #162539: [saga][epic] future ideas version for version control features within openQA
coordination #181886: [epic] Good git defaults for new instances while keeping compatibility for old instances
openQA: script dump / load issues: loading job templates is broken
0%
Description
Observation¶
job templates are never dumped unless --group and --full are set, $options{group}
magically gets defined halfway through dump-templates
lruzicka reported to me today that template loading is broken in current openQA, and boy, did this turn out to be a rabbit hole.
The initial issue is actually simple enough. https://github.com/os-autoinst/openQA/pull/5973 broke job template loading completely, because it made openqa-load-templates invariably pass the data to the API as JSON, but the JobTemplate controller's create
method cannot work with JSON, it heavily uses Mojolicious validation stuff (via the validation
default helper that it has), and that does not work with JSON, only with "GET and POST parameters extracted from the query string and application/x-www-form-urlencoded or multipart/form-data message body", per https://docs.mojolicious.org/Mojolicious/Plugin/DefaultHelpers#validation . One way to fix this is simple enough - change openqa-load-templates to use form => \%param
for the JobTemplates table like it used to, not json => \%param
.
But then I got curious and wondered why the test suite never caught this, and that turns out to be a heck of a rabbit hole!
The answer turns out to be: the test suite works by doing dump-templates and then load-templates on the dumped data, but dump-templates has been broken for several years, it will never dump job templates. It's broken in two interesting ways.
The first relates to the --full
arg. This is meant to only be relevant if you also pass --group
; it's documented as "when dumping a specific job group also dump related test suites, products and machines". But during refactors over the years, this has broken. In handle_row
we do this:
if ($table eq 'JobTemplates') {
return undef if $options{group} && $r->{group_name} && !$options{group}->{$r->{group_name}};
return undef unless $options{product}->{product_key($r->{product})};
}
But the only code that ever sets $options{product}->{product_key($r->{product})}
is gated behind a check for $options{full}
, above:
if ($tables{JobTemplates} && $options{full}) {
for my $r (@{$result{JobTemplates}}) {
next if $options{group} && !$options{group}->{$r->{group_name}};
$options{test}->{$r->{test_suite}->{name}} = 1;
$options{machine}->{$r->{machine}->{name}} = 1;
$options{product}->{product_key($r->{product})} = 1;
}
}
...so we will never dump any JobTemplates unless --full
is passed. So I thought we could fix this by doing:
if ($table eq 'JobTemplates' && $options{group}) {
return undef if $r->{group_name} && !$options{group}->{$r->{group_name}};
return undef unless $options{product}->{product_key($r->{product})};
}
but even this does not work, for a surprising reason. If you do not pass --group
, $options{group}
is undefined (as you'd expect) until partway through the script, but after this bit, it is magically defined:
if ($tables{'JobGroups'}) {
my $group = (keys %{$options{group}})[0];
once that keys
call has happened, $options{group}
is defined even if it wasn't before. You can prove that with this test script:
#!/bin/perl
my %options;
defined $options{group} ? print "defined\n" : print "undefined\n";
keys %{$options{group}};
defined $options{group} ? print "defined\n" : print "undefined\n";
...so in fact, once we reach the meat of dump-templates, $options{group}
is always defined. This has the effect that we never dump job templates unless --group
and --full
are both passed, because we will always hit the check for $options{product}->{product_key($r->{product})}
in handle_row
(even with my fix), but we will never set it unless --group
is passed, because we'll always bail out of the block that tries to set it, here:
next if $options{group} && !$options{group}->{$r->{group_name}};
It looks like the call to keys
that introduces this breakage was added in 323246dc9f597c0efa137e910ad038594489ce83 , "Add YAML support to {dump,load}_templates".
So, in the test suite, when we dump the templates, we get a file with empty JobTemplates
, so we don't catch the 'loading job templates is broken' bug when we load it back, because we don't try and load any. The test suite unfortunately does not check that the dumped file contains everything it should, or that everything expected is loaded back (it only explicitly checks that the expected number of JobGroups got loaded).
I'm writing all this up because my brain is too frazzled to decide how to fix it today, so I have a reference to get started on it tomorrow.
Updated by gpathak 13 days ago
- Tags set to openQA, job templates, bug, regression
- Subject changed from script dump / load issues: loading job templates is broken, job templates are never dumped unless --group and --full are set, `$options{group}` magically gets defined halfway through dump-templates to openQA: script dump / load issues: loading job templates is broken
- Description updated (diff)
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by AdamWill 12 days ago · Edited
Ugh, so this is getting fun. If I fix all the bugs, the 're-imported groups' test in 40-script_load_dump_templates.t unexpectedly starts failing with an error:
Job group 'opensuse' already exists. Use --update to modify existing entries. at /builddir/build/BUILD/openqa-5_20250430git10b1e43-build/openQA-10b1e43840bddcc9839be91ebb46fb422effe601/script/openqa-load-templates line 127, <> line 388.
That seems weird because the test intentionally clears all the database tables immediately before running the command:
# Clear the data in relevant tables
$schema->resultset($_)->delete for qw(Machines TestSuites Products JobTemplates JobGroups);
The problem seems to happen because job templates are loaded before job groups. I can't quite see how, but I think that somehow, now we actually have job templates to load, loading them somehow causes at least one job group to be created; so when we get to loading the job groups, we fail the check that it doesn't already exist. (At first I thought it was odd I'd never run into this in the real world, but then I realized this check is quite new - it was added in 97d34ec to fix https://progress.opensuse.org/issues/174808 ). I tested this by flipping the loading order, so we load job groups before job templates:
my @tables = (qw(Machines TestSuites Products JobGroups JobTemplates));
and now it doesn't fail that way any more, but it still fails. Now, somehow, the repeated dump doesn't give identical data:
# Failed test 'both dumps match'
# at t/40-script_load_dump_templates.t line 109.
# Structures begin differing at:
# $got->{JobTemplates}[0]{test_suite}{name} = 'textmode'
# $expected->{JobTemplates}[0]{test_suite}{name} = 'client1'
I assume this is just because we don't work hard enough to make sure the job templates array is always ordered the same in the output, or something, but I kinda ran out of time at this point. Will get back to it tomorrow.
Updated by AdamWill 11 days ago · Edited
oh, god, every time I think I have this fixed, something else rears up.
So...this kinda runs into the whole "job template YAML migration" thing, right? https://progress.opensuse.org/issues/44360 .
With that, we have two kinda-competing ways of defining job groups and job templates. Job groups can kinda encode job templates: a job group can include a YAML template string which defines the templates. When load-templates runs (sigh, the word 'templates' is now overloaded...), if a job group definition includes a templates string, then loading the job group also implicitly loads the templates it defines - the template string gets POSTed to the job_templates_scheduling
API endpoint, which is connected to the JobTemplate
controller's update
method, which reads the template string as YAML and loads the resulting templates.
However, you can still define and load job templates directly by posting to job_templates
, which is connected to the JobTemplate
controller's create
method, and load-templates still supports this; if the template file has a JobTemplates
object, it will be read and loaded. You can still load a template file containing JobGroups
with no YAML template strings, and some JobTemplates
, and that works fine, but gives you the 'deprecated' webUI Job Groups admin view (this is how things currently work in Fedora).
If a template file has both JobGroups
with YAML template strings and JobTemplates
, then load-templates will try and load both, which seems maybe surprising? As it currently stands, it tries to load the job templates first, then the job groups. Loading the templates seems to somehow implicitly create the groups they refer to, and then when it tries to load the groups, it chokes on the check added in 97d34ec . If I reverse this so it loads the groups first, then it will create the job templates they describe, then try to load the job templates explicitly defined in the templates file. If the two sets of templates match, then all the explicitly-defined ones just get skipped, and things are fine. I'm not sure what happens if the explicitly-defined templates are a superset of the ones defined in YAML in the groups - I suspect it'd die on the "Group foo must be updated through the YAML template" error check in JobTemplate
create
.
Due to the bugs I found at first, dump-templates currently almost never actually dumps any job templates, which goes a long way to hiding all of this if you only ever load template files created by dump-templates. But it does seem clearly intended to dump job templates, the issues that make it not do so really do look like bugs, nobody ever committed anything that said 'make dump-templates not write JobTemplates by default' or anything like that.
But it also unconditionally writes the YAML template strings, if it's dumping job group data at all, and the mechanism for this means you'll get a YAML template string in the output data even if there is no YAML template string in the job group as it currently exists on the server. That's because dump-templates GETs the job_templates_scheduling
endpoint, which is connected to JobTemplate
schedules
method, which winds up iterating over the groups in the DB and calling $group->to_yaml
on each, and that returns the existing YAML template string if there is one, but if there isn't one, it calls $self->to_template
which tries to auto-generate it from the existing legacy templates.
Now I've traced this all the way out, that feels...odd. Let's say you have a 'legacy' instance, and you run dump-templates on it. The fact that it unconditionally auto-generates and writes YAML template strings implies that this will be a 'conversion' operation - if you load the dumped templates, you'll be converting the instance to the 'modern' style with job groups containing YAML template strings. But in this case, why is it dumping the legacy templates at all? It doesn't make sense.
At this point I kinda feel like dump-templates should only ever dump 'legacy' data (groups with no template string, explicit templates) or 'modern' data (groups with template strings, no templates), and load-templates should only accept data in that format: it should intentionally exit if the data it's passed contains both groups with template strings and a JobTemplates object.
(Alternatively we could do a band-aid rip and give up on the 'legacy' format entirely, which would require Fedora and any other legacy deployments to convert).
Does anyone see any problems with this plan?
Updated by AdamWill 11 days ago · Edited
hmm, indeed, looking at https://github.com/os-autoinst/openQA/commit/323246dc9f597c0efa137e910ad038594489ce83 - which added the initial YAML support to dump/load-templates - I remember that before that, job groups weren't explicitly dumped or loaded at all. They were always implied by the templates. If a template referenced a group that didn't exist, it would be created. Other properties of job groups could only (and still can only) be set via the web UI, they don't exist so far as the template dump/load stuff is concerned.
So...I think we could just make this either/or in both dump and load. Job template and job group data should never co-exist in the same template file. dump-templates should only ever create files with one or the other, and load-templates should only ever load files with one or the other, and error out if you try and feed it a file with both.
ETA: ugh...do we possibly consider a 'hybrid' setup valid? where some groups are YAML-converted but some are not, and this is OK so long as they don't mix?
Updated by AdamWill 11 days ago
If anyone wants to peer down my rabbit hole, the git branch is https://github.com/AdamWill/openQA/tree/template-dump-load-bugs . I'm really hoping to get it to be a PR tomorrow.
Updated by AdamWill 10 days ago
okay, after another solid day slogging at this I have a pretty clear view of where I'm going at last. The current end state of my git branch is mostly good, but I want to rearrange it a bit - moving some of the later test changes earlier would avoid the need for the "load-templates: load job groups before job templates" commit from the middle of the series, now I have a better understanding of exactly why I got errors without it (it's because if a legacy job group has a job template with settings in it, we can dump them, but we cannot restore them via the legacy API).
I need to add an explicit test of dumping and reloading YAML templates with settings, and then there are two potential follow-ups I want to look at: making --clean
work for YAML templates (see https://progress.opensuse.org/issues/182063 ), and making the limit options in dump-templates (--group
, --test
, --machine
, --product
, --full
) make sense, because there's a lot that's broken about them now, I'm pretty sure. But I may send those as separate follow-up PRs. Not sure yet.
Updated by AdamWill 6 days ago
https://github.com/os-autoinst/openQA/pull/6446 addresses most of the issues discussed here.
Updated by okurz about 12 hours ago
https://github.com/os-autoinst/openQA/pull/6446 is merged since some days. @AdamWill In case you want to look into the rest of "most of the issues" be my guest to do it as part of this ticket. If you need help from others then I suggest another ticket with clear description of the still missing parts.