action #159168
closed
[openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16
Added by okurz 8 months ago.
Updated 8 months ago.
Category:
Regressions/Crashes
Description
Motivation¶
There is a lot of failed jobs in the openQA builds due to issues with PRODUCTDIR and CASEDIR folder structure.
Acceptance criteria¶
- AC1: Significantly lower number of new failed jobs (duh!)
Suggestions¶
- Status changed from New to In Progress
- Assignee set to mkittler
It looks good to me? What was (or perhaps still is) broken?
Oh, you mean the tests which are not passing anymore? I was looking for a problem with the page itself :-)
Looks like it depends on the worker, e.g. w25 and w26 are producing those incompletes and w24 not.
I cloned the same job twice to verify this:
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/4089568 {BUILD,TEST}+=-poo159168 _GROUP=0 WORKER_CLASS+=,openqaworker24
- passes with
PRODUCTDIR
set to .
by the openQA worker which is how it should work
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/4089568 {BUILD,TEST}+=-poo159168 _GROUP=0 WORKER_CLASS+=,openqaworker26
- incompleted with
PRODUCTDIR
set to products/openqa
by the openQA worker which of course doesn't exist in the Git checkout
On all workers /var/lib/openqa/cache/openqa.opensuse.org/tests/openqa/products/openqa/
is an existing directory containing the needles
sub directory as expected. (And this was also definitely the case when a ran those test jobs.)
I'm currently looking at the code and I'm actually more wondering why it works on w24 then why it doesn't on w25/w26 (because the supposedly relevant piece of code is $vars->{PRODUCTDIR} //= … abs2rel($default_productdir, $default_casedir)
which produced PRODUCTDIR=products/openqa
).
EDIT: On w24 /var/lib/openqa/share/tests
doesn't exist. I don't see why this would make a difference (as caching is used so no code paths using this directory should be executed) but I'll try unmounting nfs.
EDIT: Unmounting nfs did the trick. So somehow the worker is influenced by the presence of /var/lib/openqa/share/tests/…
breaking our use case where CASEDIR and NEEDLES_DIR are both Git repos.
- Status changed from In Progress to Feedback
I unmounted all nfs mounts on o3 workers as short-term mitigation but it only works accidentally without it.
PR for the fix: https://github.com/os-autoinst/openQA/pull/5582
Note that we cannot just delete /var/lib/openqa/share/tests/openqa/products/openqa
on o3. It would help but also break the needle editor for the openQA distribution at the same time.
- Priority changed from Urgent to High
- Subject changed from [openqa-in-openqa] https://openqa.opensuse.org/group_overview/24?limit_builds=50 very broken since 2024-04-16 to [openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16 Mandatory trainings as of 2024-04
- Description updated (diff)
- Subject changed from [openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16 Mandatory trainings as of 2024-04 to [openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16
- Status changed from Feedback to Resolved
All PRs have been merged and https://openqa.opensuse.org/group_overview/24?limit_builds=50 looks still good.
I'll keep the NFS shares disabled because most workers had them disabled anyway. Additionally, so far it seems this hasn't caused any trouble yet and nobody complained about it when I mentioned it in #opensuse-factory.
Also available in: Atom
PDF