action #159168
closed[openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16
0%
Description
Motivation¶
There is a lot of failed jobs in the openQA builds due to issues with PRODUCTDIR and CASEDIR folder structure.
Acceptance criteria¶
- AC1: Significantly lower number of new failed jobs (duh!)
Suggestions¶
- Investigate the cause of the failures e.g. nfs setup, git cloning, test variables
- Consider behavior when test code is only pulled in from git
- Resolve https://github.com/os-autoinst/openQA/pull/5582
Updated by mkittler 8 months ago · Edited
Looks like it depends on the worker, e.g. w25 and w26 are producing those incompletes and w24 not.
I cloned the same job twice to verify this:
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/4089568 {BUILD,TEST}+=-poo159168 _GROUP=0 WORKER_CLASS+=,openqaworker24
- passes with
PRODUCTDIR
set to.
by the openQA worker which is how it should work
- passes with
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/4089568 {BUILD,TEST}+=-poo159168 _GROUP=0 WORKER_CLASS+=,openqaworker26
- incompleted with
PRODUCTDIR
set toproducts/openqa
by the openQA worker which of course doesn't exist in the Git checkout
- incompleted with
On all workers /var/lib/openqa/cache/openqa.opensuse.org/tests/openqa/products/openqa/
is an existing directory containing the needles
sub directory as expected. (And this was also definitely the case when a ran those test jobs.)
I'm currently looking at the code and I'm actually more wondering why it works on w24 then why it doesn't on w25/w26 (because the supposedly relevant piece of code is $vars->{PRODUCTDIR} //= … abs2rel($default_productdir, $default_casedir)
which produced PRODUCTDIR=products/openqa
).
EDIT: On w24 /var/lib/openqa/share/tests
doesn't exist. I don't see why this would make a difference (as caching is used so no code paths using this directory should be executed) but I'll try unmounting nfs.
EDIT: Unmounting nfs did the trick. So somehow the worker is influenced by the presence of /var/lib/openqa/share/tests/…
breaking our use case where CASEDIR and NEEDLES_DIR are both Git repos.
Updated by mkittler 8 months ago
- Status changed from In Progress to Feedback
I unmounted all nfs mounts on o3 workers as short-term mitigation but it only works accidentally without it.
PR for the fix: https://github.com/os-autoinst/openQA/pull/5582
Note that we cannot just delete /var/lib/openqa/share/tests/openqa/products/openqa
on o3. It would help but also break the needle editor for the openQA distribution at the same time.
Updated by livdywan 8 months ago
- Subject changed from [openqa-in-openqa] https://openqa.opensuse.org/group_overview/24?limit_builds=50 very broken since 2024-04-16 to [openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16 Mandatory trainings as of 2024-04
- Description updated (diff)
Updated by mkittler 8 months ago
The PR has been merged but see my latest comments there and https://github.com/os-autoinst/os-autoinst/pull/2490.
Updated by mkittler 8 months ago
- Status changed from Feedback to Resolved
All PRs have been merged and https://openqa.opensuse.org/group_overview/24?limit_builds=50 looks still good.
I'll keep the NFS shares disabled because most workers had them disabled anyway. Additionally, so far it seems this hasn't caused any trouble yet and nobody complained about it when I mentioned it in #opensuse-factory.