Project

General

Profile

Actions

action #159168

closed

[openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16

Added by okurz 8 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-04-17
Due date:
% Done:

0%

Estimated time:

Description

Motivation

There is a lot of failed jobs in the openQA builds due to issues with PRODUCTDIR and CASEDIR folder structure.

Acceptance criteria

  • AC1: Significantly lower number of new failed jobs (duh!)

Suggestions

Actions #1

Updated by mkittler 8 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #2

Updated by mkittler 8 months ago

It looks good to me? What was (or perhaps still is) broken?

Actions #3

Updated by mkittler 8 months ago

Oh, you mean the tests which are not passing anymore? I was looking for a problem with the page itself :-)

Actions #4

Updated by mkittler 8 months ago · Edited

Looks like it depends on the worker, e.g. w25 and w26 are producing those incompletes and w24 not.

I cloned the same job twice to verify this:

  • openqa-clone-job --within-instance https://openqa.opensuse.org/tests/4089568 {BUILD,TEST}+=-poo159168 _GROUP=0 WORKER_CLASS+=,openqaworker24
    • passes with PRODUCTDIR set to . by the openQA worker which is how it should work
  • openqa-clone-job --within-instance https://openqa.opensuse.org/tests/4089568 {BUILD,TEST}+=-poo159168 _GROUP=0 WORKER_CLASS+=,openqaworker26
    • incompleted with PRODUCTDIR set to products/openqa by the openQA worker which of course doesn't exist in the Git checkout

On all workers /var/lib/openqa/cache/openqa.opensuse.org/tests/openqa/products/openqa/ is an existing directory containing the needles sub directory as expected. (And this was also definitely the case when a ran those test jobs.)

I'm currently looking at the code and I'm actually more wondering why it works on w24 then why it doesn't on w25/w26 (because the supposedly relevant piece of code is $vars->{PRODUCTDIR} //= … abs2rel($default_productdir, $default_casedir) which produced PRODUCTDIR=products/openqa).

EDIT: On w24 /var/lib/openqa/share/tests doesn't exist. I don't see why this would make a difference (as caching is used so no code paths using this directory should be executed) but I'll try unmounting nfs.

EDIT: Unmounting nfs did the trick. So somehow the worker is influenced by the presence of /var/lib/openqa/share/tests/… breaking our use case where CASEDIR and NEEDLES_DIR are both Git repos.

Actions #5

Updated by mkittler 8 months ago

  • Status changed from In Progress to Feedback

I unmounted all nfs mounts on o3 workers as short-term mitigation but it only works accidentally without it.

PR for the fix: https://github.com/os-autoinst/openQA/pull/5582

Note that we cannot just delete /var/lib/openqa/share/tests/openqa/products/openqa on o3. It would help but also break the needle editor for the openQA distribution at the same time.

Actions #6

Updated by mkittler 8 months ago

  • Priority changed from Urgent to High
Actions #7

Updated by livdywan 8 months ago

  • Subject changed from [openqa-in-openqa] https://openqa.opensuse.org/group_overview/24?limit_builds=50 very broken since 2024-04-16 to [openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16 Mandatory trainings as of 2024-04
  • Description updated (diff)
Actions #8

Updated by mkittler 8 months ago

The PR has been merged but see my latest comments there and https://github.com/os-autoinst/os-autoinst/pull/2490.

Actions #9

Updated by okurz 8 months ago

  • Subject changed from [openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16 Mandatory trainings as of 2024-04 to [openqa-in-openqa] Builds in openQA job group very broken since 2024-04-16
Actions #10

Updated by mkittler 8 months ago

  • Status changed from Feedback to Resolved

All PRs have been merged and https://openqa.opensuse.org/group_overview/24?limit_builds=50 looks still good.

I'll keep the NFS shares disabled because most workers had them disabled anyway. Additionally, so far it seems this hasn't caused any trouble yet and nobody complained about it when I mentioned it in #opensuse-factory.

Actions

Also available in: Atom PDF