Project

General

Profile

Actions

coordination #67723

closed

coordination #58184: [saga][epic][use case] full version control awareness within openQA

coordination #80372: [epic] Cleanup vars.json as initial information container between openQA worker and isotovideo

[epic] Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec

Added by ggardet_arm almost 4 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-03-18
Due date:
% Done:

100%

Estimated time:
(Total: 24.00 h)

Description

Observation

Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec, see: https://openqa.opensuse.org/tests/1287895

needles_dir not found: /var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles (check vars.json?) at /usr/lib/os-autoinst/needle.pm line 330, <$fh> line 20.
[2020-06-04T09:10:50.934 UTC] [debug] terminating command server 19004 because test execution ended through exception
[2020-06-04T09:10:51.935 UTC] [debug] done with command server
18926: EXIT 1

Is there anything to setup?

Motivation

As a user of remote openQA worker instances I want to be able to use openqa-clone-custom-git-refspec as well

Acceptance criteria

  • AC1: openqa-clone-custom-git-refspec creates jobs with valid CASEDIR and valid NEEDLES_DIR if source job is a "remote worker"
  • AC2: openqa-clone-custom-git-refspec still creates jobs that find tests from specified git hash and needles for other workers

Suggestions

  • Croscheck the command line with which openqa-clone-job would be or is called within openqa-clone-custom-git-refspec in case of normal workers or the affected machines
  • As it seems that the paths given to openqa-clone-job seem valid also check how openQA internally handles paths to generate the obviously wrong path like "/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles"
  • Either apply fix in openQA side or openqa-clone-custom-git-refspec

Subtasks 3 (0 open3 closed)

action #90290: Relative paths for CASEDIR and others as default to be not bound to specific workersResolvedokurz2021-03-18

Actions
action #90293: Optional relative paths for CASEDIR and others to be not bound to specific workersResolvedXiaojing_liu2021-03-18

Actions
action #90302: Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec due to differing pathsResolvedXiaojing_liu2021-03-18

Actions

Related issues 3 (0 open3 closed)

Related to openQA Project - action #66658: Fail to use openqa-clone-custom-git-refspec to trigger a job which is also triggered by this scriptResolvedXiaojing_liu2020-05-09

Actions
Related to openQA Project - action #88482: Two absolute paths concatenated to form a default needle dir when PRODUCT_DIR/needles doesn't existResolvedXiaojing_liu2021-02-08

Actions
Related to openQA Tests - action #89990: Error on tests/kernel/run_ltp.pm: Can't locate sle/tests/kernel/run_ltp.pmResolvedokurz2021-03-12

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Project changed from openQA Tests to openQA Project
  • Category set to Regressions/Crashes
  • Status changed from New to Feedback
  • Assignee set to okurz

/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles is obviously an in correct path. Could it be your clone source job was already one created with openqa-clone-custom-git-refspec?

Actions #2

Updated by okurz over 3 years ago

  • Category changed from Regressions/Crashes to Support
  • Status changed from Feedback to Resolved

No response

Actions #3

Updated by ggardet_arm over 3 years ago

  • Status changed from Resolved to New

okurz wrote:

/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles is obviously an in correct path. Could it be your clone source job was already one created with openqa-clone-custom-git-refspec?

I tried to clone a test which was the first run for the 20200704 snapshot and it failed. See: https://openqa.opensuse.org/t1326587

Actions #4

Updated by okurz over 3 years ago

  • Category changed from Support to Regressions/Crashes
  • Status changed from New to Workable
  • Assignee deleted (okurz)
  • Target version set to Ready
Actions #5

Updated by Xiaojing_liu over 3 years ago

I reproduced this issue in my local environment. Because the cloned source job's PRODUCTDIR is /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/ (remote worker and enable cache), so the openqa-clone-custom-git-refspec set the new job's NEEDLES_DIR as '/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles' in this line: https://github.com/os-autoinst/openQA/blob/master/script/openqa-clone-custom-git-refspec#L126

In os-autoinst, when the needles_dir does not exist, it will change the needles_dir to CASEDIR+NEEDLES_DIR, https://github.com/os-autoinst/os-autoinst/blob/master/needle.pm#L329.
So in this situation, the needle_dir is /var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles

I am wondering, do we need to give the arguments NEEDLES_DIR and PRODUCTDIR when calling openqa-clone-job in openqa-clone-custom-git-refspec?
I have not got a good idea to fix this ticket.
A workaround:
use openqa-clone-job ... CASEDIR=https://github.com/ggardet/os-autoinst-distri-opensuse.git#add_ai_ml _GROUP="0"

Actions #6

Updated by okurz over 3 years ago

  • Related to action #66658: Fail to use openqa-clone-custom-git-refspec to trigger a job which is also triggered by this script added
Actions #7

Updated by okurz over 3 years ago

  • Priority changed from Normal to Low

@Xiaojing_liu thank you for looking into this and also for giving a workaround suggestion. I am not sure which job you used as "source job".

@ggardet_arm it seems we are still missing something here. Could you help us please to provide better steps for reproduction, e.g. the command line you called the script with.

Actions #8

Updated by okurz over 3 years ago

  • Status changed from Workable to Feedback
  • Assignee set to ggardet_arm
Actions #9

Updated by Xiaojing_liu over 3 years ago

I reproduced it in my local environment, so both source job and cloned job are local jobs. Here are the link:
source job: http://10.67.19.103/tests/158 (worker: barry:1 is a remote worker)
clone job: http://10.67.19.103/tests/159

Actions #10

Updated by okurz over 3 years ago

  • Tags set to openqa-clone-custom-git-refspec
  • Description updated (diff)
  • Category changed from Regressions/Crashes to Feature requests
  • Status changed from Feedback to Workable
  • Assignee deleted (ggardet_arm)
  • Difficulty set to easy

Thanks. I updated the ticket description to be a workable feature request for us to cover remote workers as well.

Xiaojing_liu wrote:

So in this situation, the needle_dir is /var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles

I am wondering, do we need to give the arguments NEEDLES_DIR and PRODUCTDIR when calling openqa-clone-job in openqa-clone-custom-git-refspec?

Yes, openqa-clone-custom-git-refspec adds NEEDLES_DIR explicitly because we only check out the test data and rely on the needles to be present on the system already. For PRODUCTDIR we point to a relative dir within the checked out test distribution directory.

So far it seems to me as if the openqa-clone-job command line is correctly generated but within openQA the additional path /var/lib/openqa/pool/1/os-autoinst-distri-opensuse seems to be added in front.

Actions #11

Updated by okurz over 3 years ago

  • Due date set to 2020-09-25
  • Status changed from Workable to Feedback
  • Assignee set to okurz

@ggardet_arm can you please check again if this works now after https://github.com/os-autoinst/openQA/pull/3348 has been merged?

Actions #12

Updated by ggardet_arm over 3 years ago

okurz wrote:

@ggardet_arm can you please check again if this works now after https://github.com/os-autoinst/openQA/pull/3348 has been merged?

No, it does not work.
See: https://openqa.opensuse.org/tests/1384426#details created by:

./openqa-clone-custom-git-refspec https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/10944 https://openqa.opensuse.org/tests/1384351 WORKER_CLASS=aws
Actions #13

Updated by okurz over 3 years ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

alright, thanks for testing.

Actions #14

Updated by Xiaojing_liu over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to Xiaojing_liu
Actions #15

Updated by Xiaojing_liu over 3 years ago

  • Status changed from In Progress to Feedback

For now, if the source job has no NEEDLES_DIR and users do not specify this parameter when using openqa-clone-custom-git-refspec, the NEEDLES_DIR will be set to $old_product_dir/needles in this script openqa-clone-custom-git-refspec.
In this ticket, $old_product_dir is a cache directory /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse, because cache service is enabled. But when the new job is run on another worker, the NEEDLES_DIR (/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles) may does not exist (maybe becuase the worker does not enable cache service or the openqa has different host name (not openqa1-opensuse) in this worker), so following the way of dealing with needles_dir in os-autoinst: https://github.com/os-autoinst/os-autoinst/blob/c9117f1b248d5a08fe3192d34826d0c42c40750b/needle.pm#L329

$needles_dir = File::Spec->catdir($bmwqemu::vars{CASEDIR}, $needles_dir) unless -d $needles_dir;

the needles_dir is wrong, just like the description in this ticket.

I also checked the jobs on OSD, seems all workers have the same configurations about the cache directory /var/lib/openqa/cache/openqa.suse.de/tests/..., so we have not met this problem on OSD.

In this scenario, we cannot find out the correct NEEDLES_DIR from the source job. But we could specify the NEEDLES_DIR as a workaround, such as

./openqa-clone-custom-git-refspec https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11043 https://openqa.opensuse.org/tests/1399182 WORKER_CLASS=aws NEEDLES_DIR="/var/lib/openqa/share/tests/opensuse/products/opensuse/needles" 
Created job #1400225: opensuse-Tumbleweed-DVD-aarch64-Build20200918-mediacheck@aarch64 -> https://openqa.opensuse.org/t1400225

It also worked, and the job https://openqa.opensuse.org/t1400225 is passed.

@ggardet_arm @okurz WDYT?

Actions #16

Updated by okurz over 3 years ago

That's a good and correct explanation and a good workaround. For now I don't think we should do more than document this for the special case of test distributions with extra needle repo cloned from caching worker to non-caching worker. How about we document within the openQA docs or the help of openqa-clone-custom-git-refspec? For a better intelligence I think we should work on other tickets first, e.g. the full version control awareness

Actions #17

Updated by Guillaume_G over 3 years ago

Thanks for the workaround. The problem is when jobs are cloned and WORKER_CLASS not overwritten, any aarch64 worker could be picked up and could fail due to this problem.
Maybe we should setup all workers to use the same path, as done in osd?

Actions #18

Updated by okurz over 3 years ago

Yes. I guess for this you just need to ensure two things on the remote worker machines that you maintain:

  • Enabling caching
  • Ensure the main o3 webui server can be reached by the hostname "openqa1-opensuse"

To show the relevant parts from the worker config

$ ssh root@aarch64 cat /etc/openqa/workers.ini
[global]
HOST = http://openqa1-opensuse
WORKER_HOSTNAME = 192.168.112.3
CACHEDIRECTORY = /var/lib/openqa/cache
WORKER_CLASS = qemu_aarch64,qemu_aarch32,heavyload

[http://openqa1-opensuse]
TESTPOOLSERVER = rsync://openqa1-opensuse/tests
Actions #19

Updated by ggardet_arm over 3 years ago

Is rsync available from Internet?

And I think it would be better to rename openqa1-opensuse to openqa.opensuse.org.

Actions #20

Updated by ggardet_arm over 3 years ago

ggardet_arm wrote:

Is rsync available from Internet?

It looks like it is not:

rsync: [Receiver] failed to connect to openqa.opensuse.org (2001:67c:2178:8::16): Connection timed out (110)
rsync: [Receiver] failed to connect to openqa.opensuse.org (195.135.221.140): Connection timed out (110)
rsync error: error in socket IO (code 10) at clientserver.c(137) [Receiver=3.2.3]

So, we cannot go this way.

Actions #21

Updated by Xiaojing_liu over 3 years ago

May I set this ticket as 'Resolve'?

Actions #22

Updated by okurz over 3 years ago

Hm, I don't think we have actually resolved the issue yet. In the case of test distributions where needles are in a separate directory and we have differing workers config, e.g. some with caching, some without, we still run into problems.

@Xiaojing_liu do you think the acceptance criteria as currently specified are fulfilled?

I think we can improve the documentation to mention the limitations. Also we can try to handle this situation better, e.g. look in multiple alternative paths when looking for needles.

Actions #23

Updated by Xiaojing_liu over 3 years ago

  • Assignee deleted (Xiaojing_liu)

I did not work on it for a long time, I also missed the last message. Sorry for that. I remove the Assignee because I have no idea how to get the valid NEEDLES_DIR if the source job is remote worker.

Actions #24

Updated by okurz over 3 years ago

  • Due date deleted (2020-09-25)
  • Status changed from Feedback to Workable

alright. Thanks for the update. Back to Workable.

Actions #25

Updated by okurz over 3 years ago

  • Priority changed from Low to Normal
Actions #26

Updated by Xiaojing_liu over 3 years ago

  • Due date set to 2020-11-27
  • Status changed from Workable to In Progress

I would like to try it again.

Actions #27

Updated by Xiaojing_liu over 3 years ago

  • Assignee set to Xiaojing_liu
Actions #29

Updated by okurz over 3 years ago

  • Parent task set to #58184
Actions #30

Updated by Xiaojing_liu over 3 years ago

  • Status changed from In Progress to Blocked
Actions #31

Updated by Xiaojing_liu over 3 years ago

  • Blocked by coordination #80372: [epic] Cleanup vars.json as initial information container between openQA worker and isotovideo added
Actions #32

Updated by Xiaojing_liu over 3 years ago

  • Due date deleted (2020-11-27)
Actions #33

Updated by okurz over 3 years ago

  • Status changed from Blocked to In Progress
  • Parent task changed from #58184 to #80372

as discussed in daily adding to the epic as subtask, not blocked by, and setting ticket to "In Progress"

Actions #35

Updated by okurz over 3 years ago

Today multiple persons met and we discussed the ticket and various points around it. Important points to mention:

Ideas:

  • In every step overwrite less or no variables but instead only define new ones instead of overwriting input parameters
  • Use relative paths where possible, e.g. absolute PRJDIR or SHAREDIR but CASEDIR, PRODUCTDIR, NEEDLESDIR relative to PRJDIR or SHAREDIR
Actions #36

Updated by livdywan about 3 years ago

With https://github.com/os-autoinst/os-autoinst/pull/1606 merged which superseded previous PRs, is there more to be done here? @Xiaojing_liu

Actions #37

Updated by livdywan about 3 years ago

  • Due date set to 2021-01-22

Setting a due date since it doesn't seem like Ivan's script is catching it

Actions #38

Updated by Xiaojing_liu about 3 years ago

cdywan wrote:

With https://github.com/os-autoinst/os-autoinst/pull/1606 merged which superseded previous PRs, is there more to be done here? @Xiaojing_liu

https://github.com/os-autoinst/openQA/pull/3604 There is an open pr which is waiting for feedback.

Actions #39

Updated by okurz about 3 years ago

  • Status changed from In Progress to Feedback
Actions #41

Updated by livdywan about 3 years ago

  • Due date changed from 2021-01-22 to 2021-01-29

For the record, I suggested in the chat to add DISTRI=mine BUILD=test CASEDIR=https://github.com/coolo/os-autoinst-distri-example.git#master (or a variantion of it) as a unit test, and this could be added even before re-introducing the feature.

Actions #42

Updated by Xiaojing_liu about 3 years ago

Actions #43

Updated by livdywan about 3 years ago

  • Status changed from Feedback to In Progress
Actions #44

Updated by okurz about 3 years ago

We decided to merge the PR after we added additional unit tests and not finding easy cases to test pre-merge. We observed openQA "last_good_tests" openQA investigation jobs incompleted due to being unable to find the main schedule files, e.g.
openqa.opensuse.org/tests/1609706
On openqa.opensuse.org/tests?match=last_good_test I also saw the suspicion confirmed that any "last_good_tests" investigation job since this day (since the new version is deployed) are incomplete except of the ones on the old aarch64 workers which still have the old version.

Further notes from chat about the reason and our choices:

@Xiaojing_liu
15:02
Oliver Kurz I checked others incomplete, have the same reason. Actually I confuse about the scenario. If we supply a git url as CASEDIR, and use the old PRODUCTDIR value, the main.pm in productdir will not make sense. right? I guess it's the reason why we supply RPODUCTDIR in openqa-clone-custom-git-refspec together.

okurz
Oliver Kurz @okurz
15:04
the reason why I supplied PRODUCTDIR in openqa-clone-custom-git-refspec initially was a workaround because it is needed. Let me check the openQA code what is expected and what should be done for the future

okurz
Oliver Kurz @okurz
15:07
right. https://github.com/os-autoinst/openQA/blob/1998b3cde8c85be95d8ddab6933d7877c9437670/lib/OpenQA/Utils.pm#L127 explains it: Normally the user should not need to specify PRODUCTDIR because we have the line return $dir . "/products/$distri" if -e "$dir/products/$distri"; in https://github.com/os-autoinst/openQA/blob/1998b3cde8c85be95d8ddab6933d7877c9437670/lib/OpenQA/Utils.pm#L142 . And if someone wants to use CASEDIR= and keep DISTRI in before there should be no need to specify PRODUCTDIR, only if one wants to rename "DISTRI" to a value that would not correspond to "$casedir/products/$distri/main.pm" then PRODUCTDIR would need to be set

Actions #45

Updated by Xiaojing_liu about 3 years ago

Please help to confirm those points:

  1. If users only supply CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#6f7510c5fc8d0c3c46a1d2154f5fc54296c22bd5, we should define PRODUCTDOR=/var/lib/openqa/share/tests/opensuse/products/opensuse? This is the behavior for now. In this scenario, if the push includes the modification for products/opensuse/main.pm, it doesn't work, right? Because the PRODUCTDIR is not under the CASEDIR.
  2. There is also an issue of openqa-clone-custom-git-refspec. If use this script to trigger a new job (job's CASEDIE=opensuse, PRODUCTDIR=opensuse/products/opensuse), the NEEDLES_DIR will be set as opensuse/products/opensuse/needles. it's not correct. we could modify this script to set this value as needles, or modify the Worker code to Compatible with it.
Actions #46

Updated by livdywan about 3 years ago

  • Due date changed from 2021-01-29 to 2021-02-12

Xiaojing_liu wrote:

Please help to confirm those points:

  1. If users only supply CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#6f7510c5fc8d0c3c46a1d2154f5fc54296c22bd5, we should define PRODUCTDOR=/var/lib/openqa/share/tests/opensuse/products/opensuse? This is the behavior for now. In this scenario, if the push includes the modification for products/opensuse/main.pm, it doesn't work, right? Because the PRODUCTDIR is not under the CASEDIR.

Are you talking about a new default for PRODUCTDIR? Would the default be relative to CASEDIR?

  1. There is also an issue of openqa-clone-custom-git-refspec. If use this script to trigger a new job (job's CASEDIE=opensuse, PRODUCTDIR=opensuse/products/opensuse), the NEEDLES_DIR will be set as opensuse/products/opensuse/needles. it's not correct. we could modify this script to set this value as needles, or modify the Worker code to Compatible with it.

I guess, same as above, we should make sure the defaults are well-defined. Either that or they have to be set at all times.

Actions #47

Updated by Xiaojing_liu about 3 years ago

According note#44, commit new pr: https://github.com/os-autoinst/openQA/pull/3712

Actions #48

Updated by livdywan about 3 years ago

  • Due date changed from 2021-02-12 to 2021-02-19

Let's assume Jane is enjoying the new year's holidays before getting back to this🧧🧨🐂🥮

Actions #49

Updated by livdywan about 3 years ago

  • Due date changed from 2021-02-19 to 2021-02-26
Actions #50

Updated by okurz about 3 years ago

  • Related to action #88482: Two absolute paths concatenated to form a default needle dir when PRODUCT_DIR/needles doesn't exist added
Actions #51

Updated by Xiaojing_liu about 3 years ago

  • Status changed from In Progress to Feedback

The pull request is set acceptance-tests-needed, and one colleague helped deploy this patch to one worker: https://openqa.opensuse.org/admin/workers/260

See details in https://github.com/os-autoinst/openQA/pull/3712#issuecomment-784178554.
I did not spend much time on it, just reviewed the jobs' results of that worker sometimes.

Since we are waiting for more test results, so I prefer to set this ticket to feedback.

Actions #52

Updated by okurz about 3 years ago

  • Due date changed from 2021-02-26 to 2021-03-04

Good choice! As the the smaller os-autoinst change https://github.com/os-autoinst/os-autoinst/pull/1627 is still pending deployment on OSD I would wait until after the deployment tomorrow and then merge the openQA change to come in next.

Actions #53

Updated by livdywan about 3 years ago

  • Due date changed from 2021-03-04 to 2021-03-12

I didn't save the comment here, so once more:

We briefly discussed that https://github.com/os-autoinst/openQA/pull/3712 was tested enough, the os-autoinst branch was merged as well.

Unfortunately now the PR is stuck even 20 hours after review, trying to figure out what's wrong there now.

Actions #54

Updated by okurz about 3 years ago

  • Due date changed from 2021-03-12 to 2021-03-17

PR is merged. Was deployed on OSD by now. Given that with the previous iterations we had some problems I suggest to await any potential user feedback until next week before we check against the original request again if the feature is completed.

Actions #55

Updated by pcervinka about 3 years ago

  • Related to action #89990: Error on tests/kernel/run_ltp.pm: Can't locate sle/tests/kernel/run_ltp.pm added
Actions #56

Updated by okurz about 3 years ago

Regression found with #89990 . I will handle the rollback in #89990 . Can you try to understand the specific test requirements and include in your changes as well?

Further ideas for improvements:

  • Instead of enabling the new functionality by default, introduce an experimental feature switch so that we do not have to revert the source code, only the switch configuration
  • We should have an alert for the broken jobs caused by the change in behaviour -> Check the existing alert for "new incompletes"
Actions #57

Updated by livdywan about 3 years ago

From the weekly:

  • no alerts due to incomplete alerts not firing, this was expected
  • use feature switch approach similar to faster localhost uploads
  • enable via job setting (could be machine, job group, etc)
Actions #59

Updated by Xiaojing_liu about 3 years ago

  • Due date changed from 2021-03-17 to 2021-03-19

The pr for os-autoinst has been merged. After it is deployed on OSD and o3, if there is no problem, we could consider recovery the pr on openQA.

Actions #60

Updated by okurz about 3 years ago

Yes, we should bring back the PR but only enable the changed behaviour based on a feature switch, ok?

Actions #61

Updated by Xiaojing_liu about 3 years ago

As discussed with @cdywan, we agreed with using a job setting to enable or disable this feature. Will commit a new pull request later.

Actions #63

Updated by okurz about 3 years ago

  • Tracker changed from action to coordination
  • Subject changed from Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec to [epic] Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec
  • Status changed from Feedback to Blocked
  • Assignee changed from Xiaojing_liu to okurz

As after this PR there are likely more steps also regarding the feature switch I know now how to switch this ticket better, e.g. see subtask #90293

Actions #64

Updated by okurz almost 3 years ago

  • Status changed from Blocked to Resolved

All subtasks resolved, both ACs covered.

Actions

Also available in: Atom PDF