Project

General

Profile

action #67723

coordination #58184: [saga][epic][use case] full version control awareness within openQA, e.g. user forks and branches, fully versioned test schedules and configuration settings

action #80372: [epic] Cleanup vars.json as initial information container between openQA worker and isotovideo

Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec

Added by ggardet_arm 8 months ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-06-04
Due date:
% Done:

0%

Estimated time:
Difficulty:
easy

Description

Observation

Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec, see: https://openqa.opensuse.org/tests/1287895

needles_dir not found: /var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles (check vars.json?) at /usr/lib/os-autoinst/needle.pm line 330, <$fh> line 20.
[2020-06-04T09:10:50.934 UTC] [debug] terminating command server 19004 because test execution ended through exception
[2020-06-04T09:10:51.935 UTC] [debug] done with command server
18926: EXIT 1

Is there anything to setup?

Motivation

As a user of remote openQA worker instances I want to be able to use openqa-clone-custom-git-refspec as well

Acceptance criteria

  • AC1: openqa-clone-custom-git-refspec creates jobs with valid CASEDIR and valid NEEDLES_DIR if source job is a "remote worker"
  • AC2: openqa-clone-custom-git-refspec still creates jobs that find tests from specified git hash and needles for other workers

Suggestions

  • Croscheck the command line with which openqa-clone-job would be or is called within openqa-clone-custom-git-refspec in case of normal workers or the affected machines
  • As it seems that the paths given to openqa-clone-job seem valid also check how openQA internally handles paths to generate the obviously wrong path like "/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles"
  • Either apply fix in openQA side or openqa-clone-custom-git-refspec

Related issues

Related to openQA Project - action #66658: Fail to use openqa-clone-custom-git-refspec to trigger a job which is also triggered by this scriptResolved2020-05-09

History

#1 Updated by okurz 8 months ago

  • Project changed from openQA Tests to openQA Project
  • Category set to Concrete Bugs
  • Status changed from New to Feedback
  • Assignee set to okurz

/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles is obviously an in correct path. Could it be your clone source job was already one created with openqa-clone-custom-git-refspec?

#2 Updated by okurz 7 months ago

  • Category changed from Concrete Bugs to Support
  • Status changed from Feedback to Resolved

No response

#3 Updated by ggardet_arm 7 months ago

  • Status changed from Resolved to New

okurz wrote:

/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles is obviously an in correct path. Could it be your clone source job was already one created with openqa-clone-custom-git-refspec?

I tried to clone a test which was the first run for the 20200704 snapshot and it failed. See: https://openqa.opensuse.org/t1326587

#4 Updated by okurz 6 months ago

  • Category changed from Support to Concrete Bugs
  • Status changed from New to Workable
  • Assignee deleted (okurz)
  • Target version set to Ready

#5 Updated by Xiaojing_liu 6 months ago

I reproduced this issue in my local environment. Because the cloned source job's PRODUCTDIR is /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/ (remote worker and enable cache), so the openqa-clone-custom-git-refspec set the new job's NEEDLES_DIR as '/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles' in this line: https://github.com/os-autoinst/openQA/blob/master/script/openqa-clone-custom-git-refspec#L126

In os-autoinst, when the needles_dir does not exist, it will change the needles_dir to CASEDIR+NEEDLES_DIR, https://github.com/os-autoinst/os-autoinst/blob/master/needle.pm#L329.
So in this situation, the needle_dir is /var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles

I am wondering, do we need to give the arguments NEEDLES_DIR and PRODUCTDIR when calling openqa-clone-job in openqa-clone-custom-git-refspec?
I have not got a good idea to fix this ticket.
A workaround:
use openqa-clone-job ... CASEDIR=https://github.com/ggardet/os-autoinst-distri-opensuse.git#add_ai_ml _GROUP="0"

#6 Updated by okurz 6 months ago

  • Related to action #66658: Fail to use openqa-clone-custom-git-refspec to trigger a job which is also triggered by this script added

#7 Updated by okurz 6 months ago

  • Priority changed from Normal to Low

@Xiaojing_liu thank you for looking into this and also for giving a workaround suggestion. I am not sure which job you used as "source job".

ggardet_arm it seems we are still missing something here. Could you help us please to provide better steps for reproduction, e.g. the command line you called the script with.

#8 Updated by okurz 6 months ago

  • Status changed from Workable to Feedback
  • Assignee set to ggardet_arm

#9 Updated by Xiaojing_liu 6 months ago

I reproduced it in my local environment, so both source job and cloned job are local jobs. Here are the link:
source job: http://10.67.19.103/tests/158 (worker: barry:1 is a remote worker)
clone job: http://10.67.19.103/tests/159

#10 Updated by okurz 6 months ago

  • Tags set to openqa-clone-custom-git-refspec
  • Description updated (diff)
  • Category changed from Concrete Bugs to Feature requests
  • Status changed from Feedback to Workable
  • Assignee deleted (ggardet_arm)
  • Difficulty set to easy

Thanks. I updated the ticket description to be a workable feature request for us to cover remote workers as well.

Xiaojing_liu wrote:

So in this situation, the needle_dir is /var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles

I am wondering, do we need to give the arguments NEEDLES_DIR and PRODUCTDIR when calling openqa-clone-job in openqa-clone-custom-git-refspec?

Yes, openqa-clone-custom-git-refspec adds NEEDLES_DIR explicitly because we only check out the test data and rely on the needles to be present on the system already. For PRODUCTDIR we point to a relative dir within the checked out test distribution directory.

So far it seems to me as if the openqa-clone-job command line is correctly generated but within openQA the additional path /var/lib/openqa/pool/1/os-autoinst-distri-opensuse seems to be added in front.

#11 Updated by okurz 5 months ago

  • Due date set to 2020-09-25
  • Status changed from Workable to Feedback
  • Assignee set to okurz

ggardet_arm can you please check again if this works now after https://github.com/os-autoinst/openQA/pull/3348 has been merged?

#12 Updated by ggardet_arm 5 months ago

okurz wrote:

ggardet_arm can you please check again if this works now after https://github.com/os-autoinst/openQA/pull/3348 has been merged?

No, it does not work.
See: https://openqa.opensuse.org/tests/1384426#details created by:

./openqa-clone-custom-git-refspec https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/10944 https://openqa.opensuse.org/tests/1384351 WORKER_CLASS=aws

#13 Updated by okurz 5 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

alright, thanks for testing.

#14 Updated by Xiaojing_liu 4 months ago

  • Status changed from Workable to In Progress
  • Assignee set to Xiaojing_liu

#15 Updated by Xiaojing_liu 4 months ago

  • Status changed from In Progress to Feedback

For now, if the source job has no NEEDLES_DIR and users do not specify this parameter when using openqa-clone-custom-git-refspec, the NEEDLES_DIR will be set to $old_product_dir/needles in this script openqa-clone-custom-git-refspec.
In this ticket, $old_product_dir is a cache directory /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse, because cache service is enabled. But when the new job is run on another worker, the NEEDLES_DIR (/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles) may does not exist (maybe becuase the worker does not enable cache service or the openqa has different host name (not openqa1-opensuse) in this worker), so following the way of dealing with needles_dir in os-autoinst: https://github.com/os-autoinst/os-autoinst/blob/c9117f1b248d5a08fe3192d34826d0c42c40750b/needle.pm#L329

$needles_dir = File::Spec->catdir($bmwqemu::vars{CASEDIR}, $needles_dir) unless -d $needles_dir;

the needles_dir is wrong, just like the description in this ticket.

I also checked the jobs on OSD, seems all workers have the same configurations about the cache directory /var/lib/openqa/cache/openqa.suse.de/tests/..., so we have not met this problem on OSD.

In this scenario, we cannot find out the correct NEEDLES_DIR from the source job. But we could specify the NEEDLES_DIR as a workaround, such as

./openqa-clone-custom-git-refspec https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11043 https://openqa.opensuse.org/tests/1399182 WORKER_CLASS=aws NEEDLES_DIR="/var/lib/openqa/share/tests/opensuse/products/opensuse/needles" 
Created job #1400225: opensuse-Tumbleweed-DVD-aarch64-Build20200918-mediacheck@aarch64 -> https://openqa.opensuse.org/t1400225

It also worked, and the job https://openqa.opensuse.org/t1400225 is passed.

ggardet_arm okurz WDYT?

#16 Updated by okurz 4 months ago

That's a good and correct explanation and a good workaround. For now I don't think we should do more than document this for the special case of test distributions with extra needle repo cloned from caching worker to non-caching worker. How about we document within the openQA docs or the help of openqa-clone-custom-git-refspec? For a better intelligence I think we should work on other tickets first, e.g. the full version control awareness

#17 Updated by Guillaume_G 4 months ago

Thanks for the workaround. The problem is when jobs are cloned and WORKER_CLASS not overwritten, any aarch64 worker could be picked up and could fail due to this problem.
Maybe we should setup all workers to use the same path, as done in osd?

#18 Updated by okurz 4 months ago

Yes. I guess for this you just need to ensure two things on the remote worker machines that you maintain:

  • Enabling caching
  • Ensure the main o3 webui server can be reached by the hostname "openqa1-opensuse"

To show the relevant parts from the worker config

$ ssh root@aarch64 cat /etc/openqa/workers.ini
[global]
HOST = http://openqa1-opensuse
WORKER_HOSTNAME = 192.168.112.3
CACHEDIRECTORY = /var/lib/openqa/cache
WORKER_CLASS = qemu_aarch64,qemu_aarch32,heavyload

[http://openqa1-opensuse]
TESTPOOLSERVER = rsync://openqa1-opensuse/tests

#19 Updated by ggardet_arm 4 months ago

Is rsync available from Internet?

And I think it would be better to rename openqa1-opensuse to openqa.opensuse.org.

#20 Updated by ggardet_arm 4 months ago

ggardet_arm wrote:

Is rsync available from Internet?

It looks like it is not:

rsync: [Receiver] failed to connect to openqa.opensuse.org (2001:67c:2178:8::16): Connection timed out (110)
rsync: [Receiver] failed to connect to openqa.opensuse.org (195.135.221.140): Connection timed out (110)
rsync error: error in socket IO (code 10) at clientserver.c(137) [Receiver=3.2.3]

So, we cannot go this way.

#21 Updated by Xiaojing_liu 3 months ago

May I set this ticket as 'Resolve'?

#22 Updated by okurz 3 months ago

Hm, I don't think we have actually resolved the issue yet. In the case of test distributions where needles are in a separate directory and we have differing workers config, e.g. some with caching, some without, we still run into problems.

@Xiaojing_liu do you think the acceptance criteria as currently specified are fulfilled?

I think we can improve the documentation to mention the limitations. Also we can try to handle this situation better, e.g. look in multiple alternative paths when looking for needles.

#23 Updated by Xiaojing_liu 3 months ago

  • Assignee deleted (Xiaojing_liu)

I did not work on it for a long time, I also missed the last message. Sorry for that. I remove the Assignee because I have no idea how to get the valid NEEDLES_DIR if the source job is remote worker.

#24 Updated by okurz 3 months ago

  • Due date deleted (2020-09-25)
  • Status changed from Feedback to Workable

alright. Thanks for the update. Back to Workable.

#25 Updated by okurz 2 months ago

  • Priority changed from Low to Normal

#26 Updated by Xiaojing_liu 2 months ago

  • Due date set to 2020-11-27
  • Status changed from Workable to In Progress

I would like to try it again.

#27 Updated by Xiaojing_liu 2 months ago

  • Assignee set to Xiaojing_liu

#29 Updated by okurz about 2 months ago

  • Parent task set to #58184

#30 Updated by Xiaojing_liu about 2 months ago

  • Status changed from In Progress to Blocked

#31 Updated by Xiaojing_liu about 2 months ago

  • Blocked by action #80372: [epic] Cleanup vars.json as initial information container between openQA worker and isotovideo added

#32 Updated by Xiaojing_liu about 2 months ago

  • Due date deleted (2020-11-27)

#33 Updated by okurz about 2 months ago

  • Status changed from Blocked to In Progress
  • Parent task changed from #58184 to #80372

as discussed in daily adding to the epic as subtask, not blocked by, and setting ticket to "In Progress"

#35 Updated by okurz about 1 month ago

Today multiple persons met and we discussed the ticket and various points around it. Important points to mention:

Ideas:

  • In every step overwrite less or no variables but instead only define new ones instead of overwriting input parameters
  • Use relative paths where possible, e.g. absolute PRJDIR or SHAREDIR but CASEDIR, PRODUCTDIR, NEEDLESDIR relative to PRJDIR or SHAREDIR

Also available in: Atom PDF