coordination #67723
closedcoordination #58184: [saga][epic][use case] full version control awareness within openQA
coordination #80372: [epic] Cleanup vars.json as initial information container between openQA worker and isotovideo
[epic] Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec
100%
Description
Observation¶
Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec
, see: https://openqa.opensuse.org/tests/1287895
[0mneedles_dir not found: /var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles (check vars.json?) at /usr/lib/os-autoinst/needle.pm line 330, <$fh> line 20.
[37m[2020-06-04T09:10:50.934 UTC] [debug] terminating command server 19004 because test execution ended through exception
[0m[37m[2020-06-04T09:10:51.935 UTC] [debug] done with command server
[0m18926: EXIT 1
Is there anything to setup?
Motivation¶
As a user of remote openQA worker instances I want to be able to use openqa-clone-custom-git-refspec as well
Acceptance criteria¶
- AC1: openqa-clone-custom-git-refspec creates jobs with valid CASEDIR and valid NEEDLES_DIR if source job is a "remote worker"
- AC2: openqa-clone-custom-git-refspec still creates jobs that find tests from specified git hash and needles for other workers
Suggestions¶
- Croscheck the command line with which openqa-clone-job would be or is called within
openqa-clone-custom-git-refspec
in case of normal workers or the affected machines - As it seems that the paths given to openqa-clone-job seem valid also check how openQA internally handles paths to generate the obviously wrong path like "/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles"
- Either apply fix in openQA side or openqa-clone-custom-git-refspec
Updated by okurz over 4 years ago
- Project changed from openQA Tests to openQA Project
- Category set to Regressions/Crashes
- Status changed from New to Feedback
- Assignee set to okurz
/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles is obviously an in correct path. Could it be your clone source job was already one created with openqa-clone-custom-git-refspec
?
Updated by okurz over 4 years ago
- Category changed from Regressions/Crashes to Support
- Status changed from Feedback to Resolved
No response
Updated by ggardet_arm over 4 years ago
- Status changed from Resolved to New
okurz wrote:
/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles is obviously an in correct path. Could it be your clone source job was already one created with
openqa-clone-custom-git-refspec
?
I tried to clone a test which was the first run for the 20200704 snapshot and it failed. See: https://openqa.opensuse.org/t1326587
Updated by okurz about 4 years ago
- Category changed from Support to Regressions/Crashes
- Status changed from New to Workable
- Assignee deleted (
okurz) - Target version set to Ready
Updated by Xiaojing_liu about 4 years ago
I reproduced this issue in my local environment. Because the cloned source job's PRODUCTDIR is /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/
(remote worker and enable cache), so the openqa-clone-custom-git-refspec
set the new job's NEEDLES_DIR as '/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles' in this line: https://github.com/os-autoinst/openQA/blob/master/script/openqa-clone-custom-git-refspec#L126
In os-autoinst, when the needles_dir does not exist, it will change the needles_dir to CASEDIR+NEEDLES_DIR, https://github.com/os-autoinst/os-autoinst/blob/master/needle.pm#L329.
So in this situation, the needle_dir is /var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles
I am wondering, do we need to give the arguments NEEDLES_DIR
and PRODUCTDIR
when calling openqa-clone-job
in openqa-clone-custom-git-refspec
?
I have not got a good idea to fix this ticket.
A workaround:
use openqa-clone-job ... CASEDIR=https://github.com/ggardet/os-autoinst-distri-opensuse.git#add_ai_ml _GROUP="0"
Updated by okurz about 4 years ago
- Related to action #66658: Fail to use openqa-clone-custom-git-refspec to trigger a job which is also triggered by this script added
Updated by okurz about 4 years ago
- Priority changed from Normal to Low
@Xiaojing_liu thank you for looking into this and also for giving a workaround suggestion. I am not sure which job you used as "source job".
@ggardet_arm it seems we are still missing something here. Could you help us please to provide better steps for reproduction, e.g. the command line you called the script with.
Updated by okurz about 4 years ago
- Status changed from Workable to Feedback
- Assignee set to ggardet_arm
Updated by Xiaojing_liu about 4 years ago
I reproduced it in my local environment, so both source job and cloned job are local jobs. Here are the link:
source job: http://10.67.19.103/tests/158 (worker: barry:1 is a remote worker)
clone job: http://10.67.19.103/tests/159
Updated by okurz about 4 years ago
- Tags set to openqa-clone-custom-git-refspec
- Description updated (diff)
- Category changed from Regressions/Crashes to Feature requests
- Status changed from Feedback to Workable
- Assignee deleted (
ggardet_arm) - Difficulty set to easy
Thanks. I updated the ticket description to be a workable feature request for us to cover remote workers as well.
Xiaojing_liu wrote:
So in this situation, the needle_dir is
/var/lib/openqa/pool/2/os-autoinst-distri-opensuse/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles
I am wondering, do we need to give the arguments
NEEDLES_DIR
andPRODUCTDIR
when callingopenqa-clone-job
inopenqa-clone-custom-git-refspec
?
Yes, openqa-clone-custom-git-refspec
adds NEEDLES_DIR explicitly because we only check out the test data and rely on the needles to be present on the system already. For PRODUCTDIR we point to a relative dir within the checked out test distribution directory.
So far it seems to me as if the openqa-clone-job
command line is correctly generated but within openQA the additional path /var/lib/openqa/pool/1/os-autoinst-distri-opensuse
seems to be added in front.
Updated by okurz about 4 years ago
- Due date set to 2020-09-25
- Status changed from Workable to Feedback
- Assignee set to okurz
@ggardet_arm can you please check again if this works now after https://github.com/os-autoinst/openQA/pull/3348 has been merged?
Updated by ggardet_arm about 4 years ago
okurz wrote:
@ggardet_arm can you please check again if this works now after https://github.com/os-autoinst/openQA/pull/3348 has been merged?
No, it does not work.
See: https://openqa.opensuse.org/tests/1384426#details created by:
./openqa-clone-custom-git-refspec https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/10944 https://openqa.opensuse.org/tests/1384351 WORKER_CLASS=aws
Updated by okurz about 4 years ago
- Status changed from Feedback to Workable
- Assignee deleted (
okurz)
alright, thanks for testing.
Updated by Xiaojing_liu about 4 years ago
- Status changed from Workable to In Progress
- Assignee set to Xiaojing_liu
Updated by Xiaojing_liu about 4 years ago
- Status changed from In Progress to Feedback
For now, if the source job has no NEEDLES_DIR
and users do not specify this parameter when using openqa-clone-custom-git-refspec
, the NEEDLES_DIR
will be set to $old_product_dir/needles
in this script openqa-clone-custom-git-refspec
.
In this ticket, $old_product_dir
is a cache directory /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse
, because cache service is enabled. But when the new job is run on another worker, the NEEDLES_DIR
(/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles)
may does not exist (maybe becuase the worker does not enable cache service or the openqa has different host name (not openqa1-opensuse) in this worker), so following the way of dealing with needles_dir in os-autoinst: https://github.com/os-autoinst/os-autoinst/blob/c9117f1b248d5a08fe3192d34826d0c42c40750b/needle.pm#L329
$needles_dir = File::Spec->catdir($bmwqemu::vars{CASEDIR}, $needles_dir) unless -d $needles_dir;
the needles_dir is wrong, just like the description in this ticket.
I also checked the jobs on OSD, seems all workers have the same configurations about the cache directory /var/lib/openqa/cache/openqa.suse.de/tests/...
, so we have not met this problem on OSD.
In this scenario, we cannot find out the correct NEEDLES_DIR
from the source job. But we could specify the NEEDLES_DIR
as a workaround, such as
./openqa-clone-custom-git-refspec https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11043 https://openqa.opensuse.org/tests/1399182 WORKER_CLASS=aws NEEDLES_DIR="/var/lib/openqa/share/tests/opensuse/products/opensuse/needles"
Created job #1400225: opensuse-Tumbleweed-DVD-aarch64-Build20200918-mediacheck@aarch64 -> https://openqa.opensuse.org/t1400225
It also worked, and the job https://openqa.opensuse.org/t1400225
is passed.
@ggardet_arm @okurz WDYT?
Updated by okurz about 4 years ago
That's a good and correct explanation and a good workaround. For now I don't think we should do more than document this for the special case of test distributions with extra needle repo cloned from caching worker to non-caching worker. How about we document within the openQA docs or the help of openqa-clone-custom-git-refspec? For a better intelligence I think we should work on other tickets first, e.g. the full version control awareness
Updated by Guillaume_G about 4 years ago
Thanks for the workaround. The problem is when jobs are cloned and WORKER_CLASS
not overwritten, any aarch64
worker could be picked up and could fail due to this problem.
Maybe we should setup all workers to use the same path, as done in osd
?
Updated by okurz about 4 years ago
Yes. I guess for this you just need to ensure two things on the remote worker machines that you maintain:
- Enabling caching
- Ensure the main o3 webui server can be reached by the hostname "openqa1-opensuse"
To show the relevant parts from the worker config
$ ssh root@aarch64 cat /etc/openqa/workers.ini
[global]
HOST = http://openqa1-opensuse
WORKER_HOSTNAME = 192.168.112.3
CACHEDIRECTORY = /var/lib/openqa/cache
WORKER_CLASS = qemu_aarch64,qemu_aarch32,heavyload
[http://openqa1-opensuse]
TESTPOOLSERVER = rsync://openqa1-opensuse/tests
Updated by ggardet_arm about 4 years ago
Is rsync
available from Internet?
And I think it would be better to rename openqa1-opensuse
to openqa.opensuse.org
.
Updated by ggardet_arm about 4 years ago
ggardet_arm wrote:
Is
rsync
available from Internet?
It looks like it is not:
rsync: [Receiver] failed to connect to openqa.opensuse.org (2001:67c:2178:8::16): Connection timed out (110)
rsync: [Receiver] failed to connect to openqa.opensuse.org (195.135.221.140): Connection timed out (110)
rsync error: error in socket IO (code 10) at clientserver.c(137) [Receiver=3.2.3]
So, we cannot go this way.
Updated by okurz almost 4 years ago
Hm, I don't think we have actually resolved the issue yet. In the case of test distributions where needles are in a separate directory and we have differing workers config, e.g. some with caching, some without, we still run into problems.
@Xiaojing_liu do you think the acceptance criteria as currently specified are fulfilled?
I think we can improve the documentation to mention the limitations. Also we can try to handle this situation better, e.g. look in multiple alternative paths when looking for needles.
Updated by Xiaojing_liu almost 4 years ago
- Assignee deleted (
Xiaojing_liu)
I did not work on it for a long time, I also missed the last message. Sorry for that. I remove the Assignee
because I have no idea how to get the valid NEEDLES_DIR if the source job is remote worker
.
Updated by okurz almost 4 years ago
- Due date deleted (
2020-09-25) - Status changed from Feedback to Workable
alright. Thanks for the update. Back to Workable.
Updated by Xiaojing_liu almost 4 years ago
- Due date set to 2020-11-27
- Status changed from Workable to In Progress
I would like to try it again.
Updated by Xiaojing_liu almost 4 years ago
- Status changed from In Progress to Blocked
Updated by Xiaojing_liu almost 4 years ago
- Blocked by coordination #80372: [epic] Cleanup vars.json as initial information container between openQA worker and isotovideo added
Updated by okurz almost 4 years ago
- Status changed from Blocked to In Progress
- Parent task changed from #58184 to #80372
as discussed in daily adding to the epic as subtask, not blocked by, and setting ticket to "In Progress"
Updated by okurz almost 4 years ago
Today multiple persons met and we discussed the ticket and various points around it. Important points to mention:
- https://github.com/os-autoinst/os-autoinst/blob/master/doc/backend_vars.asciidoc describes the current test variables that are read and written by isotovideo
- https://github.com/os-autoinst/openQA/blob/master/docs/Installing.asciidoc#terms-and-variables-for-certain-directories-used-by-openqa-and-isotovideo has an overview of path related variables from openQA side
- openQA-in-openQA tests like https://openqa.opensuse.org/tests/1506935# using https://github.com/os-autoinst/os-autoinst-distri-openQA/ show an example of a test distribution that have main.pm in the root folder so we can not strictly set productdir=$casedir/products/$distri
- We need to keep in mind that openQA has job settings stored in the database, which are also used by the clone script. The worker updates vars.json, overwrites some variables. os-autoinst uses vars.json and also overwrites some variables and adds more variables to vars.json. vars.json is also used by openqa-clone-custom-git-refspec
Ideas:
- In every step overwrite less or no variables but instead only define new ones instead of overwriting input parameters
- Use relative paths where possible, e.g. absolute PRJDIR or SHAREDIR but CASEDIR, PRODUCTDIR, NEEDLESDIR relative to PRJDIR or SHAREDIR
Updated by livdywan over 3 years ago
With https://github.com/os-autoinst/os-autoinst/pull/1606 merged which superseded previous PRs, is there more to be done here? @Xiaojing_liu
Updated by livdywan over 3 years ago
- Due date set to 2021-01-22
Setting a due date since it doesn't seem like Ivan's script is catching it
Updated by Xiaojing_liu over 3 years ago
cdywan wrote:
With https://github.com/os-autoinst/os-autoinst/pull/1606 merged which superseded previous PRs, is there more to be done here? @Xiaojing_liu
https://github.com/os-autoinst/openQA/pull/3604 There is an open pr which is waiting for feedback.
Updated by okurz over 3 years ago
- Status changed from In Progress to Feedback
Updated by okurz over 3 years ago
https://github.com/os-autoinst/openQA/pull/3604 was merged and reverted again in https://github.com/os-autoinst/openQA/pull/3693 after an error report https://github.com/os-autoinst/openQA/pull/3604#issuecomment-765888161
Updated by livdywan over 3 years ago
- Due date changed from 2021-01-22 to 2021-01-29
For the record, I suggested in the chat to add DISTRI=mine BUILD=test CASEDIR=https://github.com/coolo/os-autoinst-distri-example.git#master
(or a variantion of it) as a unit test, and this could be added even before re-introducing the feature.
Updated by Xiaojing_liu over 3 years ago
Another pr to fix that issue: https://github.com/os-autoinst/openQA/pull/3696
Updated by livdywan over 3 years ago
- Status changed from Feedback to In Progress
Updated by okurz over 3 years ago
We decided to merge the PR after we added additional unit tests and not finding easy cases to test pre-merge. We observed openQA "last_good_tests" openQA investigation jobs incompleted due to being unable to find the main schedule files, e.g.
openqa.opensuse.org/tests/1609706
On openqa.opensuse.org/tests?match=last_good_test I also saw the suspicion confirmed that any "last_good_tests" investigation job since this day (since the new version is deployed) are incomplete except of the ones on the old aarch64 workers which still have the old version.
Further notes from chat about the reason and our choices:
@Xiaojing_liu
15:02
Oliver Kurz I checked others incomplete, have the same reason. Actually I confuse about the scenario. If we supply a git url as CASEDIR, and use the oldPRODUCTDIR
value, themain.pm
inproductdir
will not make sense. right? I guess it's the reason why we supply RPODUCTDIR inopenqa-clone-custom-git-refspec
together.okurz
Oliver Kurz @okurz
15:04
the reason why I supplied PRODUCTDIR in openqa-clone-custom-git-refspec initially was a workaround because it is needed. Let me check the openQA code what is expected and what should be done for the futureokurz
Oliver Kurz @okurz
15:07
right. https://github.com/os-autoinst/openQA/blob/1998b3cde8c85be95d8ddab6933d7877c9437670/lib/OpenQA/Utils.pm#L127 explains it: Normally the user should not need to specify PRODUCTDIR because we have the linereturn $dir . "/products/$distri" if -e "$dir/products/$distri";
in https://github.com/os-autoinst/openQA/blob/1998b3cde8c85be95d8ddab6933d7877c9437670/lib/OpenQA/Utils.pm#L142 . And if someone wants to use CASEDIR= and keep DISTRI in before there should be no need to specify PRODUCTDIR, only if one wants to rename "DISTRI" to a value that would not correspond to "$casedir/products/$distri/main.pm" then PRODUCTDIR would need to be set
Updated by Xiaojing_liu over 3 years ago
Please help to confirm those points:
- If users only supply
CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#6f7510c5fc8d0c3c46a1d2154f5fc54296c22bd5
, we should definePRODUCTDOR=/var/lib/openqa/share/tests/opensuse/products/opensuse
? This is the behavior for now. In this scenario, if the push includes the modification forproducts/opensuse/main.pm
, it doesn't work, right? Because the PRODUCTDIR is not under the CASEDIR. - There is also an issue of
openqa-clone-custom-git-refspec
. If use this script to trigger a new job (job's CASEDIE=opensuse, PRODUCTDIR=opensuse/products/opensuse), theNEEDLES_DIR
will be set asopensuse/products/opensuse/needles
. it's not correct. we could modify this script to set this value asneedles
, or modify the Worker code to Compatible with it.
Updated by livdywan over 3 years ago
- Due date changed from 2021-01-29 to 2021-02-12
Xiaojing_liu wrote:
Please help to confirm those points:
- If users only supply
CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#6f7510c5fc8d0c3c46a1d2154f5fc54296c22bd5
, we should definePRODUCTDOR=/var/lib/openqa/share/tests/opensuse/products/opensuse
? This is the behavior for now. In this scenario, if the push includes the modification forproducts/opensuse/main.pm
, it doesn't work, right? Because the PRODUCTDIR is not under the CASEDIR.
Are you talking about a new default for PRODUCTDIR? Would the default be relative to CASEDIR?
- There is also an issue of
openqa-clone-custom-git-refspec
. If use this script to trigger a new job (job's CASEDIE=opensuse, PRODUCTDIR=opensuse/products/opensuse), theNEEDLES_DIR
will be set asopensuse/products/opensuse/needles
. it's not correct. we could modify this script to set this value asneedles
, or modify the Worker code to Compatible with it.
I guess, same as above, we should make sure the defaults are well-defined. Either that or they have to be set at all times.
Updated by Xiaojing_liu over 3 years ago
According note#44, commit new pr: https://github.com/os-autoinst/openQA/pull/3712
Updated by livdywan over 3 years ago
- Due date changed from 2021-02-12 to 2021-02-19
Let's assume Jane is enjoying the new year's holidays before getting back to this🧧🧨🐂🥮
Updated by livdywan over 3 years ago
- Due date changed from 2021-02-19 to 2021-02-26
Manual tests to be found here: https://github.com/os-autoinst/openQA/pull/3712#issuecomment-783193059
Updated by okurz over 3 years ago
- Related to action #88482: Two absolute paths concatenated to form a default needle dir when PRODUCT_DIR/needles doesn't exist added
Updated by Xiaojing_liu over 3 years ago
- Status changed from In Progress to Feedback
The pull request is set acceptance-tests-needed
, and one colleague helped deploy this patch to one worker: https://openqa.opensuse.org/admin/workers/260
See details in https://github.com/os-autoinst/openQA/pull/3712#issuecomment-784178554.
I did not spend much time on it, just reviewed the jobs' results of that worker sometimes.
Since we are waiting for more test results, so I prefer to set this ticket to feedback
.
Updated by okurz over 3 years ago
- Due date changed from 2021-02-26 to 2021-03-04
Good choice! As the the smaller os-autoinst change https://github.com/os-autoinst/os-autoinst/pull/1627 is still pending deployment on OSD I would wait until after the deployment tomorrow and then merge the openQA change to come in next.
Updated by livdywan over 3 years ago
- Due date changed from 2021-03-04 to 2021-03-12
I didn't save the comment here, so once more:
We briefly discussed that https://github.com/os-autoinst/openQA/pull/3712 was tested enough, the os-autoinst branch was merged as well.
Unfortunately now the PR is stuck even 20 hours after review, trying to figure out what's wrong there now.
Updated by okurz over 3 years ago
- Due date changed from 2021-03-12 to 2021-03-17
PR is merged. Was deployed on OSD by now. Given that with the previous iterations we had some problems I suggest to await any potential user feedback until next week before we check against the original request again if the feature is completed.
Updated by pcervinka over 3 years ago
- Related to action #89990: Error on tests/kernel/run_ltp.pm: Can't locate sle/tests/kernel/run_ltp.pm added
Updated by okurz over 3 years ago
Regression found with #89990 . I will handle the rollback in #89990 . Can you try to understand the specific test requirements and include in your changes as well?
Further ideas for improvements:
- Instead of enabling the new functionality by default, introduce an experimental feature switch so that we do not have to revert the source code, only the switch configuration
- We should have an alert for the broken jobs caused by the change in behaviour -> Check the existing alert for "new incompletes"
Updated by livdywan over 3 years ago
From the weekly:
- no alerts due to incomplete alerts not firing, this was expected
- use feature switch approach similar to faster localhost uploads
- enable via job setting (could be machine, job group, etc)
Updated by Xiaojing_liu over 3 years ago
PR for this regression issue: https://github.com/os-autoinst/os-autoinst/pull/1634
Updated by Xiaojing_liu over 3 years ago
- Due date changed from 2021-03-17 to 2021-03-19
The pr for os-autoinst has been merged. After it is deployed on OSD and o3, if there is no problem, we could consider recovery the pr on openQA.
Updated by okurz over 3 years ago
Yes, we should bring back the PR but only enable the changed behaviour based on a feature switch, ok?
Updated by Xiaojing_liu over 3 years ago
As discussed with @cdywan, we agreed with using a job setting to enable or disable this feature. Will commit a new pull request later.
Updated by Xiaojing_liu over 3 years ago
Updated by okurz over 3 years ago
- Tracker changed from action to coordination
- Subject changed from Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec to [epic] Remote openQA worker fails to run tests from openqa-clone-custom-git-refspec
- Status changed from Feedback to Blocked
- Assignee changed from Xiaojing_liu to okurz
As after this PR there are likely more steps also regarding the feature switch I know now how to switch this ticket better, e.g. see subtask #90293
Updated by okurz over 3 years ago
- Status changed from Blocked to Resolved
All subtasks resolved, both ACs covered.