action #88754
closedopenQA-in-openQA tests always fail and results do not impact submission pipeline
Description
Observation¶
https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_containers&version=Tumbleweed#next_previous shows a longer history of jobs that are most often failed in various steps but the result is completely ignored in http://jenkins.qa.suse.de/view/openQA-in-openQA/
Expected results¶
- pipeline reliable and mostly green
- failures in tests prevent the submission of new packages
Suggestion¶
- https://github.com/os-autoinst/scripts/blob/master/monitor-openqa_job#L26 JOB_ID uses one job ID only rather than all found job ID's
- Switching from
openqa-client
toopenqa-cli
would provide JSON output where we can easily handle multiple jobs e.g. viajq
(see also filter_id) - Adjust or manually run Build command in http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/configure
Updated by okurz over 3 years ago
- Priority changed from Normal to High
hm, actually it seems that sometimes or maybe always the submission pipeline actually is blocked.
I removed the schedule part from https://openqa.opensuse.org/admin/job_templates/24 to allow the important fix for perl-Mojolicious to be submitted:
- openqa_from_containers:
testsuite: null
settings:
OPENQA_CONTAINERS: '1'
OPENQA_FROM_GIT: '1' # see: main.pm, avoid load_osautoinst_tests
description: >-
Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa"
distri.
Updated by livdywan over 3 years ago
I can see a failure in the 3 most recent worker tests:
# Test died: command 'docker logs openqa_worker 2>&1 | grep "API key and secret are needed" >/dev/null' failed at /var/lib/openqa/cache/openqa1-opensuse/tests/openqa/lib/utils.pm line 100.
Maybe something for @ilausuch to take a look at. I guess the expected log message is absent here, meaning the credentials are already set or the connection isn't coming up at all 🤔
Updated by Xiaojing_liu over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to Xiaojing_liu
Updated by openqa_review over 3 years ago
- Due date set to 2021-03-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 3 years ago
Jane, Ivan and I were discussing this together a bit, some notes from that:
- https://github.com/os-autoinst/os-autoinst-distri-openQA/blob/master/lib/utils.pm#L94
- in the logs you (don't) find this:
[debug] --------------------------[0m[2021-03-11T10:29:03.624 CET] [debug] /tests/containers/worker.pm:10 called utils::wait_for_container_log -> lib/utils.pm:95
$cmd log ...
returns no logs- Can we conditionally output all logs if the
...grep
failed?
groupmod GID '0' already exists
- 0 is passed via
groupmod -g 0 kvm
which may not be the kvm group - shouldn't we do
groupmod kvm
with no ID?
- 0 is passed via
Updated by ilausuch over 3 years ago
Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787
Remains fix the test
Updated by livdywan over 3 years ago
- Assignee changed from Xiaojing_liu to ilausuch
ilausuch wrote:
Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787
Since Jane's fix for validate_script_output got merged, I assume we're waiting for the groupmod GID '0' already exists
issue to be resolved before we can re-renable the tests?
Updated by Xiaojing_liu over 3 years ago
The new pr has been merged. I did a test if there is no groupmod GID '0' already exists
, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.
Updated by livdywan over 3 years ago
- Due date changed from 2021-03-25 to 2021-04-01
Moving up the due date due to hackweek
Xiaojing_liu wrote:
The new pr has been merged. I did a test if there is no
groupmod GID '0' already exists
, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.
@ilausuch Are you going to add the test back?
Updated by ilausuch over 3 years ago
https://github.com/os-autoinst/openQA/pull/3787 is under review
Updated by livdywan over 3 years ago
- Due date changed from 2021-04-01 to 2021-04-09
ilausuch wrote:
https://github.com/os-autoinst/openQA/pull/3787 is under review
The PR got merged - what's the status on the openQA tests now? Could you please comment here on what, if anything is still to be done here, and update the status as needed?
Updated by ilausuch over 3 years ago
I created a test to prove that this works now
https://openqa.opensuse.org/tests/1696773#
Running this PR https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/65
I created this test with the env variable OPENQA_CONTAINERS=1
https://openqa.opensuse.org/tests/1696822
Updated by ilausuch over 3 years ago
Could we activate again the test in the scheduler?
Updated by okurz over 3 years ago
Sure, please try that yourself. Basically undoing the changes from #88754#note-1
Updated by ilausuch over 3 years ago
Updated by ilausuch over 3 years ago
I found that fails eventually in the same way than #90614. I am preparing the same solution to retry when build the container images
Updated by ilausuch over 3 years ago
I created this PR to check an alternative https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/66
Running test https://openqa.opensuse.org/tests/1700457
Updated by ilausuch over 3 years ago
A new Running test https://openqa.opensuse.org/tests/1704825
Updated by ilausuch over 3 years ago
I a training session with Oliver and Christian we identify a problem that was affecting to the container tests. This was the first time it failed https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7835&groupid=24
And we created a needle to solve that
https://github.com/os-autoinst/os-autoinst-needles-openQA/commit/10eeb87d6a33aca10d1f1d5cff3145cacd802617
This is the running test with the new needle
https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7865&groupid=24
Updated by ilausuch over 3 years ago
- Due date changed from 2021-04-09 to 2021-04-23
Updated by ilausuch over 3 years ago
Next step is to ensure "failures in tests prevent the submission of new packages" works, generating a manual failure
Updated by ilausuch over 3 years ago
- Status changed from In Progress to Blocked
- Assignee deleted (
ilausuch)
I am unable to change the parameters to force the failure to test this. Please, someone with Jenkings experience could check this out?
Updated by okurz over 3 years ago
- Status changed from Blocked to Workable
please use "Blocked" only with an assignee to track any blocker. And blockers are only other tickets
Updated by livdywan over 3 years ago
- Blocked by action #91752: jenkins: Multiple missing fields and errors in configuration of openQA-in-openQA added
Updated by livdywan over 3 years ago
- Status changed from Workable to In Progress
- Assignee changed from ilausuch to mkittler
Updated by mkittler over 3 years ago
PR for first suggestion: https://github.com/os-autoinst/scripts/pull/84
This leaves only the last suggestion.
Updated by openqa_review over 3 years ago
- Due date set to 2021-07-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 3 years ago
- Status changed from In Progress to Resolved
I have tested the change from the PR locally against multiple jobs from o3 and it seemed to work, e.g. if one of the jobs fails it'll exit with a non-zero return code.
I've also re-triggered the Jeninks job and it failed (as expected as one of the previously triggered openQA jobs failed) leaving a comment on OBS, see:
- http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/7195/console
- https://build.opensuse.org/project/show/devel:openQA#comment-1477717 (
openQA-in-openQA test(s) failed (job IDs: 1802764), see https://openqa.opensuse.org/tests/overview?version=Tumbleweed&groupid=24
)
Note that copying the file with the job IDs from trigger-openQA_in_openQA-TW works. It is currently not shown under http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/ because since the change there hasn't been a successful run and it shows only the artifact produced by the last successful run. The console log shows clearly that the expected jobs have been considered (+ echo 'Result of job 1802764: failed'
, + echo 'Result of job 1802766: passed'
, + echo 'Result of job 1802766: passed'
), though.