action #88754
openQA-in-openQA tests always fail and results do not impact submission pipeline
0%
Description
Observation¶
https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_containers&version=Tumbleweed#next_previous shows a longer history of jobs that are most often failed in various steps but the result is completely ignored in http://jenkins.qa.suse.de/view/openQA-in-openQA/
Expected results¶
- pipeline reliable and mostly green
- failures in tests prevent the submission of new packages
Suggestion¶
- https://github.com/os-autoinst/scripts/blob/master/monitor-openqa_job#L26 JOB_ID uses one job ID only rather than all found job ID's
- Switching from
openqa-client
toopenqa-cli
would provide JSON output where we can easily handle multiple jobs e.g. viajq
(see also filter_id) - Adjust or manually run Build command in http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/configure
Related issues
History
#1
Updated by okurz over 1 year ago
- Priority changed from Normal to High
hm, actually it seems that sometimes or maybe always the submission pipeline actually is blocked.
I removed the schedule part from https://openqa.opensuse.org/admin/job_templates/24 to allow the important fix for perl-Mojolicious to be submitted:
- openqa_from_containers: testsuite: null settings: OPENQA_CONTAINERS: '1' OPENQA_FROM_GIT: '1' # see: main.pm, avoid load_osautoinst_tests description: >- Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa" distri.
#2
Updated by cdywan over 1 year ago
I can see a failure in the 3 most recent worker tests:
# Test died: command 'docker logs openqa_worker 2>&1 | grep "API key and secret are needed" >/dev/null' failed at /var/lib/openqa/cache/openqa1-opensuse/tests/openqa/lib/utils.pm line 100.
Maybe something for ilausuch to take a look at. I guess the expected log message is absent here, meaning the credentials are already set or the connection isn't coming up at all 🤔
#3
Updated by Xiaojing_liu over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to Xiaojing_liu
#4
Updated by openqa_review over 1 year ago
- Due date set to 2021-03-25
Setting due date based on mean cycle time of SUSE QE Tools
#5
Updated by cdywan over 1 year ago
Jane, Ivan and I were discussing this together a bit, some notes from that:
- https://github.com/os-autoinst/os-autoinst-distri-openQA/blob/master/lib/utils.pm#L94
- in the logs you (don't) find this:
[debug] --------------------------[0m[2021-03-11T10:29:03.624 CET] [debug] /tests/containers/worker.pm:10 called utils::wait_for_container_log -> lib/utils.pm:95
$cmd log ...
returns no logs- Can we conditionally output all logs if the
...grep
failed?
groupmod GID '0' already exists
- 0 is passed via
groupmod -g 0 kvm
which may not be the kvm group - shouldn't we do
groupmod kvm
with no ID?
- 0 is passed via
#6
Updated by ilausuch over 1 year ago
Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787
Remains fix the test
#7
Updated by cdywan over 1 year ago
- Assignee changed from Xiaojing_liu to ilausuch
ilausuch wrote:
Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787
Since Jane's fix for validate_script_output got merged, I assume we're waiting for the groupmod GID '0' already exists
issue to be resolved before we can re-renable the tests?
#8
Updated by Xiaojing_liu over 1 year ago
The new pr has been merged. I did a test if there is no groupmod GID '0' already exists
, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.
#9
Updated by cdywan over 1 year ago
- Due date changed from 2021-03-25 to 2021-04-01
Moving up the due date due to hackweek
Xiaojing_liu wrote:
The new pr has been merged. I did a test if there is no
groupmod GID '0' already exists
, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.
ilausuch Are you going to add the test back?
#10
Updated by ilausuch over 1 year ago
https://github.com/os-autoinst/openQA/pull/3787 is under review
#11
Updated by cdywan about 1 year ago
- Due date changed from 2021-04-01 to 2021-04-09
ilausuch wrote:
https://github.com/os-autoinst/openQA/pull/3787 is under review
The PR got merged - what's the status on the openQA tests now? Could you please comment here on what, if anything is still to be done here, and update the status as needed?
#12
Updated by ilausuch about 1 year ago
I created a test to prove that this works now
https://openqa.opensuse.org/tests/1696773#
Running this PR https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/65
I created this test with the env variable OPENQA_CONTAINERS=1
https://openqa.opensuse.org/tests/1696822
#13
Updated by ilausuch about 1 year ago
Could we activate again the test in the scheduler?
#14
Updated by okurz about 1 year ago
Sure, please try that yourself. Basically undoing the changes from #88754#note-1
#15
Updated by ilausuch about 1 year ago
#16
Updated by ilausuch about 1 year ago
I found that fails eventually in the same way than #90614. I am preparing the same solution to retry when build the container images
#17
Updated by ilausuch about 1 year ago
I created this PR to check an alternative https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/66
Running test https://openqa.opensuse.org/tests/1700457
#18
Updated by ilausuch about 1 year ago
A new Running test https://openqa.opensuse.org/tests/1704825
#19
Updated by ilausuch about 1 year ago
I a training session with Oliver and Christian we identify a problem that was affecting to the container tests. This was the first time it failed https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7835&groupid=24
And we created a needle to solve that
https://github.com/os-autoinst/os-autoinst-needles-openQA/commit/10eeb87d6a33aca10d1f1d5cff3145cacd802617
This is the running test with the new needle
https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7865&groupid=24
#20
Updated by ilausuch about 1 year ago
- Due date changed from 2021-04-09 to 2021-04-23
#21
Updated by ilausuch about 1 year ago
Next step is to ensure "failures in tests prevent the submission of new packages" works, generating a manual failure
#22
Updated by ilausuch about 1 year ago
- Status changed from In Progress to Blocked
- Assignee deleted (
ilausuch)
I am unable to change the parameters to force the failure to test this. Please, someone with Jenkings experience could check this out?
#23
Updated by okurz about 1 year ago
- Status changed from Blocked to Workable
please use "Blocked" only with an assignee to track any blocker. And blockers are only other tickets
#24
Updated by ilausuch about 1 year ago
- Assignee set to ilausuch
#26
Updated by cdywan about 1 year ago
- Blocked by action #91752: jenkins: Multiple missing fields and errors in configuration of openQA-in-openQA added
#27
Updated by ilausuch about 1 year ago
- Due date deleted (
2021-04-23)
#28
Updated by okurz about 1 year ago
- Status changed from Blocked to Workable
blocker #91752 resolved
#29
Updated by cdywan about 1 year ago
- Description updated (diff)
#30
Updated by cdywan about 1 year ago
- Status changed from Workable to In Progress
- Assignee changed from ilausuch to mkittler
#31
Updated by mkittler about 1 year ago
PR for first suggestion: https://github.com/os-autoinst/scripts/pull/84
This leaves only the last suggestion.
#32
Updated by openqa_review about 1 year ago
- Due date set to 2021-07-08
Setting due date based on mean cycle time of SUSE QE Tools
#33
Updated by mkittler about 1 year ago
- Status changed from In Progress to Resolved
I have tested the change from the PR locally against multiple jobs from o3 and it seemed to work, e.g. if one of the jobs fails it'll exit with a non-zero return code.
I've also re-triggered the Jeninks job and it failed (as expected as one of the previously triggered openQA jobs failed) leaving a comment on OBS, see:
- http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/7195/console
- https://build.opensuse.org/project/show/devel:openQA#comment-1477717 (
openQA-in-openQA test(s) failed (job IDs: 1802764), see https://openqa.opensuse.org/tests/overview?version=Tumbleweed&groupid=24
)
Note that copying the file with the job IDs from trigger-openQA_in_openQA-TW works. It is currently not shown under http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/ because since the change there hasn't been a successful run and it shows only the artifact produced by the last successful run. The console log shows clearly that the expected jobs have been considered (+ echo 'Result of job 1802764: failed'
, + echo 'Result of job 1802766: passed'
, + echo 'Result of job 1802766: passed'
), though.