action #88754: openQA-in-openQA tests always fail and results do not impact submission pipeline - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #88754

closed

openQA-in-openQA tests always fail and results do not impact submission pipeline

Added by okurz about 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-02-18

Due date:

2021-07-08

% Done:

Estimated time:

Description

Observation¶

https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_containers&version=Tumbleweed#next_previous shows a longer history of jobs that are most often failed in various steps but the result is completely ignored in http://jenkins.qa.suse.de/view/openQA-in-openQA/

Expected results¶

pipeline reliable and mostly green
failures in tests prevent the submission of new packages

Suggestion¶

https://github.com/os-autoinst/scripts/blob/master/monitor-openqa_job#L26 JOB_ID uses one job ID only rather than all found job ID's
Switching from openqa-client to openqa-cli would provide JSON output where we can easily handle multiple jobs e.g. via jq (see also filter_id)
Adjust or manually run Build command in http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/configure

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 4 years ago

Priority changed from Normal to High

hm, actually it seems that sometimes or maybe always the submission pipeline actually is blocked.

I removed the schedule part from https://openqa.opensuse.org/admin/job_templates/24 to allow the important fix for perl-Mojolicious to be submitted:

    - openqa_from_containers:
        testsuite: null
        settings:
          OPENQA_CONTAINERS: '1'
          OPENQA_FROM_GIT: '1' # see: main.pm, avoid load_osautoinst_tests
        description: >-
          Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa"
          distri.

Actions

Copy link

Updated by livdywan about 4 years ago

I can see a failure in the 3 most recent worker tests:

# Test died: command 'docker logs openqa_worker 2>&1 | grep "API key and secret are needed" >/dev/null' failed at /var/lib/openqa/cache/openqa1-opensuse/tests/openqa/lib/utils.pm line 100.

Maybe something for @ilausuch to take a look at. I guess the expected log message is absent here, meaning the credentials are already set or the connection isn't coming up at all 🤔

Actions

Copy link

Updated by Xiaojing_liu about 4 years ago

Status changed from Workable to In Progress
Assignee set to Xiaojing_liu

Actions

Copy link

Updated by openqa_review about 4 years ago

Due date set to 2021-03-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by livdywan about 4 years ago

Jane, Ivan and I were discussing this together a bit, some notes from that:

https://github.com/os-autoinst/os-autoinst-distri-openQA/blob/master/lib/utils.pm#L94
in the logs you (don't) find this:
- [debug] --------------------------[0m[2021-03-11T10:29:03.624 CET] [debug] /tests/containers/worker.pm:10 called utils::wait_for_container_log -> lib/utils.pm:95
- $cmd log ... returns no logs
- Can we conditionally output all logs if the ...grep failed?
groupmod GID '0' already exists
- 0 is passed via groupmod -g 0 kvm which may not be the kvm group
- shouldn't we do groupmod kvm with no ID?

Actions

Copy link

Updated by ilausuch about 4 years ago

Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787

Remains fix the test

Actions

Copy link

Updated by livdywan about 4 years ago

Assignee changed from Xiaojing_liu to ilausuch

ilausuch wrote:

Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787

Since Jane's fix for validate_script_output got merged, I assume we're waiting for the groupmod GID '0' already exists issue to be resolved before we can re-renable the tests?

Actions

Copy link

Updated by Xiaojing_liu about 4 years ago

The new pr has been merged. I did a test if there is no groupmod GID '0' already exists, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.

Actions

Copy link

Updated by livdywan about 4 years ago

Due date changed from 2021-03-25 to 2021-04-01

Moving up the due date due to hackweek

Xiaojing_liu wrote:

The new pr has been merged. I did a test if there is no groupmod GID '0' already exists, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.

@ilausuch Are you going to add the test back?

Actions

Copy link

#10

Updated by ilausuch about 4 years ago

https://github.com/os-autoinst/openQA/pull/3787 is under review

Actions

Copy link

#11

Updated by livdywan about 4 years ago

Due date changed from 2021-04-01 to 2021-04-09

ilausuch wrote:

https://github.com/os-autoinst/openQA/pull/3787 is under review

The PR got merged - what's the status on the openQA tests now? Could you please comment here on what, if anything is still to be done here, and update the status as needed?

Actions

Copy link

#12

Updated by ilausuch about 4 years ago

I created a test to prove that this works now
https://openqa.opensuse.org/tests/1696773#
Running this PR https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/65

I created this test with the env variable OPENQA_CONTAINERS=1
https://openqa.opensuse.org/tests/1696822

Actions

Copy link

#13

Updated by ilausuch about 4 years ago

Could we activate again the test in the scheduler?

Actions

Copy link

#14

Updated by okurz about 4 years ago

Sure, please try that yourself. Basically undoing the changes from #88754#note-1

Actions

Copy link

#15

Updated by ilausuch about 4 years ago

Done
https://openqa.opensuse.org/tests/1699047

Actions

Copy link

#16

Updated by ilausuch about 4 years ago

I found that fails eventually in the same way than #90614. I am preparing the same solution to retry when build the container images

See: https://openqa.opensuse.org/tests/1700287#step/build/5

Actions

Copy link

#17

Updated by ilausuch about 4 years ago

I created this PR to check an alternative https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/66
Running test https://openqa.opensuse.org/tests/1700457

Actions

Copy link

#18

Updated by ilausuch about 4 years ago

A new Running test https://openqa.opensuse.org/tests/1704825

Actions

Copy link

#19

Updated by ilausuch about 4 years ago

I a training session with Oliver and Christian we identify a problem that was affecting to the container tests. This was the first time it failed https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7835&groupid=24

And we created a needle to solve that
https://github.com/os-autoinst/os-autoinst-needles-openQA/commit/10eeb87d6a33aca10d1f1d5cff3145cacd802617

This is the running test with the new needle
https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7865&groupid=24

Actions

Copy link

#20

Updated by ilausuch about 4 years ago

Due date changed from 2021-04-09 to 2021-04-23

Actions

Copy link

#21

Updated by ilausuch almost 4 years ago

Next step is to ensure "failures in tests prevent the submission of new packages" works, generating a manual failure

Actions

Copy link

#22

Updated by ilausuch almost 4 years ago

Status changed from In Progress to Blocked
Assignee deleted (~~ilausuch~~)

I am unable to change the parameters to force the failure to test this. Please, someone with Jenkings experience could check this out?

Actions

Copy link

#23

Updated by okurz almost 4 years ago

Status changed from Blocked to Workable

please use "Blocked" only with an assignee to track any blocker. And blockers are only other tickets

Actions

Copy link

#24

Updated by ilausuch almost 4 years ago

Assignee set to ilausuch

Actions

Copy link

#25

Updated by ilausuch almost 4 years ago

Status changed from Workable to Blocked

Blocked by #91752

Actions

Copy link

#26

Updated by livdywan almost 4 years ago

Blocked by action #91752: jenkins: Multiple missing fields and errors in configuration of openQA-in-openQA added

Actions

Copy link

#27

Updated by ilausuch almost 4 years ago

Due date deleted (~~2021-04-23~~)

Actions

Copy link

#28

Updated by okurz almost 4 years ago

Status changed from Blocked to Workable

blocker #91752 resolved

Actions

Copy link

#29

Updated by livdywan almost 4 years ago

Description updated (diff)

Actions

Copy link

#30

Updated by livdywan almost 4 years ago

Status changed from Workable to In Progress
Assignee changed from ilausuch to mkittler

Actions

Copy link

#31

Updated by mkittler almost 4 years ago

PR for first suggestion: https://github.com/os-autoinst/scripts/pull/84
This leaves only the last suggestion.

Actions

Copy link

#32

Updated by openqa_review almost 4 years ago

Due date set to 2021-07-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#33

Updated by mkittler almost 4 years ago

Status changed from In Progress to Resolved

I have tested the change from the PR locally against multiple jobs from o3 and it seemed to work, e.g. if one of the jobs fails it'll exit with a non-zero return code.

I've also re-triggered the Jeninks job and it failed (as expected as one of the previously triggered openQA jobs failed) leaving a comment on OBS, see:

http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/7195/console
https://build.opensuse.org/project/show/devel:openQA#comment-1477717 (openQA-in-openQA test(s) failed (job IDs: 1802764), see https://openqa.opensuse.org/tests/overview?version=Tumbleweed&groupid=24)

Note that copying the file with the job IDs from trigger-openQA_in_openQA-TW works. It is currently not shown under http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/ because since the change there hasn't been a successful run and it shows only the artifact produced by the last successful run. The console log shows clearly that the expected jobs have been considered (+ echo 'Result of job 1802764: failed', + echo 'Result of job 1802766: passed', + echo 'Result of job 1802766: passed'), though.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #88754

openQA-in-openQA tests always fail and results do not impact submission pipeline

Observation¶

Expected results¶

Suggestion¶

Updated by okurz about 4 years ago

Updated by livdywan about 4 years ago

Updated by Xiaojing_liu about 4 years ago

Updated by openqa_review about 4 years ago

Updated by livdywan about 4 years ago

Updated by ilausuch about 4 years ago

Updated by livdywan about 4 years ago

Updated by Xiaojing_liu about 4 years ago

Updated by livdywan about 4 years ago

Updated by ilausuch about 4 years ago

Updated by livdywan about 4 years ago

Updated by ilausuch about 4 years ago

Updated by ilausuch about 4 years ago

Updated by okurz about 4 years ago

Updated by ilausuch about 4 years ago

Updated by ilausuch about 4 years ago

Updated by ilausuch about 4 years ago

Updated by ilausuch about 4 years ago

Updated by ilausuch about 4 years ago

Updated by ilausuch about 4 years ago

Updated by ilausuch almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by okurz almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by okurz almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by mkittler almost 4 years ago