Project

General

Profile

Actions

action #88754

closed

openQA-in-openQA tests always fail and results do not impact submission pipeline

Added by okurz almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-02-18
Due date:
2021-07-08
% Done:

0%

Estimated time:

Description

Observation

https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_containers&version=Tumbleweed#next_previous shows a longer history of jobs that are most often failed in various steps but the result is completely ignored in http://jenkins.qa.suse.de/view/openQA-in-openQA/

Expected results

  • pipeline reliable and mostly green
  • failures in tests prevent the submission of new packages

Suggestion


Related issues 1 (0 open1 closed)

Blocked by openQA Project (public) - action #91752: jenkins: Multiple missing fields and errors in configuration of openQA-in-openQAResolvedokurz2021-04-26

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Priority changed from Normal to High

hm, actually it seems that sometimes or maybe always the submission pipeline actually is blocked.

I removed the schedule part from https://openqa.opensuse.org/admin/job_templates/24 to allow the important fix for perl-Mojolicious to be submitted:

    - openqa_from_containers:
        testsuite: null
        settings:
          OPENQA_CONTAINERS: '1'
          OPENQA_FROM_GIT: '1' # see: main.pm, avoid load_osautoinst_tests
        description: >-
          Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa"
          distri.
Actions #2

Updated by livdywan almost 4 years ago

I can see a failure in the 3 most recent worker tests:

# Test died: command 'docker logs openqa_worker 2>&1 | grep "API key and secret are needed" >/dev/null' failed at /var/lib/openqa/cache/openqa1-opensuse/tests/openqa/lib/utils.pm line 100.

Maybe something for @ilausuch to take a look at. I guess the expected log message is absent here, meaning the credentials are already set or the connection isn't coming up at all 🤔

Actions #3

Updated by Xiaojing_liu almost 4 years ago

  • Status changed from Workable to In Progress
  • Assignee set to Xiaojing_liu
Actions #4

Updated by openqa_review almost 4 years ago

  • Due date set to 2021-03-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by livdywan almost 4 years ago

Jane, Ivan and I were discussing this together a bit, some notes from that:

  • https://github.com/os-autoinst/os-autoinst-distri-openQA/blob/master/lib/utils.pm#L94
  • in the logs you (don't) find this:
    • [debug] --------------------------[2021-03-11T10:29:03.624 CET] [debug] /tests/containers/worker.pm:10 called utils::wait_for_container_log -> lib/utils.pm:95
    • $cmd log ... returns no logs
    • Can we conditionally output all logs if the ...grep failed?
  • groupmod GID '0' already exists
    • 0 is passed via groupmod -g 0 kvm which may not be the kvm group
    • shouldn't we do groupmod kvm with no ID?
Actions #6

Updated by ilausuch almost 4 years ago

Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787

Remains fix the test

Actions #7

Updated by livdywan over 3 years ago

  • Assignee changed from Xiaojing_liu to ilausuch

ilausuch wrote:

Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787

Since Jane's fix for validate_script_output got merged, I assume we're waiting for the groupmod GID '0' already exists issue to be resolved before we can re-renable the tests?

Actions #8

Updated by Xiaojing_liu over 3 years ago

The new pr has been merged. I did a test if there is no groupmod GID '0' already exists, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.

Actions #9

Updated by livdywan over 3 years ago

  • Due date changed from 2021-03-25 to 2021-04-01

Moving up the due date due to hackweek

Xiaojing_liu wrote:

The new pr has been merged. I did a test if there is no groupmod GID '0' already exists, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.

@ilausuch Are you going to add the test back?

Actions #11

Updated by livdywan over 3 years ago

  • Due date changed from 2021-04-01 to 2021-04-09

ilausuch wrote:

https://github.com/os-autoinst/openQA/pull/3787 is under review

The PR got merged - what's the status on the openQA tests now? Could you please comment here on what, if anything is still to be done here, and update the status as needed?

Actions #12

Updated by ilausuch over 3 years ago

I created a test to prove that this works now
https://openqa.opensuse.org/tests/1696773#
Running this PR https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/65

I created this test with the env variable OPENQA_CONTAINERS=1
https://openqa.opensuse.org/tests/1696822

Actions #13

Updated by ilausuch over 3 years ago

Could we activate again the test in the scheduler?

Actions #14

Updated by okurz over 3 years ago

Sure, please try that yourself. Basically undoing the changes from #88754#note-1

Actions #16

Updated by ilausuch over 3 years ago

I found that fails eventually in the same way than #90614. I am preparing the same solution to retry when build the container images

See: https://openqa.opensuse.org/tests/1700287#step/build/5

Actions #19

Updated by ilausuch over 3 years ago

I a training session with Oliver and Christian we identify a problem that was affecting to the container tests. This was the first time it failed https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7835&groupid=24

And we created a needle to solve that
https://github.com/os-autoinst/os-autoinst-needles-openQA/commit/10eeb87d6a33aca10d1f1d5cff3145cacd802617

This is the running test with the new needle
https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7865&groupid=24

Actions #20

Updated by ilausuch over 3 years ago

  • Due date changed from 2021-04-09 to 2021-04-23
Actions #21

Updated by ilausuch over 3 years ago

Next step is to ensure "failures in tests prevent the submission of new packages" works, generating a manual failure

Actions #22

Updated by ilausuch over 3 years ago

  • Status changed from In Progress to Blocked
  • Assignee deleted (ilausuch)

I am unable to change the parameters to force the failure to test this. Please, someone with Jenkings experience could check this out?

Actions #23

Updated by okurz over 3 years ago

  • Status changed from Blocked to Workable

please use "Blocked" only with an assignee to track any blocker. And blockers are only other tickets

Actions #24

Updated by ilausuch over 3 years ago

  • Assignee set to ilausuch
Actions #25

Updated by ilausuch over 3 years ago

  • Status changed from Workable to Blocked

Blocked by #91752

Actions #26

Updated by livdywan over 3 years ago

  • Blocked by action #91752: jenkins: Multiple missing fields and errors in configuration of openQA-in-openQA added
Actions #27

Updated by ilausuch over 3 years ago

  • Due date deleted (2021-04-23)
Actions #28

Updated by okurz over 3 years ago

  • Status changed from Blocked to Workable

blocker #91752 resolved

Actions #29

Updated by livdywan over 3 years ago

  • Description updated (diff)
Actions #30

Updated by livdywan over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee changed from ilausuch to mkittler
Actions #31

Updated by mkittler over 3 years ago

PR for first suggestion: https://github.com/os-autoinst/scripts/pull/84
This leaves only the last suggestion.

Actions #32

Updated by openqa_review over 3 years ago

  • Due date set to 2021-07-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #33

Updated by mkittler over 3 years ago

  • Status changed from In Progress to Resolved

I have tested the change from the PR locally against multiple jobs from o3 and it seemed to work, e.g. if one of the jobs fails it'll exit with a non-zero return code.

I've also re-triggered the Jeninks job and it failed (as expected as one of the previously triggered openQA jobs failed) leaving a comment on OBS, see:

Note that copying the file with the job IDs from trigger-openQA_in_openQA-TW works. It is currently not shown under http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/ because since the change there hasn't been a successful run and it shows only the artifact produced by the last successful run. The console log shows clearly that the expected jobs have been considered (+ echo 'Result of job 1802764: failed', + echo 'Result of job 1802766: passed', + echo 'Result of job 1802766: passed'), though.

Actions

Also available in: Atom PDF