Project

General

Profile

action #88754

openQA-in-openQA tests always fail and results do not impact submission pipeline

Added by okurz 8 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-02-18
Due date:
2021-07-08
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_containers&version=Tumbleweed#next_previous shows a longer history of jobs that are most often failed in various steps but the result is completely ignored in http://jenkins.qa.suse.de/view/openQA-in-openQA/

Expected results

  • pipeline reliable and mostly green
  • failures in tests prevent the submission of new packages

Suggestion


Related issues

Blocked by openQA Project - action #91752: jenkins: Multiple missing fields and errors in configuration of openQA-in-openQAResolved2021-04-26

History

#1 Updated by okurz 8 months ago

  • Priority changed from Normal to High

hm, actually it seems that sometimes or maybe always the submission pipeline actually is blocked.

I removed the schedule part from https://openqa.opensuse.org/admin/job_templates/24 to allow the important fix for perl-Mojolicious to be submitted:

    - openqa_from_containers:
        testsuite: null
        settings:
          OPENQA_CONTAINERS: '1'
          OPENQA_FROM_GIT: '1' # see: main.pm, avoid load_osautoinst_tests
        description: >-
          Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa"
          distri.

#2 Updated by cdywan 8 months ago

I can see a failure in the 3 most recent worker tests:

# Test died: command 'docker logs openqa_worker 2>&1 | grep "API key and secret are needed" >/dev/null' failed at /var/lib/openqa/cache/openqa1-opensuse/tests/openqa/lib/utils.pm line 100.

Maybe something for ilausuch to take a look at. I guess the expected log message is absent here, meaning the credentials are already set or the connection isn't coming up at all 🤔

#3 Updated by Xiaojing_liu 7 months ago

  • Status changed from Workable to In Progress
  • Assignee set to Xiaojing_liu

#4 Updated by openqa_review 7 months ago

  • Due date set to 2021-03-25

Setting due date based on mean cycle time of SUSE QE Tools

#5 Updated by cdywan 7 months ago

Jane, Ivan and I were discussing this together a bit, some notes from that:

  • https://github.com/os-autoinst/os-autoinst-distri-openQA/blob/master/lib/utils.pm#L94
  • in the logs you (don't) find this:
    • [debug] --------------------------[2021-03-11T10:29:03.624 CET] [debug] /tests/containers/worker.pm:10 called utils::wait_for_container_log -> lib/utils.pm:95
    • $cmd log ... returns no logs
    • Can we conditionally output all logs if the ...grep failed?
  • groupmod GID '0' already exists
    • 0 is passed via groupmod -g 0 kvm which may not be the kvm group
    • shouldn't we do groupmod kvm with no ID?

#6 Updated by ilausuch 7 months ago

Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787

Remains fix the test

#7 Updated by cdywan 7 months ago

  • Assignee changed from Xiaojing_liu to ilausuch

ilausuch wrote:

Fixed the entrypoint
https://github.com/os-autoinst/openQA/pull/3787

Since Jane's fix for validate_script_output got merged, I assume we're waiting for the groupmod GID '0' already exists issue to be resolved before we can re-renable the tests?

#8 Updated by Xiaojing_liu 7 months ago

The new pr has been merged. I did a test if there is no groupmod GID '0' already exists, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.

#9 Updated by cdywan 7 months ago

  • Due date changed from 2021-03-25 to 2021-04-01

Moving up the due date due to hackweek

Xiaojing_liu wrote:

The new pr has been merged. I did a test if there is no groupmod GID '0' already exists, the job will pass. See an example: https://openqa.opensuse.org/tests/1672546#
So after https://github.com/os-autoinst/openQA/pull/3787 got merged, we could add the test back.

ilausuch Are you going to add the test back?

#11 Updated by cdywan 6 months ago

  • Due date changed from 2021-04-01 to 2021-04-09

ilausuch wrote:

https://github.com/os-autoinst/openQA/pull/3787 is under review

The PR got merged - what's the status on the openQA tests now? Could you please comment here on what, if anything is still to be done here, and update the status as needed?

#12 Updated by ilausuch 6 months ago

I created a test to prove that this works now
https://openqa.opensuse.org/tests/1696773#
Running this PR https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/65

I created this test with the env variable OPENQA_CONTAINERS=1
https://openqa.opensuse.org/tests/1696822

#13 Updated by ilausuch 6 months ago

Could we activate again the test in the scheduler?

#14 Updated by okurz 6 months ago

Sure, please try that yourself. Basically undoing the changes from #88754#note-1

#16 Updated by ilausuch 6 months ago

I found that fails eventually in the same way than #90614. I am preparing the same solution to retry when build the container images

See: https://openqa.opensuse.org/tests/1700287#step/build/5

#19 Updated by ilausuch 6 months ago

I a training session with Oliver and Christian we identify a problem that was affecting to the container tests. This was the first time it failed https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7835&groupid=24

And we created a needle to solve that
https://github.com/os-autoinst/os-autoinst-needles-openQA/commit/10eeb87d6a33aca10d1f1d5cff3145cacd802617

This is the running test with the new needle
https://openqa.opensuse.org/tests/overview?distri=openqa&version=Tumbleweed&build=%3ATW.7865&groupid=24

#20 Updated by ilausuch 6 months ago

  • Due date changed from 2021-04-09 to 2021-04-23

#21 Updated by ilausuch 6 months ago

Next step is to ensure "failures in tests prevent the submission of new packages" works, generating a manual failure

#22 Updated by ilausuch 6 months ago

  • Status changed from In Progress to Blocked
  • Assignee deleted (ilausuch)

I am unable to change the parameters to force the failure to test this. Please, someone with Jenkings experience could check this out?

#23 Updated by okurz 6 months ago

  • Status changed from Blocked to Workable

please use "Blocked" only with an assignee to track any blocker. And blockers are only other tickets

#24 Updated by ilausuch 6 months ago

  • Assignee set to ilausuch

#25 Updated by ilausuch 6 months ago

  • Status changed from Workable to Blocked

Blocked by #91752

#26 Updated by cdywan 6 months ago

  • Blocked by action #91752: jenkins: Multiple missing fields and errors in configuration of openQA-in-openQA added

#27 Updated by ilausuch 6 months ago

  • Due date deleted (2021-04-23)

#28 Updated by okurz 4 months ago

  • Status changed from Blocked to Workable

blocker #91752 resolved

#29 Updated by cdywan 4 months ago

  • Description updated (diff)

#30 Updated by cdywan 4 months ago

  • Status changed from Workable to In Progress
  • Assignee changed from ilausuch to mkittler

#31 Updated by mkittler 4 months ago

PR for first suggestion: https://github.com/os-autoinst/scripts/pull/84
This leaves only the last suggestion.

#32 Updated by openqa_review 4 months ago

  • Due date set to 2021-07-08

Setting due date based on mean cycle time of SUSE QE Tools

#33 Updated by mkittler 4 months ago

  • Status changed from In Progress to Resolved

I have tested the change from the PR locally against multiple jobs from o3 and it seemed to work, e.g. if one of the jobs fails it'll exit with a non-zero return code.

I've also re-triggered the Jeninks job and it failed (as expected as one of the previously triggered openQA jobs failed) leaving a comment on OBS, see:

Note that copying the file with the job IDs from trigger-openQA_in_openQA-TW works. It is currently not shown under http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/ because since the change there hasn't been a successful run and it shows only the artifact produced by the last successful run. The console log shows clearly that the expected jobs have been considered (+ echo 'Result of job 1802764: failed', + echo 'Result of job 1802766: passed', + echo 'Result of job 1802766: passed'), though.

Also available in: Atom PDF