Project

General

Profile

Actions

action #98898

closed

`t/05-scheduler-full.t` sometimes fails in CircleCI size:M

Added by mkittler over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-09-20
Due date:
2021-10-06
% Done:

0%

Estimated time:

Description

The test is already deemed unstable but now also fails repeatedly despite the retry we have for unstable tests. There currently seem to be two types of failures which might have the same cause or not.


Test Summary Report
-------------------
t/05-scheduler-full.t (Wstat: 256 Tests: 5 Failed: 1)
  Failed test:  3
  Non-zero exit status: 1
Files=2, Tests=51, 598.414 wallclock secs ( 1.77 usr  0.10 sys + 501.36 cusr 22.31 csys = 525.54 CPU)
Result: FAIL
Retry 5 of 5 …
[16:38:38] t/05-scheduler-full.t .. 2/?
    #   Failed test 'running job set to done if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 219.
    #          got: 'running'
    #     expected: 'done'

    #   Failed test 'running job incompleted if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 221.
    #          got: 'none'
    #     expected: 'incomplete'

    #   Failed test 'reason is set'
    #   at t/05-scheduler-full.t line 223.
    #                   undef
    #     doesn't match '(?^:abandoned: associated worker .+:\d+ re-connected but abandoned the job)'
    # Looks like you failed 3 tests of 12.
[16:38:38] t/05-scheduler-full.t .. 3/?
#   Failed test 're-scheduling and incompletion of jobs when worker rejects jobs or goes offline'
#   at t/05-scheduler-full.t line 227.
[16:38:38] t/05-scheduler-full.t .. 4/? [16:38:38] t/05-scheduler-full.t .. 5/? # Looks like you failed 1 test of 5.
                                         [16:38:38] t/05-scheduler-full.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/5 subtests

So far I only saw the problem on https://github.com/os-autoinst/openQA/pull/4213 (https://app.circleci.com/pipelines/github/os-autoinst/openQA/7726/workflows/132858b7-d6bd-4df0-9667-50caca647581/jobs/73025) but it seems unrelated because the test passes locally and re-triggering eventually helped. I've attached the detailed log.


[09:08:13] t/05-scheduler-full.t .. 3/? 
    #   Failed test 'Allocated maximum number of jobs that could have been allocated'
    #   at t/05-scheduler-full.t line 240.
    #          got: '9'
    #     expected: '10'
    # Looks like you failed 1 test of 1.

This one has been seen on multiple PRs, e.g. https://github.com/os-autoinst/openQA/pull/4208 (https://app.circleci.com/pipelines/github/os-autoinst/openQA/7736/workflows/afd32447-77c5-4f29-b1fe-aecce3dbc434/jobs/73024).

Acceptance criteria

  • AC1: scheduler-full.t is reliable in CircleCI after repeated runs

Suggestion

  • Check if there's a hard-coded wait e.g. 5 minutes and proceeds before the jobs have finished
  • Try to reproduce locally
  • Rough idea, try and insert a sleep where the job done happens i.e. via mocking in the test to make it slow
  • Come up with an improvement based on CircleCI logs and reading code, and monitor CI to see if it improves, if it can't be reproduced locally

Files

05-scheduler-full.t (226 KB) 05-scheduler-full.t mkittler, 2021-09-20 09:44
Actions #1

Updated by mkittler over 2 years ago

  • Description updated (diff)
Actions #2

Updated by okurz over 2 years ago

  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #3

Updated by livdywan over 2 years ago

  • Subject changed from `t/05-scheduler-full.t` sometimes fails in CircleCI to `t/05-scheduler-full.t` sometimes fails in CircleCI size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by dheidler over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #5

Updated by openqa_review over 2 years ago

  • Due date set to 2021-10-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by dheidler over 2 years ago

All mentioned errors are in consecutive lines - it looks like the

note 'waiting for job to be incompleted';
for (0 .. 100) {
    last if $jobs->find(99982)->state eq OpenQA::Jobs::Constants::DONE;
    sleep .2;
}

did not wait long enough.
Let's see if 200*0.2=40s is enough: https://github.com/os-autoinst/openQA/pull/4228

Actions #7

Updated by livdywan over 2 years ago

dheidler wrote:

All mentioned errors are in consecutive lines - it looks like the

note 'waiting for job to be incompleted';
for (0 .. 100) {
    last if $jobs->find(99982)->state eq OpenQA::Jobs::Constants::DONE;
    sleep .2;
}

did not wait long enough.
Let's see if 200*0.2=40s is enough: https://github.com/os-autoinst/openQA/pull/4228

Can you replace this magic number with something better? Either make it scale, or better yet wait on a prerequisite and fail directly if that's not fulfilled. Will also comment on the PR.

Actions #9

Updated by dheidler over 2 years ago

  • Status changed from In Progress to Feedback
Actions #10

Updated by okurz over 2 years ago

I still see the problem, e.g. in https://github.com/os-autoinst/openQA/pull/4208

Retry 1 of 5 …
[21:26:20] t/05-scheduler-full.t .. 2/? Bailout called.  Further testing stopped:  Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available
Retry 2 of 5 …
[21:28:16] t/05-scheduler-full.t .. 2/? Bailout called.  Further testing stopped:  Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available
Retry 3 of 5 …
[21:30:03] t/05-scheduler-full.t .. 2/? Bailout called.  Further testing stopped:  Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available
Retry 4 of 5 …
[21:31:50] t/05-scheduler-full.t .. 2/? Bailout called.  Further testing stopped:  Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available
Retry 5 of 5 …
[21:33:36] t/05-scheduler-full.t .. 2/? Bailout called.  Further testing stopped:  Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available

from https://app.circleci.com/pipelines/github/os-autoinst/openQA/7856/workflows/ad76d514-0af0-47d6-9ebe-aaf551dcfe83/jobs/74235

Actions #11

Updated by okurz over 2 years ago

  • Status changed from Feedback to Blocked
  • Assignee changed from dheidler to okurz

discussed in daily 2021-09-29 and we concluded that we might be blocked by #98952 or at least makes sense to await updated dependencies and a stable t/full-stack.t

Actions #12

Updated by okurz over 2 years ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

The "dependency update PR" problem we commented in #98952 is fixed. There is no new dependency PR because likely there are no necessary dependency updates. I did not see the full-stack.t test failing recently. But t/05-scheduler-full.t seems to fail in most if not all pull requests so should be looked into individually.

Actions #13

Updated by livdywan over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

While #98952 is going slow, I might as well poke at this. I can reproduce it now as well, so I'm testing some changes to the "stepping" to see if it's unreliable vs. a specific regression with what is being checked.

Actions #14

Updated by livdywan over 2 years ago

Interim update, test passes more often locally than it fails, but I'm getting scheduler does not consider broken worker for allocating job at t/05-scheduler-full.t line 163 consistently at least.
My simple idea of re-stepping doesn't seem to help here. And it's the same on Tumbleweed and Leap.

Actions #15

Updated by livdywan over 2 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (livdywan)

Going through different commits did not help make the tests pass i.e. doesn't seem like recent code changes affect it. Kind of running out of ideas. Maybe somebody else wants to give it a stab?

Actions #16

Updated by okurz over 2 years ago

well, a simple approach would be still to go back to git commits from some weeks/months ago and check if the test runs stable then

Actions #17

Updated by livdywan over 2 years ago

okurz wrote:

well, a simple approach would be still to go back to git commits from some weeks/months ago and check if the test runs stable then

See my above comment :-)

Actions #18

Updated by mkittler over 2 years ago

  • Assignee set to mkittler
Actions #19

Updated by mkittler over 2 years ago

I'm currently looking into the first of the two problems mentioned in the ticket description. The problem is that the job is apparently not associated with any worker so the re-registration of the worker it is supposed to be associated with doesn't cause the job's incompletion:

    ok 9 - job allocated to expected worker
[debug] API key from client: *CB4F1C079FADCAE6*
[debug] [pid:2525] Updating seen of worker 5 from worker_status
[debug] Key is for user "worker"
[debug] API auth by user: worker, operator: 1
    # Starting unstable worker. Instance: 3 for host http://localhost:9526
    # PID of openqa-worker-unstable: 2552
    # Waiting for worker with ID 5
[info] [pid:2526] Got status update for job 99982 and worker 5 but there is not even a worker assigned to this job (job is running)
[debug] [0hdoojO-nUDN] 400 Bad Request (0s, ??/s)
[debug] [pid:2525] Rescheduling jobs assigned to worker 5
    # waiting for job to be incompleted
[info] [pid:2552] worker 3:
 - config file:           /tmp/b5Cu6Nkflr/scheduler/config/workers.ini
 - worker hostname:       95b38fd71521
 - isotovideo version:    0
 - websocket API version: 1
 - web UI hosts:          http://localhost:9526
 - class:                 ?
 - no cleanup:            no
 - pool directory:        /tmp/b5Cu6Nkflr/scheduler/openqa/pool/3
[info] [pid:2552] Project dir for host http://localhost:9526 is /tmp/b5Cu6Nkflr/scheduler/openqa/share
[info] [pid:2552] Registering with openQA http://localhost:9526
[warn] [pid:2525] Worker 5 accepted job 99982 which was never assigned to it or has already finished
[debug] [pid:2525] Sent population to worker: { population => 5, type => "info" }
[info] [pid:2525] Worker 5 websocket connection closed - 1006
[debug] API key from client: *CB4F1C079FADCAE6*
[debug] Key is for user "worker"
[debug] API auth by user: worker, operator: 1
[debug] [q2bN6or9HBvf] 200 OK (0s, ??/s)
[info] [pid:2552] Establishing ws connection via ws://localhost:9526/api/v1/ws/5
[debug] [d_Q73YZHi0A0] 302 Found (0s, ??/s)
[debug] [MF8wN-hbpUP_] 101 Switching Protocols (0s, ??/s)
[info] [pid:2552] Registered and connected via websockets with openQA host http://localhost:9526 and worker ID 5
[debug] [pid:2525] Received from worker "5" worker_status message "{
  "status" => "free",
  "type" => "worker_status"
}

It looks like this problem happens because the worker actually takes the job here and incompletes it before so then the job is no longer assigned when the test which fails here is reached:

[debug] [pid:2550] Accepting job 99982 from http://localhost:9526.
[debug] [pid:2550] Setting job 99982 from http://localhost:9526 up
[debug] [pid:2550] Preparing Mojo::IOLoop::ReadWriteProcess::Session
[info] [pid:2550] +++ setup notes +++
[info] [pid:2550] Running on 95b38fd71521:3 (Linux 4.15.0-1106-aws #113-Ubuntu SMP Fri Jun 18 21:31:27 UTC 2021 x86_64)
[debug] [pid:2550] Job settings:
[debug] [pid:2550] 
    ARCH=i686
    BUILD=0091
    DESKTOP=gnome
    DISTRI=opensuse
    FLAVOR=GNOME-Live
    GNOME=1
    INSTALLONLY=1
    ISO=openSUSE-13.1-GNOME-Live-i686-Build0091-Media.iso
    ISO_MAXSIZE=999999999
    JOBTOKEN=e2u7qrZtYVCcLlr1
    LIVECD=1
    LOG_LEVEL=debug
    MACHINE=32bit
    NAME=00099982-opensuse-13.1-GNOME-Live-i686-Build0091-RAID0@32bit
    OPENQA_HOSTNAME=localhost:9526
    OPENQA_URL=http://localhost:9526
    PRJDIR=/tmp/b5Cu6Nkflr/scheduler/openqa/share
    QEMUCPU=qemu32
    QEMUPORT=20032
    RAIDLEVEL=0
    RETRY_DELAY=5
    RETRY_DELAY_IF_WEBUI_BUSY=60
    TEST=RAID0
    VERSION=13.1
    VNC=93
    WORKER_HOSTNAME=127.0.0.1
    WORKER_ID=5
    WORKER_INSTANCE=3
[error] [pid:2550] ISO handling Cannot find ISO asset iso/openSUSE-13.1-GNOME-Live-i686-Build0091-Media.iso!
[error] [pid:2550] Unable to setup job 99982: Cannot find ISO asset iso/openSUSE-13.1-GNOME-Live-i686-Build0091-Media.iso!
[debug] [pid:2550] Stopping job 99982 from http://localhost:9526: 00099982-opensuse-13.1-GNOME-Live-i686-Build0091-RAID0@32bit - reason: setup failure
[debug] [pid:2550] REST-API call: POST http://localhost:9526/api/v1/jobs/99982/status
Actions #20

Updated by mkittler over 2 years ago

PR to address the first problem: https://github.com/os-autoinst/openQA/pull/4276


The 2nd problem is a bit harder because I pasted the wrong CircleCI URL when creating the ticket and the current CI run of the PR just shows the first problem (although it looks different because of https://github.com/os-autoinst/openQA/pull/4240).

Actions #21

Updated by mkittler over 2 years ago

  • Status changed from Workable to In Progress
Actions #22

Updated by mkittler over 2 years ago

  • Description updated (diff)
Actions #23

Updated by okurz over 2 years ago

  • Status changed from In Progress to Resolved

let's hope that https://github.com/os-autoinst/openQA/pull/4276 did the job. As this is concerning self-tests and we should be able to find this ticket in case of recurring problems we can resolve here.

Actions

Also available in: Atom PDF