action #98898
closed`t/05-scheduler-full.t` sometimes fails in CircleCI size:M
Description
The test is already deemed unstable but now also fails repeatedly despite the retry we have for unstable tests. There currently seem to be two types of failures which might have the same cause or not.
Test Summary Report
-------------------
t/05-scheduler-full.t (Wstat: 256 Tests: 5 Failed: 1)
Failed test: 3
Non-zero exit status: 1
Files=2, Tests=51, 598.414 wallclock secs ( 1.77 usr 0.10 sys + 501.36 cusr 22.31 csys = 525.54 CPU)
Result: FAIL
Retry 5 of 5 …
[16:38:38] t/05-scheduler-full.t .. 2/?
# Failed test 'running job set to done if its worker re-connects claiming not to work on it anymore'
# at t/05-scheduler-full.t line 219.
# got: 'running'
# expected: 'done'
# Failed test 'running job incompleted if its worker re-connects claiming not to work on it anymore'
# at t/05-scheduler-full.t line 221.
# got: 'none'
# expected: 'incomplete'
# Failed test 'reason is set'
# at t/05-scheduler-full.t line 223.
# undef
# doesn't match '(?^:abandoned: associated worker .+:\d+ re-connected but abandoned the job)'
# Looks like you failed 3 tests of 12.
[16:38:38] t/05-scheduler-full.t .. 3/?
# Failed test 're-scheduling and incompletion of jobs when worker rejects jobs or goes offline'
# at t/05-scheduler-full.t line 227.
[16:38:38] t/05-scheduler-full.t .. 4/? [16:38:38] t/05-scheduler-full.t .. 5/? # Looks like you failed 1 test of 5.
[16:38:38] t/05-scheduler-full.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/5 subtests
So far I only saw the problem on https://github.com/os-autoinst/openQA/pull/4213 (https://app.circleci.com/pipelines/github/os-autoinst/openQA/7726/workflows/132858b7-d6bd-4df0-9667-50caca647581/jobs/73025) but it seems unrelated because the test passes locally and re-triggering eventually helped. I've attached the detailed log.
[09:08:13] t/05-scheduler-full.t .. 3/?
# Failed test 'Allocated maximum number of jobs that could have been allocated'
# at t/05-scheduler-full.t line 240.
# got: '9'
# expected: '10'
# Looks like you failed 1 test of 1.
This one has been seen on multiple PRs, e.g. https://github.com/os-autoinst/openQA/pull/4208 (https://app.circleci.com/pipelines/github/os-autoinst/openQA/7736/workflows/afd32447-77c5-4f29-b1fe-aecce3dbc434/jobs/73024).
Acceptance criteria¶
- AC1: scheduler-full.t is reliable in CircleCI after repeated runs
Suggestion¶
- Check if there's a hard-coded wait e.g. 5 minutes and proceeds before the jobs have finished
- Try to reproduce locally
- Rough idea, try and insert a sleep where the job done happens i.e. via mocking in the test to make it slow
- Come up with an improvement based on CircleCI logs and reading code, and monitor CI to see if it improves, if it can't be reproduced locally
Files
Updated by okurz about 3 years ago
- Category set to Regressions/Crashes
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by livdywan about 3 years ago
- Subject changed from `t/05-scheduler-full.t` sometimes fails in CircleCI to `t/05-scheduler-full.t` sometimes fails in CircleCI size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by dheidler about 3 years ago
- Status changed from Workable to In Progress
- Assignee set to dheidler
Updated by openqa_review about 3 years ago
- Due date set to 2021-10-06
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler about 3 years ago
All mentioned errors are in consecutive lines - it looks like the
note 'waiting for job to be incompleted';
for (0 .. 100) {
last if $jobs->find(99982)->state eq OpenQA::Jobs::Constants::DONE;
sleep .2;
}
did not wait long enough.
Let's see if 200*0.2=40s
is enough: https://github.com/os-autoinst/openQA/pull/4228
Updated by livdywan about 3 years ago
dheidler wrote:
All mentioned errors are in consecutive lines - it looks like the
note 'waiting for job to be incompleted'; for (0 .. 100) { last if $jobs->find(99982)->state eq OpenQA::Jobs::Constants::DONE; sleep .2; }
did not wait long enough.
Let's see if200*0.2=40s
is enough: https://github.com/os-autoinst/openQA/pull/4228
Can you replace this magic number with something better? Either make it scale, or better yet wait on a prerequisite and fail directly if that's not fulfilled. Will also comment on the PR.
Updated by dheidler about 3 years ago
Updated by dheidler about 3 years ago
- Status changed from In Progress to Feedback
Updated by okurz about 3 years ago
I still see the problem, e.g. in https://github.com/os-autoinst/openQA/pull/4208
Retry 1 of 5 …
[21:26:20] t/05-scheduler-full.t .. 2/? Bailout called. Further testing stopped: Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available
Retry 2 of 5 …
[21:28:16] t/05-scheduler-full.t .. 2/? Bailout called. Further testing stopped: Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available
Retry 3 of 5 …
[21:30:03] t/05-scheduler-full.t .. 2/? Bailout called. Further testing stopped: Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available
Retry 4 of 5 …
[21:31:50] t/05-scheduler-full.t .. 2/? Bailout called. Further testing stopped: Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available
Retry 5 of 5 …
[21:33:36] t/05-scheduler-full.t .. 2/? Bailout called. Further testing stopped: Job 99982 is done not available
FAILED--Further testing stopped: Job 99982 is done not available
Updated by okurz about 3 years ago
- Status changed from Feedback to Blocked
- Assignee changed from dheidler to okurz
discussed in daily 2021-09-29 and we concluded that we might be blocked by #98952 or at least makes sense to await updated dependencies and a stable t/full-stack.t
Updated by okurz about 3 years ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
The "dependency update PR" problem we commented in #98952 is fixed. There is no new dependency PR because likely there are no necessary dependency updates. I did not see the full-stack.t test failing recently. But t/05-scheduler-full.t seems to fail in most if not all pull requests so should be looked into individually.
Updated by livdywan about 3 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
While #98952 is going slow, I might as well poke at this. I can reproduce it now as well, so I'm testing some changes to the "stepping" to see if it's unreliable vs. a specific regression with what is being checked.
Updated by livdywan about 3 years ago
Interim update, test passes more often locally than it fails, but I'm getting scheduler does not consider broken worker for allocating job at t/05-scheduler-full.t line 163
consistently at least.
My simple idea of re-stepping doesn't seem to help here. And it's the same on Tumbleweed and Leap.
Updated by livdywan about 3 years ago
- Status changed from In Progress to Workable
- Assignee deleted (
livdywan)
Going through different commits did not help make the tests pass i.e. doesn't seem like recent code changes affect it. Kind of running out of ideas. Maybe somebody else wants to give it a stab?
Updated by okurz about 3 years ago
well, a simple approach would be still to go back to git commits from some weeks/months ago and check if the test runs stable then
Updated by livdywan about 3 years ago
okurz wrote:
well, a simple approach would be still to go back to git commits from some weeks/months ago and check if the test runs stable then
See my above comment :-)
Updated by mkittler about 3 years ago
I'm currently looking into the first of the two problems mentioned in the ticket description. The problem is that the job is apparently not associated with any worker so the re-registration of the worker it is supposed to be associated with doesn't cause the job's incompletion:
ok 9 - job allocated to expected worker
[debug] API key from client: *CB4F1C079FADCAE6*
[debug] [pid:2525] Updating seen of worker 5 from worker_status
[debug] Key is for user "worker"
[debug] API auth by user: worker, operator: 1
# Starting unstable worker. Instance: 3 for host http://localhost:9526
# PID of openqa-worker-unstable: 2552
# Waiting for worker with ID 5
[info] [pid:2526] Got status update for job 99982 and worker 5 but there is not even a worker assigned to this job (job is running)
[debug] [0hdoojO-nUDN] 400 Bad Request (0s, ??/s)
[debug] [pid:2525] Rescheduling jobs assigned to worker 5
# waiting for job to be incompleted
[info] [pid:2552] worker 3:
- config file: /tmp/b5Cu6Nkflr/scheduler/config/workers.ini
- worker hostname: 95b38fd71521
- isotovideo version: 0
- websocket API version: 1
- web UI hosts: http://localhost:9526
- class: ?
- no cleanup: no
- pool directory: /tmp/b5Cu6Nkflr/scheduler/openqa/pool/3
[info] [pid:2552] Project dir for host http://localhost:9526 is /tmp/b5Cu6Nkflr/scheduler/openqa/share
[info] [pid:2552] Registering with openQA http://localhost:9526
[warn] [pid:2525] Worker 5 accepted job 99982 which was never assigned to it or has already finished
[debug] [pid:2525] Sent population to worker: { population => 5, type => "info" }
[info] [pid:2525] Worker 5 websocket connection closed - 1006
[debug] API key from client: *CB4F1C079FADCAE6*
[debug] Key is for user "worker"
[debug] API auth by user: worker, operator: 1
[debug] [q2bN6or9HBvf] 200 OK (0s, ??/s)
[info] [pid:2552] Establishing ws connection via ws://localhost:9526/api/v1/ws/5
[debug] [d_Q73YZHi0A0] 302 Found (0s, ??/s)
[debug] [MF8wN-hbpUP_] 101 Switching Protocols (0s, ??/s)
[info] [pid:2552] Registered and connected via websockets with openQA host http://localhost:9526 and worker ID 5
[debug] [pid:2525] Received from worker "5" worker_status message "{
"status" => "free",
"type" => "worker_status"
}
It looks like this problem happens because the worker actually takes the job here and incompletes it before so then the job is no longer assigned when the test which fails here is reached:
[debug] [pid:2550] Accepting job 99982 from http://localhost:9526.
[debug] [pid:2550] Setting job 99982 from http://localhost:9526 up
[debug] [pid:2550] Preparing Mojo::IOLoop::ReadWriteProcess::Session
[info] [pid:2550] +++ setup notes +++
[info] [pid:2550] Running on 95b38fd71521:3 (Linux 4.15.0-1106-aws #113-Ubuntu SMP Fri Jun 18 21:31:27 UTC 2021 x86_64)
[debug] [pid:2550] Job settings:
[debug] [pid:2550]
ARCH=i686
BUILD=0091
DESKTOP=gnome
DISTRI=opensuse
FLAVOR=GNOME-Live
GNOME=1
INSTALLONLY=1
ISO=openSUSE-13.1-GNOME-Live-i686-Build0091-Media.iso
ISO_MAXSIZE=999999999
JOBTOKEN=e2u7qrZtYVCcLlr1
LIVECD=1
LOG_LEVEL=debug
MACHINE=32bit
NAME=00099982-opensuse-13.1-GNOME-Live-i686-Build0091-RAID0@32bit
OPENQA_HOSTNAME=localhost:9526
OPENQA_URL=http://localhost:9526
PRJDIR=/tmp/b5Cu6Nkflr/scheduler/openqa/share
QEMUCPU=qemu32
QEMUPORT=20032
RAIDLEVEL=0
RETRY_DELAY=5
RETRY_DELAY_IF_WEBUI_BUSY=60
TEST=RAID0
VERSION=13.1
VNC=93
WORKER_HOSTNAME=127.0.0.1
WORKER_ID=5
WORKER_INSTANCE=3
[error] [pid:2550] ISO handling Cannot find ISO asset iso/openSUSE-13.1-GNOME-Live-i686-Build0091-Media.iso!
[error] [pid:2550] Unable to setup job 99982: Cannot find ISO asset iso/openSUSE-13.1-GNOME-Live-i686-Build0091-Media.iso!
[debug] [pid:2550] Stopping job 99982 from http://localhost:9526: 00099982-opensuse-13.1-GNOME-Live-i686-Build0091-RAID0@32bit - reason: setup failure
[debug] [pid:2550] REST-API call: POST http://localhost:9526/api/v1/jobs/99982/status
Updated by mkittler about 3 years ago
PR to address the first problem: https://github.com/os-autoinst/openQA/pull/4276
The 2nd problem is a bit harder because I pasted the wrong CircleCI URL when creating the ticket and the current CI run of the PR just shows the first problem (although it looks different because of https://github.com/os-autoinst/openQA/pull/4240).
Updated by mkittler about 3 years ago
- Status changed from Workable to In Progress
Updated by okurz about 3 years ago
- Status changed from In Progress to Resolved
let's hope that https://github.com/os-autoinst/openQA/pull/4276 did the job. As this is concerning self-tests and we should be able to find this ticket in case of recurring problems we can resolve here.