action #71554: unstable/flaky/sporadic t/full-stack.t test failing in script waits on CircleCI - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #71554

closed

unstable/flaky/sporadic t/full-stack.t test failing in script waits on CircleCI

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

livdywan

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2020-09-19

Due date:

% Done:

Estimated time:

Description

Observation¶

Since recently t/full-stack.t is more unstable.

Steps to reproduce¶

Probably reproducable locally with

make test FULLSTACK=1 TESTS=t/full-stack.t

Suggestions¶

Bisect where the regression comes from and fix that to be stable locally and within CI.

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Copied from action #71551: unstable/flaky/sporadic t/04-scheduler.t test failing added

Actions

Copy link

Updated by mkittler over 4 years ago

I've created https://github.com/os-autoinst/openQA/pull/3405 to better track down the problem. I'll have a look at the full stack test when checking CI failures of my PRs but so far I'm not quite sure what the problem is.

Actions

Copy link

Updated by okurz over 4 years ago

Related to action #37638: Flaky fullstack test: 'Test 3 is scheduled' at t/full-stack.t added

Actions

Copy link

Updated by okurz over 4 years ago

Related to action #59043: Fix unstable/flaky full-stack test, i.e. remove sleep, and ui tests added

Actions

Copy link

Updated by okurz over 4 years ago

It seems to become worse now, e.g. https://app.circleci.com/pipelines/github/os-autoinst/openQA/4317/workflows/66701e42-dd43-4159-824e-d8ec08883956/jobs/41463 shows

timeout -s SIGINT -k 5 -v $((20 * (3 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer --merge t/full-stack.t
Retry 1 of 3 …
[17:59:23] t/full-stack.t .. 92/? make[2]: *** [Makefile:174: test-unit-and-integration] Terminated
make[1]: *** [Makefile:169: test-with-database] Terminated
make: *** [Makefile:154: test-fullstack] Terminated

Too long with no output (exceeded 30m0s): context deadline exceeded

so the test job is not even finishing within 30m but the logfile in https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/58f7029dc9e77c000129905e/46416941/5f6b8c9e29253478672eb817-0-build/artifacts/artifacts/full-stack.t?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200924T051432Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20200924%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=6562cafe3e6873cd02d008bc1837d7dad6d8924b65ccffc262ec9f1780401456 shows what looks like the test running just fine (albeit probably super slow) until it is aborted by circleci. Unfortunately neither the test module timeout nor the timeout on make level trigger. Normal runs take about 5m, e.g. see https://app.circleci.com/pipelines/github/os-autoinst/openQA/4320/workflows/6e1fdc89-9482-4a3d-9a3e-b78135abbe6e/jobs/41464 so I guess we can at least tweak some timeouts: https://github.com/os-autoinst/openQA/pull/3415
This is only fighting the symptoms, not addressing the root cause for his problem so not assigning the ticket to myself yet.

Actions

Copy link

Updated by livdywan over 4 years ago

okurz wrote:

It seems to become worse now, e.g. https://app.circleci.com/pipelines/github/os-autoinst/openQA/4317/workflows/66701e42-dd43-4159-824e-d8ec08883956/jobs/41463 shows
timeout -s SIGINT -k 5 -v $((20 * (3 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer --merge t/full-stack.t
Retry 1 of 3 …
[17:59:23] t/full-stack.t .. 92/? make[2]: *** [Makefile:174: test-unit-and-integration] Terminated
make[1]: *** [Makefile:169: test-with-database] Terminated
make: *** [Makefile:154: test-fullstack] Terminated

Too long with no output (exceeded 30m0s): context deadline exceeded
so the test job is not even finishing within 30m but the logfile in https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/58f7029dc9e77c000129905e/46416941/5f6b8c9e29253478672eb817-0-build/artifacts/artifacts/full-stack.t?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200924T051432Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20200924%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=6562cafe3e6873cd02d008bc1837d7dad6d8924b65ccffc262ec9f1780401456 shows what looks like the test running just fine (albeit probably super slow) until it is aborted by circleci. Unfortunately neither the test module timeout nor the timeout on make level trigger. Normal runs take about 5m, e.g. see https://app.circleci.com/pipelines/github/os-autoinst/openQA/4320/workflows/6e1fdc89-9482-4a3d-9a3e-b78135abbe6e/jobs/41464 so I guess we can at least tweak some timeouts: https://github.com/os-autoinst/openQA/pull/3415
This is only fighting the symptoms, not addressing the root cause for his problem so not assigning the ticket to myself yet.

Another piece we've not considered in this puzzle is Javascript. I suspect the code gets stuck waiting for the result panel of job 8. The loop never times out. And it's a loop based on number of iteration, not a timeout, which means it's as slow as the javascript and sleep calls make it in practice.

https://github.com/os-autoinst/openQA/pull/3430

Actions

Copy link

Updated by livdywan over 4 years ago

Status changed from Workable to In Progress
Assignee set to livdywan

Actions

Copy link

Updated by livdywan over 4 years ago

Note that I also evaluated past jobs on CI and the only failures I could find were due to Javascript getting stuck, and it's not failing a lot actually. I will keep an eye on it anyway, though (and that's part of Feedback).

Actions

Copy link

Updated by livdywan over 4 years ago

Subject changed from unstable/flaky/sporadic t/full-stack.t test failing to flaky t/full-stack.t test failing in script waits on CircleCI
Description updated (diff)

Actions

Copy link

#10

Updated by okurz over 4 years ago

Subject changed from flaky t/full-stack.t test failing in script waits on CircleCI to unstable/flaky/sporadic t/full-stack.t test failing in script waits on CircleCI
Status changed from In Progress to Feedback
Priority changed from Urgent to Normal

I included "unstable/flaky/sporadic" in the subject line to have a higher chance to find this ticket again when searching subjects :)

You created https://github.com/os-autoinst/openQA/pull/3430 , I merged that now. As you stated the problem seems to have been again less severe lately so we can track in Feedback with lower prio now. thx

Actions

Copy link

#11

Updated by okurz over 4 years ago

I created https://github.com/os-autoinst/openQA/pull/3455 to mark t/full-stack.t as stable and faster (reduced timeout). Do you plan any further work here or what feedback you are waiting for?

Actions

Copy link

#12

Updated by livdywan over 4 years ago

Status changed from Feedback to Resolved

I think it's fine now

Actions

Copy link

#13

Updated by okurz over 1 year ago

Related to action #152941: circleCI job runs into 20m timeout due to slow download from registry.opensuse.org added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #71554

unstable/flaky/sporadic t/full-stack.t test failing in script waits on CircleCI

Observation¶

Steps to reproduce¶

Suggestions¶

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by livdywan over 4 years ago

Updated by okurz over 1 year ago