Project

General

Profile

Actions

action #71554

closed

unstable/flaky/sporadic t/full-stack.t test failing in script waits on CircleCI

Added by okurz almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-09-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

Since recently t/full-stack.t is more unstable.

Steps to reproduce

Probably reproducable locally with

make test FULLSTACK=1 TESTS=t/full-stack.t

Suggestions

Bisect where the regression comes from and fix that to be stable locally and within CI.


Related issues 4 (0 open4 closed)

Related to openQA Project - action #37638: Flaky fullstack test: 'Test 3 is scheduled' at t/full-stack.tResolvedokurz2018-06-21

Actions
Related to openQA Project - action #59043: Fix unstable/flaky full-stack test, i.e. remove sleep, and ui testsResolvedokurz2019-11-04

Actions
Related to openQA Infrastructure - action #152941: circleCI job runs into 20m timeout due to slow download from registry.opensuse.orgResolvedokurz2023-12-27

Actions
Copied from openQA Project - action #71551: unstable/flaky/sporadic t/04-scheduler.t test failingResolvedokurz2020-09-19

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Copied from action #71551: unstable/flaky/sporadic t/04-scheduler.t test failing added
Actions #2

Updated by mkittler almost 4 years ago

I've created https://github.com/os-autoinst/openQA/pull/3405 to better track down the problem. I'll have a look at the full stack test when checking CI failures of my PRs but so far I'm not quite sure what the problem is.

Actions #3

Updated by okurz almost 4 years ago

  • Related to action #37638: Flaky fullstack test: 'Test 3 is scheduled' at t/full-stack.t added
Actions #4

Updated by okurz almost 4 years ago

  • Related to action #59043: Fix unstable/flaky full-stack test, i.e. remove sleep, and ui tests added
Actions #5

Updated by okurz almost 4 years ago

It seems to become worse now, e.g. https://app.circleci.com/pipelines/github/os-autoinst/openQA/4317/workflows/66701e42-dd43-4159-824e-d8ec08883956/jobs/41463 shows

timeout -s SIGINT -k 5 -v $((20 * (3 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer --merge t/full-stack.t
Retry 1 of 3 …
[17:59:23] t/full-stack.t .. 92/? make[2]: *** [Makefile:174: test-unit-and-integration] Terminated
make[1]: *** [Makefile:169: test-with-database] Terminated
make: *** [Makefile:154: test-fullstack] Terminated

Too long with no output (exceeded 30m0s): context deadline exceeded

so the test job is not even finishing within 30m but the logfile in https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/58f7029dc9e77c000129905e/46416941/5f6b8c9e29253478672eb817-0-build/artifacts/artifacts/full-stack.t?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200924T051432Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20200924%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=6562cafe3e6873cd02d008bc1837d7dad6d8924b65ccffc262ec9f1780401456 shows what looks like the test running just fine (albeit probably super slow) until it is aborted by circleci. Unfortunately neither the test module timeout nor the timeout on make level trigger. Normal runs take about 5m, e.g. see https://app.circleci.com/pipelines/github/os-autoinst/openQA/4320/workflows/6e1fdc89-9482-4a3d-9a3e-b78135abbe6e/jobs/41464 so I guess we can at least tweak some timeouts: https://github.com/os-autoinst/openQA/pull/3415
This is only fighting the symptoms, not addressing the root cause for his problem so not assigning the ticket to myself yet.

Actions #6

Updated by livdywan almost 4 years ago

okurz wrote:

It seems to become worse now, e.g. https://app.circleci.com/pipelines/github/os-autoinst/openQA/4317/workflows/66701e42-dd43-4159-824e-d8ec08883956/jobs/41463 shows

timeout -s SIGINT -k 5 -v $((20 * (3 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer --merge t/full-stack.t
Retry 1 of 3 …
[17:59:23] t/full-stack.t .. 92/? make[2]: *** [Makefile:174: test-unit-and-integration] Terminated
make[1]: *** [Makefile:169: test-with-database] Terminated
make: *** [Makefile:154: test-fullstack] Terminated

Too long with no output (exceeded 30m0s): context deadline exceeded

so the test job is not even finishing within 30m but the logfile in https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/58f7029dc9e77c000129905e/46416941/5f6b8c9e29253478672eb817-0-build/artifacts/artifacts/full-stack.t?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200924T051432Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20200924%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=6562cafe3e6873cd02d008bc1837d7dad6d8924b65ccffc262ec9f1780401456 shows what looks like the test running just fine (albeit probably super slow) until it is aborted by circleci. Unfortunately neither the test module timeout nor the timeout on make level trigger. Normal runs take about 5m, e.g. see https://app.circleci.com/pipelines/github/os-autoinst/openQA/4320/workflows/6e1fdc89-9482-4a3d-9a3e-b78135abbe6e/jobs/41464 so I guess we can at least tweak some timeouts: https://github.com/os-autoinst/openQA/pull/3415
This is only fighting the symptoms, not addressing the root cause for his problem so not assigning the ticket to myself yet.

Another piece we've not considered in this puzzle is Javascript. I suspect the code gets stuck waiting for the result panel of job 8. The loop never times out. And it's a loop based on number of iteration, not a timeout, which means it's as slow as the javascript and sleep calls make it in practice.

https://github.com/os-autoinst/openQA/pull/3430

Actions #7

Updated by livdywan almost 4 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan
Actions #8

Updated by livdywan almost 4 years ago

Note that I also evaluated past jobs on CI and the only failures I could find were due to Javascript getting stuck, and it's not failing a lot actually. I will keep an eye on it anyway, though (and that's part of Feedback).

Actions #9

Updated by livdywan almost 4 years ago

  • Subject changed from unstable/flaky/sporadic t/full-stack.t test failing to flaky t/full-stack.t test failing in script waits on CircleCI
  • Description updated (diff)
Actions #10

Updated by okurz almost 4 years ago

  • Subject changed from flaky t/full-stack.t test failing in script waits on CircleCI to unstable/flaky/sporadic t/full-stack.t test failing in script waits on CircleCI
  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to Normal

I included "unstable/flaky/sporadic" in the subject line to have a higher chance to find this ticket again when searching subjects :)

You created https://github.com/os-autoinst/openQA/pull/3430 , I merged that now. As you stated the problem seems to have been again less severe lately so we can track in Feedback with lower prio now. thx

Actions #11

Updated by okurz almost 4 years ago

I created https://github.com/os-autoinst/openQA/pull/3455 to mark t/full-stack.t as stable and faster (reduced timeout). Do you plan any further work here or what feedback you are waiting for?

Actions #12

Updated by livdywan almost 4 years ago

  • Status changed from Feedback to Resolved

I think it's fine now

Actions #13

Updated by okurz 9 months ago

  • Related to action #152941: circleCI job runs into 20m timeout due to slow download from registry.opensuse.org added
Actions

Also available in: Atom PDF