action #71554
closedunstable/flaky/sporadic t/full-stack.t test failing in script waits on CircleCI
Added by okurz about 4 years ago. Updated almost 4 years ago.
Description
Updated by okurz about 4 years ago
- Copied from action #71551: unstable/flaky/sporadic t/04-scheduler.t test failing added
Updated by mkittler about 4 years ago
I've created https://github.com/os-autoinst/openQA/pull/3405 to better track down the problem. I'll have a look at the full stack test when checking CI failures of my PRs but so far I'm not quite sure what the problem is.
Updated by okurz about 4 years ago
- Related to action #37638: Flaky fullstack test: 'Test 3 is scheduled' at t/full-stack.t added
Updated by okurz about 4 years ago
- Related to action #59043: Fix unstable/flaky full-stack test, i.e. remove sleep, and ui tests added
Updated by okurz about 4 years ago
It seems to become worse now, e.g. https://app.circleci.com/pipelines/github/os-autoinst/openQA/4317/workflows/66701e42-dd43-4159-824e-d8ec08883956/jobs/41463 shows
timeout -s SIGINT -k 5 -v $((20 * (3 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer --merge t/full-stack.t
Retry 1 of 3 …
[17:59:23] t/full-stack.t .. 92/? make[2]: *** [Makefile:174: test-unit-and-integration] Terminated
make[1]: *** [Makefile:169: test-with-database] Terminated
make: *** [Makefile:154: test-fullstack] Terminated
Too long with no output (exceeded 30m0s): context deadline exceeded
so the test job is not even finishing within 30m but the logfile in https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/58f7029dc9e77c000129905e/46416941/5f6b8c9e29253478672eb817-0-build/artifacts/artifacts/full-stack.t?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200924T051432Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20200924%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=6562cafe3e6873cd02d008bc1837d7dad6d8924b65ccffc262ec9f1780401456 shows what looks like the test running just fine (albeit probably super slow) until it is aborted by circleci. Unfortunately neither the test module timeout nor the timeout on make level trigger. Normal runs take about 5m, e.g. see https://app.circleci.com/pipelines/github/os-autoinst/openQA/4320/workflows/6e1fdc89-9482-4a3d-9a3e-b78135abbe6e/jobs/41464 so I guess we can at least tweak some timeouts: https://github.com/os-autoinst/openQA/pull/3415
This is only fighting the symptoms, not addressing the root cause for his problem so not assigning the ticket to myself yet.
Updated by livdywan about 4 years ago
okurz wrote:
It seems to become worse now, e.g. https://app.circleci.com/pipelines/github/os-autoinst/openQA/4317/workflows/66701e42-dd43-4159-824e-d8ec08883956/jobs/41463 shows
timeout -s SIGINT -k 5 -v $((20 * (3 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer --merge t/full-stack.t Retry 1 of 3 … [17:59:23] t/full-stack.t .. 92/? make[2]: *** [Makefile:174: test-unit-and-integration] Terminated make[1]: *** [Makefile:169: test-with-database] Terminated make: *** [Makefile:154: test-fullstack] Terminated Too long with no output (exceeded 30m0s): context deadline exceeded
so the test job is not even finishing within 30m but the logfile in https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/58f7029dc9e77c000129905e/46416941/5f6b8c9e29253478672eb817-0-build/artifacts/artifacts/full-stack.t?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200924T051432Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20200924%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=6562cafe3e6873cd02d008bc1837d7dad6d8924b65ccffc262ec9f1780401456 shows what looks like the test running just fine (albeit probably super slow) until it is aborted by circleci. Unfortunately neither the test module timeout nor the timeout on make level trigger. Normal runs take about 5m, e.g. see https://app.circleci.com/pipelines/github/os-autoinst/openQA/4320/workflows/6e1fdc89-9482-4a3d-9a3e-b78135abbe6e/jobs/41464 so I guess we can at least tweak some timeouts: https://github.com/os-autoinst/openQA/pull/3415
This is only fighting the symptoms, not addressing the root cause for his problem so not assigning the ticket to myself yet.
Another piece we've not considered in this puzzle is Javascript. I suspect the code gets stuck waiting for the result panel of job 8. The loop never times out. And it's a loop based on number of iteration, not a timeout, which means it's as slow as the javascript and sleep calls make it in practice.
Updated by livdywan about 4 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Updated by livdywan about 4 years ago
Note that I also evaluated past jobs on CI and the only failures I could find were due to Javascript getting stuck, and it's not failing a lot actually. I will keep an eye on it anyway, though (and that's part of Feedback).
Updated by livdywan about 4 years ago
- Subject changed from unstable/flaky/sporadic t/full-stack.t test failing to flaky t/full-stack.t test failing in script waits on CircleCI
- Description updated (diff)
Updated by okurz about 4 years ago
- Subject changed from flaky t/full-stack.t test failing in script waits on CircleCI to unstable/flaky/sporadic t/full-stack.t test failing in script waits on CircleCI
- Status changed from In Progress to Feedback
- Priority changed from Urgent to Normal
I included "unstable/flaky/sporadic" in the subject line to have a higher chance to find this ticket again when searching subjects :)
You created https://github.com/os-autoinst/openQA/pull/3430 , I merged that now. As you stated the problem seems to have been again less severe lately so we can track in Feedback with lower prio now. thx
Updated by okurz about 4 years ago
I created https://github.com/os-autoinst/openQA/pull/3455 to mark t/full-stack.t as stable and faster (reduced timeout). Do you plan any further work here or what feedback you are waiting for?
Updated by livdywan almost 4 years ago
- Status changed from Feedback to Resolved
I think it's fine now
Updated by okurz 10 months ago
- Related to action #152941: circleCI job runs into 20m timeout due to slow download from registry.opensuse.org added