Project

General

Profile

Actions

action #95836

closed

[sporadic] CircleCI: openQA t/full-stack.t flaky size:M

Added by tinita almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-22
Due date:
2021-09-02
% Done:

0%

Estimated time:

Description

Problem

t/full-stack.t failed in this PR https://github.com/os-autoinst/openQA/pull/4064

https://app.circleci.com/pipelines/github/os-autoinst/openQA/7140/workflows/760cdf7c-01a6-4627-9dce-1f020e046a2f/jobs/67378/steps

[02:46:52] t/full-stack.t .. 42/? # stopped waiting for '(?^u:Result: passed)', result turned out to be 'failed'
# full result panel contents:
# 1 1
# Result: failed finished less than a minute ago ( 01:56 minutes )
# Scheduled product: job has not been created by posting an ISO
# Assigned worker: 45009a9ca970:1

#   Failed test 'test 1 is passed'
#   at t/full-stack.t line 191.
[02:46:52] t/full-stack.t .. 45/? # 0
[02:46:52] t/full-stack.t .. 53/? [02:46:52] t/full-stack.t .. 54/? [02:46:52] t/full-stack.t .. 55/? [02:46:52] t/full-stack.t .. 56/? [02:46:52] t/full-stack.t .. 58/? [02:46:52] t/full-stack.t .. 61/? [02:46:52] t/full-stack.t .. 63/? [02:46:52] t/full-stack.t .. 65/? [02:46:52] t/full-stack.t .. 66/? [02:46:52] t/full-stack.t .. 68/? [02:46:52] t/full-stack.t .. 69/? [02:46:52] t/full-stack.t .. 71/? [02:46:52] t/full-stack.t .. 74/? [02:46:52] t/full-stack.t .. 75/? [02:46:52] t/full-stack.t .. 77/? [02:46:52] t/full-stack.t .. 83/? [02:46:52] t/full-stack.t .. 84/? [02:46:52] t/full-stack.t .. 86/? [02:46:52] t/full-stack.t .. 87/? [02:46:52] t/full-stack.t .. 93/?     # stopped waiting for '(?^u:Result: passed)', result turned out to be 'failed'
    # full result panel contents:
    # 1 1
    # Result: failed finished less than a minute ago ( 02:00 minutes )
    # Clone of 5
    # Scheduled product: job has not been created by posting an ISO (but possibly the original job)
    # Assigned worker: 45009a9ca970:1

    #   Failed test 'test 6 is passed'
    #   at t/full-stack.t line 366.
    # 0
    # stopped waiting for '(?^u:Result: passed)', result turned out to be 'failed'
    # full result panel contents:
    # 1 1
    # Result: failed finished less than a minute ago ( 02:00 minutes )
    # Clone of 6
    # Scheduled product: job has not been created by posting an ISO (but possibly the original job)
    # Assigned worker: 45009a9ca970:1

    #   Failed test 'test 7 is passed'
    #   at t/full-stack.t line 383.
    # 0
    # Looks like you failed 2 tests of 615.
[02:46:52] t/full-stack.t .. 95/? 
#   Failed test 'Cache tests'
#   at t/full-stack.t line 414.
# Looks like you failed 2 tests of 96.
                                   [02:46:52] t/full-stack.t .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/96 subtests 

Retriggering it fixed it.

Acceptance Criteria

AC 1: t/full-stack.t is not flaky anymore

Suggestions

  • Try to reproduce locally

Related issues 1 (0 open1 closed)

Copied to openQA Project - action #95839: [sporadic] CircleCI: openQA t/ui/14-dashboard.t flaky size:MResolvedmkittler2021-07-222021-08-07

Actions
Actions #1

Updated by tinita almost 3 years ago

  • Copied to action #95839: [sporadic] CircleCI: openQA t/ui/14-dashboard.t flaky size:M added
Actions #2

Updated by tinita almost 3 years ago

  • Description updated (diff)
Actions #3

Updated by okurz almost 3 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by okurz almost 3 years ago

  • Subject changed from CircleCI: openQA t/full-stack.t flaky to [sporadic] CircleCI: openQA t/full-stack.t flaky
Actions #5

Updated by ilausuch almost 3 years ago

  • Subject changed from [sporadic] CircleCI: openQA t/full-stack.t flaky to [sporadic] CircleCI: openQA t/full-stack.t flaky size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by tinita over 2 years ago

Just wanted to note that this failure hasn't occurred since this ticket was reported.

Actions #7

Updated by kraih over 2 years ago

  • Assignee set to kraih

I'll try to reproduce it.

Actions #8

Updated by kraih over 2 years ago

  • Assignee deleted (kraih)

Unfortunately i've not been able to replicate the issue locally or on Circle CI. The test seems very stable at the moment. Maybe someone else has more luck.

Actions #9

Updated by tinita over 2 years ago

Maybe we can reduce the priority, if we don't see it happen at all right now.

Actions #10

Updated by okurz over 2 years ago

  • Status changed from Workable to Resolved
  • Assignee set to kraih

well, I consider sporadic issues urgent because they can damage our trust in tests a lot. But as kraih could not replicate both locally and on circleCI I will just accept the hypothesis that it was an intermittent failure that we have no chance in fixing hence I see the issue as actually fixed even though we don't know what change might have fixed it again. Hence setting to resolved and setting kraih as assignee as he did the work though it was only testing which proved the stability and that's what we wanted to achieve, see ACs.

Actions #11

Updated by okurz over 2 years ago

  • Status changed from Resolved to Workable
  • Assignee deleted (kraih)

Seems like not stable enough: https://app.circleci.com/pipelines/github/os-autoinst/openQA/7234/workflows/6f863810-1f88-402f-bfbe-c0ff7fdbbc3d/jobs/68304

maybe just bump Makefile level timeout because everything is much slower

Actions #12

Updated by mkittler over 2 years ago

When looking further into the logs of the failure mentioned in the last comments then it doesn't look like bumping the timeouts would help anymore. The test was stuck waiting for the first job to progress but that job couldn't be started for some reason.

The worker status always repeats the job status and the job is considered running (ok 23 - test 1 is running) but it doesn't really progress. The continuous upload has already been started (Upload concluded (no current module)) and the developer mode is available (ok 24 - wait until developer console becomes available). However, it fails to reach the point where the test is paused (or the test is paused but it cannot be acknowledged). It looks like the developer mode connection has been re-opened multiple times but there was no response to the status command (which would contain what the test is waiting for):

    # waiting for paused, developer console contains:
    # status: Connecting to ws://127.0.0.1:20013/ML7iBoUodd8WmcnR/ws
    # status: Using proxy: no
    # status: Connection opened
    # ==> {"cmd":"status"}
    # status: Connection closed, trying to reconnect in 500 ms
    # status: Connecting to ws://127.0.0.1:20013/ML7iBoUodd8WmcnR/ws
    # status: Using proxy: no
    # status: Connection opened
    # ==> {"cmd":"status"}
    # status: Connection closed, trying to reconnect in 500 ms
    # status: Connecting to ws://127.0.0.1:20013/ML7iBoUodd8WmcnR/ws
    # status: Using proxy: no
    # status: Connection opened
    # ==> {"cmd":"status"}
    # status: Connection closed, trying to reconnect in 500 ms
    # status: Connecting to ws://127.0.0.1:20013/ML7iBoUodd8WmcnR/ws
    # status: Using proxy: no
    # status: Connection opened

So there must be something stuck in os-autoinst's command processing but it is hard to tell what exactly.

If the problem doesn't happen very often I would lower the priority. Maybe we find some more clues from further test runs.

Actions #13

Updated by livdywan over 2 years ago

See also https://app.circleci.com/pipelines/github/os-autoinst/openQA/7284/workflows/e059f161-33a7-4f73-85b7-7564e501d14d/jobs/68683/steps

[09:44:33] t/full-stack.t .. 94/?     # full result panel contents:
    # State: running started 3 minutes ago
    # Scheduled product: job has not been created by posting an ISO
    # Assigned worker: 885bbd1b5daf:1

    #   Failed test 'test 5 is passed'
    #   at t/full-stack.t line 313.
    # 0

    #   Failed test 'Test 5 autoinst-log.txt file created'
    #   at t/full-stack.t line 316.
    # Looks like you failed 2 tests of 384.
[09:44:33] t/full-stack.t .. 96/? 
#   Failed test 'Cache tests'
#   at t/full-stack.t line 414.
Can't open file "/tmp/392z3XoTPN/full-stack.d/openqa/testresults/00000/00000005-tinycore-1-flavor-i386-Build1-core@coolone/autoinst-log.txt": No such file or directory at t/full-stack.t line 317.
Actions #15

Updated by livdywan over 2 years ago

  • Assignee set to livdywan

As suggested I'll give this a stab, possibly with Marius' help and see if I learn something

Actions #16

Updated by livdywan over 2 years ago

Not a fix, but to prevent file errors from breaking the test plan: https://github.com/os-autoinst/openQA/pull/4107 This is why there's an error message at the bottom. It's supposed to be part of the test.

Actions #17

Updated by livdywan over 2 years ago

  • Status changed from Workable to In Progress
Actions #19

Updated by openqa_review over 2 years ago

  • Due date set to 2021-08-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #20

Updated by livdywan over 2 years ago

Somehow the disable tour permanently and find_element_by_link_text('core@coolone')->click() checks now fail for me locally. Manually deleting the db makes no difference - using FULLSTACK=1 make test-with-database TESTS=t/full-stack.t which I've used before with no problems. The web UI doesn't show the tour UI and also doesn't show the added job before it fails.

tinita wrote:

This seems to get more frequent:
https://app.circleci.com/pipelines/github/os-autoinst/openQA/7311/workflows/3914758f-521f-4f20-b6d3-4d776baa53e1/jobs/68949/steps

Apparently wait_for_result_panel failed and the job is still running:

    ok 385 - Wait for jQuery successful: result panel shows '(?^u:Result: passed)'
    not ok 386 - test 5 is passed
    # job info: {
    #   'assigned_worker_id' => 1,
    #   'blocked_by_id' => undef,
    #   'clone_id' => undef,
    #   'group_id' => undef,
    #   'id' => 5,
    #   'name' => 'tinycore-1-flavor-i386-Build1-core@coolone',
    #   'priority' => 50,
    #   'result' => 'none',
    #   'settings' => {
    #     'ARCH' => 'i386',
    #     'BUILD' => '1',
    #     'DISTRI' => 'tinycore',
    #     'FLAVOR' => 'flavor',
    #     'INTEGRATION_TESTS' => '1',
    #     'ISO' => 'Core-7.2.iso',
    #     'MACHINE' => 'coolone',
    #     'NAME' => '00000005-tinycore-1-flavor-i386-Build1-core@coolone',
    #     'PUBLISH_HDD_1' => '',
    #     'QEMU' => 'i386',
    #     'QEMU_NO_FDC_SET' => '1',
    #     'QEMU_NO_KVM' => '1',
    #     'QEMU_NO_TABLET' => '1',
    #     'TEST' => 'core',
    #     'UEFI_PFLASH_VARS' => '/usr/share/qemu/ovmf-x86_64.bin',
    #     'VERSION' => '1',
    #     'WORKER_CLASS' => 'qemu_i386'
    #   },
    #   'state' => 'running',
    #   't_finished' => undef,
    #   't_started' => '2021-08-04T16:58:54',
    #   'test' => 'core'
    # }
[info] [pid:736] Received signal TERM
[debug] [pid:736] Stopping job 5 from http://localhost:9526: 00000005-tinycore-1-flavor-i386-Build1-core@coolone - reason: quit

Job 5 was auto_duplicated after that which happens during done but the scheduler refuses the connection and test_order.json is gone 🤔️

[debug] [pid:170] Job 5 duplicated as 6
[debug] [aTX-Prx--Y8r] 200 OK (0.170014s, 5.882/s)
[warn] [pid:170] Unable to wakeup scheduler: Connection refused
[debug] [pid:736] Unable to read test_order.json: Can't open file "/tmp/uvQ70yVnRE/full-stack.d/openqa/pool/1/testresults/test_order.json": No such file or directory at /home/squamata/project/script/../lib/OpenQA/Worker/Job.pm line 1098.
[debug] [pid:736] Upload concluded (no current module)
Actions #21

Updated by livdywan over 2 years ago

I proposed another PR to extend the panel wait timeout, since the test fails because the job hasn't actually finished yet: https://github.com/os-autoinst/openQA/pull/4111
My second commit actually makes the wait bail out like we do in other cases to make this clearer.

Actions #22

Updated by livdywan over 2 years ago

Belated update, apparently I hadn't submitted the comment and much of my attention was on #96713 and related issues.

cdywan wrote:

Somehow the disable tour permanently and find_element_by_link_text('core@coolone')->click() checks now fail for me locally. Manually deleting the db makes no difference - using FULLSTACK=1 make test-with-database TESTS=t/full-stack.t which I've used before with no problems. The web UI doesn't show the tour UI and also doesn't show the added job before it fails.

For the record, git clean -dfX got rid of some of the ignored files mixed in with tracked files and I assume something in there's tripping up the test with no clear error message:

Removing .sass-cache/
Removing assets/assetpack.db
Removing assets/cache/
Removing base_state.json
Removing cover_db/
Removing docker.env
Removing job.json
Removing t/cache.d/
Removing t/data/openqa/db/db.lock
Removing t/data/openqa/pool/
Removing t/data/openqa/share/factory/iso/Core-7.2.iso
Removing t/data/openqa/share/tests/opensuse/needles/
Removing t/data/openqa/testresults/
Removing t/diagram-v87.png
Removing t/diagram-v88.png
Removing t/log_chromedriver
Removing t/poolPEcH/
Removing testresults/

I'm still looking to reproduce this reliably or confirm it's really just CI being slow and which my second PR would remedy.

Actions #23

Updated by livdywan over 2 years ago

Observed another similar failure, although this was test 6:

ok 462 - Wait for jQuery successful: result panel shows '(?^u:Result: passed)'
    not ok 463 - test 6 is passed
    # job info: {
    #   'assigned_worker_id' => 1,
    #   'blocked_by_id' => undef,
    #   'clone_id' => undef,
    #   'group_id' => undef,
    #   'id' => 6,
    #   'name' => 'tinycore-1-flavor-i386-Build1-core@coolone',
    #   'origin_id' => 5,
    #   'priority' => 50,
    #   'result' => 'none',
    #   'settings' => {
    #     'ARCH' => 'i386',
    #     'BUILD' => '1',
    #     'DISTRI' => 'tinycore',
    #     'FLAVOR' => 'flavor',
    #     'INTEGRATION_TESTS' => '1',
    #     'ISO' => 'Core-7.2.iso',
    #     'MACHINE' => 'coolone',
    #     'NAME' => '00000006-tinycore-1-flavor-i386-Build1-core@coolone',
    #     'PUBLISH_HDD_1' => '',
    #     'QEMU' => 'i386',
    #     'QEMU_NO_FDC_SET' => '1',
    #     'QEMU_NO_KVM' => '1',
    #     'QEMU_NO_TABLET' => '1',
    #     'TEST' => 'core',
    #     'UEFI_PFLASH_VARS' => '/usr/share/qemu/ovmf-x86_64.bin',
    #     'VERSION' => '1',
    #     'WORKER_CLASS' => 'qemu_i386'
    #   },
    #   'state' => 'running',
    #   't_finished' => undef,
    #   't_started' => '2021-08-11T13:49:53',
    #   'test' => 'core'
    # }
[info] [pid:980] Received signal TERM
[debug] [pid:980] Stopping job 6 from http://localhost:9526: 00000006-tinycore-1-flavor-i386-Build1-core@coolone - reason: quit
[debug] [pid:980] REST-API call: POST http://localhost:9526/api/v1/jobs/6/status
[debug] API key from client: *1234567890ABCDEF*
[debug] Key is for user "Demo"
[debug] API auth by user: Demo, operator: 1
[debug] [vUJruy90IEKO] 200 OK (0s, ??/s)
[info] [pid:980] Trying to stop job gracefully by announcing it to command server via http://localhost:20013/4Cy5hI1eBpLmIYuS/broadcast
[info] [pid:980] Isotovideo exit status: 1
[info] [pid:980] +++ worker notes +++
[info] [pid:980] End time: 2021-08-11 13:53:03
[info] [pid:980] Result: quit
[debug] [pid:980] Job 6 stopped as incomplete
[debug] [pid:980] Duplicating job 6
[debug] [pid:980] REST-API call: POST http://localhost:9526/api/v1/jobs/6/duplicate?dup_type_auto=1
[debug] API key from client: *1234567890ABCDEF*
[debug] Key is for user "Demo"
[debug] API auth by user: Demo, operator: 1
[debug] Restarting job 6
[debug] [pid:181] Duplicating jobs: {
  6 => {
         chained_children => [],
         chained_parents => [],
         directly_chained_children => [],
         directly_chained_parents => [],
         is_parent_or_initial_job => 1,
         ok => 0,
         parallel_children => [],
         parallel_parents => [],
         state => "uploading",
       },
}
[debug] [pid:181] Job 6 duplicated as 7
[debug] [e6jn1vk__jmH] 200 OK (0s, ??/s)
[warn] [pid:181] Unable to wakeup scheduler: Connection refused
[debug] [pid:980] Unable to read test_order.json: Can't open file "/tmp/dFO0RsD8Pu/full-stack.d/openqa/pool/1/testresults/test_order.json": No such file or directory at /home/squamata/project/script/../lib/OpenQA/Worker/Job.pm line 1098.
[debug] [pid:980] Upload concluded (no current module)
[debug] [pid:980] REST-API call: POST http://localhost:9526/api/v1/jobs/6/set_done?reason=quit%3A+worker+has+been+stopped+or+restarted&result=incomplete&worker_id=1
[debug] API key from client: *1234567890ABCDEF*
[debug] Key is for user "Demo"
[debug] API auth by user: Demo, operator: 1
[debug] [3hvHqtFwVqT3] 200 OK (0s, ??/s)
[debug] [pid:980] Job 6 from http://localhost:9526 finished - reason: quit
[debug] [pid:980] Informing http://localhost:9526 that we are going offline
[info] [pid:180] Worker 1 websocket connection closed - 1006
    not ok 464 - Test 6 autoinst-log.txt file created
    ok 465 - asset 5.qcow2 removed during cache init
    ok 466 - Core-7.2.iso the most recent asset again
[debug] API key from client: *1234567890ABCDEF*
[debug] Key is for user "Demo"
[debug] API auth by user: Demo, operator: 1
[debug] Restarting job 6
[debug] [1Tr19SSe0zk-] 200 OK (0s, ??/s)
[warn] [pid:181] Unable to wakeup scheduler: Connection refused
[debug] [pid:181] Unrecognized referer 'localhost'
[debug] [9jtIkthp8U5T] 200 OK (0s, ??/s)
[debug] [pid:181] Unrecognized referer 'localhost'
[debug] [X5J1PGJqPN2s] 200 OK (0s, ??/s)
[debug] [pid:181] Unrecognized referer 'localhost'
[debug] [omqi-OTHkNRk] 200 OK (0s, ??/s)
    not ok 467 - client returned new test_url for test 7
Actions #24

Updated by livdywan over 2 years ago

With my PR adjusting the timeout to 5 minutes (instead of 3) in wait_for_result_panel I'm observing errors like this on CI:

[14:02:50] t/full-stack.t .. 98/? Mojo::Reactor::Poll: Timer failed: Not a SCALAR reference at /home/squamata/project/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 102.
    # full result panel contents:
    # State: setup
    # Scheduled product: job has not been created by posting an ISO
    # Assigned worker: 3500d1376f46:1
Bailout called.  Further testing stopped:  Expected result not found
[14:24:45] t/full-stack.t .. 617/? # full result panel contents:
# State: scheduled
# Scheduled product: job has not been created by posting an ISO
# 50
Bailout called.  Further testing stopped:  Expected result not found
Actions #25

Updated by livdywan over 2 years ago

cdywan wrote:

I proposed another PR to extend the panel wait timeout, since the test fails because the job hasn't actually finished yet: https://github.com/os-autoinst/openQA/pull/4111

I'm proposing to scale the timeouts rather than increasing them (undoing the change in the timeout from the previous PR): https://github.com/os-autoinst/openQA/pull/4129

Actions #26

Updated by livdywan over 2 years ago

  • Status changed from In Progress to Feedback

With the PR merged I would tentatively consider this solved. Unless new data suggests this isn't racy.

Actions #27

Updated by okurz over 2 years ago

  • Status changed from Feedback to Resolved
Actions #28

Updated by okurz over 2 years ago

  • Due date changed from 2021-08-19 to 2021-09-02
  • Status changed from Resolved to Feedback
Actions #29

Updated by livdywan over 2 years ago

  • Status changed from Feedback to In Progress

okurz wrote:

Too soon. See https://app.circleci.com/pipelines/github/os-autoinst/openQA/7458/workflows/99260186-0e4e-48e4-aa87-ff676c323792/jobs/70352 from https://github.com/os-autoinst/openQA/pull/4140

RETRY=0 timeout -s SIGINT -k 5 -v $((9 * 2 * (0 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer t/full-stack.t
[09:05:18] t/full-stack.t .. 18/? Bailout called.  Further testing stopped:  findElement: no such element: Unable to locate element: {"method":"link text","selector":"Live View"} at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm:94
FAILED--Further testing stopped: findElement: no such element: Unable to locate element: {"method":"link text","selector":"Live View"} at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm:94
make[2]: *** [Makefile:191: test-unit-and-integration] Error 255

This isn't the same error (job supposed to be done but running too long) so I regard this as a good thing. The test fails a wait_for_job_running check, which means the "panel" wait finished. Except the live view button isn't visible (yet): https://github.com/os-autoinst/openQA/pull/4146

Actions #30

Updated by livdywan over 2 years ago

  • Status changed from In Progress to Feedback

cdywan wrote:

This isn't the same error (job supposed to be done but running too long) so I regard this as a good thing. The test fails a wait_for_job_running check, which means the "panel" wait finished. Except the live view button isn't visible (yet): https://github.com/os-autoinst/openQA/pull/4146

Merged. Will monitor and see if this is it.

Actions #31

Updated by livdywan over 2 years ago

  • Status changed from Feedback to Resolved

cdywan wrote:

cdywan wrote:

This isn't the same error (job supposed to be done but running too long) so I regard this as a good thing. The test fails a wait_for_job_running check, which means the "panel" wait finished. Except the live view button isn't visible (yet): https://github.com/os-autoinst/openQA/pull/4146

Merged. Will monitor and see if this is it.

I didn't observe any more failures in the last week, so I'm assuming we're good.

Actions

Also available in: Atom PDF