Project

General

Profile

Actions

action #163127

closed

Failed openQA CI tests in master: t/ui/15-admin-workers.t and t/full-stack.t size:M

Added by okurz 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-07-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://app.circleci.com/pipelines/github/os-autoinst/openQA/13929/workflows/19d49b27-c2b5-4b66-8fe4-711b4ee108b0/jobs/131058?invite=true#step-111-88

RETRY=0 HOOK=./tools/delete-coverdb-folder timeout -s SIGINT -k 5 -v $((30 * 2 * (0 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer t/full-stack.t
[14:42:14] t/full-stack.t ......... 33/?     # full result panel contents:
    # State: scheduled, created 28 minutes ago
    # Scheduled product: job has not been created by posting an ISO
    #  50 
    # Actions:
    # 
    # Cancel job

    #   Failed test 'Expected result not found'
    #   at /home/squamata/project/t/lib/OpenQA/Test/FullstackUtils.pm line 76.

    #   Failed test 'job 5 running'
    #   at t/full-stack.t line 332.
    # 0
Bailout called.  Further testing stopped:  test 't/full-stack.t' exceeds runtime limit of '1800' seconds

and

https://app.circleci.com/pipelines/github/os-autoinst/openQA/13930/workflows/032473ad-e89f-4787-a7a2-abbbca3a85c6/jobs/131066?invite=true#step-108-91

[15:39:42] t/ui/15-admin-workers.t .................... 5/? 
    #   Failed test 'on 99963'
    #   at t/ui/15-admin-workers.t line 119.
    #                   ''
    #     doesn't match '(?^:Worker status\nJob: 99963)'
    # Looks like you failed 1 test of 19.
[15:39:42] t/ui/15-admin-workers.t .................... 7/? 
#   Failed test 'worker overview'
#   at t/ui/15-admin-workers.t line 142.
[15:39:42] t/ui/15-admin-workers.t .................... 16/? [15:39:42] t/ui/15-admin-workers.t .................... 17/? # Looks like you failed 1 test of 17.
                                                              [15:39:42] t/ui/15-admin-workers.t .................... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/17 subtests

Acceptance criteria

  • AC1: t/ui/15-admin-workers.t passes consistently in circleCI
  • AC2: t/full-stack.t passes consistently in circleCI

Suggestions

  • Try to reproduce locally, e.g. using 'count-fail-ratio' from https://software.opensuse.org/package/retry . If not reproducible locally try with coverage enabled or in circleCI
  • While the above is running consider recent changes that could possibly trigger this as regression, maybe even bisect (e.g. animations not disabled correctly, could be #159408).
  • As necessary get additional logs from circleCI artifacts
  • Maybe SSH into Circle CI to try things with the test
  • File another ticket if it turns out there's 2 issues here
Actions #1

Updated by okurz 6 months ago

  • Subject changed from Failed openQA CI tests in master: t/ui/15-admin-workers.t and t/full-stack.t to Failed openQA CI tests in master: t/ui/15-admin-workers.t and t/full-stack.t size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by mkittler 6 months ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler 6 months ago

  • Status changed from Workable to In Progress
Actions #4

Updated by mkittler 6 months ago

Actions #5

Updated by mkittler 6 months ago · Edited

The failure of the fullstack test could be a problem with the worker websocket connection. The problem is that the worker disconnects in the critical moment when the job is being assigned:

# Subtest: Cache tests
    ok 1 - cache.sqlite is not present
    # Starting cache minion worker
    # Waiting for 'cache service at http://127.0.0.1:9530' to become available (timeout: 60)
    # Starting worker cache service on port 9530
    # Waiting for 'cache service worker' to become available (timeout: 60)
[debug] [pid:367] Unable to wakeup scheduler: Connection refused. Retry scheduled
    ok 2 - test 5 is scheduled
    ok 3 - Worker started as IPC::Run=HASH(0x55edc96d1d28)
    ok 4 - job with worker class qemu_i386 scheduled
[info] [pid:817] worker 1:
 - config file:                      /home/squamata/project/test-results/fullstack/full-stack.d/config/workers.ini
 - name used to register:            372c80df1da0
 - worker address (WORKER_HOSTNAME): localhost
 - isotovideo version:               40
 - websocket API version:            1
 - web UI hosts:                     http://localhost:9526
 - class:                            qemu_i386,qemu_x86_64
 - no cleanup:                       no
 - pool directory:                   /home/squamata/project/test-results/fullstack/full-stack.d/openqa/pool/1
[info] [pid:817] Project dir for host http://localhost:9526 is /home/squamata/project/test-results/fullstack/full-stack.d/openqa/share
[info] [pid:817] Registering with openQA http://localhost:9526
    ok 5 - at least one free worker with class qemu_i386 registered
[debug] [pid:351] Scheduling: Free workers: 1/1; Scheduled jobs: 1
[info] [pid:817] Establishing ws connection via ws://localhost:9526/api/v1/ws/1
[debug] [pid:351] Need to schedule 1 parallel jobs for job 5 (with priority 50)
[debug] [pid:351] limit reached, scheduling no additional jobs (max_running_jobs=-1, free workers=1, running=0, allocated=1)
[debug] [pid:351] Assigned job '5' to worker ID '1'
[debug] [pid:351] [Job#5] Prepare for being processed by worker 1
[warn] [pid:351] Failed sending job(s) '5' to worker '1': Unable to assign job to worker 1: the worker is not connected anymore
[info] [pid:817] Registered and connected via websockets with openQA host http://localhost:9526 and worker ID 1
[debug] [pid:366] Updating seen of worker 1 from worker_status (free)
[debug] [pid:351] Job 5 reset to state scheduled
[debug] [pid:351] Scheduler took 0.11739s to perform operations and allocated 0 jobs
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 0 of 3240)
…
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 269 of 3240)
[debug] [pid:366] Updating seen of worker 1 from worker_status (free)
[debug] [pid:366] Rescheduling jobs assigned to worker 1
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 270 of 3240)
…
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 637 of 3240)
    # Wait for jQuery successful: wait_for_re[debug] [pid:366] Updating seen of worker 1 from worker_status (free)
[debug] [pid:366] Rescheduling jobs assigned to worker 1sult_panel: waiting for '(?^u:State: running)' (count 638 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 639 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 640 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 641 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 642 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 643 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 644 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 645 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 646 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 647 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 648 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 649 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 650 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 651 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 652 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 653 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 654 of 3240)
    # Wait for jQuery successful: wait_for_re[debug] [pid:366] Updating seen of worker 1 from worker_status (free)
[debug] [pid:366] Rescheduling jobs assigned to worker 1sult_panel: waiting for '(?^u:State: running)' (count 655 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 656 of 3240)
…
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 3238 of 3240)
    # Wait for jQuery successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 3239 of 3240)
    ok 6 - No unexpected js warnings
    not ok 7 - Expected result not found
    not ok 8 - job 5 running
    # job info: {
    #   'blocked_by_id' => undef,
    #   'clone_id' => undef,
    #   'group_id' => undef,
    #   'id' => 5,
    #   'name' => 'tinycore-1-flavor-i386-Build1-core@coolone',
    #   'priority' => 50,
    #   'result' => 'none',
    #   'settings' => {
    #     'ARCH' => 'i386',
    #     'BUILD' => '1',
    #     'DISTRI' => 'tinycore',
    #     'FLAVOR' => 'flavor',
    #     'INTEGRATION_TESTS' => '1',
    #     'ISO' => 'Core-7.2.iso',
    #     'MACHINE' => 'coolone',
    #     'NAME' => '00000005-tinycore-1-flavor-i386-Build1-core@coolone',
    #     'PUBLISH_HDD_1' => '',
    #     'QEMU' => 'i386',
    #     'QEMU_NO_FDC_SET' => '1',
    #     'QEMU_NO_KVM' => '1',
    #     'QEMU_NO_TABLET' => '1',
    #     'TEST' => 'core',
    #     'UEFI_PFLASH_VARS' => '/usr/share/qemu/ovmf-x86_64.bin',
    #     'VERSION' => '1',
    #     'WORKER_CLASS' => 'qemu_i386'
    #   },
    #   'state' => 'scheduled',
    #   't_finished' => undef,
    #   't_started' => undef,
    #   'test' => 'core'
    # }
    # Subtest: results of test 5
…

Not sure why the scheduler/ws-server cannot just do the assignment on the next tick. The worker is at least updating its status and considered free.

EDIT: Looks like the scheduling in the fullstack test happens explicitly (the scheduler is not constantly running in the background). So we start the worker, don' wait that it is actually ready and then try to assign jobs. That is a recipe for failures like this and I'm surprised we don't see such failures more often.

Actions #6

Updated by openqa_review 6 months ago

  • Due date set to 2024-07-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback

PR for fullstack test: https://github.com/os-autoinst/openQA/pull/5749


So we start the worker, don' wait that it is actually ready and then try to assign jobs. That is a recipe for failures like this and I'm surprised we don't see such failures more often.

The test actually does wait for the worker to be registered. It only does not wait for the web socket connection. Therefore it makes sense that this kind of failure is not happening very often.

Actions #8

Updated by mkittler 5 months ago

  • Status changed from Feedback to Resolved

I haven't seen any failures so far. I'm closing the ticket for now and if it happens again we can reopen it.

Actions #9

Updated by okurz 5 months ago

  • Due date deleted (2024-07-20)
Actions

Also available in: Atom PDF