Project

General

Profile

Actions

action #128807

closed

[sporadic] ci: openQA circleci fullstack-unstable fails with new error size:M

Added by tinita over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-05-05
Due date:
2023-07-05
% Done:

0%

Estimated time:

Description

Observation

https://app.circleci.com/pipelines/github/os-autoinst/openQA/11429/workflows/2deb72d7-9b20-4ac4-b917-a83129bef0a3/jobs/106787

[10:02:24] t/33-developer_mode.t .. 20/? 
    #   Failed test 'otherdeveloper logged-in'
    #   at t/33-developer_mode.t line 262.
    #          got: 'Login'
    #     expected: 'Logged in as otherdeveloper'

        #   Failed test '#developer-panel .card-body visible'
        #   at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm line 261.

        #   Failed test '#developer-panel .card-body contains (?^u:Another user has already locked this job.)'
        #   at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm line 267.
        #                   ''
        #     doesn't match '(?^u:Another user has already locked this job.)'
        # Looks like you failed 2 tests of 13.

    #   Failed test 'expand developer panel'
    #   at t/33-developer_mode.t line 282.
    # Looks like you failed 2 tests of 3.
[10:02:24] t/33-developer_mode.t .. 21/? 
#   Failed test 'status-only route accessible for other users'
#   at t/33-developer_mode.t line 283.

Acceptance criteria

  • AC1: Test no longer fails

Suggestions

  • Try to reproduce manually with make test FULLSTACK=1 KEEP_DB=1 TESTS=t/33-developer_mode.t STABILITY_TEST=1 RETRY=30, otherwise run the test in Circle CI with SSH activated
Actions #1

Updated by okurz over 1 year ago

  • Priority changed from Normal to High
Actions #2

Updated by mkittler over 1 year ago

  • Subject changed from [sporadic] ci: openQA circleci fullstack-unstable fails with new error to [sporadic] ci: openQA circleci fullstack-unstable fails with new error size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz over 1 year ago

I could just reproduce the problem locally easily

Actions #4

Updated by okurz over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #5

Updated by openqa_review over 1 year ago

  • Due date set to 2023-05-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz over 1 year ago

I lost my statistics, trying again:

echo "### Before" && OPENQA_TEST_TIMEOUT_DISABLE=1 FULLSTACK=1 count_fail_ratio timeout 120 prove -l t/33-developer_mode.t && echo "### end of before" && git stash pop  && echo "### After" && OPENQA_TEST_TIMEOUT_DISABLE=1 FULLSTACK=1 count_fail_ratio timeout 120 prove -l t/33-developer_mode.t && echo "### end of after"

shows right now ## count_fail_ratio: Run: 20. Fails: 3. Fail ratio 15.00% for "before fix".

Actions #8

Updated by okurz over 1 year ago

By the way, the test is not fully stable hence it's correct that the test is in "unstable" tests. I just saw one occurence of

## Run 101
t/33-developer_mode.t .. 25/?
    #   Failed test 'log file generated under /tmp/abKc0gdL7l/developer-mode/openqa/testresults/00000/00000001-tinycore-1-flavor-i386-Build1-core@coolone/autoinst-log.txt'
    #   at t/33-developer_mode.t line 345.
    # Looks like you failed 1 test of 2.

#   Failed test 'test cancelled by quitting the session'
#   at t/33-developer_mode.t line 346.
t/33-developer_mode.t .. 27/? # Looks like you failed 1 test of 27.
t/33-developer_mode.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/27 subtests

The according test code is:

339 subtest 'test cancelled by quitting the session' => sub {
340     $driver->switch_to_window($first_tab);
341     $driver->get($job_page_url);
342     ok wait_for_result_panel($driver, qr/(State: cancelled|Result: (user_cancelled|passed))/),
343       'test 1 has been cancelled (if it was fast enough to actually pass that is ok, too)';
344     my $log_file_path = path($resultdir, '00000', "00000001-$job_name")->make_path->child('autoinst-log.txt');
345     ok -s $log_file_path, "log file generated under $log_file_path";
346 };
Actions #9

Updated by okurz over 1 year ago

Talked about the topic within the daily 2023-05-16. I am trying to reproduce the problem with a simpler test:

25 $driver->click_element_ok('Login', 'link_text');
26 while (1) {
27     $driver->get('/logout');
28     $driver->get('/login?user=otherdeveloper');
29     $driver->element_text_is('#user-action a', 'Logged in as otherdeveloper', 'otherdeveloper logged-in');
30 }

Results after running that for 3h: 4/22959=0.0174% fails so way less than in the original test where maybe the openQA instance is "more busy" to process the logout/login. I don't consider this that useful for reproduction. However let me still try to use it with get_ok as variant.

Current state 0/(26000/3) fails. Calculation is "/3" as there are three test messages for logout&login

Actions #10

Updated by okurz over 1 year ago

  • Status changed from In Progress to Feedback

8/(114741/3)=0.021% so about the same fail ratio as in before, get_ok does not help. Any other ideas?

Regarding a bit of statistics and improving count_fail_ratio I am thinking of https://en.wikipedia.org/wiki/Rule_of_three_(statistics) stating that if no failures occur in N runs one can expect the probability of a failure to occur to be less than 3/N (with a confidence value of 95%).

I still would like to find out how to calculate the 1-sigma and like 3-sigma.

Other thing I want to try is to try to provoke the error more easily with artificially delaying the request on the server side.

Actions #11

Updated by okurz over 1 year ago

  • Due date changed from 2023-05-26 to 2023-06-23
  • Priority changed from High to Normal
Actions #13

Updated by okurz over 1 year ago

PR is clean, CI jobs passed. Local execution

echo "### Before" && OPENQA_TEST_TIMEOUT_DISABLE=1 FULLSTACK=1 runs=1000 count_fail_ratio timeout 120 prove -l t/33-developer_mode.t && echo "### end of before" && git co fix/developer_t && echo "### After" && OPENQA_TEST_TIMEOUT_DISABLE=1 FULLSTACK=1 runs=1000 count_fail_ratio timeout 120 prove -l t/33-developer_mode.t && echo "### end of after"

Before fix (aborted prematurely)
"## count_fail_ratio: Run: 53. Fails: 7. Fail ratio 13.00±0%"
After
"## count_fail_ratio: Run: 24. Fails: 4. Fail ratio 16.66±14.91%"

So no effect. I need further ideas what to do.

Actions #14

Updated by dheidler about 1 year ago

  • Status changed from Feedback to Workable
  • Assignee changed from okurz to mkittler
Actions #15

Updated by dheidler about 1 year ago

  • Due date changed from 2023-06-23 to 2023-07-05

Decided in unblock meeting

Actions #16

Updated by mkittler about 1 year ago

  • Status changed from Workable to In Progress

Even when introducing a 5 seconds delay to /logout or to / (which /logout redirects to) this issue is not locally reproducible. So Selenium generally synchronizes the get calls well enough (also in case there is a redirection).

Actions #17

Updated by okurz about 1 year ago

mkittler wrote:

Even when introducing a 5 seconds delay to /logout or to / (which /logout redirects to) this issue is not locally reproducible.

Have you seen my results in #128807-13 regarding reproduction rate? That was on my Leap 15.4 x86_64 notebook. Did you try a reasonable big sample size? Maybe your Tumbleweed(?) instance has different reproduction rates? Maybe Tumbleweed actually has a fix?

Actions #18

Updated by mkittler about 1 year ago

I was and I am still aiming for a stable reproducer. We have to figure out the conditions/problems that lead to this situation so we can fix the problem. If that doesn't work I would just introduce a retry loop.

Actions #19

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Feedback

I added a retry loop: https://github.com/os-autoinst/openQA/pull/5218

Even if we could pin-down the problem more closely that might be the solution we'd end up anyways (except ditching Selenium completely and rewriting with something else).

Actions #20

Updated by mkittler about 1 year ago

  • Status changed from Feedback to In Progress
Actions #21

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Feedback
Actions #22

Updated by mkittler about 1 year ago

  • Status changed from Feedback to Resolved

The PR has been merged.

Actions

Also available in: Atom PDF