action #128807
closed[sporadic] ci: openQA circleci fullstack-unstable fails with new error size:M
Description
Observation¶
[10:02:24] t/33-developer_mode.t .. 20/?
# Failed test 'otherdeveloper logged-in'
# at t/33-developer_mode.t line 262.
# got: 'Login'
# expected: 'Logged in as otherdeveloper'
# Failed test '#developer-panel .card-body visible'
# at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm line 261.
# Failed test '#developer-panel .card-body contains (?^u:Another user has already locked this job.)'
# at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm line 267.
# ''
# doesn't match '(?^u:Another user has already locked this job.)'
# Looks like you failed 2 tests of 13.
# Failed test 'expand developer panel'
# at t/33-developer_mode.t line 282.
# Looks like you failed 2 tests of 3.
[10:02:24] t/33-developer_mode.t .. 21/?
# Failed test 'status-only route accessible for other users'
# at t/33-developer_mode.t line 283.
Acceptance criteria¶
- AC1: Test no longer fails
Suggestions¶
- Try to reproduce manually with
make test FULLSTACK=1 KEEP_DB=1 TESTS=t/33-developer_mode.t STABILITY_TEST=1 RETRY=30
, otherwise run the test in Circle CI with SSH activated
Updated by mkittler over 1 year ago
- Subject changed from [sporadic] ci: openQA circleci fullstack-unstable fails with new error to [sporadic] ci: openQA circleci fullstack-unstable fails with new error size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
I could just reproduce the problem locally easily
Updated by okurz over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Updated by openqa_review over 1 year ago
- Due date set to 2023-05-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
I lost my statistics, trying again:
echo "### Before" && OPENQA_TEST_TIMEOUT_DISABLE=1 FULLSTACK=1 count_fail_ratio timeout 120 prove -l t/33-developer_mode.t && echo "### end of before" && git stash pop && echo "### After" && OPENQA_TEST_TIMEOUT_DISABLE=1 FULLSTACK=1 count_fail_ratio timeout 120 prove -l t/33-developer_mode.t && echo "### end of after"
shows right now ## count_fail_ratio: Run: 20. Fails: 3. Fail ratio 15.00%
for "before fix".
Updated by okurz over 1 year ago
Updated by okurz over 1 year ago
By the way, the test is not fully stable hence it's correct that the test is in "unstable" tests. I just saw one occurence of
## Run 101
t/33-developer_mode.t .. 25/?
# Failed test 'log file generated under /tmp/abKc0gdL7l/developer-mode/openqa/testresults/00000/00000001-tinycore-1-flavor-i386-Build1-core@coolone/autoinst-log.txt'
# at t/33-developer_mode.t line 345.
# Looks like you failed 1 test of 2.
# Failed test 'test cancelled by quitting the session'
# at t/33-developer_mode.t line 346.
t/33-developer_mode.t .. 27/? # Looks like you failed 1 test of 27.
t/33-developer_mode.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/27 subtests
The according test code is:
339 subtest 'test cancelled by quitting the session' => sub {
340 $driver->switch_to_window($first_tab);
341 $driver->get($job_page_url);
342 ok wait_for_result_panel($driver, qr/(State: cancelled|Result: (user_cancelled|passed))/),
343 'test 1 has been cancelled (if it was fast enough to actually pass that is ok, too)';
344 my $log_file_path = path($resultdir, '00000', "00000001-$job_name")->make_path->child('autoinst-log.txt');
345 ok -s $log_file_path, "log file generated under $log_file_path";
346 };
Updated by okurz over 1 year ago
Talked about the topic within the daily 2023-05-16. I am trying to reproduce the problem with a simpler test:
25 $driver->click_element_ok('Login', 'link_text');
26 while (1) {
27 $driver->get('/logout');
28 $driver->get('/login?user=otherdeveloper');
29 $driver->element_text_is('#user-action a', 'Logged in as otherdeveloper', 'otherdeveloper logged-in');
30 }
Results after running that for 3h: 4/22959=0.0174% fails so way less than in the original test where maybe the openQA instance is "more busy" to process the logout/login. I don't consider this that useful for reproduction. However let me still try to use it with get_ok
as variant.
Current state 0/(26000/3) fails. Calculation is "/3" as there are three test messages for logout&login
Updated by okurz over 1 year ago
- Status changed from In Progress to Feedback
8/(114741/3)=0.021% so about the same fail ratio as in before, get_ok
does not help. Any other ideas?
Regarding a bit of statistics and improving count_fail_ratio I am thinking of https://en.wikipedia.org/wiki/Rule_of_three_(statistics) stating that if no failures occur in N runs one can expect the probability of a failure to occur to be less than 3/N (with a confidence value of 95%).
I still would like to find out how to calculate the 1-sigma and like 3-sigma.
Other thing I want to try is to try to provoke the error more easily with artificially delaying the request on the server side.
Updated by okurz over 1 year ago
- Due date changed from 2023-05-26 to 2023-06-23
- Priority changed from High to Normal
Trying to bring in https://github.com/os-autoinst/openQA/pull/5154 first
Updated by okurz over 1 year ago
PR merged, back to https://github.com/os-autoinst/openQA/pull/5137
Updated by okurz over 1 year ago
PR is clean, CI jobs passed. Local execution
echo "### Before" && OPENQA_TEST_TIMEOUT_DISABLE=1 FULLSTACK=1 runs=1000 count_fail_ratio timeout 120 prove -l t/33-developer_mode.t && echo "### end of before" && git co fix/developer_t && echo "### After" && OPENQA_TEST_TIMEOUT_DISABLE=1 FULLSTACK=1 runs=1000 count_fail_ratio timeout 120 prove -l t/33-developer_mode.t && echo "### end of after"
Before fix (aborted prematurely)
"## count_fail_ratio: Run: 53. Fails: 7. Fail ratio 13.00±0%"
After
"## count_fail_ratio: Run: 24. Fails: 4. Fail ratio 16.66±14.91%"
So no effect. I need further ideas what to do.
Updated by dheidler over 1 year ago
- Status changed from Feedback to Workable
- Assignee changed from okurz to mkittler
Updated by dheidler over 1 year ago
- Due date changed from 2023-06-23 to 2023-07-05
Decided in unblock meeting
Updated by mkittler over 1 year ago
- Status changed from Workable to In Progress
Even when introducing a 5 seconds delay to /logout
or to /
(which /logout
redirects to) this issue is not locally reproducible. So Selenium generally synchronizes the get
calls well enough (also in case there is a redirection).
Updated by okurz over 1 year ago
mkittler wrote:
Even when introducing a 5 seconds delay to
/logout
or to/
(which/logout
redirects to) this issue is not locally reproducible.
Have you seen my results in #128807-13 regarding reproduction rate? That was on my Leap 15.4 x86_64 notebook. Did you try a reasonable big sample size? Maybe your Tumbleweed(?) instance has different reproduction rates? Maybe Tumbleweed actually has a fix?
Updated by mkittler over 1 year ago
I was and I am still aiming for a stable reproducer. We have to figure out the conditions/problems that lead to this situation so we can fix the problem. If that doesn't work I would just introduce a retry loop.
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
I added a retry loop: https://github.com/os-autoinst/openQA/pull/5218
Even if we could pin-down the problem more closely that might be the solution we'd end up anyways (except ditching Selenium completely and rewriting with something else).
Updated by mkittler over 1 year ago
- Status changed from Feedback to In Progress
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
Updated by mkittler over 1 year ago
- Status changed from Feedback to Resolved
The PR has been merged.