action #10460

*Huge* gaps in assert_screen

Added by coolo about 4 years ago. Updated about 4 years ago.

Status:ResolvedStart date:28/01/2016
Priority:ImmediateDue date:
Assignee:coolo% Done:

0%

Category:Concrete Bugs
Target version:Milestone 1
Difficulty:
Duration:

Description

https://openqa.suse.de/tests/190772 is a very suprising failure - waiting for rebootnow it suddenly finds itself in gdm. Looking at the logs, we missed 15 seconds:

no match 1915
MATCH(rebootnow-20131217:0.00)
MATCH(rebootnow-20141204:0.00)
MATCH(rebootnow-20150409:0.00)
MATCH(rebootnow-390x-20150709:0.00)
no match 1914
MATCH(rebootnow-20131217:0.00)
MATCH(rebootnow-20141204:0.00)
MATCH(rebootnow-20150409:0.00)
MATCH(rebootnow-390x-20150709:0.00)
no match 1899
no change 1899
no change 1898

So this is about half an hour before the test was done - which should be around 2am according to the journal:

Jan 28 01:55:14 openqaworker1 worker[13862]: got job 190772: 00190772-sle-12SP2-Staging:D-Server-DVD-x86_64-Build0006-RAID1
Jan 28 01:55:14 openqaworker1 worker[13862]: 28004: WORKING 190772
Jan 28 02:37:55 openqaworker1 worker[13862]: cleaning up 00190772-sle-12SP2-Staging:D-Server-DVD-x86_64-Build0006-RAID1...

There is nothing too unusual in the log and if I count this right, it had roughly 8 jobs running - so what is going on?

But I think we need to mark this gap in the result somehow. We need to consider more than 2 seconds for one assert_screen match a fatal failure. This also applies to the grub needle. If we have too many needles to get done in 2 seconds, we need to clean up the needle pool.

vars.json (1.26 KB) okurz, 07/02/2016 09:48 pm


Related issues

Related to openQA Tests - action #10136: "reboot now" can not be detected, potentially skipped Resolved 07/01/2016
Duplicated by openQA Project - action #10688: [Build 0902] "btrfs" and "gnome" - select_console("instal... Resolved 10/02/2016

History

#1 Updated by coolo about 4 years ago

190775 is even more brutal:


no change 1551
MATCH(rebootnow-20131217:0.00)
MATCH(rebootnow-20141204:0.00)
MATCH(rebootnow-20150409:0.00)
MATCH(rebootnow-390x-20150709:0.00)
no match 1550
MATCH(rebootnow-20131217:0.00)
MATCH(rebootnow-20141204:0.00)
MATCH(rebootnow-20150409:0.00)
MATCH(rebootnow-390x-20150709:0.00)
no match 1526
MATCH(rebootnow-20131217:0.00)
MATCH(rebootnow-20141204:0.00)
MATCH(rebootnow-20150409:0.00)
MATCH(rebootnow-390x-20150709:0.00)
no match 1360
no change 1350
no change 1350
no change 1349

We're missing 166 seconds there! This is an environment error we need to error on - instead of pretending.

#2 Updated by okurz about 4 years ago

  • Related to action #10136: "reboot now" can not be detected, potentially skipped added

#3 Updated by coolo about 4 years ago

I reduced it to capture_screenshot taking way too long. Normally it's done in < 0.1s, but the highscore in my tests was 33.4s. capture_screenshot is creating a new file and it seems problematic to the filesystem ;(

#4 Updated by coolo about 4 years ago

  • Category set to Concrete Bugs
  • Status changed from New to Resolved
  • Assignee set to coolo
  • Target version set to Milestone 1

The result of my day of research:

https://github.com/os-autoinst/os-autoinst/pull/405 and our workers being consistent on xfs now

#5 Updated by coolo about 4 years ago

  • Status changed from Resolved to In Progress
  • Priority changed from Normal to Immediate

The problem is less likely but still happening. I disabled now 4 workers on worker1 and worker2 - last resort ;(

#6 Updated by coolo about 4 years ago

Reenabled the workers as it a) didn't really change anything and b) we now reschedule the job if it happens.

But it breaks in all kind of places ;(

Debug: /var/lib/openqa/share/tests/sle/tests/console/curl_https.pm:17 called testapi::validate_script_output
<<< wait_serial(regex='SCRIPT_FINISHED', timeout=10)
WARNING: capture_screenshot took 9.93 seconds - slow IO?
>>> wait_serial: SCRIPT_FINISHED: fail
test curl_https died at /usr/lib/os-autoinst/basetest.pm line 274.

#7 Updated by okurz about 4 years ago

maybe this is a regression because of changes within os-autoinst || openQA || test distri. You should have observed this earlier if not…

#8 Updated by coolo about 4 years ago

we did notice strange "overload" problems before. We just couldn't name the reason so far.

#9 Updated by okurz about 4 years ago

We should improve the error checking, e.g. in cases like these:

MATCH(bootmenu-20141112:0.00)
no match 15
MATCH(bootmenu-20141112:0.00)
no match 14
MATCH(bootmenu-20141112:0.00)
no match 13
MATCH(bootmenu-20141112:0.00)
no match 11
MATCH(bootmenu-20141112:0.00)
no match 11
WARNING: capture_screenshot took 49.06 seconds - slow IO?
MATCH(bootmenu-20141112:0.00)
>>> assert_screen: match=inst-bootmenu timed out after 15
test rescuesystem failed at /usr/lib/os-autoinst/basetest.pm line 274.
WARNING: capture_screenshot took 0.55 seconds - slow IO?
test rescuesystem failed

bad luck that the capture die happened in second 0, the warning is correct - but it doesn't abort the test where it should have been incomplete here, too.
IMHO it must be as complicated as "if we fail in an assert_screen AND we reported stalls in before, judge this as a reason and return incomplete. To make the "WARNING" memorable to os-autoinst we could probably use a leaky bucket error counter and check the bucket fill grade at test outcome to decide if it should be incomplete or pass. I know, sounds "too intelligent" but I am sure this is correct.

Of course, it is more important to fix the gaps appearing rather than get lost in statistical analysis but the same problem might appear again anyway in the future because of differing reasons as long as we don't use a hard-realtime system for monitoring the SUTs.

#10 Updated by okurz about 4 years ago

error detection also triggers on my machine on cloud.suse.de

Debug: /var/lib/openqa/share/tests/opensuse/tests/installation/installation_overview.pm:21 called testapi::assert_screen
<<< assert_screen(mustmatch='inst-overview', timeout=30)
MATCH(O42-inst-overview-gnome-20150624:0.00)
MATCH(installation_overview-Update-to-TW+NonOSS-20150317:0.00)
MATCH(inst-overview-gnome-20151116:0.00)
MATCH(installation_overview-Staging_Update-20150316:0.00)
MATCH(installation_overview-Update-to-TW-20151120:0.00)
MATCH(installation_overview-Update-to-TW-20150317:0.00)
MATCH(installation_overview-Update-to-TW+NonOSS-20151120:0.00)
MATCH(installation_overview-Staging_Update-20151116:0.00)
MATCH(installation_overview-Update-121-to-42+NonOSS-20151015:0.00)
MATCH(installation_overview-Update-121-to-TW+NonOSS-20150322:0.00)
MATCH(42-installation_overview-Staging_Update-20151116:0.00)
MATCH(installation_overview-Update-to-42+NonOSS-20151016:0.00)
MATCH(42-installation_overview-Staging_Update-20150822:0.00)
MATCH(inst-overview-gnome-20141210:0.00)
MATCH(installation_overview-Update-121-to-TW+NonOSS-20151125:0.00)
WARNING: check_asserted_screen took 13.15 seconds - make your needles more specific
no match 120
<<< mydie(cause_of_death='There is some problem with your environment, we detected a stall for 13 seconds')
DIE mydie at /home/okurz/local/os-autoinst/backend/baseclass.pm line 188 thread 1.

 at /home/okurz/local/os-autoinst/backend/baseclass.pm line 73 thread 1.
    backend::baseclass::die_handler('mydie at /home/okurz/local/os-autoinst/backend/baseclass.pm l...') called at /usr/lib/perl5/5.18.2/Carp.pm line 100 thread 1
    Carp::croak('mydie') called at /home/okurz/local/os-autoinst/bmwqemu.pm line 343 thread 1
    bmwqemu::mydie('There is some problem with your environment, we detected a st...') called at /home/okurz/local/os-autoinst/backend/baseclass.pm line 188 thread 1
    eval {...} called at /home/okurz/local/os-autoinst/backend/baseclass.pm line 164 thread 1
    backend::baseclass::run_capture_loop('backend::qemu=HASH(0x89b8c58)', 'IO::Select=ARRAY(0x73cf2d8)') called at /home/okurz/local/os-autoinst/backend/baseclass.pm line 113 thread 1
    backend::baseclass::run('backend::qemu=HASH(0x89b8c58)', 11, 14) called at /home/okurz/local/os-autoinst/backend/driver.pm line 82 thread 1
    backend::driver::_run('backend::qemu=HASH(0x89b8c58)', 11, 14) called at /home/okurz/local/os-autoinst/backend/driver.pm line 69 thread 1
    eval {...} called at /home/okurz/local/os-autoinst/backend/driver.pm line 69 thread 1

The test parameters used can be found in the attachment.

In this case I would have assumed the run to continue fine but I don't know for sure. This was easily reproduced four times in a row (because of automatic retrigger). Can we please reduce the die to a warning here, too?

#11 Updated by okurz about 4 years ago

looks like https://openqa.suse.de/tests/200116/file/autoinst-log.txt fails because of "stall", I will retrigger the test. Tell me if I am wrong.

#12 Updated by coolo about 4 years ago

https://openqa.suse.de/tests/200116 doesn't appear failed to me

#13 Updated by okurz about 4 years ago

the job did not fail but the zypper_in test module: https://openqa.suse.de/tests/200116/modules/zypper_in/steps/3 From the logs it looked like because of the issue at hand but I don't remember in detail why.

#14 Updated by okurz about 4 years ago

from
https://openqa.suse.de/tests/204149/file/autoinst-log.txt

WARNING: enqueue_screenshot took 62.81 seconds - slow IO? (opencv: 0.01 - encoder: 62.80)

and then dies shortly after with

WARNING: check_asserted_screen took 1.34 seconds - make your needles more specific
>>> assert_screen: match=generic-desktop timed out after 30

#15 Updated by coolo about 4 years ago

  • Status changed from In Progress to Resolved

The kernel issues still need to be fixed, but with https://github.com/os-autoinst/os-autoinst/pull/427 I consider this issue done.

#16 Updated by coolo about 4 years ago

  • Duplicated by action #10688: [Build 0902] "btrfs" and "gnome" - select_console("install-shell") seems to not work properly added

Also available in: Atom PDF