action #10460: *Huge* gaps in assert_screen - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #10460

closed

Huge gaps in assert_screen

Added by coolo over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

coolo

Category:

Regressions/Crashes

Target version:

Milestone 1

Start date:

2016-01-28

Due date:

% Done:

Estimated time:

Description

https://openqa.suse.de/tests/190772 is a very suprising failure - waiting for rebootnow it suddenly finds itself in gdm. Looking at the logs, we missed 15 seconds:

no match 1915
MATCH(rebootnow-20131217:0.00)
MATCH(rebootnow-20141204:0.00)
MATCH(rebootnow-20150409:0.00)
MATCH(rebootnow-390x-20150709:0.00)
no match 1914
MATCH(rebootnow-20131217:0.00)
MATCH(rebootnow-20141204:0.00)
MATCH(rebootnow-20150409:0.00)
MATCH(rebootnow-390x-20150709:0.00)
no match 1899
no change 1899
no change 1898

So this is about half an hour before the test was done - which should be around 2am according to the journal:

Jan 28 01:55:14 openqaworker1 worker[13862]: got job 190772: 00190772-sle-12SP2-Staging:D-Server-DVD-x86_64-Build0006-RAID1
Jan 28 01:55:14 openqaworker1 worker[13862]: 28004: WORKING 190772
Jan 28 02:37:55 openqaworker1 worker[13862]: cleaning up 00190772-sle-12SP2-Staging:D-Server-DVD-x86_64-Build0006-RAID1...

There is nothing too unusual in the log and if I count this right, it had roughly 8 jobs running - so what is going on?

But I think we need to mark this gap in the result somehow. We need to consider more than 2 seconds for one assert_screen match a fatal failure. This also applies to the grub needle. If we have too many needles to get done in 2 seconds, we need to clean up the needle pool.

Files

vars.json (1.26 KB) vars.json

okurz, 2016-02-07 21:48

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by coolo over 9 years ago

190775 is even more brutal:

no change 1551 MATCH(rebootnow-20131217:0.00) MATCH(rebootnow-20141204:0.00) MATCH(rebootnow-20150409:0.00) MATCH(rebootnow-390x-20150709:0.00) no match 1550 MATCH(rebootnow-20131217:0.00) MATCH(rebootnow-20141204:0.00) MATCH(rebootnow-20150409:0.00) MATCH(rebootnow-390x-20150709:0.00) no match 1526 MATCH(rebootnow-20131217:0.00) MATCH(rebootnow-20141204:0.00) MATCH(rebootnow-20150409:0.00) MATCH(rebootnow-390x-20150709:0.00) no match 1360 no change 1350 no change 1350 no change 1349

We're missing 166 seconds there! This is an environment error we need to error on - instead of pretending.

Actions

Copy link

Updated by okurz over 9 years ago

Related to action #10136: "reboot now" can not be detected, potentially skipped added

Actions

Copy link

Updated by coolo over 9 years ago

I reduced it to capture_screenshot taking way too long. Normally it's done in < 0.1s, but the highscore in my tests was 33.4s. capture_screenshot is creating a new file and it seems problematic to the filesystem ;(

Actions

Copy link

Updated by coolo over 9 years ago

Category set to Regressions/Crashes
Status changed from New to Resolved
Assignee set to coolo
Target version set to Milestone 1

The result of my day of research:

https://github.com/os-autoinst/os-autoinst/pull/405 and our workers being consistent on xfs now

Actions

Copy link

Updated by coolo over 9 years ago

Status changed from Resolved to In Progress
Priority changed from Normal to Immediate

The problem is less likely but still happening. I disabled now 4 workers on worker1 and worker2 - last resort ;(

Actions

Copy link

Updated by coolo over 9 years ago

Reenabled the workers as it a) didn't really change anything and b) we now reschedule the job if it happens.

But it breaks in all kind of places ;(

Debug: /var/lib/openqa/share/tests/sle/tests/console/curl_https.pm:17 called testapi::validate_script_output
<<< wait_serial(regex='SCRIPT_FINISHED', timeout=10)
WARNING: capture_screenshot took 9.93 seconds - slow IO?
>>> wait_serial: SCRIPT_FINISHED: fail
test curl_https died at /usr/lib/os-autoinst/basetest.pm line 274.

Actions

Copy link

Updated by okurz over 9 years ago

maybe this is a regression because of changes within os-autoinst || openQA || test distri. You should have observed this earlier if not…

Actions

Copy link

Updated by coolo over 9 years ago

we did notice strange "overload" problems before. We just couldn't name the reason so far.

Actions

Copy link

Updated by okurz over 9 years ago

We should improve the error checking, e.g. in cases like these:

MATCH(bootmenu-20141112:0.00)
no match 15
MATCH(bootmenu-20141112:0.00)
no match 14
MATCH(bootmenu-20141112:0.00)
no match 13
MATCH(bootmenu-20141112:0.00)
no match 11
MATCH(bootmenu-20141112:0.00)
no match 11
WARNING: capture_screenshot took 49.06 seconds - slow IO?
MATCH(bootmenu-20141112:0.00)
>>> assert_screen: match=inst-bootmenu timed out after 15
test rescuesystem failed at /usr/lib/os-autoinst/basetest.pm line 274.
WARNING: capture_screenshot took 0.55 seconds - slow IO?
test rescuesystem failed

bad luck that the capture die happened in second 0, the warning is correct - but it doesn't abort the test where it should have been incomplete here, too.
IMHO it must be as complicated as "if we fail in an assert_screen AND we reported stalls in before, judge this as a reason and return incomplete. To make the "WARNING" memorable to os-autoinst we could probably use a leaky bucket error counter and check the bucket fill grade at test outcome to decide if it should be incomplete or pass. I know, sounds "too intelligent" but I am sure this is correct.

Of course, it is more important to fix the gaps appearing rather than get lost in statistical analysis but the same problem might appear again anyway in the future because of differing reasons as long as we don't use a hard-realtime system for monitoring the SUTs.

Actions

Copy link

#10

Updated by okurz over 9 years ago

File vars.json vars.json added

error detection also triggers on my machine on cloud.suse.de

Debug: /var/lib/openqa/share/tests/opensuse/tests/installation/installation_overview.pm:21 called testapi::assert_screen
<<< assert_screen(mustmatch='inst-overview', timeout=30)
MATCH(O42-inst-overview-gnome-20150624:0.00)
MATCH(installation_overview-Update-to-TW+NonOSS-20150317:0.00)
MATCH(inst-overview-gnome-20151116:0.00)
MATCH(installation_overview-Staging_Update-20150316:0.00)
MATCH(installation_overview-Update-to-TW-20151120:0.00)
MATCH(installation_overview-Update-to-TW-20150317:0.00)
MATCH(installation_overview-Update-to-TW+NonOSS-20151120:0.00)
MATCH(installation_overview-Staging_Update-20151116:0.00)
MATCH(installation_overview-Update-121-to-42+NonOSS-20151015:0.00)
MATCH(installation_overview-Update-121-to-TW+NonOSS-20150322:0.00)
MATCH(42-installation_overview-Staging_Update-20151116:0.00)
MATCH(installation_overview-Update-to-42+NonOSS-20151016:0.00)
MATCH(42-installation_overview-Staging_Update-20150822:0.00)
MATCH(inst-overview-gnome-20141210:0.00)
MATCH(installation_overview-Update-121-to-TW+NonOSS-20151125:0.00)
WARNING: check_asserted_screen took 13.15 seconds - make your needles more specific
no match 120
<<< mydie(cause_of_death='There is some problem with your environment, we detected a stall for 13 seconds')
DIE mydie at /home/okurz/local/os-autoinst/backend/baseclass.pm line 188 thread 1.

 at /home/okurz/local/os-autoinst/backend/baseclass.pm line 73 thread 1.
	backend::baseclass::die_handler('mydie at /home/okurz/local/os-autoinst/backend/baseclass.pm l...') called at /usr/lib/perl5/5.18.2/Carp.pm line 100 thread 1
	Carp::croak('mydie') called at /home/okurz/local/os-autoinst/bmwqemu.pm line 343 thread 1
	bmwqemu::mydie('There is some problem with your environment, we detected a st...') called at /home/okurz/local/os-autoinst/backend/baseclass.pm line 188 thread 1
	eval {...} called at /home/okurz/local/os-autoinst/backend/baseclass.pm line 164 thread 1
	backend::baseclass::run_capture_loop('backend::qemu=HASH(0x89b8c58)', 'IO::Select=ARRAY(0x73cf2d8)') called at /home/okurz/local/os-autoinst/backend/baseclass.pm line 113 thread 1
	backend::baseclass::run('backend::qemu=HASH(0x89b8c58)', 11, 14) called at /home/okurz/local/os-autoinst/backend/driver.pm line 82 thread 1
	backend::driver::_run('backend::qemu=HASH(0x89b8c58)', 11, 14) called at /home/okurz/local/os-autoinst/backend/driver.pm line 69 thread 1
	eval {...} called at /home/okurz/local/os-autoinst/backend/driver.pm line 69 thread 1

The test parameters used can be found in the attachment.

In this case I would have assumed the run to continue fine but I don't know for sure. This was easily reproduced four times in a row (because of automatic retrigger). Can we please reduce the die to a warning here, too?

Actions

Copy link

#11

Updated by okurz over 9 years ago

looks like https://openqa.suse.de/tests/200116/file/autoinst-log.txt fails because of "stall", I will retrigger the test. Tell me if I am wrong.

Actions

Copy link

#12

Updated by coolo over 9 years ago

https://openqa.suse.de/tests/200116 doesn't appear failed to me

Actions

Copy link

#13

Updated by okurz over 9 years ago

the job did not fail but the zypper_in test module: https://openqa.suse.de/tests/200116/modules/zypper_in/steps/3 From the logs it looked like because of the issue at hand but I don't remember in detail why.

Actions

Copy link

#14

Updated by okurz over 9 years ago

from
https://openqa.suse.de/tests/204149/file/autoinst-log.txt

WARNING: enqueue_screenshot took 62.81 seconds - slow IO? (opencv: 0.01 - encoder: 62.80)

and then dies shortly after with

WARNING: check_asserted_screen took 1.34 seconds - make your needles more specific
>>> assert_screen: match=generic-desktop timed out after 30

Actions

Copy link

#15

Updated by coolo over 9 years ago

Status changed from In Progress to Resolved

The kernel issues still need to be fixed, but with https://github.com/os-autoinst/os-autoinst/pull/427 I consider this issue done.

Actions

Copy link

#16

Updated by coolo over 9 years ago

Has duplicate action #10688: [Build 0902] "btrfs" and "gnome" - select_console("install-shell") seems to not work properly added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #10460

Huge gaps in assert_screen

Updated by coolo over 9 years ago

Updated by okurz over 9 years ago

Updated by coolo over 9 years ago

Updated by coolo over 9 years ago

Updated by coolo over 9 years ago

Updated by coolo over 9 years ago

Updated by okurz over 9 years ago

Updated by coolo over 9 years ago

Updated by okurz over 9 years ago

Updated by okurz over 9 years ago

Updated by okurz over 9 years ago

Updated by coolo over 9 years ago

Updated by okurz over 9 years ago

Updated by okurz over 9 years ago

Updated by coolo over 9 years ago

Updated by coolo over 9 years ago