Project

General

Profile

Actions

action #16438

closed

[tools]System stall during test run not detected

Added by asmorodskyi about 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2017-02-03
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-12-SP3-Server-DVD-x86_64-RAID5@64bit fails in
install_and_reboot

Test fails on pressing "alt-s" combination during system reboot prompt dialog in "install_and_reboot".

But it's look like that it's not a product issue and it's happens because of this stall for 28 seconds :

11:03:54.1033 35503 MATCH(rebootnow-20150409:0.45)
11:03:54.1135 35503 MATCH(rebootnow-20160504:1.00)
11:04:22.2395 35325 >>> testapi::_check_backend_response: found rebootnow-20160504, similarity 1.00 @ 271/257
11:04:22.2397 Debug: /var/lib/openqa/share/tests/sle/tests/installation/install_and_reboot.pm:108 called testapi::send_key

Reproducible

Fails sporadically in different scenarios.

Expected result

Last good: 0231 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Tests - action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys)Resolvedszarate2017-03-07

Actions
Actions #1

Updated by asmorodskyi about 7 years ago

  • Project changed from openQA Tests to openQA Project
  • Category changed from Bugs in existing tests to 132
  • Priority changed from Normal to High
Actions #2

Updated by asmorodskyi about 7 years ago

another job which fails with same symptoms has something interesting in the logs (https://openqa.suse.de/tests/757155):
Couldn't get a file descriptor referring to the console
%@umount: /var/lib/nfs/rpc_pipefs: target is busy
(In some cases useful info about processes that
use the device is found by lsof(8) or fuser(1).)
Couldn't get a file descriptor referring to the console
Couldn't get a file descriptor referring to the console

Actions #3

Updated by coolo about 7 years ago

  • Priority changed from High to Immediate
  • Target version set to Milestone 5

Checking what we deployed when it started - this is caused by a combination of slow NFS and 2f6118c39fa972b4aba94407e267e5e6f47cef0f in needle.pm - it is now checking the json within NFS (and not in needle cache). And as such there is a huge delay after every needle match. The backend is not stalling, but test thread is. In the end, it doesn't matter - we do not send alt+s

Actions #4

Updated by okurz about 7 years ago

[16 Feb 2017 10:18:38] <okurz> any news on https://progress.opensuse.org/issues/16438 "System stall during test run not detected" about NFS performance?
[16 Feb 2017 10:19:04] * coolo is mainly trying to workaround https://openqa.suse.de/tests/778533#step/scc_registration/42
[16 Feb 2017 10:19:48] <coolo> okurz: the news is: rudi loadbalanced the netapps and I rebooted the VM - the combination made the problem less severe
[16 Feb 2017 10:20:04] <coolo> but we need to a) keep monitoring it and b) get rid of NFS and use proper caching
[16 Feb 2017 10:20:30] <coolo> because we will have more and more worker CPUs - all pulling on the poor virtual NFSd
Actions #5

Updated by okurz about 7 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz
  • Priority changed from Immediate to High
  • Target version changed from Milestone 5 to Milestone 6

as per #1648#note-4

Actions #6

Updated by okurz about 7 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from okurz to szarate
  • Priority changed from High to Normal

https://openqa.suse.de/tests/781515 is an instance of working fine again. I guess this is very much related to stall detection, needle and asset caching, assigning to szarate.

Actions #7

Updated by RBrownSUSE about 7 years ago

  • Subject changed from System stall during test run not detected to [tools]System stall during test run not detected
Actions #8

Updated by mgriessmeier about 7 years ago

https://openqa.suse.de/tests/812947

stall was 'only' 9 seconds but it looks like the same issue to me

23:18:20.0989 11234 MATCH(rebootnow-20160504:1.00)
23:18:29.1150 11220 >>> testapi::_handle_found_needle: found rebootnow-20160504, similarity 1.00 @ 271/257
23:18:29.1152 Debug: /var/lib/openqa/share/tests/sle/tests/installation/install_and_reboot.pm:110 called testapi::send_key
23:18:29.1152 11220 <<< testapi::send_key(key='alt-s')
23:18:29.3188 Debug: /var/lib/openqa/share/tests/sle/tests/installation/install_and_reboot.pm:110 called testapi::wait_still_screen
23:18:29.3190 11220 <<< testapi::wait_still_screen(stilltime=3, timeout=4, simlvl=47)
23:18:33.4918 11220 >>> testapi::wait_still_screen: wait_still_screen timed out after 4
Actions #9

Updated by RBrownSUSE about 7 years ago

  • Related to action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys) added
Actions #11

Updated by SLindoMansilla about 7 years ago

sle-12-SP3-Server-DVD-ppc64le-Build0313-cryptlvm_minimal_x@ppc64le
osd#856166#step/grub_test/6

sle-12-SP3-Server-DVD-ppc64le-Build0313-gnome@ppc64le
osd#856402#step/grub_test/6

Actions #12

Updated by nicksinger about 7 years ago

@SLindoMansilla: Your references are wrong. This ticket is about "stall not detected". But the jobs you referenced clearly show "stall detected". That would be #13276

Actions #13

Updated by okurz almost 7 years ago

M6 is long gone and so is M7. Care to update? I assume currently it's not planned to continue with this as we did not see it more often. Maybe it does not even apply anymore?

Actions #14

Updated by RBrownSUSE almost 7 years ago

  • Target version changed from Milestone 6 to Milestone 8
Actions #15

Updated by szarate almost 7 years ago

  • Status changed from In Progress to Feedback
  • Assignee deleted (szarate)
  • Target version deleted (Milestone 8)

I have not seen this for a while, and haven't worked with this either. Removing from my queue, and removing the milestone too

Actions #16

Updated by coolo over 6 years ago

  • Status changed from Feedback to Resolved

Not using NFS anymore clearly helped - even when okurz didn't believe it.

Actions

Also available in: Atom PDF