Project

General

Profile

Actions

action #17442

closed

[tools]Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string)

Added by dzedro about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2017-03-02
Due date:
% Done:

0%

Estimated time:

Description

observation

tests like zypper_lifecycle, sshd, ssh_sleanup, sshxterm and other started to fail much more than some time ago.
It is not new issue, but this failures were not so frequent as now.
All fails are caused by typing to soon or letters get lost because of very low performance.
#16738 is somehow related.

reproducible

seems to happen in some jobs

problem

  • H1 REJECTED worker overbooking
  • H2 qemu loosing keys -> unlikely as it was happening more often (or at all) recently but not in before, see H5; we can not be so sure. Probably it was seen in before (we have tickets for this) but as it persists it stays in people's mind.
    • H2.1 REJECTED loosing keys in keyboard buffer -> rejected: we would expect "key event buffer full" messages and an early fail in the jobs
    • H2.2 loosing keys in some part of qemu before it reaches the keyboard buffer (VNC part?)
  • H3 performance regression in the product
    • H3.1 … recently detectable on very fast machines
    • H3.2 REJECTED detectable on any worker -> #17442#note-32
  • H4 REJECTED network -> only affecting x11 tests and not console tests so why should we loose just VNC typing events there? (okurz)
  • H5 REJECTED updates on the workers
  • H5.1 REJECTED package updates on the workers (not openQA packages) -> no updates on worker since 3 weeks (rbrown)
  • H5.2 REJECTED os-autoinst update on the workers -> #17442#note-32
  • H5.3 REJECTED updates of tests -> #17442#note-35
  • H6 it was there all the time but nobody cared because it was not seen often enough and now with more and more jobs and more thorough review again it stays in everyones scope

suggestion

  • E1-1 DONE: check workers of last examples -> O1-1 linked jobs are spread over openqaworker2, …3, …5, …6 -> REJECT H1
  • E3-1 DONE: crosscheck with differing parameters (older build / test on other machines) -> #17442#note-32 -> SPLIT H3
  • E5.3-1 DONE: crosscheck with older/newer version of tests -> #17442#note-35 -> REJECT H5.3+H5

further details

Few examples:
https://openqa.suse.de/tests/793680#step/yast2_snapper/5
https://openqa.suse.de/tests/793974#step/zypper_lifecycle/55
https://openqa.suse.de/tests/793716#step/ssh_cleanup/3 --> #14826
https://openqa.suse.de/tests/793683#step/sshd/16 --> #14826
http://openqa.suse.de/t796211
https://openqa.suse.de/tests/796232#step/gnome_terminal/7
https://openqa.suse.de/tests/795241 (Build 0260)

workaround

retrigger job


Related issues 6 (0 open6 closed)

Related to openQA Tests - action #16738: test fails in zypper_lifecycle and the next tests are keep failingResolvednadvornik2017-02-13

Actions
Related to openQA Tests - action #17502: [sle][functional]test fails in grub_test on timeout waiting for inst-bootmenu Resolvedokurz2017-03-032017-11-08

Actions
Related to openQA Tests - action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys)Resolvedszarate2017-03-07

Actions
Related to openQA Tests - action #17458: test fails in libreoffice_mainmenu_componentsResolvedqkzhu2017-03-02

Actions
Related to openQA Tests - action #17456: [qam][x11regressions]test fails sporadically in gedit_saveClosedmitiao2017-03-02

Actions
Is duplicate of openQA Tests - action #17446: [qam][x11regressions]test fails in evince_find sporadically (network issue or mistyping on long strings)Resolvedqkzhu2017-03-02

Actions
Actions #1

Updated by dzedro about 7 years ago

  • Related to action #16738: test fails in zypper_lifecycle and the next tests are keep failing added
Actions #2

Updated by okurz about 7 years ago

  • Subject changed from Bad performance or too many failrues because of typing too soon/mistyping to Bad performance or too many failrues because of typing too soon/mistyping (send_keys/type_string)
  • Category set to Regressions/Crashes
  • Priority changed from High to Urgent

also staging test fail now: xterm (http://openqa.suse.de/t796211)

Actions #3

Updated by okurz about 7 years ago

  • Subject changed from Bad performance or too many failrues because of typing too soon/mistyping (send_keys/type_string) to Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string)
Actions #4

Updated by okurz about 7 years ago

load of the system seems not too high: https://openqa.suse.de/tests/796211/file/xterm-loadavg.txt 0.50 0.79 0.42 1/238 8266

Actions #5

Updated by okurz about 7 years ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz

same: https://openqa.suse.de/tests/796232#step/gnome_terminal/7

checked workers and the linked jobs are spread over openqaworker2, …3, …5, …6 -> REJECT H1

Actions #6

Updated by nicksinger about 7 years ago

Build0260 also affected: https://openqa.suse.de/tests/795241

Actions #7

Updated by okurz about 7 years ago

  • Description updated (diff)

rbrown: as discussed with okurz right now, 793716 looks like it might be a console_select bug, but all the other examples I'm finding so far all look like they're only happening in x11 tests, which make me think we might have a product performance regression that impacts only x11 performance - if it was os-autoinst, qemu, or some other part of the infra I do not see how it would manifest itself only on x11 tests
rbrown: 793683 another sign of a console select bug
rbrown: suspecting there is a correlation between select_console and x11_start_program - the more I look at this the more I am convinced this is not a worker/hardware problem - the system status + asmorodskyi's observation that it happens recurringly really convinces me of that

Actions #8

Updated by okurz about 7 years ago

  • Description updated (diff)
Actions #9

Updated by okurz about 7 years ago

  • Description updated (diff)
Actions #10

Updated by nicksinger about 7 years ago

Actions #12

Updated by asmorodskyi about 7 years ago

  • Related to action #17502: [sle][functional]test fails in grub_test on timeout waiting for inst-bootmenu added
Actions #13

Updated by szarate about 7 years ago

  • Assignee changed from okurz to szarate
Actions #15

Updated by RBrownSUSE about 7 years ago

Discussed with Oliver, below is the current leading hypothesis

https://openqa.suse.de/tests/796211#step/xterm/4 and https://openqa.suse.de/tests/796211#step/xterm/4 both show that the issue is not X11 bound

The issue is that os-autoinst is typing before the bash shell is ready to accept input https://openqa.suse.de/tests/796211#step/xterm/4 shows that many of the Enter presses were committed before bash was accepting input, even though the prompt was on the screen (confirmed by the assert_screen)

With os-autoinsts fractional time, needling on assert screen and such is a lot faster, so os-autoinst is typing before the prompt is ready

We can rule out hardware overload, this problem does not happen when systems are overloaded - in fact overloading a system seems to AVOID this problem.

Hypothethis - There may be a performance issue around bash that increases the time between the bash prompt being visible on the screen and it being ready to accept input.
This performance issue might be a regression compared to SP2, or could be just detected now as a result of the performance improvements to os-autoinst
This performance issue might be a problem with the generation of the bash prompt - is there an issue finding the hostname? is there an issue finding the username?

Jozef's pull request in January shows that the problem was at least present in mid January.

As discussed with Oliver, we're going to need some manual testing. Simple tests would be an installation of SLES 12 SP3 and holding down the enterkey on a bash prompt in xterm or tty. A well working system should produce an output with an almost endless stream of prompts

rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>

If we're able to reproduce this issue, we'd expect to see something like

rbrown@ibrokeit:~>

rbrown@ibrokeit:~>

Which would imply that bash is taking too long to generate it's prompt, just like we see in openQA

If we can reproduce this outside of openQA, then it's a clear bug that needs to be filed

If we can reproduce it outside of openQA but not on SP2, then we know it is an SP3 specific bug (but like I say above, this might not be a regression)

if we can't reproduce it out of openQA, there is still the chance of it being a bug. Areas of further investigation would include name resolution, username detection, and how they are done by bash when creating the prompt.

The different network configuration for the SUT in openQA could have a part to play - because it's installed by YaST our configuration should be valid, but are we sure it is? Is SELF name resolution working properly? the hostname.pm test only checks hostnamectl, is that the method used by bash?

If we manage to rule out all of the above as a problem with the product, then I would be prepared to accept that in this case openQA could be testing 'too fast', in which case I think we should investigate adding appropriate delays in either assert_screen or the affected tests so they give more time between checking the prompt is visible and then using it.

Actions #17

Updated by okurz about 7 years ago

  • Description updated (diff)

split H5 into os-autoinst update and others

Actions #18

Updated by szarate about 7 years ago

  • Assignee deleted (szarate)

I'm unnasigning myself from this issue for now.

Actions #19

Updated by okurz about 7 years ago

  • Description updated (diff)
  • Assignee set to RBrownSUSE
  • Priority changed from Urgent to High

xterm/gterm/terminal tests failing: I don't think it is because of os-autoinst change -> https://openqa.suse.de/tests/800650#step/yast2_snapper/8

That might also be a different issue but after talking to coolo and szarate we could not identify anything in

https://openqa.suse.de/tests/802036#step/yast2_lan/1 looks like "ninja keys" but in console tests now -> staging

Currently we are following on H2 and want to come up with a qemu with debug information about key presses and such.

rbrown wants to further evaluate statistics in preparation for a good "product issue" report. I myself triggered 100 jobs on lord.arch/tests?hoursfresh=24&match=wait_for_x11_program_responsive to check if the problem can also be reproduced on a "slow" server (i.e. rotating disk, no NVME)

Actions #20

Updated by dzedro about 7 years ago

This typing too soon happened on my local openQA (desktop with 2 workers)

Update: without nfs, just local hdd

Actions #22

Updated by osukup about 7 years ago

same behaviour in many Maintenace Test's , Test Repo + Incidents ( +-5 cases per run)

Actions #23

Updated by RBrownSUSE about 7 years ago

  • Status changed from In Progress to New
  • Assignee deleted (RBrownSUSE)
  • Priority changed from High to Urgent

Please provide at least some examples in this thread so I can follow them up

Actions #24

Updated by mgriessmeier about 7 years ago

seems also related:
https://openqa.suse.de/tests/804730#step/yast2_snapper/11

typing echo0 - probably missing a space here

also related:
https://openqa.suse.de/tests/804703#step/wireshark/4

things getting typed twice compared to old jobs

Actions #25

Updated by dzedro about 7 years ago

Here is another failed test with system performance statistics
Job was started at 2017-03-08 07:24:58

http://10.100.12.155/tests/5565#step/ssh_cleanup/3

pdf
ftp://10.100.12.155/sar.pdf
text
ftp://10.100.12.155/sar.txt

Actions #26

Updated by szarate about 7 years ago

  • Related to action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys) added
Actions #29

Updated by qkzhu about 7 years ago

  • Related to action #17458: test fails in libreoffice_mainmenu_components added
Actions #30

Updated by okurz about 7 years ago

https://openqa.suse.de/tests/806822#step/zypper_lifecycle/29 is another fun example of missing keys from beginning of string in console but not immediately after console switching

Actions #31

Updated by okurz about 7 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

the ticket is in progress because we are already investing a lot of time in investigation. If rbrown does not want to take it I will because at least I can provide a little bit of answers if someone wants to understand what's going on

Actions #32

Updated by okurz about 7 years ago

  • Description updated (diff)

conducted experiment on lord.arch by triggering 100 jobs in a row as a clone of a recently failing staging job.

-> H3 split, H3.1 rejected as well as H5.2

Also adding H6, H5.3

next experiment E5.3-1 with tests update on lord.arch

Actions #33

Updated by okurz about 7 years ago

https://openqa.suse.de/tests/807661#step/xterm/6 single 'ret' had an effect, nothing more is shown in xterm.
https://openqa.suse.de/tests/807661#step/sshxterm/5 same job, next module, missing characters from beginning.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525 is my proposal for a workaround but I will test something more thoroughly later on, e.g. some kind of send_key_until_needle_match 'next-line-in-xterm', 'ret' or a while-loop, maybe.

Actions #34

Updated by okurz about 7 years ago

Actions #35

Updated by okurz about 7 years ago

https://openqa.suse.de/tests/807449#step/check_console_font/3 in check_console_font, beginning "echo" missing

Actions #36

Updated by okurz about 7 years ago

  • Description updated (diff)

E5.3-1 DONE: http://lord.arch/tests/5903 - http://lord.arch/tests/5922 all passed, HDD on lord.arch was full afterwards so I aborted.

This leaves H2.2, H3.1, H6

rbrown is conducting an experiment to load the production workers after "rudi did some changes".

Actions #38

Updated by okurz about 7 years ago

  • Assignee changed from okurz to RBrownSUSE

@RBrownSUSE, any recent findings? staging tests are still failing a lot on osd as it seems, I can't reproduce it myself, dzedro can on his machine. What is your take on this?

workaround https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525 is still ready but we don't see the problem on o3 so not approved. What to do now?

Actions #39

Updated by RBrownSUSE about 7 years ago

As we chatted about in my office today - I think the console failures of this type like the one we saw today could/should be mitigated by a better needle that actually checks it has the correct console so it doesn't type into the user console before it's actually flipped to the root console

Have you or anyone else submitted a needle to fix that?

The X11 failures, like we saw in gedit, need more investigation and http://openqa.suse.de/tests/810619#step/xterm/6 alone isn't enough for me to get a feel for the problem:
More examples would help, so I can profile the system performance at the time and such.
I'm waiting and seeing before deciding what to do about https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525/

If things are really getting in the way of staging, then I can accept gh#2525 as a 'make release managers happy' measure, but as we're just on the dark side of Alpha 2 I would prefer to have more time to investigate if you can help make that happen

What do you think?

Actions #40

Updated by okurz about 7 years ago

RBrownSUSE wrote:

As we chatted about in my office today - I think the console failures of this type like the one we saw today could/should be mitigated by a better needle that actually checks it has the correct console so it doesn't type into the user console before it's actually flipped to the root console

Have you or anyone else submitted a needle to fix that?

I am on it in #14826

The X11 failures, like we saw in gedit, need more investigation and http://openqa.suse.de/tests/810619#step/xterm/6 alone isn't enough for me to get a feel for the problem:
More examples would help, so I can profile the system performance at the time and such.
I'm waiting and seeing before deciding what to do about https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525/

If things are really getting in the way of staging, then I can accept gh#2525 as a 'make release managers happy' measure, but as we're just on the dark side of Alpha 2 I would prefer to have more time to investigate if you can help make that happen

What do you think?

I think not enough people trust the system so after two weeks with no solution in sight we should expedite to find a workaround and make some people hunt for product bugs while we can continue to investigate further on this. If we simply put in a record_soft_failure or maybe even a single but unambiguous "diag" message we can find occurences of these problems and gather more data. I updated my PR to record a soft failure and try again. This might help to gather even more data.

Actions #41

Updated by RBrownSUSE about 7 years ago

  • Subject changed from Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string) to [tools]Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string)
Actions #42

Updated by okurz about 7 years ago

so we merged the test pull request which might help us to detect this (or break everything, as always, but I will monitor)

But now for something completely different. In
https://openqa.opensuse.org/tests/369987#step/inkscape/8
which is on o3, we see a mistyping of "xtrem" instead of "xterm". The test code is fine and so is the call to the testapi

testapi::type_string(string='xterm', max_interval=250, wait_screen_changes=0)

same issue?

Actions #43

Updated by asmorodskyi about 7 years ago

https://openqa.suse.de/tests/818019 - another example of missing keys job

Actions #44

Updated by okurz about 7 years ago

recent example of failing job: https://openqa.suse.de/tests/820752#step/xterm/9

Actions #46

Updated by coolo about 7 years ago

Careful not to mix different issues. the zypper_lifecycle cases are 'not waiting for select-console to finish'.

Same with sshd

Actions #47

Updated by coolo about 7 years ago

same with mtab

Actions #48

Updated by okurz about 7 years ago

o3 examples of mistyped:
https://openqa.opensuse.org/tests/375639#step/vlc/22 -> "v-lc -no-autoscale" instead of "vlc --no-autoscale" so the one "-" ended up somewhere differently.
Similar is https://openqa.opensuse.org/tests/376071#step/ooffice/7 -> "oowritre" instead of "oowriter"

Actions #49

Updated by coolo about 7 years ago

krunner is using autocompletion - and it breaks often if you type manually in there too. So we need to type very slow there - but it's more a product workaround not an openqa problem.

Actions #50

Updated by coolo about 7 years ago

So I experimented with waiting for screen updates after every key on malbec (which has qemu 2.8 now), but the emulated keyboard still bounces keys:
https://openqa.suse.de/tests/837199#step/snapper_cleanup/30 - so typing SLOOOW on power and aarch is the only option atm :(

Actions #51

Updated by asmorodskyi about 7 years ago

Actions #52

Updated by coolo about 7 years ago

that test ran on the weekend when I experimented with malbec. The result of that was https://github.com/os-autoinst/os-autoinst/pull/743/files - and we haven't yet updated the machines

Actions #53

Updated by RBrownSUSE about 7 years ago

  • Priority changed from Urgent to High

Lowering Prio as issues are no longer 'on fire'

Will lower further after PR 743 is deployed and this becomes a 'monitoring' task

Still open as clearly happens still on Power and aarch but shouldn't after 743 is merged.

Actions #54

Updated by coolo about 7 years ago

743 is deployed, but not used as not configured

Actions #56

Updated by szarate about 7 years ago

There's another one too: https://openqa.suse.de/tests/843506

Actions #57

Updated by RBrownSUSE about 7 years ago

  • Status changed from In Progress to Resolved

Believed to be resolved with the new deployed settings, closing ticket.

Actions #58

Updated by okurz about 7 years ago

see #17594#note-16 for more information on potential resolution helps

Actions #59

Updated by okurz about 7 years ago

  • Related to action #17456: [qam][x11regressions]test fails sporadically in gedit_save added
Actions #60

Updated by qkzhu about 7 years ago

  • Is duplicate of action #17446: [qam][x11regressions]test fails in evince_find sporadically (network issue or mistyping on long strings) added
Actions #61

Updated by okurz about 7 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_zdup_offline_sle12sp2_allpatterns_fullupdate_ppc
https://openqa.suse.de/tests/878304

Actions #62

Updated by okurz about 7 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: om_proxyscc_sles12sp1_smt_full_update_by_yast
https://openqa.suse.de/tests/878430

Actions

Also available in: Atom PDF