action #17442: [tools]Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string) - openQA Project - openSUSE Project Management Tool

Actions

Copy link

action #17442

closed

[tools]Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string)

Added by dzedro over 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

RBrownSUSE

Category:

Regressions/Crashes

Target version:

Start date:

2017-03-02

Due date:

% Done:

Estimated time:

Description

observation¶

tests like zypper_lifecycle, sshd, ssh_sleanup, sshxterm and other started to fail much more than some time ago.
It is not new issue, but this failures were not so frequent as now.
All fails are caused by typing to soon or letters get lost because of very low performance.
#16738 is somehow related.

reproducible¶

seems to happen in some jobs

problem¶

H1 REJECTED worker overbooking
H2 qemu loosing keys -> unlikely as it was happening more often (or at all) recently but not in before, see H5; we can not be so sure. Probably it was seen in before (we have tickets for this) but as it persists it stays in people's mind.
- H2.1 REJECTED loosing keys in keyboard buffer -> rejected: we would expect "key event buffer full" messages and an early fail in the jobs
- H2.2 loosing keys in some part of qemu before it reaches the keyboard buffer (VNC part?)
H3 performance regression in the product
- H3.1 … recently detectable on very fast machines
- H3.2 REJECTED detectable on any worker -> #17442#note-32
H4 REJECTED network -> only affecting x11 tests and not console tests so why should we loose just VNC typing events there? (okurz)
H5 REJECTED updates on the workers
H5.1 REJECTED package updates on the workers (not openQA packages) -> no updates on worker since 3 weeks (rbrown)
H5.2 REJECTED os-autoinst update on the workers -> #17442#note-32
H5.3 REJECTED updates of tests -> #17442#note-35
H6 it was there all the time but nobody cared because it was not seen often enough and now with more and more jobs and more thorough review again it stays in everyones scope

suggestion¶

E1-1 DONE: check workers of last examples -> O1-1 linked jobs are spread over openqaworker2, …3, …5, …6 -> REJECT H1
E3-1 DONE: crosscheck with differing parameters (older build / test on other machines) -> #17442#note-32 -> SPLIT H3
E5.3-1 DONE: crosscheck with older/newer version of tests -> #17442#note-35 -> REJECT H5.3+H5

further details¶

Few examples:
https://openqa.suse.de/tests/793680#step/yast2_snapper/5
https://openqa.suse.de/tests/793974#step/zypper_lifecycle/55
https://openqa.suse.de/tests/793716#step/ssh_cleanup/3 --> #14826
https://openqa.suse.de/tests/793683#step/sshd/16 --> #14826
http://openqa.suse.de/t796211
https://openqa.suse.de/tests/796232#step/gnome_terminal/7
https://openqa.suse.de/tests/795241 (Build 0260)

workaround¶

retrigger job

Related issues 6 (0 open — 6 closed)

Related to openQA Tests - action #16738: test fails in zypper_lifecycle and the next tests are keep failing

Resolved

nadvornik

2017-02-13

Actions

Related to openQA Tests - action #17502: [sle][functional]test fails in grub_test on timeout waiting for inst-bootmenu

Resolved

okurz

2017-03-03

2017-11-08

Actions

Related to openQA Tests - action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys)

Resolved

szarate

2017-03-07

Actions

Related to openQA Tests - action #17458: test fails in libreoffice_mainmenu_components

Resolved

qkzhu

2017-03-02

Actions

Related to openQA Tests - action #17456: [qam][x11regressions]test fails sporadically in gedit_save

Closed

mitiao

2017-03-02

Actions

Is duplicate of openQA Tests - action #17446: [qam][x11regressions]test fails in evince_find sporadically (network issue or mistyping on long strings)

Resolved

qkzhu

2017-03-02

Actions

Copy link

Updated by dzedro over 7 years ago

Related to action #16738: test fails in zypper_lifecycle and the next tests are keep failing added

Actions

Copy link

Updated by okurz over 7 years ago

Subject changed from Bad performance or too many failrues because of typing too soon/mistyping to Bad performance or too many failrues because of typing too soon/mistyping (send_keys/type_string)
Category set to Regressions/Crashes
Priority changed from High to Urgent

also staging test fail now: xterm (http://openqa.suse.de/t796211)

Actions

Copy link

Updated by okurz over 7 years ago

Subject changed from Bad performance or too many failrues because of typing too soon/mistyping (send_keys/type_string) to Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string)

Actions

Copy link

Updated by okurz over 7 years ago

load of the system seems not too high: https://openqa.suse.de/tests/796211/file/xterm-loadavg.txt 0.50 0.79 0.42 1/238 8266

Actions

Copy link

Updated by okurz over 7 years ago

Description updated (diff)
Status changed from New to In Progress
Assignee set to okurz

same: https://openqa.suse.de/tests/796232#step/gnome_terminal/7

checked workers and the linked jobs are spread over openqaworker2, …3, …5, …6 -> REJECT H1

Actions

Copy link

Updated by nicksinger over 7 years ago

Build0260 also affected: https://openqa.suse.de/tests/795241

Actions

Copy link

Updated by okurz over 7 years ago

Description updated (diff)

rbrown: as discussed with okurz right now, 793716 looks like it might be a console_select bug, but all the other examples I'm finding so far all look like they're only happening in x11 tests, which make me think we might have a product performance regression that impacts only x11 performance - if it was os-autoinst, qemu, or some other part of the infra I do not see how it would manifest itself only on x11 tests
rbrown: 793683 another sign of a console select bug
rbrown: suspecting there is a correlation between select_console and x11_start_program - the more I look at this the more I am convinced this is not a worker/hardware problem - the system status + asmorodskyi's observation that it happens recurringly really convinces me of that

Actions

Copy link

Updated by okurz over 7 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 7 years ago

Description updated (diff)

Actions

Copy link

#10

Updated by nicksinger over 7 years ago

Another one in 0260: https://openqa.suse.de/tests/796258

Actions

Copy link

#11

Updated by pgeorgiadis over 7 years ago

another one: https://openqa.suse.de/tests/796012#step/change_password/10

Actions

Copy link

#12

Updated by asmorodskyi over 7 years ago

Related to action #17502: [sle][functional]test fails in grub_test on timeout waiting for inst-bootmenu added

Actions

Copy link

#13

Updated by szarate over 7 years ago

Assignee changed from okurz to szarate

Actions

Copy link

#14

Updated by asmorodskyi over 7 years ago

https://openqa.suse.de/tests/797387

Actions

Copy link

#15

Updated by RBrownSUSE over 7 years ago

Discussed with Oliver, below is the current leading hypothesis

https://openqa.suse.de/tests/796211#step/xterm/4 and https://openqa.suse.de/tests/796211#step/xterm/4 both show that the issue is not X11 bound

The issue is that os-autoinst is typing before the bash shell is ready to accept input https://openqa.suse.de/tests/796211#step/xterm/4 shows that many of the Enter presses were committed before bash was accepting input, even though the prompt was on the screen (confirmed by the assert_screen)

With os-autoinsts fractional time, needling on assert screen and such is a lot faster, so os-autoinst is typing before the prompt is ready

We can rule out hardware overload, this problem does not happen when systems are overloaded - in fact overloading a system seems to AVOID this problem.

Hypothethis - There may be a performance issue around bash that increases the time between the bash prompt being visible on the screen and it being ready to accept input.
This performance issue might be a regression compared to SP2, or could be just detected now as a result of the performance improvements to os-autoinst
This performance issue might be a problem with the generation of the bash prompt - is there an issue finding the hostname? is there an issue finding the username?

Jozef's pull request in January shows that the problem was at least present in mid January.

As discussed with Oliver, we're going to need some manual testing. Simple tests would be an installation of SLES 12 SP3 and holding down the enterkey on a bash prompt in xterm or tty. A well working system should produce an output with an almost endless stream of prompts

rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>

If we're able to reproduce this issue, we'd expect to see something like

rbrown@ibrokeit:~>

Which would imply that bash is taking too long to generate it's prompt, just like we see in openQA

If we can reproduce this outside of openQA, then it's a clear bug that needs to be filed

If we can reproduce it outside of openQA but not on SP2, then we know it is an SP3 specific bug (but like I say above, this might not be a regression)

if we can't reproduce it out of openQA, there is still the chance of it being a bug. Areas of further investigation would include name resolution, username detection, and how they are done by bash when creating the prompt.

The different network configuration for the SUT in openQA could have a part to play - because it's installed by YaST our configuration should be valid, but are we sure it is? Is SELF name resolution working properly? the hostname.pm test only checks hostnamectl, is that the method used by bash?

If we manage to rule out all of the above as a problem with the product, then I would be prepared to accept that in this case openQA could be testing 'too fast', in which case I think we should investigate adding appropriate delays in either assert_screen or the affected tests so they give more time between checking the prompt is visible and then using it.

Actions

Copy link

#16

Updated by nicksinger over 7 years ago

https://openqa.suse.de/tests/789944#

Actions

Copy link

#17

Updated by okurz over 7 years ago

Description updated (diff)

split H5 into os-autoinst update and others

Actions

Copy link

#18

Updated by szarate over 7 years ago

Assignee deleted (~~szarate~~)

I'm unnasigning myself from this issue for now.

Actions

Copy link

#19

Updated by okurz over 7 years ago

Description updated (diff)
Assignee set to RBrownSUSE
Priority changed from Urgent to High

xterm/gterm/terminal tests failing: I don't think it is because of os-autoinst change -> https://openqa.suse.de/tests/800650#step/yast2_snapper/8

That might also be a different issue but after talking to coolo and szarate we could not identify anything in

https://openqa.suse.de/tests/802036#step/yast2_lan/1 looks like "ninja keys" but in console tests now -> staging

Currently we are following on H2 and want to come up with a qemu with debug information about key presses and such.

rbrown wants to further evaluate statistics in preparation for a good "product issue" report. I myself triggered 100 jobs on lord.arch/tests?hoursfresh=24&match=wait_for_x11_program_responsive to check if the problem can also be reproduced on a "slow" server (i.e. rotating disk, no NVME)

Actions

Copy link

#20

Updated by dzedro over 7 years ago

This typing too soon happened on my local openQA (desktop with 2 workers)

Update: without nfs, just local hdd

Actions

Copy link

#21

Updated by mkravec over 7 years ago

Typing too soon / too late happened a lot in build 0267:

All gedit failures:
https://openqa.suse.de/tests/804361#step/gedit/5
https://openqa.suse.de/tests/804365#step/gedit/5
https://openqa.suse.de/tests/804342#step/gedit/5
https://openqa.suse.de/tests/804413#step/gedit/5
https://openqa.suse.de/tests/804357#step/gedit/5

Gnome terminal:
https://openqa.suse.de/tests/804357#step/gnome_terminal/7

Firefox:
https://openqa.suse.de/tests/804345#step/firefox/4

GPK updates:
https://openqa.suse.de/tests/804350#step/updates_packagekit_gpk/5
https://openqa.suse.de/tests/804375#step/updates_packagekit_gpk/5

VNC:
https://openqa.suse.de/tests/802819#step/vnc_two_passwords/11

X11_start_program: (not typing program name in many tests)
https://openqa.suse.de/tests/804319#step/xterm/1

Actions

Copy link

#22

Updated by osukup over 7 years ago

same behaviour in many Maintenace Test's , Test Repo + Incidents ( +-5 cases per run)

Actions

Copy link

#23

Updated by RBrownSUSE over 7 years ago

Status changed from In Progress to New
Assignee deleted (~~RBrownSUSE~~)
Priority changed from High to Urgent

Please provide at least some examples in this thread so I can follow them up

Actions

Copy link

#24

Updated by mgriessmeier over 7 years ago

seems also related:
https://openqa.suse.de/tests/804730#step/yast2_snapper/11

typing echo0 - probably missing a space here

also related:
https://openqa.suse.de/tests/804703#step/wireshark/4

things getting typed twice compared to old jobs

Actions

Copy link

#25

Updated by dzedro over 7 years ago

Here is another failed test with system performance statistics
Job was started at 2017-03-08 07:24:58

http://10.100.12.155/tests/5565#step/ssh_cleanup/3

pdf
ftp://10.100.12.155/sar.pdf
text
ftp://10.100.12.155/sar.txt

Actions

Copy link

#26

Updated by szarate over 7 years ago

Related to action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys) added

Actions

Copy link

#27

Updated by osukup over 7 years ago

http://argus.suse.cz/tests/160#step/x11regressions_setup/4 hit locally :)

Actions

Copy link

#28

Updated by mgriessmeier over 7 years ago

https://openqa.suse.de/tests/806657

Actions

Copy link

#29

Updated by qkzhu over 7 years ago

Related to action #17458: test fails in libreoffice_mainmenu_components added

Actions

Copy link

#30

Updated by okurz over 7 years ago

https://openqa.suse.de/tests/806822#step/zypper_lifecycle/29 is another fun example of missing keys from beginning of string in console but not immediately after console switching

Actions

Copy link

#31

Updated by okurz over 7 years ago

Status changed from New to In Progress
Assignee set to okurz

the ticket is in progress because we are already investing a lot of time in investigation. If rbrown does not want to take it I will because at least I can provide a little bit of answers if someone wants to understand what's going on

Actions

Copy link

#32

Updated by okurz over 7 years ago

Description updated (diff)

conducted experiment on lord.arch by triggering 100 jobs in a row as a clone of a recently failing staging job.

os-autoinst: 945efee
os-autoinst-distri-opensuse: 04dcf59
os-autoinst-needles-sle: 7811e4e
- jobs http://lord.arch/tests?hoursfresh=24&match=wait_for_x11_program_responsive 5699-5793 (94 jobs) all passed or only failed in unrelated modules so no problem observed -> update os-autoinst 945efee..02cd04f
- http://lord.arch/tests?hoursfresh=24&match=poo17442 -> all 100 jobs (5794-5902) do not show any related failure (one incomplete because of 'stall detected', one softfailed module in reboot_gnome) -> os-autoinst is not to blame in general

-> H3 split, H3.1 rejected as well as H5.2

Also adding H6, H5.3

next experiment E5.3-1 with tests update on lord.arch

Actions

Copy link

#33

Updated by okurz over 7 years ago

https://openqa.suse.de/tests/807661#step/xterm/6 single 'ret' had an effect, nothing more is shown in xterm.
https://openqa.suse.de/tests/807661#step/sshxterm/5 same job, next module, missing characters from beginning.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525 is my proposal for a workaround but I will test something more thoroughly later on, e.g. some kind of send_key_until_needle_match 'next-line-in-xterm', 'ret' or a while-loop, maybe.

Actions

Copy link

#34

Updated by okurz over 7 years ago

https://openqa.suse.de/tests/807824#step/updates_packagekit_gpk/6 in xterm, missing keys in beginning

Actions

Copy link

#35

Updated by okurz over 7 years ago

https://openqa.suse.de/tests/807449#step/check_console_font/3 in check_console_font, beginning "echo" missing

Actions

Copy link

#36

Updated by okurz over 7 years ago

Description updated (diff)

E5.3-1 DONE: http://lord.arch/tests/5903 - http://lord.arch/tests/5922 all passed, HDD on lord.arch was full afterwards so I aborted.

This leaves H2.2, H3.1, H6

rbrown is conducting an experiment to load the production workers after "rudi did some changes".

Actions

Copy link

#37

Updated by dzedro over 7 years ago

Staging http://openqa.suse.de/tests/810619#step/xterm/6

Actions

Copy link

#38

Updated by okurz over 7 years ago

Assignee changed from okurz to RBrownSUSE

@RBrownSUSE, any recent findings? staging tests are still failing a lot on osd as it seems, I can't reproduce it myself, dzedro can on his machine. What is your take on this?

workaround https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525 is still ready but we don't see the problem on o3 so not approved. What to do now?

Actions

Copy link

#39

Updated by RBrownSUSE over 7 years ago

As we chatted about in my office today - I think the console failures of this type like the one we saw today could/should be mitigated by a better needle that actually checks it has the correct console so it doesn't type into the user console before it's actually flipped to the root console

Have you or anyone else submitted a needle to fix that?

The X11 failures, like we saw in gedit, need more investigation and http://openqa.suse.de/tests/810619#step/xterm/6 alone isn't enough for me to get a feel for the problem:
More examples would help, so I can profile the system performance at the time and such.
I'm waiting and seeing before deciding what to do about https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525/

If things are really getting in the way of staging, then I can accept gh#2525 as a 'make release managers happy' measure, but as we're just on the dark side of Alpha 2 I would prefer to have more time to investigate if you can help make that happen

What do you think?

Actions

Copy link

#40

Updated by okurz over 7 years ago

RBrownSUSE wrote:

As we chatted about in my office today - I think the console failures of this type like the one we saw today could/should be mitigated by a better needle that actually checks it has the correct console so it doesn't type into the user console before it's actually flipped to the root console

Have you or anyone else submitted a needle to fix that?

I am on it in #14826

The X11 failures, like we saw in gedit, need more investigation and http://openqa.suse.de/tests/810619#step/xterm/6 alone isn't enough for me to get a feel for the problem:
More examples would help, so I can profile the system performance at the time and such.
I'm waiting and seeing before deciding what to do about https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525/

If things are really getting in the way of staging, then I can accept gh#2525 as a 'make release managers happy' measure, but as we're just on the dark side of Alpha 2 I would prefer to have more time to investigate if you can help make that happen

What do you think?

I think not enough people trust the system so after two weeks with no solution in sight we should expedite to find a workaround and make some people hunt for product bugs while we can continue to investigate further on this. If we simply put in a record_soft_failure or maybe even a single but unambiguous "diag" message we can find occurences of these problems and gather more data. I updated my PR to record a soft failure and try again. This might help to gather even more data.

Actions

Copy link

#41

Updated by RBrownSUSE over 7 years ago

Subject changed from Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string) to [tools]Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string)

Actions

Copy link

#42

Updated by okurz over 7 years ago

so we merged the test pull request which might help us to detect this (or break everything, as always, but I will monitor)

But now for something completely different. In
https://openqa.opensuse.org/tests/369987#step/inkscape/8
which is on o3, we see a mistyping of "xtrem" instead of "xterm". The test code is fine and so is the call to the testapi

testapi::type_string(string='xterm', max_interval=250, wait_screen_changes=0)

same issue?

Actions

Copy link

#43

Updated by asmorodskyi over 7 years ago

https://openqa.suse.de/tests/818019 - another example of missing keys job

Actions

Copy link

#44

Updated by okurz over 7 years ago

recent example of failing job: https://openqa.suse.de/tests/820752#step/xterm/9

Actions

Copy link

#45

Updated by mkravec over 7 years ago

In build 0288 review:
https://openqa.suse.de/tests/826663#step/zypper_lifecycle/59
https://openqa.suse.de/tests/826673#step/zypper_lifecycle/55
https://openqa.suse.de/tests/826653#step/zypper_lifecycle/29
https://openqa.suse.de/tests/826680#step/zypper_lifecycle/29
https://openqa.suse.de/tests/826680#step/updates_packagekit_gpk/11
https://openqa.suse.de/tests/826696#step/sshd/15
https://openqa.suse.de/tests/826679#step/sshd/17
https://openqa.suse.de/tests/826683#step/mtab/1

Actions

Copy link

#46

Updated by coolo over 7 years ago

Careful not to mix different issues. the zypper_lifecycle cases are 'not waiting for select-console to finish'.

Same with sshd

Actions

Copy link

#47

Updated by coolo over 7 years ago

same with mtab

Actions

Copy link

#48

Updated by okurz over 7 years ago

o3 examples of mistyped:
https://openqa.opensuse.org/tests/375639#step/vlc/22 -> "v-lc -no-autoscale" instead of "vlc --no-autoscale" so the one "-" ended up somewhere differently.
Similar is https://openqa.opensuse.org/tests/376071#step/ooffice/7 -> "oowritre" instead of "oowriter"

Actions

Copy link

#49

Updated by coolo over 7 years ago

krunner is using autocompletion - and it breaks often if you type manually in there too. So we need to type very slow there - but it's more a product workaround not an openqa problem.

Actions

Copy link

#50

Updated by coolo over 7 years ago

So I experimented with waiting for screen updates after every key on malbec (which has qemu 2.8 now), but the emulated keyboard still bounces keys:
https://openqa.suse.de/tests/837199#step/snapper_cleanup/30 - so typing SLOOOW on power and aarch is the only option atm :(

Actions

Copy link

#51

Updated by asmorodskyi over 7 years ago

https://openqa.suse.de/tests/837414 - power architecture

Actions

Copy link

#52

Updated by coolo over 7 years ago

that test ran on the weekend when I experimented with malbec. The result of that was https://github.com/os-autoinst/os-autoinst/pull/743/files - and we haven't yet updated the machines

Actions

Copy link

#53