action #17442
closed[tools]Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string)
Added by dzedro about 7 years ago. Updated about 7 years ago.
0%
Description
observation¶
tests like zypper_lifecycle, sshd, ssh_sleanup, sshxterm and other started to fail much more than some time ago.
It is not new issue, but this failures were not so frequent as now.
All fails are caused by typing to soon or letters get lost because of very low performance.
#16738 is somehow related.
reproducible¶
seems to happen in some jobs
problem¶
- H1 REJECTED worker overbooking
- H2 qemu loosing keys -> unlikely as it was happening more often (or at all) recently but not in before, see H5; we can not be so sure. Probably it was seen in before (we have tickets for this) but as it persists it stays in people's mind.
- H2.1 REJECTED loosing keys in keyboard buffer -> rejected: we would expect "key event buffer full" messages and an early fail in the jobs
- H2.2 loosing keys in some part of qemu before it reaches the keyboard buffer (VNC part?)
- H3 performance regression in the product
- H3.1 … recently detectable on very fast machines
- H3.2 REJECTED detectable on any worker -> #17442#note-32
- H4 REJECTED network -> only affecting x11 tests and not console tests so why should we loose just VNC typing events there? (okurz)
- H5 REJECTED updates on the workers
- H5.1 REJECTED package updates on the workers (not openQA packages) -> no updates on worker since 3 weeks (rbrown)
- H5.2 REJECTED os-autoinst update on the workers -> #17442#note-32
- H5.3 REJECTED updates of tests -> #17442#note-35
- H6 it was there all the time but nobody cared because it was not seen often enough and now with more and more jobs and more thorough review again it stays in everyones scope
suggestion¶
- E1-1 DONE: check workers of last examples -> O1-1 linked jobs are spread over openqaworker2, …3, …5, …6 -> REJECT H1
- E3-1 DONE: crosscheck with differing parameters (older build / test on other machines) -> #17442#note-32 -> SPLIT H3
- E5.3-1 DONE: crosscheck with older/newer version of tests -> #17442#note-35 -> REJECT H5.3+H5
further details¶
Few examples:
https://openqa.suse.de/tests/793680#step/yast2_snapper/5
https://openqa.suse.de/tests/793974#step/zypper_lifecycle/55
https://openqa.suse.de/tests/793716#step/ssh_cleanup/3 --> #14826
https://openqa.suse.de/tests/793683#step/sshd/16 --> #14826
http://openqa.suse.de/t796211
https://openqa.suse.de/tests/796232#step/gnome_terminal/7
https://openqa.suse.de/tests/795241 (Build 0260)
workaround¶
retrigger job
Updated by dzedro about 7 years ago
- Related to action #16738: test fails in zypper_lifecycle and the next tests are keep failing added
Updated by okurz about 7 years ago
- Subject changed from Bad performance or too many failrues because of typing too soon/mistyping to Bad performance or too many failrues because of typing too soon/mistyping (send_keys/type_string)
- Category set to Regressions/Crashes
- Priority changed from High to Urgent
also staging test fail now: xterm (http://openqa.suse.de/t796211)
Updated by okurz about 7 years ago
- Subject changed from Bad performance or too many failrues because of typing too soon/mistyping (send_keys/type_string) to Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string)
Updated by okurz about 7 years ago
load of the system seems not too high: https://openqa.suse.de/tests/796211/file/xterm-loadavg.txt 0.50 0.79 0.42 1/238 8266
Updated by okurz about 7 years ago
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to okurz
same: https://openqa.suse.de/tests/796232#step/gnome_terminal/7
checked workers and the linked jobs are spread over openqaworker2, …3, …5, …6 -> REJECT H1
Updated by nicksinger about 7 years ago
Build0260 also affected: https://openqa.suse.de/tests/795241
Updated by okurz about 7 years ago
- Description updated (diff)
rbrown: as discussed with okurz right now, 793716 looks like it might be a console_select bug, but all the other examples I'm finding so far all look like they're only happening in x11 tests, which make me think we might have a product performance regression that impacts only x11 performance - if it was os-autoinst, qemu, or some other part of the infra I do not see how it would manifest itself only on x11 tests
rbrown: 793683 another sign of a console select bug
rbrown: suspecting there is a correlation between select_console and x11_start_program - the more I look at this the more I am convinced this is not a worker/hardware problem - the system status + asmorodskyi's observation that it happens recurringly really convinces me of that
Updated by nicksinger about 7 years ago
Another one in 0260: https://openqa.suse.de/tests/796258
Updated by pgeorgiadis about 7 years ago
Updated by asmorodskyi about 7 years ago
- Related to action #17502: [sle][functional]test fails in grub_test on timeout waiting for inst-bootmenu added
Updated by RBrownSUSE about 7 years ago
Discussed with Oliver, below is the current leading hypothesis
https://openqa.suse.de/tests/796211#step/xterm/4 and https://openqa.suse.de/tests/796211#step/xterm/4 both show that the issue is not X11 bound
The issue is that os-autoinst is typing before the bash shell is ready to accept input https://openqa.suse.de/tests/796211#step/xterm/4 shows that many of the Enter presses were committed before bash was accepting input, even though the prompt was on the screen (confirmed by the assert_screen)
With os-autoinsts fractional time, needling on assert screen and such is a lot faster, so os-autoinst is typing before the prompt is ready
We can rule out hardware overload, this problem does not happen when systems are overloaded - in fact overloading a system seems to AVOID this problem.
Hypothethis - There may be a performance issue around bash that increases the time between the bash prompt being visible on the screen and it being ready to accept input.
This performance issue might be a regression compared to SP2, or could be just detected now as a result of the performance improvements to os-autoinst
This performance issue might be a problem with the generation of the bash prompt - is there an issue finding the hostname? is there an issue finding the username?
Jozef's pull request in January shows that the problem was at least present in mid January.
As discussed with Oliver, we're going to need some manual testing. Simple tests would be an installation of SLES 12 SP3 and holding down the enterkey on a bash prompt in xterm or tty. A well working system should produce an output with an almost endless stream of prompts
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
If we're able to reproduce this issue, we'd expect to see something like
rbrown@ibrokeit:~>
rbrown@ibrokeit:~>
Which would imply that bash is taking too long to generate it's prompt, just like we see in openQA
If we can reproduce this outside of openQA, then it's a clear bug that needs to be filed
If we can reproduce it outside of openQA but not on SP2, then we know it is an SP3 specific bug (but like I say above, this might not be a regression)
if we can't reproduce it out of openQA, there is still the chance of it being a bug. Areas of further investigation would include name resolution, username detection, and how they are done by bash when creating the prompt.
The different network configuration for the SUT in openQA could have a part to play - because it's installed by YaST our configuration should be valid, but are we sure it is? Is SELF name resolution working properly? the hostname.pm test only checks hostnamectl, is that the method used by bash?
If we manage to rule out all of the above as a problem with the product, then I would be prepared to accept that in this case openQA could be testing 'too fast', in which case I think we should investigate adding appropriate delays in either assert_screen or the affected tests so they give more time between checking the prompt is visible and then using it.
Updated by okurz about 7 years ago
- Description updated (diff)
split H5 into os-autoinst update and others
Updated by szarate about 7 years ago
- Assignee deleted (
szarate)
I'm unnasigning myself from this issue for now.
Updated by okurz about 7 years ago
- Description updated (diff)
- Assignee set to RBrownSUSE
- Priority changed from Urgent to High
xterm/gterm/terminal tests failing: I don't think it is because of os-autoinst change -> https://openqa.suse.de/tests/800650#step/yast2_snapper/8
That might also be a different issue but after talking to coolo and szarate we could not identify anything in
https://openqa.suse.de/tests/802036#step/yast2_lan/1 looks like "ninja keys" but in console tests now -> staging
Currently we are following on H2 and want to come up with a qemu with debug information about key presses and such.
rbrown wants to further evaluate statistics in preparation for a good "product issue" report. I myself triggered 100 jobs on lord.arch/tests?hoursfresh=24&match=wait_for_x11_program_responsive to check if the problem can also be reproduced on a "slow" server (i.e. rotating disk, no NVME)
Updated by dzedro about 7 years ago
This typing too soon happened on my local openQA (desktop with 2 workers)
Update: without nfs, just local hdd
Updated by mkravec about 7 years ago
Typing too soon / too late happened a lot in build 0267:
All gedit failures:
https://openqa.suse.de/tests/804361#step/gedit/5
https://openqa.suse.de/tests/804365#step/gedit/5
https://openqa.suse.de/tests/804342#step/gedit/5
https://openqa.suse.de/tests/804413#step/gedit/5
https://openqa.suse.de/tests/804357#step/gedit/5
Gnome terminal:
https://openqa.suse.de/tests/804357#step/gnome_terminal/7
Firefox:
https://openqa.suse.de/tests/804345#step/firefox/4
GPK updates:
https://openqa.suse.de/tests/804350#step/updates_packagekit_gpk/5
https://openqa.suse.de/tests/804375#step/updates_packagekit_gpk/5
VNC:
https://openqa.suse.de/tests/802819#step/vnc_two_passwords/11
X11_start_program: (not typing program name in many tests)
https://openqa.suse.de/tests/804319#step/xterm/1
Updated by osukup about 7 years ago
same behaviour in many Maintenace Test's , Test Repo + Incidents ( +-5 cases per run)
Updated by RBrownSUSE about 7 years ago
- Status changed from In Progress to New
- Assignee deleted (
RBrownSUSE) - Priority changed from High to Urgent
Please provide at least some examples in this thread so I can follow them up
Updated by mgriessmeier about 7 years ago
seems also related:
https://openqa.suse.de/tests/804730#step/yast2_snapper/11
typing echo0 - probably missing a space here
also related:
https://openqa.suse.de/tests/804703#step/wireshark/4
things getting typed twice compared to old jobs
Updated by dzedro about 7 years ago
Here is another failed test with system performance statistics
Job was started at 2017-03-08 07:24:58
http://10.100.12.155/tests/5565#step/ssh_cleanup/3
pdf
ftp://10.100.12.155/sar.pdf
text
ftp://10.100.12.155/sar.txt
Updated by szarate about 7 years ago
- Related to action #17594: [tools]Missing characters in the middle of type_string/assert_script_run (Ninja Keys) added
Updated by osukup about 7 years ago
http://argus.suse.cz/tests/160#step/x11regressions_setup/4 hit locally :)
Updated by qkzhu about 7 years ago
- Related to action #17458: test fails in libreoffice_mainmenu_components added
Updated by okurz about 7 years ago
https://openqa.suse.de/tests/806822#step/zypper_lifecycle/29 is another fun example of missing keys from beginning of string in console but not immediately after console switching
Updated by okurz about 7 years ago
- Status changed from New to In Progress
- Assignee set to okurz
the ticket is in progress because we are already investing a lot of time in investigation. If rbrown does not want to take it I will because at least I can provide a little bit of answers if someone wants to understand what's going on
Updated by okurz about 7 years ago
- Description updated (diff)
conducted experiment on lord.arch by triggering 100 jobs in a row as a clone of a recently failing staging job.
- os-autoinst: 945efee
- os-autoinst-distri-opensuse: 04dcf59
os-autoinst-needles-sle: 7811e4e
- jobs http://lord.arch/tests?hoursfresh=24&match=wait_for_x11_program_responsive 5699-5793 (94 jobs) all passed or only failed in unrelated modules so no problem observed -> update os-autoinst 945efee..02cd04f
- http://lord.arch/tests?hoursfresh=24&match=poo17442 -> all 100 jobs (5794-5902) do not show any related failure (one incomplete because of 'stall detected', one softfailed module in reboot_gnome) -> os-autoinst is not to blame in general
-> H3 split, H3.1 rejected as well as H5.2
Also adding H6, H5.3
next experiment E5.3-1 with tests update on lord.arch
Updated by okurz about 7 years ago
https://openqa.suse.de/tests/807661#step/xterm/6 single 'ret' had an effect, nothing more is shown in xterm.
https://openqa.suse.de/tests/807661#step/sshxterm/5 same job, next module, missing characters from beginning.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525 is my proposal for a workaround but I will test something more thoroughly later on, e.g. some kind of send_key_until_needle_match 'next-line-in-xterm', 'ret'
or a while-loop, maybe.
Updated by okurz about 7 years ago
https://openqa.suse.de/tests/807824#step/updates_packagekit_gpk/6 in xterm, missing keys in beginning
Updated by okurz about 7 years ago
https://openqa.suse.de/tests/807449#step/check_console_font/3 in check_console_font, beginning "echo" missing
Updated by okurz about 7 years ago
- Description updated (diff)
E5.3-1 DONE: http://lord.arch/tests/5903 - http://lord.arch/tests/5922 all passed, HDD on lord.arch was full afterwards so I aborted.
This leaves H2.2, H3.1, H6
rbrown is conducting an experiment to load the production workers after "rudi did some changes".
Updated by dzedro about 7 years ago
Updated by okurz about 7 years ago
- Assignee changed from okurz to RBrownSUSE
@RBrownSUSE, any recent findings? staging tests are still failing a lot on osd as it seems, I can't reproduce it myself, dzedro can on his machine. What is your take on this?
workaround https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525 is still ready but we don't see the problem on o3 so not approved. What to do now?
Updated by RBrownSUSE about 7 years ago
As we chatted about in my office today - I think the console failures of this type like the one we saw today could/should be mitigated by a better needle that actually checks it has the correct console so it doesn't type into the user console before it's actually flipped to the root console
Have you or anyone else submitted a needle to fix that?
The X11 failures, like we saw in gedit, need more investigation and http://openqa.suse.de/tests/810619#step/xterm/6 alone isn't enough for me to get a feel for the problem:
More examples would help, so I can profile the system performance at the time and such.
I'm waiting and seeing before deciding what to do about https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525/
If things are really getting in the way of staging, then I can accept gh#2525 as a 'make release managers happy' measure, but as we're just on the dark side of Alpha 2 I would prefer to have more time to investigate if you can help make that happen
What do you think?
Updated by okurz about 7 years ago
RBrownSUSE wrote:
As we chatted about in my office today - I think the console failures of this type like the one we saw today could/should be mitigated by a better needle that actually checks it has the correct console so it doesn't type into the user console before it's actually flipped to the root console
Have you or anyone else submitted a needle to fix that?
I am on it in #14826
The X11 failures, like we saw in gedit, need more investigation and http://openqa.suse.de/tests/810619#step/xterm/6 alone isn't enough for me to get a feel for the problem:
More examples would help, so I can profile the system performance at the time and such.
I'm waiting and seeing before deciding what to do about https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2525/If things are really getting in the way of staging, then I can accept gh#2525 as a 'make release managers happy' measure, but as we're just on the dark side of Alpha 2 I would prefer to have more time to investigate if you can help make that happen
What do you think?
I think not enough people trust the system so after two weeks with no solution in sight we should expedite to find a workaround and make some people hunt for product bugs while we can continue to investigate further on this. If we simply put in a record_soft_failure or maybe even a single but unambiguous "diag" message we can find occurences of these problems and gather more data. I updated my PR to record a soft failure and try again. This might help to gather even more data.
Updated by RBrownSUSE about 7 years ago
- Subject changed from Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string) to [tools]Bad performance or too many failures because of typing too soon/mistyping (send_keys/type_string)
Updated by okurz about 7 years ago
so we merged the test pull request which might help us to detect this (or break everything, as always, but I will monitor)
But now for something completely different. In
https://openqa.opensuse.org/tests/369987#step/inkscape/8
which is on o3, we see a mistyping of "xtrem" instead of "xterm". The test code is fine and so is the call to the testapi
testapi::type_string(string='xterm', max_interval=250, wait_screen_changes=0)
same issue?
Updated by asmorodskyi about 7 years ago
https://openqa.suse.de/tests/818019 - another example of missing keys job
Updated by okurz about 7 years ago
recent example of failing job: https://openqa.suse.de/tests/820752#step/xterm/9
Updated by mkravec about 7 years ago
In build 0288 review:
https://openqa.suse.de/tests/826663#step/zypper_lifecycle/59
https://openqa.suse.de/tests/826673#step/zypper_lifecycle/55
https://openqa.suse.de/tests/826653#step/zypper_lifecycle/29
https://openqa.suse.de/tests/826680#step/zypper_lifecycle/29
https://openqa.suse.de/tests/826680#step/updates_packagekit_gpk/11
https://openqa.suse.de/tests/826696#step/sshd/15
https://openqa.suse.de/tests/826679#step/sshd/17
https://openqa.suse.de/tests/826683#step/mtab/1
Updated by coolo about 7 years ago
Careful not to mix different issues. the zypper_lifecycle cases are 'not waiting for select-console to finish'.
Same with sshd
Updated by okurz about 7 years ago
o3 examples of mistyped:
https://openqa.opensuse.org/tests/375639#step/vlc/22 -> "v-lc -no-autoscale" instead of "vlc --no-autoscale" so the one "-" ended up somewhere differently.
Similar is https://openqa.opensuse.org/tests/376071#step/ooffice/7 -> "oowritre" instead of "oowriter"
Updated by coolo about 7 years ago
krunner is using autocompletion - and it breaks often if you type manually in there too. So we need to type very slow there - but it's more a product workaround not an openqa problem.
Updated by coolo about 7 years ago
So I experimented with waiting for screen updates after every key on malbec (which has qemu 2.8 now), but the emulated keyboard still bounces keys:
https://openqa.suse.de/tests/837199#step/snapper_cleanup/30 - so typing SLOOOW on power and aarch is the only option atm :(
Updated by asmorodskyi about 7 years ago
https://openqa.suse.de/tests/837414 - power architecture
Updated by coolo about 7 years ago
that test ran on the weekend when I experimented with malbec. The result of that was https://github.com/os-autoinst/os-autoinst/pull/743/files - and we haven't yet updated the machines
Updated by RBrownSUSE about 7 years ago
- Priority changed from Urgent to High
Lowering Prio as issues are no longer 'on fire'
Will lower further after PR 743 is deployed and this becomes a 'monitoring' task
Still open as clearly happens still on Power and aarch but shouldn't after 743 is merged.
Updated by coolo about 7 years ago
743 is deployed, but not used as not configured
Updated by SLindoMansilla about 7 years ago
Could you please check these tests:
https://openqa.suse.de/tests/840897#step/curl_https/7
It might be related, see difference between "Console output" and "serial output"
https://openqa.suse.de/tests/840898#step/sshd/20
https://openqa.suse.de/tests/840905#step/sshd/20
https://openqa.suse.de/tests/840899#step/curl_https/7
Updated by szarate about 7 years ago
There's another one too: https://openqa.suse.de/tests/843506
Updated by RBrownSUSE about 7 years ago
- Status changed from In Progress to Resolved
Believed to be resolved with the new deployed settings, closing ticket.
Updated by okurz about 7 years ago
see #17594#note-16 for more information on potential resolution helps
Updated by okurz about 7 years ago
- Related to action #17456: [qam][x11regressions]test fails sporadically in gedit_save added
Updated by qkzhu about 7 years ago
- Is duplicate of action #17446: [qam][x11regressions]test fails in evince_find sporadically (network issue or mistyping on long strings) added
Updated by okurz about 7 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: migration_zdup_offline_sle12sp2_allpatterns_fullupdate_ppc
https://openqa.suse.de/tests/878304
Updated by okurz about 7 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: om_proxyscc_sles12sp1_smt_full_update_by_yast
https://openqa.suse.de/tests/878430