Project

General

Profile

action #12250

[sporadic]"QEMU: usb-kbd: warning: key event queue full" on ppc64le and aarch64

Added by okurz about 5 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
-
Start date:
2016-06-03
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

observation

missing character "-f" after the test in https://openqa.suse.de/tests/412084/modules/snapper_undochange/steps/21, message in logfile: "QEMU: usb-kbd: warning: key event queue full"

problem

typing too fast for this backend?

suggestion

Probably we should detect this QEMU message and handle it explicitly.

further details

Related issue: https://progress.opensuse.org/issues/12018

workaround

Sporadic issue: Retrigger job


Related issues

Related to openQA Project - coordination #14972: [tools][epic] Improvements on backend to improve better handling of stallsResolved2016-11-24

Related to openQA Project - action #38777: [tools][ppc64le] QEMU: usb-kbd: warning: key event queue fullRejected2018-07-24

Has duplicate openQA Tests - action #15118: Text is mistyped on aarch64Rejected2016-11-29

History

#1 Updated by okurz about 5 years ago

  • Assignee set to coolo

happened recently here: https://openqa.suse.de/tests/463809

who is our ppc64le qemu expert?

#2 Updated by coolo about 5 years ago

and do what?

#3 Updated by okurz about 5 years ago

reassign to our ppc64le qemu expert

#4 Updated by coolo about 5 years ago

  • Assignee deleted (coolo)

that's not me

#5 Updated by algraf about 5 years ago

I think we just need to use the new vnc completion timeout feature from QEMU to work around that. Same issue as aarch64.

#6 Updated by okurz about 5 years ago

I knew the coolo-okurz back and forth would serve a purpose ;-)

What you mentioned sounds interesting. Can you go into a little bit more detail or provide a reference?

#7 Updated by algraf about 5 years ago

okurz wrote:

I knew the coolo-okurz back and forth would serve a purpose ;-)

What you mentioned sounds interesting. Can you go into a little bit more detail or provide a reference?

Sure, I'm referring to this patch:

https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg03811.html

which was written exactly to fix this issue. The patch is also available in the Virtualization/qemu package:

https://build.opensuse.org/package/view_file/Virtualization/qemu/0060-vnc-add-configurable-keyboard-delay.patch?expand=1

as well as SLE12 SP2:

https://build.suse.de/package/view_file/SUSE:SLE-12-SP2:GA/qemu/0065-vnc-add-configurable-keyboard-delay.patch?expand=1

So all you need to do is make sure you're running a very recent QEMU version and pass -vnc ...,key-delay-ms=200 as parameter to the qemu command line.

#8 Updated by mgriessmeier almost 5 years ago

created PR for that: https://github.com/os-autoinst/os-autoinst/pull/568
but since lack of backend knowledge, I'll need some help with that

okurz suggested to put in a check for the qemu version

#9 Updated by dirkmueller almost 5 years ago

I'm currently out of office and your email has been received. I'll be back August 31st. For issues related
SUSE OpenStack Cloud that need attention before that, please reach out to Roman Arcea rarcea@suse.com.

Thanks,
Dirk

#10 Updated by AdamWill almost 5 years ago

I may be missing something here, but - we already have quite a bit of rate limiting code in os-autoinst , including an arg you can pass to type_string to control the speed of typing. In SUSE's distri code you have type_slowly and type_very_slowly functions that limit the rate beyond the default by using this arg, for instance. Do we really move in the right direction by turning on some all-or-nothing rate limiting in qemu?

If it's really the case that a given arch can't keep up with the same rate of typing in 'ideal conditions' - i.e. at a console with the minimum amount of other activity going on in the SUT and its host - then couldn't we just adjust the default rate limiting in type_string etc. to be different per arch?

And is that really the problem in this case at all? What I think I see on Fedora is that typing issues seem to be related to system load. For instance, we almost never get any errors when typing at a console on an otherwise-idle VM...but very occasionally we do get some characters swallowed, and if you happen to catch it live or get a fortunate video (the speedup on the videos can make them useless for analyzing this) and analyze it frame-by-frame, it looks like the VM mostly keeps up just fine with the typing, then suddenly for a second or two absolutely no keypresses show up, then a bunch of them appear at once; basically, it looks like something somewhere (the SUT, the worker host, or some other damn part of the chain) gets really busy and stops forwarding keypresses, then when it stops being busy, all the backlogged key presses arrive in a lump - only if more presses backed up than the SUT (I think that's where the buffer is) has buffer space for, you lose some of them.

Rate limiting could mitigate that, but universal rate limiting of all typing ever is a very large hammer for cracking that nut, and not even really foolproof, if the 'not forwarding keypresses' period lasts long enough.

#11 Updated by okurz almost 5 years ago

Yes, I think you are right. Basically as also discussed in IRC some times and also in another progress issue, e.g. see #13702: If this issue appears it is not really a problem in qemu but just the symptom that the SUT gets stalled. A human user with better OCR is waiting for the expected even. We mainly have the problem described here when we are typing long sequences without waiting for proper feedback of the system. IMHO there should be some waiting when typing long sequences but I am not sure what kind. Would would you say regarding #13702? I'd say as in #13702#note-3 a forced wait_still_screen(1) after 30 characters by default in type_string which can be overridden on demand is a good approach. WDYT?

#12 Updated by agraf@suse.de almost 5 years ago

Would it help if you had a separate qemu keyboard input command (over vnc?) that only returns if the key was successfully pulled from the fifo queue?

Alex

Am 19.09.2016 um 23:22 schrieb redmine@opensuse.org redmine@opensuse.org:

[openSUSE Tracker]
Issue #12250 has been updated by okurz.

Yes, I think you are right. Basically as also discussed in IRC some times and also in another progress issue, e.g. see #13702: If this issue appears it is not really a problem in qemu but just the symptom that the SUT gets stalled. A human user with better OCR is waiting for the expected even. We mainly have the problem described here when we are typing long sequences without waiting for proper feedback of the system. IMHO there should be some waiting when typing long sequences but I am not sure what kind. Would would you say regarding #13702? I'd say as in #13702#note-3 a forced wait_still_screen(1) after 30 characters by default in type_string which can be overridden on demand is a good approach. WDYT?


action #12250: "QEMU: usb-kbd: warning: key event queue full" on ppc64le
https://progress.opensuse.org/issues/12250#change-28946

  • Author: okurz
  • Status: New
  • Priority: Normal
  • Assignee:
  • Category: Bugs in existing tests

* Target version:

observation

missing character "-f" after the test in https://openqa.suse.de/tests/412084/modules/snapper_undochange/steps/21, message in logfile: "QEMU: usb-kbd: warning: key event queue full"

problem

typing too fast for this backend?

suggestion

Probably we should detect this QEMU message and handle it explicitly.

further details

Related issue: https://progress.opensuse.org/issues/12018

--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://progress.opensuse.org/my/account

#13 Updated by okurz@suse.de almost 5 years ago

Would it help if you had a separate qemu keyboard input command (over vnc?)
that only returns if the key was successfully pulled from the fifo queue?

sounds like a great idea. I don't if we the then still have problems with
other backends but the qemu machines are more prone to mutual stalling so I
might very well help.

#14 Updated by maritawerner over 4 years ago

What is the status here? Can szarate do the fix suggested from agraf?

#15 Updated by szarate over 4 years ago

  • Assignee set to szarate

i'm taking this one... the patch mentioned by @agraf is already present (it's fun to use it with 1000ms...

#16 Updated by szarate over 4 years ago

  • Assignee deleted (szarate)
  • Priority changed from Normal to Low

I could not reproduce the same issue on ppc64le, but i got similar problems with x86_64.

While i was using the -key-delay i could get rid of the type_slow and type_very_slow typing speeds... but this would led to all sort of different problems in following screens.

I'm changing the priority to low, since it has not been seen in a while (At least not a specific message)

#17 Updated by mkravec over 4 years ago

QEMU: usb-kbd: warning: key event queue full
QEMU: main-loop: WARNING: I/O thread spun for 1000 iterations

Job: https://openqa.suse.de/tests/666816#step/zypper_lifecycle/15

#18 Updated by zluo over 4 years ago

yes, it seems that openQA server is overloaded.

#19 Updated by szarate over 4 years ago

To add a bit more on what zluo wrote on #15118, there are stalls in this job that are up to 38 seconds, which basically means that the system was really slow at that point.

However, what mkravec is describing seems to be more related to: This bug... It seems that there are few solutions to this like this one.

#20 Updated by szarate over 4 years ago

#21 Updated by szarate over 4 years ago

  • Related to coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls added

#22 Updated by asmorodskyi over 4 years ago

another example of failure https://openqa.suse.de/tests/668028

#24 Updated by asmorodskyi over 4 years ago

  • Priority changed from Low to Normal

another example https://openqa.suse.de/tests/685272/file/autoinst-log.txt

also change priority back to normal , according latest comments issue start to appear more often

#26 Updated by mkravec over 4 years ago

  • Subject changed from "QEMU: usb-kbd: warning: key event queue full" on ppc64le to "QEMU: usb-kbd: warning: key event queue full" on ppc64le and aarch64

#27 Updated by szarate over 4 years ago

coolo, i think it would be wise to add some kind of monitoring now at least to capture the CPU usage... i'll play with a machine in orthos to see if i can catch it... but so far looks like we can say that this would be a known failure (And therefore a job to restart automatically by the webui)

#28 Updated by mkravec over 4 years ago

There are failures in every build (5 tests failed in build 0202)
Build0211: https://openqa.suse.de/tests/721648

#29 Updated by szarate over 4 years ago

  • Status changed from New to In Progress

Now with Die early when usb-kbd: key event full is found, jobs that have this abnormal situation will be inmediately terminated. In the future, they should be rescheduled.

#30 Updated by szarate over 4 years ago

  • Status changed from In Progress to Feedback

Will set the status to feedback for now...

#31 Updated by asmorodskyi over 4 years ago

  • Status changed from Feedback to In Progress

#32 Updated by szarate over 4 years ago

  • Status changed from In Progress to Feedback

I came across this interesting one: https://openqa.suse.de/tests/733459#step/updates_packagekit_gpk/7 The message "QEMU: usb-kbd: warning: key event queue full" is not shown, but still the c from clear was already missing.

This might be unrelated but leaving this here for future reference.

Also, changing the status to feedback as it has been fixed but waiting for deploy of os-autoinst on workers for final verification.

#33 Updated by szarate over 4 years ago

szarate wrote:

I came across this interesting one: https://openqa.suse.de/tests/733459#step/updates_packagekit_gpk/7 The message "QEMU: usb-kbd: warning: key event queue full" is not shown, but still the c from clear was already missing.

This might be unrelated but leaving this here for future reference.

Also, changing the status to feedback as it has been fixed but waiting for deploy of os-autoinst on workers for final verification.

This is also another similar occurence of this bug: http://e13.suse.de/tests/1705#step/yast2_apparmor/10 previous screen sends a alt-n but there's no messages again by the QEMU backend...

#35 Updated by okurz over 4 years ago

  • Status changed from Feedback to In Progress

szarate: can you take this? https://openqa.suse.de/tests/754579 fails with this and it is not an "early die"

#36 Updated by okurz over 4 years ago

  • Priority changed from Normal to High

#38 Updated by okurz over 4 years ago

you know that overdrive2 is not under control of salt and therefore fall in the domain of "but it is fixed in os-autoinst@master!!!1eins"

#39 Updated by okurz over 4 years ago

ppc64le: https://openqa.suse.de/tests/754640
it died immediately after the "queue full" was detected but the job is regarded as "incomplete" and the error is not obvious without diving into the logfile. So some time saved by not fooling ourselves. for "incomplete" it is pretty obvious we need to look into the logfile anyway but it would be better if we can call something like a "record_known_failure" and forward that to the webui.

#40 Updated by szarate over 4 years ago

  • Status changed from In Progress to Resolved

I'm thinking about renaming this bug to the: Neverending bug!. Anywho... Changing my mind and will set this one to resolved... The missing/ninja keys is another issue. Guys feel free to reopen in case it happens again :D

#41 Updated by pgeorgiadis over 4 years ago

  • Status changed from Resolved to In Progress

And yes, it happened again :P openQA crashed during qam-minimal for aarch64

Latest build can be found here. The test crashed (dark red color).
Here's the log from auto-inst:

19:37:18.1552 27443 <<< testapi::type_string(string='zypper -n in -l -t patch SUSE-SLE-SERVER-12-SP2-2017-4210 | tee /tmp/zypper.log ; echo 2VCGL-${PIPESTATUS}- > /dev/ttyAMA0', max_interval=250, wait_screen_changes=0)
19:37:18.7930 27445 QEMU: usb-kbd: warning: key event queue full
DIE QEMU: Shutting down the job at /usr/lib/os-autoinst/backend/qemu.pm line 939.

This error had not happened in the last 3 months.

#42 Updated by szarate over 4 years ago

  • Status changed from In Progress to Resolved

pgeorgiadis wrote:

And yes, it happened again :P openQA crashed during qam-minimal for aarch64

Latest build can be found here. The test crashed (dark red color).
Here's the log from auto-inst:

19:37:18.1552 27443 <<< testapi::type_string(string='zypper -n in -l -t patch SUSE-SLE-SERVER-12-SP2-2017-4210 | tee /tmp/zypper.log ; echo 2VCGL-${PIPESTATUS}- > /dev/ttyAMA0', max_interval=250, wait_screen_changes=0)
19:37:18.7930 27445 QEMU: usb-kbd: warning: key event queue full
DIE QEMU: Shutting down the job at /usr/lib/os-autoinst/backend/qemu.pm line 939.

This error had not happened in the last 3 months.

The actual fix was to add specifically that die :). So far it's a problem we will have to live with, until we figure out how to deal with overbooking and qemu usb packet drop (https://bugzilla.redhat.com/show_bug.cgi?id=1293234)

szarate wrote:

Now with Die early when usb-kbd: key event full is found, jobs that have this abnormal situation will be inmediately terminated. In the future, they should be rescheduled.

#44 Updated by mitiao over 4 years ago

The issue always reproduced on power for recent builds
https://openqa.suse.de/tests/829192
https://openqa.suse.de/tests/829193

#46 Updated by SLindoMansilla over 4 years ago

  • Status changed from Resolved to In Progress
  • Priority changed from High to Normal

as discussed with okurz. Not happening too often but as long as it appears we need to handle it somehow and track it.

#47 Updated by coolo over 4 years ago

So when typing with 50kps we can reproduce this problem like 100% in console tests. 15 was okayish, so I guess ppc (and possibly aarch64) would need some even lower.

As also ikvm and other platforms are affected (I had to lower the typing speed for idrac interfaces to 10kps), I guess we need to have this setting globally in the machine definition.

#48 Updated by dzedro over 4 years ago

ppc64le failed

#49 Updated by algraf over 4 years ago

This is the bugzilla to track "keyboard type queue is empty" events which OpenQA could then use to only press keys when they were successfully handled by the guest:

https://bugzilla.suse.com/show_bug.cgi?id=1031692

#50 Updated by SLindoMansilla over 4 years ago

  • Category changed from Bugs in existing tests to Infrastructure

as discussed with okurz

#53 Updated by michel_mno over 4 years ago

Other occurrences for Leap 42.3 and ppc64le (1)
systematically when "curl_https" script has loop of validate_script_output. (2)
Do we need a specific new issue for this problem ?

(1) https://openqa.opensuse.org/tests/overview?distri=opensuse&version=42.3&build=0037&groupid=30
(2) https://openqa.opensuse.org/tests/381080/modules/curl_https/steps/1/src

#54 Updated by michel_mno over 4 years ago

And another exemple (1) for sshd.pm
where the issued string is 86 characters long (2)

(1) https://openqa.opensuse.org/tests/382112/file/autoinst-log.txt
(2) 'ssh -v sshboy@localhost -t echo LOGIN_SUCCESSFUL; echo SSH-1491407073-$?- > /dev/hvc0'

#55 Updated by nicksinger over 4 years ago

Here is another example: https://openqa.suse.de/tests/861360

#56 Updated by dzedro about 4 years ago

ppc64le migration test failed on QEMU: usb-kbd: warning: key event queue full
https://openqa.suse.de/tests/918599
https://openqa.suse.de/tests/918604

#57 Updated by nicksinger about 4 years ago

Here are some more recent examples on ppc64le:
https://openqa.suse.de/tests/931322 (Tried to type 134 chars before fail)
https://openqa.suse.de/tests/930630 (107 chars)
https://openqa.suse.de/tests/931356 (86 chars)
https://openqa.suse.de/tests/931937
https://openqa.suse.de/tests/930663

Also the log of this: https://openqa.suse.de/tests/931322/file/autoinst-log.txt
Shows another interesting error:

22:01:27.9912 88247 capture loop failed QEMU: Shutting down the job at /usr/lib/os-autoinst/backend/qemu.pm line 955.

but this is maybe just caused by the full event queue.

#58 Updated by mitiao about 4 years ago

slesp2 image creating (for migration) failed on aarch64_maintenance worker
https://openqa.suse.de/tests/940559

#59 Updated by okurz about 4 years ago

  • Subject changed from "QEMU: usb-kbd: warning: key event queue full" on ppc64le and aarch64 to [sporadic]"QEMU: usb-kbd: warning: key event queue full" on ppc64le and aarch64
  • Description updated (diff)

#60 Updated by mitiao about 4 years ago

Re-triggered several times and always failed at the same step with key event queue full issue:
https://openqa.suse.de/tests/982202

Any idea to avoid it to publish a SP2 aarch64 image, then I can try migration from that.

#61 Updated by RBrownSUSE about 4 years ago

My advice would be to apply the same fix to the tests that I suggested in #19540

I would recommend changing grub_set_bootargs.pm line 20 to use wait_screen_changes=1 to naturally type more slowly and therefore avoid filling up the key event queue.

#62 Updated by SLindoMansilla about 4 years ago

Recent example:

sle-12-SP3-Server-DVD-ppc64le-Build0424-RAID0@ppc64le
osd#995630#downloads

#63 Updated by pgeorgiadis about 4 years ago

Recent example: qam-minimal-full@aarch64_maintenance just died because of this:

QEMU: usb-kbd: warning: key event queue full

Log: https://openqa.suse.de/tests/1035782/file/autoinst-log.txt
https://openqa.suse.de/tests/1035782#

#64 Updated by algraf about 4 years ago

pgeorgiadis wrote:

Recent example: qam-minimal-full@aarch64_maintenance just died because of this:

QEMU: usb-kbd: warning: key event queue full

Log: https://openqa.suse.de/tests/1035782/file/autoinst-log.txt
https://openqa.suse.de/tests/1035782#

Is this using the latest SP3 QEMU yet?

#65 Updated by algraf about 4 years ago

algraf wrote:

Is this using the latest SP3 QEMU yet?

Oh I just saw this is maintenance, so it's running on the old SP2 OpenQA system?

#66 Updated by okurz about 4 years ago

yes, the host is overdrive2.

#67 Updated by szarate over 3 years ago

  • Status changed from In Progress to Resolved

I think this hasnt been seen for quite a while. Overdrive2 is not even doing any job fro a while.

#68 Updated by okurz about 2 years ago

  • Related to action #38777: [tools][ppc64le] QEMU: usb-kbd: warning: key event queue full added

Also available in: Atom PDF