action #65109: [qe-core] Dynamically adjust test waits based on load - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #65109

open

[qe-core] Dynamically adjust test waits based on load

Added by dzedro over 4 years ago. Updated 9 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Enhancement to existing tests

Target version:

QA (public, currently private due to #173521) - future

Start date:

Due date:

% Done:

Estimated time:

Difficulty:

Tags:

bulkupdate

Description

There are some things that could be improved.

Maybe decreasing IO load on workers, snapshots on qemu backend are done basically everywhere even where it doesn't make much sense, small jobs(one, few) tests, multimachine and installations.

I add QEMU_DISABLE_SNAPSHOTS=1 into most of QAM suites, there are installations with following tests, but fail is mostly some random timeout, typing, etc. issue. The test is restarted anyway... It could make sense to disable snapshots by default on suites where snapshot is waste of IO, exceptions could overwrite it with QEMU_DISABLE_SNAPSHOTS=0.

Review testapi, remove/fix inconsistent functions and add functions which are used widely in combinations?
e.g. wait_screen_change is widely used and failing where is expected that the function will wait, wait can be more than just screen change, generally is the function waiting 0 seconds. Thus I don't use it, because when you add wait time, mostly in gui/ncurses tests, you expect that it will wait some minimal time, not zero.

I guess there is some screen change like button animation, but that does not mean the action is finished and next step should continue safely.
Exactly for this I prefer wait_still_screen, as you can define min/max wait time based on screen activity.

send_key is often used with wait_still_screen or wait_screen_change, despite wait_screen_change is part of send_key as parameter to wait after key press. I guess most of people don't know that there is parameter in send_key for wait_screen_change, which is still questionable if it will be sufficient wait function.

Btw I was thinking that the parameter is not working since it was using wait_idle, which became deprecated. In one tests was created send_key_and_wait [1], use of wait_still_screen could be parameter of send_key, replace or be second parameter with wait_screen_change?

Acceptance criteria¶

AC1: Unify implementation of max_interval/wait_still_screen/wait_screen_change for send_key and other testapi functions.
AC2: Reduce I/O overhead proactively

Suggestions¶

AC1:
- Extend send_key with a wait_still_screen option
- Refactor wait logic for better re-use
AC2:
- Enable and disable QEMU_DISABLE_SNAPSHOTS systematically
- Evaluate resource usage e.g. CPU load, I/O causing timeouts

[1] https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/yast2_gui/yast2_instserver.pm#L27

Actions

Copy link

Updated by livdywan over 4 years ago

Note for the benefit of drive-by readers: This is based on previous notes, and will need some refinement. I intend to identify some actionable points from this.

Actions

Copy link

Updated by livdywan over 4 years ago

Description updated (diff)

Actions

Copy link

Updated by livdywan over 4 years ago

dzedro wrote:

send_key is often used with wait_still_screen or wait_screen_change, despite wait_screen_change is part of send_key as parameter to wait after key press. I guess most of people don't know that there is parameter in send_key for wait_screen_change, which is still questionable if it will be sufficient wait function.

Btw I was thinking that the parameter is not working since it was using wait_idle, which became deprecated. In one tests was created send_key_and_wait [1], use of wait_still_screen could be parameter of send_key, replace or be second parameter with wait_screen_change?

[1] https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/yast2_gui/yast2_instserver.pm#L27

So the current version of send_key_and_wait is basically send_key $key, $stilltime, $timeout//=5; wait_still_screen $stilltime, $timeout;. On average it's called with a $stilltime of 2 seconds... so I see two attractive options:

Extending send_key with a new $wait_still_screen option, like type_string/password woud make sense for consistency/simplicity
Adding send_key_still_screen which defaults to 2 seconds stilltime and 5 seconds timeout.

To further pout this to the test, we could merge all calls of the send_key_and_wait function and see if that works well.

Actions

Copy link

Updated by okurz over 4 years ago

@dzedro should we move this to "openQA project" or is your intention to discuss this within the scope of "test maintainers" mainly?

I think you are raising many important points and it is good to see that you care about general performance and stability which I only see in a very limited scope among users of openQA.

In general I think we have a lot of potential for improvement as mainly we are using a system from around 2013-2015 and there had been a lot of improvements in the domain of mainly virtualized environments since then, e.g. improvement using newer qemu features, virtio, qcow tune parameters, filesystems from which machines are started, etc.

Actions

Copy link

Updated by livdywan over 4 years ago

dzedro wrote:

Maybe decreasing IO load on workers, snapshots on qemu backend are done basically everywhere even where it doesn't make much sense, small jobs(one, few) tests, multimachine and installations.

I add QEMU_DISABLE_SNAPSHOTS=1 into most of QAM suites, there are installations with following tests, but fail is mostly some random timeout, typing, etc. issue. The test is restarted anyway... It could make sense to disable snapshots by default on suites where snapshot is waste of IO, exceptions could overwrite it with QEMU_DISABLE_SNAPSHOTS=0.

Maybe the setting should be moved to a different layer? The job group, not the test, for instance.

Actions

Copy link

Updated by dzedro over 4 years ago

okurz wrote:

@dzedro should we move this to "openQA project" or is your intention to discuss this within the scope of "test maintainers" mainly?

I think you are raising many important points and it is good to see that you care about general performance and stability which I only see in a very limited scope among users of openQA.

In general I think we have a lot of potential for improvement as mainly we are using a system from around 2013-2015 and there had been a lot of improvements in the domain of mainly virtualized environments since then, e.g. improvement using newer qemu features, virtio, qcow tune parameters, filesystems from which machines are started, etc.

Anybody is welcome, to discuss such change.

@cdywan I would vote for option 1. extend existing send_key.
I was disabling it in specific tests where snapshots are pointless, when you have big group and you don't know exactly where you want to have snapshots then per test.

Actions

Copy link

Updated by livdywan over 4 years ago

Status changed from New to In Progress
Target version set to Current Sprint

dzedro wrote:

@cdywan I would vote for option 1. extend existing send_key.

https://github.com/os-autoinst/os-autoinst/pull/1418

Actions

Copy link

Updated by livdywan over 4 years ago

Status changed from In Progress to Feedback

I closed the above PR for now since I don't necessarily want to address just the obvious symptoms, but it already helped inspire some interesting conversations.

Actions

Copy link

Updated by livdywan over 4 years ago

Description updated (diff)

Actions

Copy link

#10

Updated by livdywan over 4 years ago

Default timeout based on (over)load
- Query host load data (not VM guest/ worker)
- (e.g. load average: 2,48, 1,90, 1,91)
- 8 CPUs, 16 tasks loaded => timeout of 8 seconds
- timeout += (load/cpu) > 1 ? (load / cpu) * seconds(=1) : 0
- Swappiness (free, Perl API?)
- timeout += swap ? (seconds=1) : 0
- I/O load (iostat, Perl API?)
- timeout += ??
- Networking
- useful? may not be relevant to our setups
Wait on ... instead of seconds
- cpu idle cycles (context switches?)
- disk sync
Conceive a test with artificially busy processes
- HUNGRY_RAM=n (malloc(1024*1024*1024);)
- HUNGRY_PROCESSES=n (yes >/dev/null)
- HUNGRY_DISKS=n
- artificially hungry node app

Actions

Copy link

#11

Updated by livdywan about 4 years ago

Subject changed from [tools] backend setting & API improvement to Dynamically adjust test waits based on load
Status changed from Feedback to Workable
Assignee deleted (~~livdywan~~)

Actions

Copy link

#12

Updated by okurz about 4 years ago

Target version changed from Current Sprint to future

With your update I think some useful ideas have been collected. I will remove the ticket from the current SUSE QA Tools team backlog. So this is left for another team to pick up. In general I think the ideas in this ticket are more important than most test extensions to have even more test coverage which gets hard to maintain.

Actions

Copy link

#13

Updated by okurz about 4 years ago

Assignee set to tjyrinki_suse

@tjyrinki_suse I think this is one of many tickets regarding general "improvement" of openQA tests so could be something for "QE Core" hence assigning to you. As this is just one tiny example of many many other cases I would be happy to also discuss this with you in a synchronous meeting between "QE Tools" and "QE Core" PO if you are interested.

Actions

Copy link

#14

Updated by tjyrinki_suse over 3 years ago

Subject changed from Dynamically adjust test waits based on load to [qe-core] Dynamically adjust test waits based on load
Start date deleted (~~2020-03-31~~)

Actions

Copy link

#15

Updated by tjyrinki_suse over 3 years ago

Assignee deleted (~~tjyrinki_suse~~)

Actions

Copy link

#16

Updated by apappas almost 3 years ago

Assignee set to tjyrinki_suse

This ticket was discussed in the qe-core refinement session.

We assign it to you Timo, as it's up to the PO to either reject or organize the discussion "this with you in a synchronous meeting between "QE Tools" and "QE Core" PO if you are interested."

Actions

Copy link

#17

Updated by tjyrinki_suse over 2 years ago

Status changed from Workable to New
Assignee deleted (~~tjyrinki_suse~~)
Priority changed from Normal to High

Back to backlog as I'm now away from QE Core. I think this could be a candidate for the post-SP4 breathing period, so putting High priority.

Actions

Copy link

#18

Updated by szarate over 2 years ago

These tickets are not on high prio

Actions

Copy link

#19

Updated by szarate over 2 years ago

Tags set to bulkupdate

These tickets are not on high pro

Actions

Copy link

#20

Updated by szarate over 2 years ago

Priority changed from High to Normal

Actions

Copy link

#21

Updated by slo-gin 9 months ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #65109

[qe-core] Dynamically adjust test waits based on load

Acceptance criteria¶

Suggestions¶

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by okurz over 4 years ago

Updated by livdywan over 4 years ago

Updated by dzedro over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by tjyrinki_suse over 3 years ago

Updated by tjyrinki_suse over 3 years ago

Updated by apappas almost 3 years ago

Updated by tjyrinki_suse over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by szarate over 2 years ago

Updated by slo-gin 9 months ago