action #121045
closedPerformance regression of `os-autoinst` size:M
0%
Description
Observation¶
I have noticed significant performance regression on aarch64 since few days which leads to some test failures.
On o3, this is visible on MicroOS-Image
and MicroOS-Image-ContainerHost
. For example, see history of https://openqa.opensuse.org/tests/2912766#next_previous
I can point to os-autoinst
since on my local aarch64 Tumbleweed machine I perform some tests and the result is:
os-autoinst
at git ID61d5b38
(https://github.com/os-autoinst/os-autoinst/commit/61d5b38765b5fcecc7006083a5eb9a654b5b9b5f) works fine (Current Tumbleweed version)- But
os-autoinst
at git IDd36887c
(https://github.com/os-autoinst/os-autoinst/commit/d36887c63b6452d3f88356438cf55ecfab763a51) is BROKEN (Was the version fromDevel:openQA
repo when I installed it)
For my tests, I cloned locally https://openqa.opensuse.org/tests/2912766 and the test passes with os-autoinst
at git ID 61d5b38
but os-autoinst
at git ID d36887c
fails in disk_boot
as it fails to "see" the grub menu.
Acceptance criteria¶
- AC1: Ensure there is no performance regression
Suggestions¶
- Investigate sporadic excessive select calls
Updated by ggardet_arm about 2 years ago
os-autoinst
at commit 1f2f0a2
is also BROKEN https://github.com/os-autoinst/os-autoinst/commit/1f2f0a2038c9c0cfc918a16ed1d313516fca93da
Updated by ggardet_arm about 2 years ago
os-autoinst
at commit 4c5642d
works fine https://github.com/os-autoinst/os-autoinst/commit/4c5642d
So, it points to https://github.com/os-autoinst/os-autoinst/pull/2216
Updated by ggardet_arm about 2 years ago
One difference I could spot by comparing the os-autoinst.log
files is that GOT GO
line is:
- after
QEMU: QEMU emulator version...
andQEMU: Copyright (c)...
lines when it works - before
QEMU: QEMU emulator version...
andQEMU: Copyright (c)...
lines when it is broken
Updated by mkittler about 2 years ago
- Assignee set to mkittler
That's an interesting observation. The That's not it. It looks like the backend process is the one consuming more CPU time.GOT GO
line must after the backend starts because only then its fd will be correctly used when polling via select
. Likely now the select
call always returns immediately making the isotovideo
loop spin much faster. I'll have a closer look.
Updated by mkittler about 2 years ago
- Status changed from New to In Progress
For now it is likely best to revert the PR: https://github.com/os-autoinst/os-autoinst/pull/2221
Updated by mkittler about 2 years ago
I've been adding some debugging code to see what handles let the backend loop spin:
bmwqemu::diag("read set: " . join(", ", map { $self->{select_read}->get_name($_) } @$read_set));
bmwqemu::diag("write set: " . join(", ", map { $self->{select_write}->get_name($_) } @$write_set));
bmwqemu::diag("e set: " . join(", ", map { $self->{read_write}->get_name($_) } @$e_set));
Then tail -f /hdd/openqa-devel/openqa/pool/1/autoinst-log.txt
looks like this:
[2022-11-28T17:36:09.947088+01:00] [debug] read set:
[2022-11-28T17:36:09.947171+01:00] [debug] write set:
[2022-11-28T17:36:09.947215+01:00] [debug] e set:
[2022-11-28T17:36:09.981551+01:00] [debug] read set:
[2022-11-28T17:36:09.981719+01:00] [debug] write set:
[2022-11-28T17:36:09.981817+01:00] [debug] e set:
[2022-11-28T17:36:10.004369+01:00] [debug] read set:
[2022-11-28T17:36:10.004538+01:00] [debug] write set:
[2022-11-28T17:36:10.004683+01:00] [debug] e set:
[2022-11-28T17:36:10.016215+01:00] [debug] read set:
[2022-11-28T17:36:10.016347+01:00] [debug] write set:
[2022-11-28T17:36:10.016392+01:00] [debug] e set:
[2022-11-28T17:36:10.050502+01:00] [debug] read set:
[2022-11-28T17:36:10.050675+01:00] [debug] write set:
[2022-11-28T17:36:10.050828+01:00] [debug] e set:
[2022-11-28T17:36:10.084859+01:00] [debug] read set:
[2022-11-28T17:36:10.085061+01:00] [debug] write set:
[2022-11-28T17:36:10.085211+01:00] [debug] e set:
[2022-11-28T17:36:10.119174+01:00] [debug] read set:
[2022-11-28T17:36:10.119400+01:00] [debug] write set:
[2022-11-28T17:36:10.119561+01:00] [debug] e set:
[2022-11-28T17:36:10.153774+01:00] [debug] read set:
[2022-11-28T17:36:10.153960+01:00] [debug] write set:
[2022-11-28T17:36:10.154071+01:00] [debug] e set:
[2022-11-28T17:36:10.188218+01:00] [debug] read set:
[2022-11-28T17:36:10.188470+01:00] [debug] write set:
[2022-11-28T17:36:10.188642+01:00] [debug] e set:
[2022-11-28T17:36:10.223027+01:00] [debug] read set:
[2022-11-28T17:36:10.223238+01:00] [debug] write set:
[2022-11-28T17:36:10.223374+01:00] [debug] e set:
[2022-11-28T17:36:10.255226+01:00] [debug] read set:
[2022-11-28T17:36:10.255451+01:00] [debug] write set:
[2022-11-28T17:36:10.255618+01:00] [debug] e set:
[2022-11-28T17:36:10.257666+01:00] [debug] read set:
[2022-11-28T17:36:10.257840+01:00] [debug] write set:
[2022-11-28T17:36:10.257976+01:00] [debug] e set:
[2022-11-28T17:36:10.291883+01:00] [debug] read set:
[2022-11-28T17:36:10.292056+01:00] [debug] write set:
[2022-11-28T17:36:10.292159+01:00] [debug] e set:
So the select returns very quickly with just empty sets. That seems wrong and likely causes the high CPU usage. Only occasionally the sets aren't empty:
[2022-11-28T17:36:10.293609+01:00] [debug] read set: baseclass::cmdpipe
[2022-11-28T17:36:10.293705+01:00] [debug] write set:
[2022-11-28T17:36:10.293785+01:00] [debug] e set:
[2022-11-28T17:36:10.315112+01:00] [debug] read set:
[2022-11-28T17:36:10.315262+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:36:10.315340+01:00] [debug] e set:
[2022-11-28T17:36:10.433268+01:00] [debug] read set:
[2022-11-28T17:36:10.433412+01:00] [debug] write set: baseclass::encoder_pipe
I've also added logging for the timeout which at some point of the test execution is computed to Inf
:
…
[2022-11-28T17:45:37.557189+01:00] [debug] timeout: 0.999979019165039
[2022-11-28T17:45:37.557332+01:00] [debug] read set:
[2022-11-28T17:45:37.557515+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:45:37.557586+01:00] [debug] e set:
[2022-11-28T17:45:37.619011+01:00] [debug] timeout: 0.935184955596924
[2022-11-28T17:45:37.619140+01:00] [debug] read set:
[2022-11-28T17:45:37.619266+01:00] [debug] write set: baseclass::encoder_pipe
[2022-11-28T17:45:37.619356+01:00] [debug] e set:
[2022-11-28T17:45:37.804401+01:00] [debug] timeout: 0.933241128921509
[2022-11-28T17:45:37.804548+01:00] [debug] read set:
[2022-11-28T17:45:37.804642+01:00] [debug] write set:
[2022-11-28T17:45:37.804767+01:00] [debug] e set:
[2022-11-28T17:45:38.060613+01:00] [debug] timeout: 0.749186038970947
[2022-11-28T17:45:38.060775+01:00] [debug] read set:
[2022-11-28T17:45:38.060882+01:00] [debug] write set:
[2022-11-28T17:45:38.061037+01:00] [debug] e set:
[2022-11-28T17:45:38.062793+01:00] [debug] timeout: 0.492897987365723
[2022-11-28T17:45:38.062943+01:00] [debug] read set:
[2022-11-28T17:45:38.063042+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:45:38.063135+01:00] [debug] e set:
[2022-11-28T17:45:38.311525+01:00] [debug] timeout: 0.425005912780762
[2022-11-28T17:45:38.311678+01:00] [debug] read set:
[2022-11-28T17:45:38.311803+01:00] [debug] write set:
[2022-11-28T17:45:38.311894+01:00] [debug] e set:
[2022-11-28T17:45:38.554664+01:00] [debug] timeout: 0.242048978805542
[2022-11-28T17:45:38.554809+01:00] [debug] read set:
[2022-11-28T17:45:38.554915+01:00] [debug] write set:
[2022-11-28T17:45:38.555031+01:00] [debug] e set:
[2022-11-28T17:45:38.561353+01:00] [debug] timeout: Inf
[2022-11-28T17:45:38.561474+01:00] [debug] read set:
[2022-11-28T17:45:38.561613+01:00] [debug] write set:
[2022-11-28T17:45:38.561685+01:00] [debug] e set:
[2022-11-28T17:45:38.563226+01:00] [debug] timeout: Inf
[2022-11-28T17:45:38.563388+01:00] [debug] read set:
[2022-11-28T17:45:38.563526+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:45:38.563657+01:00] [debug] e set:
[2022-11-28T17:45:39.062534+01:00] [debug] timeout: Inf
[2022-11-28T17:45:39.062644+01:00] [debug] read set:
[2022-11-28T17:45:39.062726+01:00] [debug] write set:
[2022-11-28T17:45:39.062797+01:00] [debug] e set:
…
[2022-11-28T17:51:09.353237+01:00] [debug] timeout: Inf
[2022-11-28T17:51:09.353437+01:00] [debug] read set:
[2022-11-28T17:51:09.353659+01:00] [debug] write set:
[2022-11-28T17:51:09.353816+01:00] [debug] e set:
Not sure whether that's normal. I suppose I'll have to re-conduct all investigations also without the problematic change to see what's actually normal and what not.
Updated by openqa_review about 2 years ago
- Due date set to 2022-12-13
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 2 years ago
With the PR reverted the select
also returns with empty sets and the timeout is computed to infinity:
[2022-11-30T15:16:48.838931+01:00] [debug] timeout: Inf
[2022-11-30T15:16:48.839063+01:00] [debug] read set:
[2022-11-30T15:16:48.839200+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-30T15:16:48.839280+01:00] [debug] e set:
[2022-11-30T15:16:49.087809+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.088003+01:00] [debug] read set:
[2022-11-30T15:16:49.088154+01:00] [debug] write set:
[2022-11-30T15:16:49.088301+01:00] [debug] e set:
[2022-11-30T15:16:49.345798+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.346663+01:00] [debug] read set:
[2022-11-30T15:16:49.346891+01:00] [debug] write set:
[2022-11-30T15:16:49.347119+01:00] [debug] e set:
[2022-11-30T15:16:49.348873+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.348989+01:00] [debug] read set:
[2022-11-30T15:16:49.349205+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-30T15:16:49.349314+01:00] [debug] e set:
[2022-11-30T15:16:49.648401+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.648568+01:00] [debug] read set: baseclass::cmdpipe
[2022-11-30T15:16:49.648660+01:00] [debug] write set:
[2022-11-30T15:16:49.648714+01:00] [debug] e set:
[2022-11-30T15:16:50.070618+01:00] [debug] no match: -0.3s, best candidate: packages-section-selected-20220222 (0.83)
[2022-11-30T15:16:50.146970+01:00] [debug] timeout: Inf
[2022-11-30T15:16:50.147042+01:00] [debug] read set:
[2022-11-30T15:16:50.147140+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-30T15:16:50.147194+01:00] [debug] e set:
However, the frequency in which it happens isn't as high. It also doesn't happen as often in a row. This difference in behavior is nevertheless likely just a symptom, though.
Updated by tinita about 2 years ago
So far I also wasn't able to see something suspicious in the code change.
The only thing I realized is that the ->on(tests_done => ...)
is removed: https://github.com/os-autoinst/os-autoinst/pull/2216/files#diff-a0c7c8ca335c30440a2ce90775102e5adbfaf80f2001231562fda076bdfc0093L261
But we still execute ->emit(tests_done ...)
later.
But that can't really be related to this issue, and if there is no event handler for an event, it doesn't complain about it.
I ran a test locally on my laptop twice, with current master and with the PR, and I can't see significant changes in the autoinst-log.txt
Also I can't see any significant difference in the autoinst-log.txt of https://openqa.opensuse.org/tests/2912766/ and https://openqa.opensuse.org/tests/2890492
I would need to take one of the aarch64 workers out and add some debugging there.
Updated by tinita about 2 years ago
I actually can't use an aarch64 worker because I can't ssh to any of them.
So I'm really lost here. To further debug, I need a machine which shows the symptom.
@ggardet_arm any idea?
Updated by ggardet_arm about 2 years ago
tinita wrote:
I actually can't use an aarch64 worker because I can't ssh to any of them.
So I'm really lost here. To further debug, I need a machine which shows the symptom.
@ggardet_arm any idea?
You can likely reach openqa-aarch64
machine on o3, right?
Updated by tinita about 2 years ago
ggardet_arm wrote:
You can likely reach
openqa-aarch64
machine on o3, right?
After we found out the IP and the "real" hostname, yeah. But before I only got:
ssh: Could not resolve hostname openqa-aarch64: Name or service not known
So I now took the worker out and will try to add some debugging code.
Updated by tinita about 2 years ago
In the end I was not able to reproduce performance problems.
First I thought I found something when the load was going up to 7 during a test, but I later also saw that on master.
I now conducted 10 tests each for master and PR 2216.
Master¶
- https://openqa.opensuse.org/tests/2929255 disk_boot: 2:29
- https://openqa.opensuse.org/tests/2929329 disk_boot: 2:26
- https://openqa.opensuse.org/tests/2929310 disk_boot: 2:33
- https://openqa.opensuse.org/tests/2929313 disk_boot: 2:23
- https://openqa.opensuse.org/tests/2929327 disk_boot: 2:29
- https://openqa.opensuse.org/tests/2929328 disk_boot: 2:28
- https://openqa.opensuse.org/tests/2929309 fail
- https://openqa.opensuse.org/tests/2929311 fail
- https://openqa.opensuse.org/tests/2929321 fail
- https://openqa.opensuse.org/tests/2929326 fail
PR 2216¶
- https://openqa.opensuse.org/tests/2929242 disk_boot: 2:30
- https://openqa.opensuse.org/tests/2929272 disk_boot: 2:29
- https://openqa.opensuse.org/tests/2929276 disk_boot: 2:25
- https://openqa.opensuse.org/tests/2929315 disk_boot: 2:33
- https://openqa.opensuse.org/tests/2929316 disk_boot: 2:26
- https://openqa.opensuse.org/tests/2929317 disk_boot: 2:22
- https://openqa.opensuse.org/tests/2929320 disk_boot: 2:40
- https://openqa.opensuse.org/tests/2929330 disk_boot: 2:25
- https://openqa.opensuse.org/tests/2929256 fail
- https://openqa.opensuse.org/tests/2929318 fail
I also ran top
permanently and watched the load closely. There were no significant changes between the two code branches. The load was around 1.5-2.2 during the main part of the test, I would say.
@ggardet_arm How many test runs did you do and which test did you run? Did I maybe chose the wrong test?
The patched files from PR 2216 are still lying on openqa-aarch64 in the /root folder (isotovideo-patched
, CommandHandler.pm-patched
), but I will start the regular services again on that machine soon. Let me know if you have more ideas to reproduce.
Updated by ggardet_arm about 2 years ago
I used MicroOS-Image
and MicroOS-Image-ContainerHost
from Tumbleweed to perform my tests. I manually triggered a number of jobs, but cannot tell you the exact number, sorry.
IIRC, the problem was more obvious on other aarch64 machines but I cannot provide access to them, sorry.
Updated by okurz about 2 years ago
Discussed in daily 2022-12-06. We see three tasks that we could do as we could not reproduce the issue today:
- Bring in parts of the original PR https://github.com/os-autoinst/os-autoinst/pull/2216 to see if they have any impact on production. At best ggardet_arm can check on machines that were affected with hot-patching before merging
- Create a setup that can reproduce the issue in cloud ourselves with help from ggardet_arm
- ggardet_arm needs to debug and fix himself as we can't reproduce but we give hints and suggestions
Updated by tinita about 2 years ago
I prepared a very very small PR: https://github.com/os-autoinst/os-autoinst/pull/2227
But apparently it was too small for some :)
So now I made this big PR: https://github.com/os-autoinst/os-autoinst/pull/2228 which leaves isotovideo with around 150 lines (without pod).
It is divided into 16 commits.
I asked @ggardet_arm if he can test it.
Updated by tinita about 2 years ago
- Status changed from In Progress to Feedback
Tested 2228 successfully locally and now waiting for @ggardet_arm
Updated by ggardet_arm about 2 years ago
I just updated ip-10-252-32-90
and ip-10-252-32-98:8
workers with os-autoinst
from https://github.com/os-autoinst/os-autoinst/pull/2228
Updated by livdywan about 2 years ago
From worker ip-10-252-32-90:2 looking at one of the jobs seems to show a lot of messages about too many candidates needles, making it very slow even before it eventually fails. I'm finding it difficult to assess if this is at all related to Tina's changes. It doesn't seem like it? But I also don't really see any passing runs before it.
Updated by tinita about 2 years ago
- Due date changed from 2022-12-13 to 2022-12-19
Bumping due date as we are still checking the outcome of the PR
Updated by livdywan about 2 years ago
cdywan wrote:
From worker ip-10-252-32-90:2 looking at one of the jobs seems to show a lot of messages about too many candidates needles, making it very slow even before it eventually fails. I'm finding it difficult to assess if this is at all related to Tina's changes.
All the recent jobs seem to be failing. Failures don't look related to the suspected regression but rather network-related.
@ggardet_arm Can you comment on that? A passing job for reference would be helpful. And to be able to see if it's slower than expected.
Updated by ggardet_arm about 2 years ago
cdywan wrote:
cdywan wrote:
From worker ip-10-252-32-90:2 looking at one of the jobs seems to show a lot of messages about too many candidates needles, making it very slow even before it eventually fails. I'm finding it difficult to assess if this is at all related to Tina's changes.
All the recent jobs seem to be failing. Failures don't look related to the suspected regression but rather network-related.
@ggardet_arm Can you comment on that? A passing job for reference would be helpful. And to be able to see if it's slower than expected.
The network related issue is tracked here: https://progress.opensuse.org/issues/122029
I think the current Tina's PR does not bring any significant perf regression. At least I couldn't reproduce it.
Updated by livdywan about 2 years ago
- Due date changed from 2022-12-19 to 2023-01-06
Yay. That means we can go ahead with the branch. Perhaps we should aim to merge first thing in the new year.
Updated by okurz about 2 years ago
- Due date changed from 2023-01-06 to 2023-01-20
christmas grace due date bump :)
Updated by tinita almost 2 years ago
Updated by tinita almost 2 years ago
https://github.com/os-autoinst/os-autoinst/pull/2228 merged, let's observe for a while
Updated by tinita almost 2 years ago
- Status changed from Feedback to Resolved
This looks like it is resolved. Please reopen if you observe failures again.