action #121045: Performance regression of `os-autoinst` size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #121045

closed

Performance regression of `os-autoinst` size:M

Added by ggardet_arm over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

tinita

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2022-11-28

Due date:

2023-01-20

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

I have noticed significant performance regression on aarch64 since few days which leads to some test failures.
On o3, this is visible on MicroOS-Image and MicroOS-Image-ContainerHost. For example, see history of https://openqa.opensuse.org/tests/2912766#next_previous

I can point to os-autoinst since on my local aarch64 Tumbleweed machine I perform some tests and the result is:

os-autoinst at git ID 61d5b38 (https://github.com/os-autoinst/os-autoinst/commit/61d5b38765b5fcecc7006083a5eb9a654b5b9b5f) works fine (Current Tumbleweed version)
But os-autoinst at git ID d36887c (https://github.com/os-autoinst/os-autoinst/commit/d36887c63b6452d3f88356438cf55ecfab763a51) is BROKEN (Was the version from Devel:openQA repo when I installed it)

For my tests, I cloned locally https://openqa.opensuse.org/tests/2912766 and the test passes with os-autoinst at git ID 61d5b38 but os-autoinst at git ID d36887c fails in disk_boot as it fails to "see" the grub menu.

Acceptance criteria¶

AC1: Ensure there is no performance regression

Suggestions¶

Investigate sporadic excessive select calls

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by ggardet_arm over 2 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by ggardet_arm over 2 years ago

os-autoinst at commit 1f2f0a2 is also BROKEN https://github.com/os-autoinst/os-autoinst/commit/1f2f0a2038c9c0cfc918a16ed1d313516fca93da

Actions

Copy link

Updated by ggardet_arm over 2 years ago

os-autoinst at commit 4c5642d works fine https://github.com/os-autoinst/os-autoinst/commit/4c5642d

So, it points to https://github.com/os-autoinst/os-autoinst/pull/2216

Actions

Copy link

Updated by ggardet_arm over 2 years ago

One difference I could spot by comparing the os-autoinst.log files is that GOT GO line is:

after QEMU: QEMU emulator version... and QEMU: Copyright (c)... lines when it works
before QEMU: QEMU emulator version... and QEMU: Copyright (c)... lines when it is broken

Actions

Copy link

Updated by mkittler over 2 years ago

Assignee set to mkittler

That's an interesting observation. The GOT GO line must after the backend starts because only then its fd will be correctly used when polling via select. Likely now the select call always returns immediately making the isotovideo loop spin much faster. I'll have a closer look. That's not it. It looks like the backend process is the one consuming more CPU time.

Actions

Copy link

Updated by mkittler over 2 years ago

Status changed from New to In Progress

For now it is likely best to revert the PR: https://github.com/os-autoinst/os-autoinst/pull/2221

Actions

Copy link

Updated by mkittler over 2 years ago

Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by mkittler over 2 years ago

I've been adding some debugging code to see what handles let the backend loop spin:

bmwqemu::diag("read set: " . join(", ", map { $self->{select_read}->get_name($_) } @$read_set));
bmwqemu::diag("write set: " . join(", ", map { $self->{select_write}->get_name($_) } @$write_set));
bmwqemu::diag("e set: " . join(", ", map { $self->{read_write}->get_name($_) } @$e_set));

Then tail -f /hdd/openqa-devel/openqa/pool/1/autoinst-log.txt looks like this:

[2022-11-28T17:36:09.947088+01:00] [debug] read set: 
[2022-11-28T17:36:09.947171+01:00] [debug] write set: 
[2022-11-28T17:36:09.947215+01:00] [debug] e set: 
[2022-11-28T17:36:09.981551+01:00] [debug] read set: 
[2022-11-28T17:36:09.981719+01:00] [debug] write set: 
[2022-11-28T17:36:09.981817+01:00] [debug] e set: 
[2022-11-28T17:36:10.004369+01:00] [debug] read set: 
[2022-11-28T17:36:10.004538+01:00] [debug] write set: 
[2022-11-28T17:36:10.004683+01:00] [debug] e set: 
[2022-11-28T17:36:10.016215+01:00] [debug] read set: 
[2022-11-28T17:36:10.016347+01:00] [debug] write set: 
[2022-11-28T17:36:10.016392+01:00] [debug] e set: 
[2022-11-28T17:36:10.050502+01:00] [debug] read set: 
[2022-11-28T17:36:10.050675+01:00] [debug] write set: 
[2022-11-28T17:36:10.050828+01:00] [debug] e set: 
[2022-11-28T17:36:10.084859+01:00] [debug] read set: 
[2022-11-28T17:36:10.085061+01:00] [debug] write set: 
[2022-11-28T17:36:10.085211+01:00] [debug] e set: 
[2022-11-28T17:36:10.119174+01:00] [debug] read set: 
[2022-11-28T17:36:10.119400+01:00] [debug] write set: 
[2022-11-28T17:36:10.119561+01:00] [debug] e set: 
[2022-11-28T17:36:10.153774+01:00] [debug] read set: 
[2022-11-28T17:36:10.153960+01:00] [debug] write set: 
[2022-11-28T17:36:10.154071+01:00] [debug] e set: 
[2022-11-28T17:36:10.188218+01:00] [debug] read set: 
[2022-11-28T17:36:10.188470+01:00] [debug] write set: 
[2022-11-28T17:36:10.188642+01:00] [debug] e set: 
[2022-11-28T17:36:10.223027+01:00] [debug] read set: 
[2022-11-28T17:36:10.223238+01:00] [debug] write set: 
[2022-11-28T17:36:10.223374+01:00] [debug] e set: 
[2022-11-28T17:36:10.255226+01:00] [debug] read set: 
[2022-11-28T17:36:10.255451+01:00] [debug] write set: 
[2022-11-28T17:36:10.255618+01:00] [debug] e set: 
[2022-11-28T17:36:10.257666+01:00] [debug] read set: 
[2022-11-28T17:36:10.257840+01:00] [debug] write set: 
[2022-11-28T17:36:10.257976+01:00] [debug] e set: 
[2022-11-28T17:36:10.291883+01:00] [debug] read set: 
[2022-11-28T17:36:10.292056+01:00] [debug] write set: 
[2022-11-28T17:36:10.292159+01:00] [debug] e set:

So the select returns very quickly with just empty sets. That seems wrong and likely causes the high CPU usage. Only occasionally the sets aren't empty:

[2022-11-28T17:36:10.293609+01:00] [debug] read set: baseclass::cmdpipe
[2022-11-28T17:36:10.293705+01:00] [debug] write set: 
[2022-11-28T17:36:10.293785+01:00] [debug] e set: 
[2022-11-28T17:36:10.315112+01:00] [debug] read set: 
[2022-11-28T17:36:10.315262+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:36:10.315340+01:00] [debug] e set: 
[2022-11-28T17:36:10.433268+01:00] [debug] read set: 
[2022-11-28T17:36:10.433412+01:00] [debug] write set: baseclass::encoder_pipe

I've also added logging for the timeout which at some point of the test execution is computed to Inf:

…
[2022-11-28T17:45:37.557189+01:00] [debug] timeout: 0.999979019165039
[2022-11-28T17:45:37.557332+01:00] [debug] read set: 
[2022-11-28T17:45:37.557515+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:45:37.557586+01:00] [debug] e set: 
[2022-11-28T17:45:37.619011+01:00] [debug] timeout: 0.935184955596924
[2022-11-28T17:45:37.619140+01:00] [debug] read set: 
[2022-11-28T17:45:37.619266+01:00] [debug] write set: baseclass::encoder_pipe
[2022-11-28T17:45:37.619356+01:00] [debug] e set: 
[2022-11-28T17:45:37.804401+01:00] [debug] timeout: 0.933241128921509
[2022-11-28T17:45:37.804548+01:00] [debug] read set: 
[2022-11-28T17:45:37.804642+01:00] [debug] write set: 
[2022-11-28T17:45:37.804767+01:00] [debug] e set: 
[2022-11-28T17:45:38.060613+01:00] [debug] timeout: 0.749186038970947
[2022-11-28T17:45:38.060775+01:00] [debug] read set: 
[2022-11-28T17:45:38.060882+01:00] [debug] write set: 
[2022-11-28T17:45:38.061037+01:00] [debug] e set: 
[2022-11-28T17:45:38.062793+01:00] [debug] timeout: 0.492897987365723
[2022-11-28T17:45:38.062943+01:00] [debug] read set: 
[2022-11-28T17:45:38.063042+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:45:38.063135+01:00] [debug] e set: 
[2022-11-28T17:45:38.311525+01:00] [debug] timeout: 0.425005912780762
[2022-11-28T17:45:38.311678+01:00] [debug] read set: 
[2022-11-28T17:45:38.311803+01:00] [debug] write set: 
[2022-11-28T17:45:38.311894+01:00] [debug] e set: 
[2022-11-28T17:45:38.554664+01:00] [debug] timeout: 0.242048978805542
[2022-11-28T17:45:38.554809+01:00] [debug] read set: 
[2022-11-28T17:45:38.554915+01:00] [debug] write set: 
[2022-11-28T17:45:38.555031+01:00] [debug] e set: 
[2022-11-28T17:45:38.561353+01:00] [debug] timeout: Inf
[2022-11-28T17:45:38.561474+01:00] [debug] read set: 
[2022-11-28T17:45:38.561613+01:00] [debug] write set: 
[2022-11-28T17:45:38.561685+01:00] [debug] e set: 
[2022-11-28T17:45:38.563226+01:00] [debug] timeout: Inf
[2022-11-28T17:45:38.563388+01:00] [debug] read set: 
[2022-11-28T17:45:38.563526+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:45:38.563657+01:00] [debug] e set: 
[2022-11-28T17:45:39.062534+01:00] [debug] timeout: Inf
[2022-11-28T17:45:39.062644+01:00] [debug] read set: 
[2022-11-28T17:45:39.062726+01:00] [debug] write set: 
[2022-11-28T17:45:39.062797+01:00] [debug] e set:
…
[2022-11-28T17:51:09.353237+01:00] [debug] timeout: Inf
[2022-11-28T17:51:09.353437+01:00] [debug] read set: 
[2022-11-28T17:51:09.353659+01:00] [debug] write set: 
[2022-11-28T17:51:09.353816+01:00] [debug] e set:

Not sure whether that's normal. I suppose I'll have to re-conduct all investigations also without the problematic change to see what's actually normal and what not.

Actions

Copy link

Updated by openqa_review over 2 years ago

Due date set to 2022-12-13

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#10

Updated by mkittler over 2 years ago

With the PR reverted the select also returns with empty sets and the timeout is computed to infinity:

[2022-11-30T15:16:48.838931+01:00] [debug] timeout: Inf
[2022-11-30T15:16:48.839063+01:00] [debug] read set: 
[2022-11-30T15:16:48.839200+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-30T15:16:48.839280+01:00] [debug] e set: 
[2022-11-30T15:16:49.087809+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.088003+01:00] [debug] read set: 
[2022-11-30T15:16:49.088154+01:00] [debug] write set: 
[2022-11-30T15:16:49.088301+01:00] [debug] e set: 
[2022-11-30T15:16:49.345798+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.346663+01:00] [debug] read set: 
[2022-11-30T15:16:49.346891+01:00] [debug] write set: 
[2022-11-30T15:16:49.347119+01:00] [debug] e set: 
[2022-11-30T15:16:49.348873+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.348989+01:00] [debug] read set: 
[2022-11-30T15:16:49.349205+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-30T15:16:49.349314+01:00] [debug] e set: 
[2022-11-30T15:16:49.648401+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.648568+01:00] [debug] read set: baseclass::cmdpipe
[2022-11-30T15:16:49.648660+01:00] [debug] write set: 
[2022-11-30T15:16:49.648714+01:00] [debug] e set: 
[2022-11-30T15:16:50.070618+01:00] [debug] no match: -0.3s, best candidate: packages-section-selected-20220222 (0.83)
[2022-11-30T15:16:50.146970+01:00] [debug] timeout: Inf
[2022-11-30T15:16:50.147042+01:00] [debug] read set: 
[2022-11-30T15:16:50.147140+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-30T15:16:50.147194+01:00] [debug] e set:

However, the frequency in which it happens isn't as high. It also doesn't happen as often in a row. This difference in behavior is nevertheless likely just a symptom, though.

Actions

Copy link

#11

Updated by robert.richardson over 2 years ago

Subject changed from Performance regression of `os-autoinst` to Performance regression of `os-autoinst` size:M
Description updated (diff)
Assignee changed from mkittler to tinita

Actions

Copy link

#12

Updated by tinita over 2 years ago

So far I also wasn't able to see something suspicious in the code change.
The only thing I realized is that the ->on(tests_done => ...) is removed: https://github.com/os-autoinst/os-autoinst/pull/2216/files#diff-a0c7c8ca335c30440a2ce90775102e5adbfaf80f2001231562fda076bdfc0093L261
But we still execute ->emit(tests_done ...) later.
But that can't really be related to this issue, and if there is no event handler for an event, it doesn't complain about it.

I ran a test locally on my laptop twice, with current master and with the PR, and I can't see significant changes in the autoinst-log.txt

Also I can't see any significant difference in the autoinst-log.txt of https://openqa.opensuse.org/tests/2912766/ and https://openqa.opensuse.org/tests/2890492

I would need to take one of the aarch64 workers out and add some debugging there.

Actions

Copy link

#13

Updated by tinita over 2 years ago

I actually can't use an aarch64 worker because I can't ssh to any of them.

So I'm really lost here. To further debug, I need a machine which shows the symptom.
@ggardet_arm any idea?

Actions

Copy link

#14

Updated by ggardet_arm over 2 years ago

tinita wrote:

I actually can't use an aarch64 worker because I can't ssh to any of them.

So I'm really lost here. To further debug, I need a machine which shows the symptom.
@ggardet_arm any idea?

You can likely reach openqa-aarch64 machine on o3, right?

Actions

Copy link

#15

Updated by tinita over 2 years ago

ggardet_arm wrote:

You can likely reach openqa-aarch64 machine on o3, right?

After we found out the IP and the "real" hostname, yeah. But before I only got:

ssh: Could not resolve hostname openqa-aarch64: Name or service not known

So I now took the worker out and will try to add some debugging code.

Actions

Copy link

#16

Updated by tinita over 2 years ago

In the end I was not able to reproduce performance problems.
First I thought I found something when the load was going up to 7 during a test, but I later also saw that on master.

I now conducted 10 tests each for master and PR 2216.

Master¶

https://openqa.opensuse.org/tests/2929255 disk_boot: 2:29
https://openqa.opensuse.org/tests/2929329 disk_boot: 2:26
https://openqa.opensuse.org/tests/2929310 disk_boot: 2:33
https://openqa.opensuse.org/tests/2929313 disk_boot: 2:23
https://openqa.opensuse.org/tests/2929327 disk_boot: 2:29
https://openqa.opensuse.org/tests/2929328 disk_boot: 2:28
https://openqa.opensuse.org/tests/2929309 fail
https://openqa.opensuse.org/tests/2929311 fail
https://openqa.opensuse.org/tests/2929321 fail
https://openqa.opensuse.org/tests/2929326 fail

PR 2216¶

https://openqa.opensuse.org/tests/2929242 disk_boot: 2:30
https://openqa.opensuse.org/tests/2929272 disk_boot: 2:29
https://openqa.opensuse.org/tests/2929276 disk_boot: 2:25
https://openqa.opensuse.org/tests/2929315 disk_boot: 2:33
https://openqa.opensuse.org/tests/2929316 disk_boot: 2:26
https://openqa.opensuse.org/tests/2929317 disk_boot: 2:22
https://openqa.opensuse.org/tests/2929320 disk_boot: 2:40
https://openqa.opensuse.org/tests/2929330 disk_boot: 2:25
https://openqa.opensuse.org/tests/2929256 fail
https://openqa.opensuse.org/tests/2929318 fail

I also ran top permanently and watched the load closely. There were no significant changes between the two code branches. The load was around 1.5-2.2 during the main part of the test, I would say.

@ggardet_arm How many test runs did you do and which test did you run? Did I maybe chose the wrong test?

The patched files from PR 2216 are still lying on openqa-aarch64 in the /root folder (isotovideo-patched, CommandHandler.pm-patched), but I will start the regular services again on that machine soon. Let me know if you have more ideas to reproduce.

Actions

Copy link

#17

Updated by ggardet_arm over 2 years ago

I used MicroOS-Image and MicroOS-Image-ContainerHost from Tumbleweed to perform my tests. I manually triggered a number of jobs, but cannot tell you the exact number, sorry.

IIRC, the problem was more obvious on other aarch64 machines but I cannot provide access to them, sorry.

Actions

Copy link

#18

Updated by okurz over 2 years ago

Discussed in daily 2022-12-06. We see three tasks that we could do as we could not reproduce the issue today:

Bring in parts of the original PR https://github.com/os-autoinst/os-autoinst/pull/2216 to see if they have any impact on production. At best ggardet_arm can check on machines that were affected with hot-patching before merging
Create a setup that can reproduce the issue in cloud ourselves with help from ggardet_arm
ggardet_arm needs to debug and fix himself as we can't reproduce but we give hints and suggestions

Actions

Copy link

#19

Updated by tinita over 2 years ago

I prepared a very very small PR: https://github.com/os-autoinst/os-autoinst/pull/2227
But apparently it was too small for some :)

So now I made this big PR: https://github.com/os-autoinst/os-autoinst/pull/2228 which leaves isotovideo with around 150 lines (without pod).
It is divided into 16 commits.
I asked @ggardet_arm if he can test it.

Actions

Copy link

#20

Updated by tinita over 2 years ago

Status changed from In Progress to Feedback

Tested 2228 successfully locally and now waiting for @ggardet_arm

Actions

Copy link

#21

Updated by ggardet_arm over 2 years ago

I just updated ip-10-252-32-90 and ip-10-252-32-98:8 workers with os-autoinst from https://github.com/os-autoinst/os-autoinst/pull/2228

Actions

Copy link

#22

Updated by livdywan over 2 years ago

From worker ip-10-252-32-90:2 looking at one of the jobs seems to show a lot of messages about too many candidates needles, making it very slow even before it eventually fails. I'm finding it difficult to assess if this is at all related to Tina's changes. It doesn't seem like it? But I also don't really see any passing runs before it.

Actions

Copy link

#23

Updated by tinita over 2 years ago

Due date changed from 2022-12-13 to 2022-12-19

Bumping due date as we are still checking the outcome of the PR

Actions

Copy link

#24

Updated by livdywan over 2 years ago

cdywan wrote:

From worker ip-10-252-32-90:2 looking at one of the jobs seems to show a lot of messages about too many candidates needles, making it very slow even before it eventually fails. I'm finding it difficult to assess if this is at all related to Tina's changes.

All the recent jobs seem to be failing. Failures don't look related to the suspected regression but rather network-related.
@ggardet_arm Can you comment on that? A passing job for reference would be helpful. And to be able to see if it's slower than expected.

Actions

Copy link

#25

Updated by ggardet_arm over 2 years ago

cdywan wrote:

cdywan wrote:

From worker ip-10-252-32-90:2 looking at one of the jobs seems to show a lot of messages about too many candidates needles, making it very slow even before it eventually fails. I'm finding it difficult to assess if this is at all related to Tina's changes.

All the recent jobs seem to be failing. Failures don't look related to the suspected regression but rather network-related.
@ggardet_arm Can you comment on that? A passing job for reference would be helpful. And to be able to see if it's slower than expected.

The network related issue is tracked here: https://progress.opensuse.org/issues/122029

I think the current Tina's PR does not bring any significant perf regression. At least I couldn't reproduce it.

Actions

Copy link

#26