Project

General

Profile

Actions

action #121045

closed

Performance regression of `os-autoinst` size:M

Added by ggardet_arm almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-11-28
Due date:
2023-01-20
% Done:

0%

Estimated time:

Description

Observation

I have noticed significant performance regression on aarch64 since few days which leads to some test failures.
On o3, this is visible on MicroOS-Image and MicroOS-Image-ContainerHost. For example, see history of https://openqa.opensuse.org/tests/2912766#next_previous

I can point to os-autoinst since on my local aarch64 Tumbleweed machine I perform some tests and the result is:

For my tests, I cloned locally https://openqa.opensuse.org/tests/2912766 and the test passes with os-autoinst at git ID 61d5b38 but os-autoinst at git ID d36887c fails in disk_boot as it fails to "see" the grub menu.

Acceptance criteria

  • AC1: Ensure there is no performance regression

Suggestions

  • Investigate sporadic excessive select calls

Related issues 1 (0 open1 closed)

Related to openQA Project - action #81899: Move code from isotovideo to a module size:MResolvedlivdywan2021-01-08

Actions
Actions #1

Updated by ggardet_arm almost 2 years ago

  • Priority changed from Normal to High
Actions #4

Updated by ggardet_arm almost 2 years ago

One difference I could spot by comparing the os-autoinst.log files is that GOT GO line is:

  • after QEMU: QEMU emulator version... and QEMU: Copyright (c)... lines when it works
  • before QEMU: QEMU emulator version... and QEMU: Copyright (c)... lines when it is broken
Actions #5

Updated by mkittler almost 2 years ago

  • Assignee set to mkittler

That's an interesting observation. The GOT GO line must after the backend starts because only then its fd will be correctly used when polling via select. Likely now the select call always returns immediately making the isotovideo loop spin much faster. I'll have a closer look. That's not it. It looks like the backend process is the one consuming more CPU time.

Actions #6

Updated by mkittler almost 2 years ago

  • Status changed from New to In Progress

For now it is likely best to revert the PR: https://github.com/os-autoinst/os-autoinst/pull/2221

Actions #7

Updated by mkittler almost 2 years ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #8

Updated by mkittler almost 2 years ago

I've been adding some debugging code to see what handles let the backend loop spin:

bmwqemu::diag("read set: " . join(", ", map { $self->{select_read}->get_name($_) } @$read_set));
bmwqemu::diag("write set: " . join(", ", map { $self->{select_write}->get_name($_) } @$write_set));
bmwqemu::diag("e set: " . join(", ", map { $self->{read_write}->get_name($_) } @$e_set));

Then tail -f /hdd/openqa-devel/openqa/pool/1/autoinst-log.txt looks like this:

[2022-11-28T17:36:09.947088+01:00] [debug] read set: 
[2022-11-28T17:36:09.947171+01:00] [debug] write set: 
[2022-11-28T17:36:09.947215+01:00] [debug] e set: 
[2022-11-28T17:36:09.981551+01:00] [debug] read set: 
[2022-11-28T17:36:09.981719+01:00] [debug] write set: 
[2022-11-28T17:36:09.981817+01:00] [debug] e set: 
[2022-11-28T17:36:10.004369+01:00] [debug] read set: 
[2022-11-28T17:36:10.004538+01:00] [debug] write set: 
[2022-11-28T17:36:10.004683+01:00] [debug] e set: 
[2022-11-28T17:36:10.016215+01:00] [debug] read set: 
[2022-11-28T17:36:10.016347+01:00] [debug] write set: 
[2022-11-28T17:36:10.016392+01:00] [debug] e set: 
[2022-11-28T17:36:10.050502+01:00] [debug] read set: 
[2022-11-28T17:36:10.050675+01:00] [debug] write set: 
[2022-11-28T17:36:10.050828+01:00] [debug] e set: 
[2022-11-28T17:36:10.084859+01:00] [debug] read set: 
[2022-11-28T17:36:10.085061+01:00] [debug] write set: 
[2022-11-28T17:36:10.085211+01:00] [debug] e set: 
[2022-11-28T17:36:10.119174+01:00] [debug] read set: 
[2022-11-28T17:36:10.119400+01:00] [debug] write set: 
[2022-11-28T17:36:10.119561+01:00] [debug] e set: 
[2022-11-28T17:36:10.153774+01:00] [debug] read set: 
[2022-11-28T17:36:10.153960+01:00] [debug] write set: 
[2022-11-28T17:36:10.154071+01:00] [debug] e set: 
[2022-11-28T17:36:10.188218+01:00] [debug] read set: 
[2022-11-28T17:36:10.188470+01:00] [debug] write set: 
[2022-11-28T17:36:10.188642+01:00] [debug] e set: 
[2022-11-28T17:36:10.223027+01:00] [debug] read set: 
[2022-11-28T17:36:10.223238+01:00] [debug] write set: 
[2022-11-28T17:36:10.223374+01:00] [debug] e set: 
[2022-11-28T17:36:10.255226+01:00] [debug] read set: 
[2022-11-28T17:36:10.255451+01:00] [debug] write set: 
[2022-11-28T17:36:10.255618+01:00] [debug] e set: 
[2022-11-28T17:36:10.257666+01:00] [debug] read set: 
[2022-11-28T17:36:10.257840+01:00] [debug] write set: 
[2022-11-28T17:36:10.257976+01:00] [debug] e set: 
[2022-11-28T17:36:10.291883+01:00] [debug] read set: 
[2022-11-28T17:36:10.292056+01:00] [debug] write set: 
[2022-11-28T17:36:10.292159+01:00] [debug] e set:

So the select returns very quickly with just empty sets. That seems wrong and likely causes the high CPU usage. Only occasionally the sets aren't empty:

[2022-11-28T17:36:10.293609+01:00] [debug] read set: baseclass::cmdpipe
[2022-11-28T17:36:10.293705+01:00] [debug] write set: 
[2022-11-28T17:36:10.293785+01:00] [debug] e set: 
[2022-11-28T17:36:10.315112+01:00] [debug] read set: 
[2022-11-28T17:36:10.315262+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:36:10.315340+01:00] [debug] e set: 
[2022-11-28T17:36:10.433268+01:00] [debug] read set: 
[2022-11-28T17:36:10.433412+01:00] [debug] write set: baseclass::encoder_pipe

I've also added logging for the timeout which at some point of the test execution is computed to Inf:

…
[2022-11-28T17:45:37.557189+01:00] [debug] timeout: 0.999979019165039
[2022-11-28T17:45:37.557332+01:00] [debug] read set: 
[2022-11-28T17:45:37.557515+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:45:37.557586+01:00] [debug] e set: 
[2022-11-28T17:45:37.619011+01:00] [debug] timeout: 0.935184955596924
[2022-11-28T17:45:37.619140+01:00] [debug] read set: 
[2022-11-28T17:45:37.619266+01:00] [debug] write set: baseclass::encoder_pipe
[2022-11-28T17:45:37.619356+01:00] [debug] e set: 
[2022-11-28T17:45:37.804401+01:00] [debug] timeout: 0.933241128921509
[2022-11-28T17:45:37.804548+01:00] [debug] read set: 
[2022-11-28T17:45:37.804642+01:00] [debug] write set: 
[2022-11-28T17:45:37.804767+01:00] [debug] e set: 
[2022-11-28T17:45:38.060613+01:00] [debug] timeout: 0.749186038970947
[2022-11-28T17:45:38.060775+01:00] [debug] read set: 
[2022-11-28T17:45:38.060882+01:00] [debug] write set: 
[2022-11-28T17:45:38.061037+01:00] [debug] e set: 
[2022-11-28T17:45:38.062793+01:00] [debug] timeout: 0.492897987365723
[2022-11-28T17:45:38.062943+01:00] [debug] read set: 
[2022-11-28T17:45:38.063042+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:45:38.063135+01:00] [debug] e set: 
[2022-11-28T17:45:38.311525+01:00] [debug] timeout: 0.425005912780762
[2022-11-28T17:45:38.311678+01:00] [debug] read set: 
[2022-11-28T17:45:38.311803+01:00] [debug] write set: 
[2022-11-28T17:45:38.311894+01:00] [debug] e set: 
[2022-11-28T17:45:38.554664+01:00] [debug] timeout: 0.242048978805542
[2022-11-28T17:45:38.554809+01:00] [debug] read set: 
[2022-11-28T17:45:38.554915+01:00] [debug] write set: 
[2022-11-28T17:45:38.555031+01:00] [debug] e set: 
[2022-11-28T17:45:38.561353+01:00] [debug] timeout: Inf
[2022-11-28T17:45:38.561474+01:00] [debug] read set: 
[2022-11-28T17:45:38.561613+01:00] [debug] write set: 
[2022-11-28T17:45:38.561685+01:00] [debug] e set: 
[2022-11-28T17:45:38.563226+01:00] [debug] timeout: Inf
[2022-11-28T17:45:38.563388+01:00] [debug] read set: 
[2022-11-28T17:45:38.563526+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-28T17:45:38.563657+01:00] [debug] e set: 
[2022-11-28T17:45:39.062534+01:00] [debug] timeout: Inf
[2022-11-28T17:45:39.062644+01:00] [debug] read set: 
[2022-11-28T17:45:39.062726+01:00] [debug] write set: 
[2022-11-28T17:45:39.062797+01:00] [debug] e set:
…
[2022-11-28T17:51:09.353237+01:00] [debug] timeout: Inf
[2022-11-28T17:51:09.353437+01:00] [debug] read set: 
[2022-11-28T17:51:09.353659+01:00] [debug] write set: 
[2022-11-28T17:51:09.353816+01:00] [debug] e set:

Not sure whether that's normal. I suppose I'll have to re-conduct all investigations also without the problematic change to see what's actually normal and what not.

Actions #9

Updated by openqa_review almost 2 years ago

  • Due date set to 2022-12-13

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by mkittler almost 2 years ago

With the PR reverted the select also returns with empty sets and the timeout is computed to infinity:

[2022-11-30T15:16:48.838931+01:00] [debug] timeout: Inf
[2022-11-30T15:16:48.839063+01:00] [debug] read set: 
[2022-11-30T15:16:48.839200+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-30T15:16:48.839280+01:00] [debug] e set: 
[2022-11-30T15:16:49.087809+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.088003+01:00] [debug] read set: 
[2022-11-30T15:16:49.088154+01:00] [debug] write set: 
[2022-11-30T15:16:49.088301+01:00] [debug] e set: 
[2022-11-30T15:16:49.345798+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.346663+01:00] [debug] read set: 
[2022-11-30T15:16:49.346891+01:00] [debug] write set: 
[2022-11-30T15:16:49.347119+01:00] [debug] e set: 
[2022-11-30T15:16:49.348873+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.348989+01:00] [debug] read set: 
[2022-11-30T15:16:49.349205+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-30T15:16:49.349314+01:00] [debug] e set: 
[2022-11-30T15:16:49.648401+01:00] [debug] timeout: Inf
[2022-11-30T15:16:49.648568+01:00] [debug] read set: baseclass::cmdpipe
[2022-11-30T15:16:49.648660+01:00] [debug] write set: 
[2022-11-30T15:16:49.648714+01:00] [debug] e set: 
[2022-11-30T15:16:50.070618+01:00] [debug] no match: -0.3s, best candidate: packages-section-selected-20220222 (0.83)
[2022-11-30T15:16:50.146970+01:00] [debug] timeout: Inf
[2022-11-30T15:16:50.147042+01:00] [debug] read set: 
[2022-11-30T15:16:50.147140+01:00] [debug] write set: baseclass::encoder_pipe, baseclass::external_video_encoder_cmd_pipe
[2022-11-30T15:16:50.147194+01:00] [debug] e set:

However, the frequency in which it happens isn't as high. It also doesn't happen as often in a row. This difference in behavior is nevertheless likely just a symptom, though.

Actions #11

Updated by robert.richardson almost 2 years ago

  • Subject changed from Performance regression of `os-autoinst` to Performance regression of `os-autoinst` size:M
  • Description updated (diff)
  • Assignee changed from mkittler to tinita
Actions #12

Updated by tinita almost 2 years ago

So far I also wasn't able to see something suspicious in the code change.
The only thing I realized is that the ->on(tests_done => ...) is removed: https://github.com/os-autoinst/os-autoinst/pull/2216/files#diff-a0c7c8ca335c30440a2ce90775102e5adbfaf80f2001231562fda076bdfc0093L261
But we still execute ->emit(tests_done ...) later.
But that can't really be related to this issue, and if there is no event handler for an event, it doesn't complain about it.

I ran a test locally on my laptop twice, with current master and with the PR, and I can't see significant changes in the autoinst-log.txt

Also I can't see any significant difference in the autoinst-log.txt of https://openqa.opensuse.org/tests/2912766/ and https://openqa.opensuse.org/tests/2890492

I would need to take one of the aarch64 workers out and add some debugging there.

Actions #13

Updated by tinita almost 2 years ago

I actually can't use an aarch64 worker because I can't ssh to any of them.

So I'm really lost here. To further debug, I need a machine which shows the symptom.
@ggardet_arm any idea?

Actions #14

Updated by ggardet_arm almost 2 years ago

tinita wrote:

I actually can't use an aarch64 worker because I can't ssh to any of them.

So I'm really lost here. To further debug, I need a machine which shows the symptom.
@ggardet_arm any idea?

You can likely reach openqa-aarch64 machine on o3, right?

Actions #15

Updated by tinita almost 2 years ago

ggardet_arm wrote:

You can likely reach openqa-aarch64 machine on o3, right?

After we found out the IP and the "real" hostname, yeah. But before I only got:

ssh: Could not resolve hostname openqa-aarch64: Name or service not known

So I now took the worker out and will try to add some debugging code.

Actions #16

Updated by tinita almost 2 years ago

In the end I was not able to reproduce performance problems.
First I thought I found something when the load was going up to 7 during a test, but I later also saw that on master.

I now conducted 10 tests each for master and PR 2216.

Master

PR 2216

I also ran top permanently and watched the load closely. There were no significant changes between the two code branches. The load was around 1.5-2.2 during the main part of the test, I would say.

@ggardet_arm How many test runs did you do and which test did you run? Did I maybe chose the wrong test?

The patched files from PR 2216 are still lying on openqa-aarch64 in the /root folder (isotovideo-patched, CommandHandler.pm-patched), but I will start the regular services again on that machine soon. Let me know if you have more ideas to reproduce.

Actions #17

Updated by ggardet_arm almost 2 years ago

I used MicroOS-Image and MicroOS-Image-ContainerHost from Tumbleweed to perform my tests. I manually triggered a number of jobs, but cannot tell you the exact number, sorry.

IIRC, the problem was more obvious on other aarch64 machines but I cannot provide access to them, sorry.

Actions #18

Updated by okurz almost 2 years ago

Discussed in daily 2022-12-06. We see three tasks that we could do as we could not reproduce the issue today:

  1. Bring in parts of the original PR https://github.com/os-autoinst/os-autoinst/pull/2216 to see if they have any impact on production. At best ggardet_arm can check on machines that were affected with hot-patching before merging
  2. Create a setup that can reproduce the issue in cloud ourselves with help from ggardet_arm
  3. ggardet_arm needs to debug and fix himself as we can't reproduce but we give hints and suggestions
Actions #19

Updated by tinita almost 2 years ago

I prepared a very very small PR: https://github.com/os-autoinst/os-autoinst/pull/2227
But apparently it was too small for some :)

So now I made this big PR: https://github.com/os-autoinst/os-autoinst/pull/2228 which leaves isotovideo with around 150 lines (without pod).
It is divided into 16 commits.
I asked @ggardet_arm if he can test it.

Actions #20

Updated by tinita almost 2 years ago

  • Status changed from In Progress to Feedback

Tested 2228 successfully locally and now waiting for @ggardet_arm

Actions #21

Updated by ggardet_arm almost 2 years ago

I just updated ip-10-252-32-90 and ip-10-252-32-98:8 workers with os-autoinst from https://github.com/os-autoinst/os-autoinst/pull/2228

Actions #22

Updated by livdywan almost 2 years ago

From worker ip-10-252-32-90:2 looking at one of the jobs seems to show a lot of messages about too many candidates needles, making it very slow even before it eventually fails. I'm finding it difficult to assess if this is at all related to Tina's changes. It doesn't seem like it? But I also don't really see any passing runs before it.

Actions #23

Updated by tinita almost 2 years ago

  • Due date changed from 2022-12-13 to 2022-12-19

Bumping due date as we are still checking the outcome of the PR

Actions #24

Updated by livdywan almost 2 years ago

cdywan wrote:

From worker ip-10-252-32-90:2 looking at one of the jobs seems to show a lot of messages about too many candidates needles, making it very slow even before it eventually fails. I'm finding it difficult to assess if this is at all related to Tina's changes.

All the recent jobs seem to be failing. Failures don't look related to the suspected regression but rather network-related.
@ggardet_arm Can you comment on that? A passing job for reference would be helpful. And to be able to see if it's slower than expected.

Actions #25

Updated by ggardet_arm almost 2 years ago

cdywan wrote:

cdywan wrote:

From worker ip-10-252-32-90:2 looking at one of the jobs seems to show a lot of messages about too many candidates needles, making it very slow even before it eventually fails. I'm finding it difficult to assess if this is at all related to Tina's changes.

All the recent jobs seem to be failing. Failures don't look related to the suspected regression but rather network-related.
@ggardet_arm Can you comment on that? A passing job for reference would be helpful. And to be able to see if it's slower than expected.

The network related issue is tracked here: https://progress.opensuse.org/issues/122029

I think the current Tina's PR does not bring any significant perf regression. At least I couldn't reproduce it.

Actions #26

Updated by livdywan almost 2 years ago

  • Due date changed from 2022-12-19 to 2023-01-06

Yay. That means we can go ahead with the branch. Perhaps we should aim to merge first thing in the new year.

Actions #27

Updated by okurz almost 2 years ago

  • Due date changed from 2023-01-06 to 2023-01-20

christmas grace due date bump :)

Actions #29

Updated by tinita almost 2 years ago

https://github.com/os-autoinst/os-autoinst/pull/2228 merged, let's observe for a while

Actions #30

Updated by tinita almost 2 years ago

  • Status changed from Feedback to Resolved

This looks like it is resolved. Please reopen if you observe failures again.

Actions #31

Updated by okurz almost 2 years ago

  • Related to action #81899: Move code from isotovideo to a module size:M added
Actions

Also available in: Atom PDF