action #57962

Needle load time increase

Added by acarvajal 5 months ago. Updated 5 months ago.

Status:ResolvedStart date:10/10/2019
Priority:HighDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration:

Description

Needle load time has increased since 09.10.2019 morning, from 3-5 seconds, to 60+ seconds.

This may cause failures in tests with time sensitive screen actions, for example offline upgrades with PATCH=1 which schedule:

migration/reboot_to_upgrade
migration/version_switch_upgrade_target
installation/isosize
installation/bootloader

The needle load time increase makes migration/version_switch_upgrade_target to take over 2 minutes to complete while the VM is rebooting in the background. When the test reaches installation/bootloader, SUT is already in the GRUB menu so it is not able to select upgrade from the installation menu and fails.

The following data is from HA offline migration tests and compare the last failing one vs the last passing one.

offline + SCC migration from 12sp4:

[2019-10-09T16:01:52.266 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 14:01:52 (108 s)
vs
[2019-10-09T01:03:01.624 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 23:03:01 (3 s)

offline + SCC migration from 12sp4:

[2019-10-09T22:12:39.589 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 20:12:39 (121 s)
vs
[2019-10-09T00:57:00.555 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 22:57:00 (3 s)

offline from 11sp4:

[2019-10-09T18:52:58.316 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 16:52:58 (194 s)
vs
[2019-10-09T01:16:05.424 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 23:16:05 (3 s)

offline from 11sp4:

[2019-10-09T18:51:43.165 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 16:51:43 (212 s)
vs
[2019-10-09T01:04:28.904 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 23:04:28 (3 s)

offline from 12sp2-ltss:

[2019-10-09T18:44:03.649 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 16:44:03 (190 s)
vs
[2019-10-09T00:55:09.279 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 22:55:09 (3 s)

offline from 12sp2-ltss:

[2019-10-09T22:09:06.597 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 20:09:06 (198 s)
vs
[2019-10-09T00:50:04.276 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 22:50:04 (5 s)

offline from 12sp3-ltss:

[2019-10-09T22:13:16.424 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 20:13:16 (152 s)
vs
[2019-10-09T01:06:02.459 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 23:06:02 (3 s)

offline from 12sp3-ltss:

[2019-10-09T22:15:35.895 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 20:15:35 (121 s)
vs
[2019-10-09T01:07:07.188 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 23:07:07 (3 s)

offline from 12sp4:

[2019-10-09T22:19:55.434 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 20:19:55 (487 s)
vs
[2019-10-09T00:52:54.409 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 22:52:54 (5 s)

offline from 12sp4:

[2019-10-09T22:04:26.060 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-09 20:04:26 (97 s)
vs
[2019-10-09T00:57:15.750 CEST] [debug] ||| finished version_switch_upgrade_target migration at 2019-10-08 22:57:15 (3 s)

More logs in: https://openqa.suse.de/tests/overview?distri=sle&version=12-SP5&build=0155%400350&groupid=157


Related issues

Related to openQA Project - action #56789: New needles from git repository not working with openqa-c... New 11/09/2019

History

#1 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

so seems very likely to be linked to yesterdays deployment with upgrade of both worker and os-autoinst. Diffs are:

  • openQA-worker: 4.6.1569991263.a96053cdc-lp151.1859.1..4.6.1570550933.4d654a67a-lp151.1882.1
  • os-autoinst: 4.5.1569914332.5870aebd-lp151.180.1..4.5.1570546690.6aabb29c-lp151.184.1

openQA-worker:

0e9086386 Make clickable links if a setting looks like a URL
2668e2598 Too many webui requests take more than a second to be processed, so limiting concurrency is much safer for now
10c54b000 Add note about plugins not to enable on typical production setups
9740f4ab2 (HEAD -> enhance/apparmor, okurz/enhance/apparmor) apparmor: Update profile based on current o3 state
1021bb92f Elaborate on differences between table and YAML
226ed29a1 Use fork for nightly circleci job (#2365)
7953cd21c Retry limit asset job on key constraint violations during scans
885b5160f Improve job dependencies documentation
4aff6eab5 Dependency cron 191002000541
ee7054888 Correctly propagate job template name from YAML

os-autoinst:

bcbde8c6 Fix error handling when passing message from ws client to isotovideo
76bd97b9 (okurz/fix/script_run, fix/script_run) Add possibility to have a comment text in script_run (fixed)
2415ff3a (okurz/fix/ev_timer, fix/ev_timer) t: Fix 'Not enough arguments for EV::timer' with import ordering
eb20617f Allow loading needles from pool directory

My bet is on https://github.com/os-autoinst/os-autoinst/pull/1217/commits/eb20617f5c259e7df93495aa98045c5145fc0df0

Verified the problem together with mkittler by running isotovideo locally and we could observe a heavy increase in loading time, "last good":

okurz@linux-28d6:../os-autoinst/local/x86_micro_selftest_opensuse 0 (fix/needle_load_time) $ ../../isotovideo -d
[2019-10-10T18:22:14.904 CEST] [debug] Current version is c66cc5fc573cb5a993d2d167bc8416bfa2c6d147 [interface v17]
[2019-10-10T18:22:15.118 CEST] [debug] init needles from /var/lib/openqa/share/tests/opensuse/products/opensuse/needles
[2019-10-10T18:22:16.308 CEST] [debug] loaded 5520 needles

"first bad":

okurz@linux-28d6:../os-autoinst/local/x86_micro_selftest_opensuse 0 (fix/needle_load_time) $ ../../isotovideo -d
[2019-10-10T18:22:42.554 CEST] [debug] Current version is e55df29235d5d996a146ec4e5ec870cb0af12aa3 [interface v17]
[2019-10-10T18:22:42.568 CEST] [debug] git hash in /var/lib/openqa/share/tests/opensuse: 03a81cc763a80046a36a9ac9fead8bb72722147d
[2019-10-10T18:22:42.921 CEST] [debug] init needles from /var/lib/openqa/share/tests/opensuse/products/opensuse/needles
[2019-10-10T18:23:23.290 CEST] [debug] loaded 5520 needles

Reverted with
https://github.com/os-autoinst/os-autoinst/pull/1224

Hotpatched OSD workers with:

sudo salt '*' cmd.run 'curl -s https://github.com/os-autoinst/os-autoinst/commit/eb20617f5c259e7df93495aa98045c5145fc0df0.patch | sudo patch -R -p1 --batch --force -d /usr/lib/os-autoinst/'

#2 Updated by coolo 5 months ago

We're talking about this time, right?

before:
[2019-10-04T03:02:35.238 CEST] [debug] init needles from /var/lib/openqa/cache/openqa.suse.de/tests/sle/products/sle/needles
[2019-10-04T03:02:37.322 CEST] [debug] loaded 9589 needles

after:
[2019-10-10T15:13:31.009 CEST] [debug] init needles from /var/lib/openqa/cache/openqa.suse.de/tests/sle/products/sle/needles
[2019-10-10T15:14:07.710 CEST] [debug] loaded 9575 needles

While I agree that the commit stands out as related, it doesn't really make sense in my head - as it only adds an if

#3 Updated by coolo 5 months ago

On o3 the slowdown can be pinned down to the 8th:

01051/01051423-opensuse-Tumbleweed-DVD-ppc64le-Build20191007-ltp_capbounds@ppc64le/autoinst-log.txt:[2019-10-08T16:36:27.760 UTC] [debug] init needles from /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles
01051/01051423-opensuse-Tumbleweed-DVD-ppc64le-Build20191007-ltp_capbounds@ppc64le/autoinst-log.txt-[2019-10-08T16:36:29.207 UTC] [debug] loaded 5801 needles
--
01051/01051424-microos-Staging:D-Staging-DVD-x86_64-Build226.1-container-host@64bit-2G-HD40G/autoinst-log.txt:[2019-10-08T14:44:28.841 CEST] [debug] init needles from /var/lib/openqa/cache/openqa1-opensuse/tests/microos/products/microos/needles
01051/01051424-microos-Staging:D-Staging-DVD-x86_64-Build226.1-container-host@64bit-2G-HD40G/autoinst-log.txt-[2019-10-08T14:44:53.216 CEST] [debug] loaded 5800 needles

#4 Updated by coolo 5 months ago

ok, that was misleading as power8 wasn't deployed

#6 Updated by okurz 5 months ago

I updated my previous comment. I think this is why you haven't read that I hotpatched on OSD already. Any new deployment of os-autoinst will include the merged revert but I would not plan a new deployment yet.

Jobs can be retriggered, hotpatch can be applied to o3 as well or we wait until the night.

EDIT: power8.o.o did not even install latest os-autoinst packages, to be checked. Others rolled back with for i in aarch64 openqaworker1 openqaworker4 imagetester; do echo $i && ssh root@$i "transactional-update rollback last && reboot"; done

#7 Updated by okurz 5 months ago

  • Related to action #56789: New needles from git repository not working with openqa-clone-custom-git-refspec added

#8 Updated by okurz 5 months ago

  • Status changed from In Progress to Resolved
  • Target version set to Current Sprint

many tests retriggered on both osd as well as some on o3 (probably none failed for because of this ticket though).

Also available in: Atom PDF