Project

General

Profile

action #107701

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[osd] Job detail page fails to load

Added by MDoucha 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-02-28
Due date:
2022-03-15
% Done:

0%

Estimated time:
Difficulty:

Description

The job detail page for the following ltp_syscalls_secureboot job is timing out:
https://openqa.suse.de/tests/8232404

Please investigate why and fix it if possible.


Related issues

Related to openQA Infrastructure - action #105618: [Alerting] CPU Load alert size:SRejected2022-01-27

Related to openQA Infrastructure - action #97943: Increase number of CPU cores on OSD VM due to high usage size:SResolved

History

#1 Updated by mkittler 4 months ago

It is definitely not about loading results from disk because the results directory doesn't even exist anymore:

martchus@openqa:/> l /var/lib/openqa/testresults/00823
insgesamt 308
drwxr-xr-x    3 geekotest nogroup    101 20. Mär 2018  ./
drwxr-xr-x 6994 geekotest root    172032 28. Feb 16:50 ../
drwxr-xr-x    3 geekotest nogroup  12288 18. Apr 2017  00823441-sle-12-SP3-Server-DVD-x86_64-Build0285-sles12_qa_kernel_lvm2_120@64bit/

The job is also not archived (according to the database and file system).

#2 Updated by MDoucha 4 months ago

mkittler wrote:

It is definitely not about loading results from disk because the results directory doesn't even exist anymore:

martchus@openqa:/> l /var/lib/openqa/testresults/00823

Umm, I think you should be looking into /var/lib/openqa/testresults/08232/ instead. The job ran over the weekend, not 5 years ago.

#3 Updated by tinita 4 months ago

I reloaded the page and it took 116s to load /tests/8232404 and 137s to load /tests/8232404/details_ajax
For the first request I saw a 502 with a bit over 300s where we hit a timeout.

So I assume it is a long database query (and the second time it comes from the cache).

I would look locally what SQL would be executed, but the pg dump is still copying, very slow network today.

#6 Updated by okurz 4 months ago

  • Target version set to future

#7 Updated by okurz 4 months ago

  • Target version changed from future to Ready

#8 Updated by MDoucha 4 months ago

okurz This is not a problem of too much data in the results. Other ltp_syscalls_secureboot jobs with the exact same number of test modules load up just fine. This one job is simply broken and even the main job page (without test module data) that normally takes less than a second to load regardless of job size is taking ages to load.

#9 Updated by okurz 4 months ago

#10 Updated by okurz 4 months ago

  • Related to action #97943: Increase number of CPU cores on OSD VM due to high usage size:S added

#11 Updated by okurz 4 months ago

  • Subject changed from Job detail page fails to load to [osd] Job detail page fails to load
  • Priority changed from Normal to High
  • Parent task set to #80142

I agree that the root cause is not a "too big job" though I do believe it's a generic problem that just becomes apparent in this job. I have seen no indication of "data corruption" that would explain that. OSD is suffering from a high load since a longer time and seems to have become worse recently. This could be due to an unexpected problem, it could be due to just the expected number of increasing jobs and number of worker instances as well as the complexity of jobs. Maybe that's not related to the current problem, maybe it is.

EDIT: Of course not the real problem nor solution but I found that also a "cupsd" service is running on OSD. That's most certainly not needed. Trying to remove that with zypper rm -u cups cups-client cups-config cups-filters libcups2 libcupscgi1 libcupsimage2 libcupsmime1 libcupsppdc1.

That would also remove

PackageKit-gtk3-module adwaita-icon-theme at-spi2-atk-common at-spi2-atk-gtk2 at-spi2-core at-spi2-core-lang atk-lang brltty-driver-at-spi2 bubblewrap cifs-utils cron cronie cryptsetup cryptsetup-lang cups cups-client cups-config cups-filters desktop-translations enchant-2-backend-aspell enchant-2-backend-hunspell
  enchant-data fuse3 gcr-data gcr-lang gcr-prompter gcr-viewer gdk-pixbuf-loader-rsvg ghostscript ghostscript-x11 gnome-online-accounts gnome-online-accounts-lang graphviz graphviz-gd graphviz-gnome graphviz-plugins-core groff-full gtk2-branding-openSUSE gtk2-data gtk2-immodule-amharic gtk2-immodule-inuktitut
  gtk2-immodule-thai gtk2-immodule-tigrigna gtk2-immodule-vietnamese gtk2-lang gtk2-metatheme-adwaita gtk2-theming-engine-adwaita gtk2-tools gtk3-branding-openSUSE gtk3-data gtk3-immodule-amharic gtk3-immodule-inuktitut gtk3-immodule-thai gtk3-immodule-tigrigna gtk3-immodule-vietnamese gtk3-lang
  gtk3-metatheme-adwaita gtk3-schema gtk3-tools gvfs gvfs-backend-afc gvfs-backend-samba gvfs-backends gvfs-fuse gvfs-lang gxditview json-glib-lang libXaw7 libXcomposite1 libXcursor1 libXi6 libXinerama1 libXmu6 libXtst6 libaspell15 libatasmart-utils libatasmart4 libatk-1_0-0 libatk-bridge-2_0-0 libatspi0
  libavahi-glib1 libbd_btrfs2 libbd_crypto2 libbd_fs2 libbd_loop2 libbd_mdraid2 libbd_part2 libbd_swap2 libbd_utils2 libblockdev libblockdev2 libbluray2 libbytesize-lang libbytesize1 libcdio16 libcdio_cdda2 libcdio_paranoia2 libcolord2 libcroco-0_6-3 libcups2 libcupscgi1 libcupsimage2 libcupsmime1 libcupsppdc1
  libenchant-2-2 libexif12 libfuse3-3 libgck-1-0 libgcr-3-1 libgdata-lang libgdata22 libgoa-1_0-0 libgoa-backend-1_0-1 libgphoto2-6 libgphoto2-6-lang libgraphviz6 libgstfft-1_0-0 libgtk-2_0-0 libgtk-3-0 libharfbuzz-icu0 libhyphen0 libjansson4 libjasper4 libjavascriptcoregtk-4_0-18 libjson-glib-1_0-0 liblcms2-2
  libldapcpp1 liblockdev1 libmtp-udev libmtp9 libnetpbm11 libnfs8 libnscd1 liboauth0 libopenjp2-7 libpoppler-cpp0 libpoppler89 libqpdf26 librest-0_7-0 librsvg-2-2 librsvg-lang libsamba-policy0-python3 libsecret-1-0 libsecret-lang libsoup-2_4-1 libsoup-lang libudisks2-0 libudisks2-0_btrfs libvte-2_91-0
  libwayland-cursor0 libwebkit2gtk-4_0-37 libwebkit2gtk3-lang libwebpdemux2 libwoff2common1_0_2 libwoff2dec1_0_2 lockdev mailx mdadm metatheme-adwaita-common netpbm notification-daemon notification-daemon-lang openssh-askpass-gnome parallel-printer-support perl-Crypt-SmbHash perl-Digest-MD4 perl-GraphViz perl-X500-DN
  perl-XML-XPath poppler-data poppler-tools psutils python3-ldb python3-tdb python3-tevent qemu-ui-gtk ruby2.5-rubygem-cfa_grub2 samba samba-ad-dc-libs samba-client samba-client-libs samba-libs samba-libs-python3 samba-python3 system-user-lp udisks2 udisks2-lang vte-lang webkit2gtk-4_0-injected-bundles xbitmaps
  xdg-dbus-proxy yast2-bootloader yast2-country yast2-ldap yast2-network yast2-pam yast2-proxy yast2-python3-bindings yast2-samba-client yast2-samba-server yast2-security yast2-trans-stats yast2-users yast2-xml

including cronie and samba which is a bit too much. Without -u samba would still be removed. zypper rm cups-client looks better. Also zypper rm -u cups-client works. Also did ionice nice zypper rm -u libcupsimage2 which among others is removing ghostscript and graphviz as well. I manually removed more packages which I think are not necessary.

In the meantime tinita is already looking into the specific SQL queries here.

#12 Updated by tinita 4 months ago

Locally it looks like there might be a query not using a table index. I've only looked at one query so far, but as the HTML page as well as the ajax call take long, there must be more than one query behaving badly.

Maybe if the number of job_modules is above a certain point, the postgres planner chooses the wrong strategy.

The query takes 12s locally, but if I force it to use the index with this: SET random_page_cost = 1.1 then it only takes 10ms.

edit: the query took 490s in production, btw.

#13 Updated by tinita 4 months ago

I think I might have a fix.

I haven't found out how to fix the index selection, but I think I can replace the SQL with a simpler one.
The culprit is lib/OpenQA/Schema/Result/Jobs.pm sub show_modules.
Testing my fix now...

#14 Updated by tinita 4 months ago

  • Status changed from New to In Progress
  • Assignee set to tinita

#16 Updated by okurz 4 months ago

Hotpatched OSD with curl -sS https://patch-diff.githubusercontent.com/raw/os-autoinst/openQA/pull/4538.patch | sudo patch -p1 -d /usr/share/openqa/ and the initial loading of https://openqa.suse.de/tests/8232404 including the info box finishes near-immediate.

#17 Updated by openqa_review 4 months ago

  • Due date set to 2022-03-15

Setting due date based on mean cycle time of SUSE QE Tools

#18 Updated by tinita 4 months ago

  • Status changed from In Progress to Resolved

This is resolved.

We are also logging slow queries now on osd.

Also available in: Atom PDF