coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
[osd] Job detail page fails to load
The job detail page for the following ltp_syscalls_secureboot job is timing out:
Please investigate why and fix it if possible.
It is definitely not about loading results from disk because the results directory doesn't even exist anymore:
martchus@openqa:/> l /var/lib/openqa/testresults/00823 insgesamt 308 drwxr-xr-x 3 geekotest nogroup 101 20. Mär 2018 ./ drwxr-xr-x 6994 geekotest root 172032 28. Feb 16:50 ../ drwxr-xr-x 3 geekotest nogroup 12288 18. Apr 2017 00823441-sle-12-SP3-Server-DVD-x86_64-Build0285-sles12_qa_kernel_lvm2_120@64bit/
The job is also not archived (according to the database and file system).
It is definitely not about loading results from disk because the results directory doesn't even exist anymore:martchus@openqa:/> l /var/lib/openqa/testresults/00823
Umm, I think you should be looking into
/var/lib/openqa/testresults/08232/ instead. The job ran over the weekend, not 5 years ago.
I reloaded the page and it took 116s to load /tests/8232404 and 137s to load /tests/8232404/details_ajax
For the first request I saw a 502 with a bit over 300s where we hit a timeout.
So I assume it is a long database query (and the second time it comes from the cache).
I would look locally what SQL would be executed, but the pg dump is still copying, very slow network today.
okurz This is not a problem of too much data in the results. Other ltp_syscalls_secureboot jobs with the exact same number of test modules load up just fine. This one job is simply broken and even the main job page (without test module data) that normally takes less than a second to load regardless of job size is taking ages to load.
- Subject changed from Job detail page fails to load to [osd] Job detail page fails to load
- Priority changed from Normal to High
- Parent task set to #80142
I agree that the root cause is not a "too big job" though I do believe it's a generic problem that just becomes apparent in this job. I have seen no indication of "data corruption" that would explain that. OSD is suffering from a high load since a longer time and seems to have become worse recently. This could be due to an unexpected problem, it could be due to just the expected number of increasing jobs and number of worker instances as well as the complexity of jobs. Maybe that's not related to the current problem, maybe it is.
EDIT: Of course not the real problem nor solution but I found that also a "cupsd" service is running on OSD. That's most certainly not needed. Trying to remove that with
zypper rm -u cups cups-client cups-config cups-filters libcups2 libcupscgi1 libcupsimage2 libcupsmime1 libcupsppdc1.
That would also remove
including cronie and samba which is a bit too much. Without
-u samba would still be removed.
zypper rm cups-client looks better. Also
zypper rm -u cups-client works. Also did
ionice nice zypper rm -u libcupsimage2 which among others is removing ghostscript and graphviz as well. I manually removed more packages which I think are not necessary.
In the meantime tinita is already looking into the specific SQL queries here.
Locally it looks like there might be a query not using a table index. I've only looked at one query so far, but as the HTML page as well as the ajax call take long, there must be more than one query behaving badly.
Maybe if the number of job_modules is above a certain point, the postgres planner chooses the wrong strategy.
The query takes 12s locally, but if I force it to use the index with this:
SET random_page_cost = 1.1 then it only takes 10ms.
edit: the query took 490s in production, btw.