action #97979
closedopenQA Infrastructure - action #97976: [alert] OSD file systems - assets
Asset cleanup takes very long to process 60k files in "other" size:M
Description
Motivation¶
See #97976 . Attaching with strace to a gru process for asset cleanup shows that it takes very long to traverse files in the "other" directory (currently 60k files on OSD) which are all not big but seem to take a lot of time to traverse and also we process files seemingly in reverse alphabetic order
stat("/var/lib/openqa/share/factory/other/06577642-cve", {st_mode=S_IFREG|0644, st_size=1821, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577642-cve", {st_mode=S_IFREG|0644, st_size=1821, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577642-cve", {st_mode=S_IFREG|0644, st_size=1821, ...}) = 0
getpid() = 22883
stat("/var/lib/openqa/share/factory/other/06577641-uevent", {st_mode=S_IFREG|0644, st_size=54, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577641-uevent", {st_mode=S_IFREG|0644, st_size=54, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577641-uevent", {st_mode=S_IFREG|0644, st_size=54, ...}) = 0
getpid() = 22883
stat("/var/lib/openqa/share/factory/other/06577641-numa", {st_mode=S_IFREG|0644, st_size=538, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577641-numa", {st_mode=S_IFREG|0644, st_size=538, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577641-numa", {st_mode=S_IFREG|0644, st_size=538, ...}) = 0
getpid() = 22883
stat("/var/lib/openqa/share/factory/other/06577640-tracing", {st_mode=S_IFREG|0644, st_size=345, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577640-tracing", {st_mode=S_IFREG|0644, st_size=345, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577640-tracing", {st_mode=S_IFREG|0644, st_size=345, ...}) = 0
getpid() = 22883
stat("/var/lib/openqa/share/factory/other/06577635-pty", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577635-pty", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577635-pty", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
…
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0154-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0154-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0154-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
getpid() = 22883
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0151-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0151-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0151-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
getpid() = 22883
which seems to have the following results:
- We spend very long time processing "other" where we hardly delete anything before we go to "iso" or "hdd" with bigger files
Suggestion¶
- Investigate if we really want to do Z-A sorting i.e. traverse more recent files by build number - or can we check mtime?
- Keep track of files already processed between runs (can we do that?)
- Reconsider ionicing the cleanup
Acceptance criteria¶
- AC1: Asset cleanup on OSD takes less time until it actually starts deleting files within one asset cleanup task
AC2: There are less files kept in "other"--> #99420
Updated by livdywan about 3 years ago
- Subject changed from Asset cleanup takes very long to process 60k files in "other" to Asset cleanup takes very long to process 60k files in "other" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by kraih about 3 years ago
- Status changed from Workable to In Progress
The syscalls are pretty easy to optimize, but i fear the SQL queries will end up being the real bottleneck. But i'll confirm with a benchmark tomorrow to make sure.
Updated by VANASTASIADIS about 3 years ago
- Due date set to 2021-09-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 3 years ago
kraih plans to have the benchmark and then identify possibly multiple places where we can optimize. There is no explicit "Z-A" sorting but "opendir" and "readdir" calls which effectively traverse the complete sections of the filesystem which can be very expensive. Instead we should use the information from the database to identify where we actually need a filesystem-lookup.
Updated by kraih about 3 years ago
Opened a PR that removes most of the syscalls completely. https://github.com/os-autoinst/openQA/pull/4189
Updated by kraih about 3 years ago
- Status changed from In Progress to Feedback
PR has been merged, let's see how it performs in production.
Updated by okurz about 3 years ago
kraih wrote:
PR has been merged, let's see how it performs in production.
could you check?
Updated by mkittler about 3 years ago
There's currently a limit_assets
task running (see https://openqa.suse.de/minion/jobs?task=limit_assets and https://openqa.suse.de/minion/jobs?id=2972890 in particular). It has already been running for more than half an hour. Not sure whether that's bad because it most likely actually needs to clean up lots of assets and this was more about the idle case. So I would trigger another asset cleanup right after the one which has just finished to check.
Updated by okurz about 3 years ago
- Due date changed from 2021-09-23 to 2021-09-27
- Assignee changed from kraih to mkittler
@mkittler as disscussed, to take over.
Updated by mkittler about 3 years ago
- Assignee changed from mkittler to kraih
It still takes 25 minutes to run an asset cleanup after another asset cleanup has just run (so when there's almost nothing to delete). Maybe we can swap out refresh_size
with ensure_size
to avoid scanning assets we already know the size of. Of course this has the caveat that we don't track when an asset changes in size. I can test how much time we would gain by hot-patching the asset cleanup code in production.
Updated by mkittler about 3 years ago
- Status changed from Feedback to In Progress
- Assignee changed from kraih to mkittler
Updated by mkittler about 3 years ago
Even with only ensure_size
it is still running for over 10 minutes now. It looks like database updates are still slowing it down:
poll([{fd=13, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "2\0\0\0\4n\0\0\0\4C\0\0\0\rUPDATE 1\0Z\0\0\0\5I", 262144, 0, NULL, NULL) = 30
sendto(13, "B\0\0\0002\0dbdpg_p16636_7\0\0\0\0\2\0\0\0\007719"..., 73, MSG_NOSIGNAL, NULL, 0) = 73
poll([{fd=13, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "2\0\0\0\4n\0\0\0\4C\0\0\0\rUPDATE 1\0Z\0\0\0\5I", 262144, 0, NULL, NULL) = 30
sendto(13, "B\0\0\0003\0dbdpg_p16636_7\0\0\0\0\2\0\0\0\007719"..., 74, MSG_NOSIGNAL, NULL, 0) = 74
poll([{fd=13, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "2\0\0\0\4n\0\0\0\4C\0\0\0\rUPDATE 1\0Z\0\0\0\5I", 262144, 0, NULL, NULL) = 30
sendto(13, "B\0\0\0003\0dbdpg_p16636_7\0\0\0\0\2\0\0\0\007719"..., 74, MSG_NOSIGNAL, NULL, 0) = 74
I suspect that these are the updates for the "fixed" state of the asset.
Updated by mkittler about 3 years ago
Even when updating the database only if the "fixed status" changes it takes long to run stat all those files, e.g. the job I'm currently testing is already running for 10 minutes. The remaining syscalls are clearly related to the code which determines whether an asset is fixed:
getpid() = 12496
stat("/var/lib/openqa/share/factory/other/07110482-cve", {st_mode=S_IFREG|0644, st_size=2010, ...}) = 0
getpid() = 12496
stat("/var/lib/openqa/share/factory/other/07100435-net.tirpc_tests", {st_mode=S_IFREG|0644, st_size=2753, ...}) = 0
getpid() = 12496
stat("/var/lib/openqa/share/factory/other/07100432-sched", {st_mode=S_IFREG|0644, st_size=371, ...}) = 0
getpid() = 12496
stat("/var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/iso/fixed/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/iso/fixed/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
getpid() = 12496
stat("/var/lib/openqa/share/factory/hdd/SLES-15-SP3-aarch64-minimal_installed_for_LTP_broken.qcow2", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/hdd/fixed/SLES-15-SP3-aarch64-minimal_installed_for_LTP_broken.qcow2", {st_mode=S_IFREG|0644, st_size=1179385856, ...}) = 0
getpid() = 12496
stat("/var/lib/openqa/share/factory/other/06203626-syscalls", {st_mode=S_IFREG|0644, st_size=30774, ...}) = 0
getpid() = 12496
stat("/var/lib/openqa/share/factory/other/06203625-dio", {st_mode=S_IFREG|0644, st_size=2184, ...}) = 0
Updated by mkittler about 3 years ago
When only updating the "fixed status" for assets which we've already seen it goes faster as all the stat calls from the previous comment #97979#note-14 go away. I'm not sure whether it is safe to do so, though.
Additionally, the update calls from #97979#note-13 remain because these are actually about updating the last job of each asset. I suppose that's really something we want although we could likely also skip it if the last job doesn't change anyways.
Parsing each asset's timestamp takes quite a long time, likely because it involves several syscalls per asset:
getpid() = 30348
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2298, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2298, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2298, ...}) = 0
flock(3, LOCK_EX) = 0
write(3, "[2021-09-23T13:16:41.0904 CEST] "..., 113) = 113
flock(3, LOCK_UN)
Updated by mkittler about 3 years ago
So even when skipping all filesystem checks for assets we have already seen it takes 19 minutes. So we likely want to optimize the points from the last two paragraphs in my previous comment.
Updated by mkittler about 3 years ago
I've been creating two PRs, one containing optimizations which should not change the behavior (https://github.com/os-autoinst/openQA/pull/4226/files) and one with additional optimizations which slightly change the behavior (https://github.com/os-autoinst/openQA/pull/4227).
I'm not sure yet whether the syscalls in #97979#note-15 actually come from the timestamp parsing. After some local testing it doesn't seem to cause these syscalls so it is likely just some logging.
Updated by livdywan about 3 years ago
mkittler wrote:
I've been creating two PRs, one containing optimizations which should not change the behavior (https://github.com/os-autoinst/openQA/pull/4226/files) and one with additional optimizations which slightly change the behavior (https://github.com/os-autoinst/openQA/pull/4227).
I'm not sure yet whether the syscalls in #97979#note-15 actually come from the timestamp parsing. After some local testing it doesn't seem to cause these syscalls so it is likely just some logging.
openQA#4226 got merged
Updated by mkittler about 3 years ago
- Status changed from In Progress to Feedback
The PR with possibly dangerous changes (https://github.com/os-autoinst/openQA/pull/4227) is still open but I guess these changes could be counterproductive. I would estimate that with only the optimizations done so far the cleanup can still take around 20 minutes for the idle case. I don't see an easy way to improve that (except for the changes in https://github.com/os-autoinst/openQA/pull/4227).
Updated by mkittler about 3 years ago
I've been creating a ticket for further improvements so the "dangerous" PR could be implemented as an optional feature: https://progress.opensuse.org/issues/99393
Updated by okurz about 3 years ago
- Status changed from Resolved to Feedback
I did not see you mention AC2 so I assume you overlooked it
Updated by okurz about 3 years ago
- Copied to action #99420: Asset cleanup takes very long to process 60k files in "other" - now for real! added
Updated by mkittler about 3 years ago
- Status changed from Feedback to Resolved
See the new ticket #99420 for AC2. AC1 should be fulfilled. I've just ran another cleanup on OSD which only took 4 minutes. There's not much load currently but it still looks like a decrease.
Updated by mkittler about 3 years ago
- Status changed from Resolved to In Progress
Looks like the optimization caused a regression leading to failing Minion jobs:
DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input syntax for type boolean: "" [for Statement "UPDATE assets SET fixed = ?, t_updated = ? WHERE id = ?" with ParamValues: 1='', 2='2021-09-28 15:01:21', 3='27487345'] at /usr/share/openqa/script/../lib/OpenQA/Schema/ResultSet/Assets.pm line 99
Updated by mkittler about 3 years ago
PR for fixing the regression: https://github.com/os-autoinst/openQA/pull/4250
Updated by okurz about 3 years ago
- Due date changed from 2021-09-27 to 2021-10-01
mkittler wrote:
Looks like the optimization caused a regression leading to failing Minion jobs:
DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input syntax for type boolean: "" [for Statement "UPDATE assets SET fixed = ?, t_updated = ? WHERE id = ?" with ParamValues: 1='', 2='2021-09-28 15:01:21', 3='27487345'] at /usr/share/openqa/script/../lib/OpenQA/Schema/ResultSet/Assets.pm line 99
I am curious, how did you find it? And also, since you found this regression after resolving I would like to take it as an opportunity to see where else we can apply improvements, monitoring, process, etc.
Updated by mkittler about 3 years ago
I found it by looking at the Minion dashboard. We have a monitoring alert and it would have triggered after a certain amount of failed jobs. For other reasons we cannot lower the threshold but there are already tickets for improving the situation. And obviously the file system alert would have fired as well at some point.
By the way, the PR has been merged and deployed and seems to work.
Updated by mkittler about 3 years ago
- Status changed from In Progress to Feedback
Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved