Project

General

Profile

Actions

action #97979

closed

openQA Infrastructure - action #97976: [alert] OSD file systems - assets

Asset cleanup takes very long to process 60k files in "other" size:M

Added by okurz about 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-09-02
Due date:
2021-10-01
% Done:

0%

Estimated time:

Description

Motivation

See #97976 . Attaching with strace to a gru process for asset cleanup shows that it takes very long to traverse files in the "other" directory (currently 60k files on OSD) which are all not big but seem to take a lot of time to traverse and also we process files seemingly in reverse alphabetic order

stat("/var/lib/openqa/share/factory/other/06577642-cve", {st_mode=S_IFREG|0644, st_size=1821, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577642-cve", {st_mode=S_IFREG|0644, st_size=1821, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577642-cve", {st_mode=S_IFREG|0644, st_size=1821, ...}) = 0
getpid()                                = 22883
stat("/var/lib/openqa/share/factory/other/06577641-uevent", {st_mode=S_IFREG|0644, st_size=54, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577641-uevent", {st_mode=S_IFREG|0644, st_size=54, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577641-uevent", {st_mode=S_IFREG|0644, st_size=54, ...}) = 0
getpid()                                = 22883
stat("/var/lib/openqa/share/factory/other/06577641-numa", {st_mode=S_IFREG|0644, st_size=538, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577641-numa", {st_mode=S_IFREG|0644, st_size=538, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577641-numa", {st_mode=S_IFREG|0644, st_size=538, ...}) = 0
getpid()                                = 22883
stat("/var/lib/openqa/share/factory/other/06577640-tracing", {st_mode=S_IFREG|0644, st_size=345, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577640-tracing", {st_mode=S_IFREG|0644, st_size=345, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577640-tracing", {st_mode=S_IFREG|0644, st_size=345, ...}) = 0
getpid()                                = 22883
stat("/var/lib/openqa/share/factory/other/06577635-pty", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577635-pty", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
stat("/var/lib/openqa/share/factory/other/06577635-pty", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
…
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0154-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0154-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0154-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
getpid()                                = 22883
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0151-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0151-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
stat("/var/lib/openqa/share/factory/repo/SLE-12-SP5-SDK-POOL-s390x-Build0151-Media1.license", {st_mode=S_IFDIR|0755, st_size=59, ...}) = 0
getpid()                                = 22883

which seems to have the following results:

  1. We spend very long time processing "other" where we hardly delete anything before we go to "iso" or "hdd" with bigger files

Suggestion

  • Investigate if we really want to do Z-A sorting i.e. traverse more recent files by build number - or can we check mtime?
  • Keep track of files already processed between runs (can we do that?)
  • Reconsider ionicing the cleanup

Acceptance criteria

  • AC1: Asset cleanup on OSD takes less time until it actually starts deleting files within one asset cleanup task
  • AC2: There are less files kept in "other" --> #99420

Related issues 1 (1 open0 closed)

Copied to openQA Project - action #99420: Asset cleanup takes very long to process 60k files in "other" - now for real!New

Actions
Actions #1

Updated by livdywan about 3 years ago

  • Subject changed from Asset cleanup takes very long to process 60k files in "other" to Asset cleanup takes very long to process 60k files in "other" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by kraih about 3 years ago

  • Assignee set to kraih

I'll take a closer look.

Actions #3

Updated by kraih about 3 years ago

  • Status changed from Workable to In Progress

The syscalls are pretty easy to optimize, but i fear the SQL queries will end up being the real bottleneck. But i'll confirm with a benchmark tomorrow to make sure.

Actions #4

Updated by VANASTASIADIS about 3 years ago

  • Due date set to 2021-09-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz about 3 years ago

kraih plans to have the benchmark and then identify possibly multiple places where we can optimize. There is no explicit "Z-A" sorting but "opendir" and "readdir" calls which effectively traverse the complete sections of the filesystem which can be very expensive. Instead we should use the information from the database to identify where we actually need a filesystem-lookup.

Actions #6

Updated by kraih about 3 years ago

Opened a PR that removes most of the syscalls completely. https://github.com/os-autoinst/openQA/pull/4189

Actions #7

Updated by kraih about 3 years ago

  • Status changed from In Progress to Feedback

PR has been merged, let's see how it performs in production.

Actions #8

Updated by okurz almost 3 years ago

kraih wrote:

PR has been merged, let's see how it performs in production.

could you check?

Actions #9

Updated by mkittler almost 3 years ago

There's currently a limit_assets task running (see https://openqa.suse.de/minion/jobs?task=limit_assets and https://openqa.suse.de/minion/jobs?id=2972890 in particular). It has already been running for more than half an hour. Not sure whether that's bad because it most likely actually needs to clean up lots of assets and this was more about the idle case. So I would trigger another asset cleanup right after the one which has just finished to check.

Actions #10

Updated by okurz almost 3 years ago

  • Due date changed from 2021-09-23 to 2021-09-27
  • Assignee changed from kraih to mkittler

@mkittler as disscussed, to take over.

Actions #11

Updated by mkittler almost 3 years ago

  • Assignee changed from mkittler to kraih

It still takes 25 minutes to run an asset cleanup after another asset cleanup has just run (so when there's almost nothing to delete). Maybe we can swap out refresh_size with ensure_size to avoid scanning assets we already know the size of. Of course this has the caveat that we don't track when an asset changes in size. I can test how much time we would gain by hot-patching the asset cleanup code in production.

Actions #12

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from kraih to mkittler
Actions #13

Updated by mkittler almost 3 years ago

Even with only ensure_size it is still running for over 10 minutes now. It looks like database updates are still slowing it down:

poll([{fd=13, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "2\0\0\0\4n\0\0\0\4C\0\0\0\rUPDATE 1\0Z\0\0\0\5I", 262144, 0, NULL, NULL) = 30
sendto(13, "B\0\0\0002\0dbdpg_p16636_7\0\0\0\0\2\0\0\0\007719"..., 73, MSG_NOSIGNAL, NULL, 0) = 73
poll([{fd=13, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "2\0\0\0\4n\0\0\0\4C\0\0\0\rUPDATE 1\0Z\0\0\0\5I", 262144, 0, NULL, NULL) = 30
sendto(13, "B\0\0\0003\0dbdpg_p16636_7\0\0\0\0\2\0\0\0\007719"..., 74, MSG_NOSIGNAL, NULL, 0) = 74
poll([{fd=13, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "2\0\0\0\4n\0\0\0\4C\0\0\0\rUPDATE 1\0Z\0\0\0\5I", 262144, 0, NULL, NULL) = 30
sendto(13, "B\0\0\0003\0dbdpg_p16636_7\0\0\0\0\2\0\0\0\007719"..., 74, MSG_NOSIGNAL, NULL, 0) = 74

I suspect that these are the updates for the "fixed" state of the asset.

Actions #14

Updated by mkittler almost 3 years ago

Even when updating the database only if the "fixed status" changes it takes long to run stat all those files, e.g. the job I'm currently testing is already running for 10 minutes. The remaining syscalls are clearly related to the code which determines whether an asset is fixed:

getpid()                                = 12496
stat("/var/lib/openqa/share/factory/other/07110482-cve", {st_mode=S_IFREG|0644, st_size=2010, ...}) = 0
getpid()                                = 12496
stat("/var/lib/openqa/share/factory/other/07100435-net.tirpc_tests", {st_mode=S_IFREG|0644, st_size=2753, ...}) = 0
getpid()                                = 12496
stat("/var/lib/openqa/share/factory/other/07100432-sched", {st_mode=S_IFREG|0644, st_size=371, ...}) = 0
getpid()                                = 12496
stat("/var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/iso/fixed/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/iso/fixed/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-s390x-Build47.3-Media1.iso", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
getpid()                                = 12496
stat("/var/lib/openqa/share/factory/hdd/SLES-15-SP3-aarch64-minimal_installed_for_LTP_broken.qcow2", 0x55999732d4f0) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
stat("/var/lib/openqa/share/factory/hdd/fixed/SLES-15-SP3-aarch64-minimal_installed_for_LTP_broken.qcow2", {st_mode=S_IFREG|0644, st_size=1179385856, ...}) = 0
getpid()                                = 12496
stat("/var/lib/openqa/share/factory/other/06203626-syscalls", {st_mode=S_IFREG|0644, st_size=30774, ...}) = 0
getpid()                                = 12496
stat("/var/lib/openqa/share/factory/other/06203625-dio", {st_mode=S_IFREG|0644, st_size=2184, ...}) = 0
Actions #15

Updated by mkittler almost 3 years ago

When only updating the "fixed status" for assets which we've already seen it goes faster as all the stat calls from the previous comment #97979#note-14 go away. I'm not sure whether it is safe to do so, though.

Additionally, the update calls from #97979#note-13 remain because these are actually about updating the last job of each asset. I suppose that's really something we want although we could likely also skip it if the last job doesn't change anyways.

Parsing each asset's timestamp takes quite a long time, likely because it involves several syscalls per asset:

getpid()                                = 30348
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2298, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2298, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2298, ...}) = 0
flock(3, LOCK_EX)                       = 0
write(3, "[2021-09-23T13:16:41.0904 CEST] "..., 113) = 113
flock(3, LOCK_UN)
Actions #16

Updated by mkittler almost 3 years ago

So even when skipping all filesystem checks for assets we have already seen it takes 19 minutes. So we likely want to optimize the points from the last two paragraphs in my previous comment.

Actions #17

Updated by mkittler almost 3 years ago

I've been creating two PRs, one containing optimizations which should not change the behavior (https://github.com/os-autoinst/openQA/pull/4226/files) and one with additional optimizations which slightly change the behavior (https://github.com/os-autoinst/openQA/pull/4227).

I'm not sure yet whether the syscalls in #97979#note-15 actually come from the timestamp parsing. After some local testing it doesn't seem to cause these syscalls so it is likely just some logging.

Actions #18

Updated by livdywan almost 3 years ago

mkittler wrote:

I've been creating two PRs, one containing optimizations which should not change the behavior (https://github.com/os-autoinst/openQA/pull/4226/files) and one with additional optimizations which slightly change the behavior (https://github.com/os-autoinst/openQA/pull/4227).

I'm not sure yet whether the syscalls in #97979#note-15 actually come from the timestamp parsing. After some local testing it doesn't seem to cause these syscalls so it is likely just some logging.

openQA#4226 got merged

Actions #19

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback

The PR with possibly dangerous changes (https://github.com/os-autoinst/openQA/pull/4227) is still open but I guess these changes could be counterproductive. I would estimate that with only the optimizations done so far the cleanup can still take around 20 minutes for the idle case. I don't see an easy way to improve that (except for the changes in https://github.com/os-autoinst/openQA/pull/4227).

Actions #20

Updated by mkittler almost 3 years ago

I've been creating a ticket for further improvements so the "dangerous" PR could be implemented as an optional feature: https://progress.opensuse.org/issues/99393

Actions #21

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to Resolved
Actions #22

Updated by okurz almost 3 years ago

  • Status changed from Resolved to Feedback

I did not see you mention AC2 so I assume you overlooked it

Actions #23

Updated by okurz almost 3 years ago

  • Copied to action #99420: Asset cleanup takes very long to process 60k files in "other" - now for real! added
Actions #24

Updated by okurz almost 3 years ago

  • Description updated (diff)

Talked with mkittler. Seems like we got distracted and did some optimization but did not really address the original problem. We have created #99420 for this. Let's try again there :)

Actions #25

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to Resolved

See the new ticket #99420 for AC2. AC1 should be fulfilled. I've just ran another cleanup on OSD which only took 4 minutes. There's not much load currently but it still looks like a decrease.

Actions #26

Updated by mkittler almost 3 years ago

  • Status changed from Resolved to In Progress

Looks like the optimization caused a regression leading to failing Minion jobs:

DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  invalid input syntax for type boolean: "" [for Statement "UPDATE assets SET fixed = ?, t_updated = ? WHERE id = ?" with ParamValues: 1='', 2='2021-09-28 15:01:21', 3='27487345'] at /usr/share/openqa/script/../lib/OpenQA/Schema/ResultSet/Assets.pm line 99

(see https://openqa.suse.de/minion/jobs?task=limit_assets)

Actions #27

Updated by mkittler almost 3 years ago

Actions #28

Updated by okurz almost 3 years ago

  • Due date changed from 2021-09-27 to 2021-10-01

mkittler wrote:

Looks like the optimization caused a regression leading to failing Minion jobs:

DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  invalid input syntax for type boolean: "" [for Statement "UPDATE assets SET fixed = ?, t_updated = ? WHERE id = ?" with ParamValues: 1='', 2='2021-09-28 15:01:21', 3='27487345'] at /usr/share/openqa/script/../lib/OpenQA/Schema/ResultSet/Assets.pm line 99

(see https://openqa.suse.de/minion/jobs?task=limit_assets)

I am curious, how did you find it? And also, since you found this regression after resolving I would like to take it as an opportunity to see where else we can apply improvements, monitoring, process, etc.

Actions #29

Updated by mkittler almost 3 years ago

I found it by looking at the Minion dashboard. We have a monitoring alert and it would have triggered after a certain amount of failed jobs. For other reasons we cannot lower the threshold but there are already tickets for improving the situation. And obviously the file system alert would have fired as well at some point.

By the way, the PR has been merged and deployed and seems to work.

Actions #30

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback
Actions #31

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Resolved

discussed it in daily and have concluded that #96263 would help. I bumped prio for #96263 accordingly and we are done here.discussed it in daily and have concluded that #96263 would help. I bumped prio for #96263 accordingly and we are done here.

Actions

Also available in: Atom PDF