action #99426
closedopenQA Infrastructure (public) - action #97976: [alert] OSD file systems - assets
Asset cleanup takes very long to process 60k files in "other" - suboptimal logging?
Description
Motivation¶
In #97979 we actually focussed on optimizing the current algorithms when the original idea was to rethink the overall cleanup approach to handle "60k other assets".
As I have seen during initial investigation on #97979 and as confirmed by mkittler during asset cleanup there are often syscalls accessing /etc/localtime. mkittler pointed out that the asset lookup also does a lot of logging. In https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Log.pm#L178 and https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Log.pm#L180 we call gettimeofday
and localtime
a lot.
Acceptance criteria¶
- AC1: Recurring log calls with timestamp do not have a significant performance impact
Updated by okurz about 3 years ago
- Copied from action #99420: Asset cleanup takes very long to process 60k files in "other" - now for real! added
Updated by okurz about 3 years ago
- Target version changed from future to Ready
I benchmarked a comparison of logging a static string vs. recurring calls with gettimeofday+localtime
1 #!/usr/bin/env perl
2 use strictures;
3 use Time::HiRes 'gettimeofday';
4 use POSIX 'strftime';
5 use Benchmark qw(:all);
6
7 #my $cmd = <<"EOM";
8 #echo "foo" && \
9 #false && \
10 #echo "you should not see this"
11 #EOM
12 #system $cmd || die "failed";
13
14
15 my $level = 'debug';
16
17 my $results = timethese(100000, {
18 'plain print' => sub { printf("[2021-09-28T13:45:08.0461 CEST] [debug]\n") },
19 'print with gettimeofday+strftime' => sub {
20 my $time = gettimeofday;
21 printf(sprintf(strftime("[%FT%T.%%04d %Z] [$level] ", localtime($time)), 1000 * ($time - int($time))) . "\n")},
22 });
23 cmpthese($results);
results
print with gettimeofday+strftime: 3 wallclock secs ( 0.47 usr + 0.71 sys = 1.18 CPU) @ 84745.76/s (n=100000)
Rate print with gettimeofday+strftime plain print
print with gettimeofday+strftime 84746/s -- -75%
plain print 344828/s 307% --
so in this case the "plain print" is 307% faster (or a factor of 4.07).
For comparison I saved the output of gettimeofday
into a variable and did another run with just "print with strftime" resulting in 90000/s so a little bit faster but no major impact.
Updated by okurz about 3 years ago
- Status changed from New to Feedback
Let's discuss if vanilla Mojo::Log helps: https://github.com/os-autoinst/openQA/pull/4249
Updated by okurz about 3 years ago
I hoped upstream Mojo::Log might be faster but it looks like it's the worst. From my benchmarking:
Rate Mojo::Log print with gettimeofday+localtime print with localtime plain print
Mojo::Log 23810/s -- -60% -71% -93%
print with gettimeofday+localtime 58824/s 147% -- -29% -82%
print with localtime 83333/s 250% 42% -- -75%
plain print 333333/s 1300% 467% 300% --
so likely it's more beneficial to optimize our use of the log format callback or reduce logging in asset cleanup.
Updated by okurz about 3 years ago
After finding many more alternatives which are all bad or worse I found https://metacpan.org/pod/Time::Moment which is already in Factory and Leap and shows a 43% improvement vs. our current solution and looks much cleaner
Rate Mojo::Log print with localtime+Time::Piece->strftime print with localtime print with gmtime print with gettimeofday+localtime print with Time::Moment plain print
Mojo::Log 35714/s -- -43% -57% -57% -64% -75% -89%
print with localtime+Time::Piece->strftime 62500/s 75% -- -25% -25% -37% -56% -81%
print with localtime 83333/s 133% 33% -- -0% -17% -42% -75%
print with gmtime 83333/s 133% 33% 0% -- -17% -42% -75%
print with gettimeofday+localtime 100000/s 180% 60% 20% 20% -- -30% -70%
print with Time::Moment 142857/s 300% 129% 71% 71% 43% -- -57%
plain print 333333/s 833% 433% 300% 300% 233% 133% --
http://blogs.perl.org/users/chansen/2014/08/timemoment-vs-datetime.html also explains how Time::Moment can provide better performance
Also I found out that the timestamp that is generated in os-autoinst and openQA is incorrect. There is %%04d
filled with the calculation result of 1000 * ($time - int($time))
meaning the milliseconds of current time but with a leading zero leading to values like .0513
which should be .513
.
Code sample to check:
use Time::HiRes 'gettimeofday';
use POSIX 'strftime';
$time = gettimeofday; printf(sprintf(strftime("[%FT%T.%%04d %Z] [$level] ", localtime($time)), 1000 * ($time - int($time))) . " " . Time::Moment->now);
this for example outputs:
[2021-10-03T20:46:57.0513 CEST] [] 2021-10-03T20:46:57.513202+02:00
Se the difference .0513
vs. .513202
? So better we use Time::Moment->now->with_precision(3)
Updated by okurz about 3 years ago
Also created https://github.com/os-autoinst/os-autoinst/pull/1801 for os-autoinst
Updated by okurz about 3 years ago
both PRs for os-autoinst and openQA merged. I am awaiting production results.
Updated by okurz about 3 years ago
- Copied to action #100599: Asset cleanup takes very long to process 60k files in "other" - too verbose logging, switch some debug to trace? added
Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved
I can see the new timestamp in the logging format effective. I know that Time::Moment showed to be more efficient than our p revious approaches but I can't say what's the real-time effect on openQA in production. I triggered an artificial assets cleanup run by configuring the asset cleanup threshold on OSD in /etc/openqa/openqa.ini to be "28" percent as the last entry on https://openqa.suse.de/minion/jobs?task=limit_assets stated that it does not need to run cleanup because there is 27% of free space. Then I restarted the gru service with systemctl restart openqa-gru
and called /usr/share/openqa/script/openqa-enqueue-asset-cleanup
, looked up the most likely correct process in watch 'ps auxf | grep gru'
and called on that process strace -f -eopen,stat -p $pid
. There are still many "stat" calls to /etc/localtime but I according to https://stackoverflow.com/questions/17149668/is-stat-an-expensive-system-call and https://www.linuxquestions.org/questions/programming-9/stat-call-performance-535082/ stat has some caching and that should be fine, meaning we don't need to optimize this further. Certainly the amount of logging information ending up in /var/log/openqa is huge and very verbose. I guess what we could do is turn some "debug" statements into "trace" which is (now) supported by https://docs.mojolicious.org/Mojo/Log#trace -> #100599