action #99426: Asset cleanup takes very long to process 60k files in "other" - suboptimal logging? - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #99426

closed

openQA Infrastructure (public) - action #97976: [alert] OSD file systems - assets

Asset cleanup takes very long to process 60k files in "other" - suboptimal logging?

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Feature requests

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

In #97979 we actually focussed on optimizing the current algorithms when the original idea was to rethink the overall cleanup approach to handle "60k other assets".
As I have seen during initial investigation on #97979 and as confirmed by mkittler during asset cleanup there are often syscalls accessing /etc/localtime. mkittler pointed out that the asset lookup also does a lot of logging. In https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Log.pm#L178 and https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Log.pm#L180 we call gettimeofday and localtime a lot.

Acceptance criteria¶

AC1: Recurring log calls with timestamp do not have a significant performance impact

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Updated by okurz over 3 years ago

Copied from action #99420: Asset cleanup takes very long to process 60k files in "other" - now for real! added

Actions

Copy link

Updated by okurz over 3 years ago

Target version changed from future to Ready

I benchmarked a comparison of logging a static string vs. recurring calls with gettimeofday+localtime

 1 #!/usr/bin/env perl
 2 use strictures;
 3 use Time::HiRes 'gettimeofday';
 4 use POSIX 'strftime';
 5 use Benchmark qw(:all);
 6 
 7 #my $cmd = <<"EOM";
 8 #echo "foo" && \
 9 #false && \
10 #echo "you should not see this"
11 #EOM
12     #system $cmd || die "failed";
13 
14 
15 my $level = 'debug';
16 
17 my $results = timethese(100000, {
18     'plain print' => sub { printf("[2021-09-28T13:45:08.0461 CEST] [debug]\n") },
19     'print with gettimeofday+strftime' => sub {
20         my $time = gettimeofday;
21         printf(sprintf(strftime("[%FT%T.%%04d %Z] [$level] ", localtime($time)), 1000 * ($time - int($time))) . "\n")},
22     });
23 cmpthese($results);

results

print with gettimeofday+strftime:  3 wallclock secs ( 0.47 usr +  0.71 sys =  1.18 CPU) @ 84745.76/s (n=100000)
                                     Rate print with gettimeofday+strftime plain print
print with gettimeofday+strftime  84746/s                               --        -75%
plain print                      344828/s                             307%          --

so in this case the "plain print" is 307% faster (or a factor of 4.07).

For comparison I saved the output of gettimeofday into a variable and did another run with just "print with strftime" resulting in 90000/s so a little bit faster but no major impact.

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from New to Feedback

Let's discuss if vanilla Mojo::Log helps: https://github.com/os-autoinst/openQA/pull/4249

Actions

Copy link

Updated by okurz over 3 years ago

I hoped upstream Mojo::Log might be faster but it looks like it's the worst. From my benchmarking:

                                      Rate Mojo::Log print with gettimeofday+localtime print with localtime plain print
Mojo::Log                          23810/s        --                              -60%                 -71%        -93%
print with gettimeofday+localtime  58824/s      147%                                --                 -29%        -82%
print with localtime               83333/s      250%                               42%                   --        -75%
plain print                       333333/s     1300%                              467%                 300%          --

so likely it's more beneficial to optimize our use of the log format callback or reduce logging in asset cleanup.

Actions

Copy link

Updated by okurz over 3 years ago

After finding many more alternatives which are all bad or worse I found https://metacpan.org/pod/Time::Moment which is already in Factory and Leap and shows a 43% improvement vs. our current solution and looks much cleaner

                                               Rate Mojo::Log print with localtime+Time::Piece->strftime print with localtime print with gmtime print with gettimeofday+localtime print with Time::Moment plain print
Mojo::Log                                   35714/s        --                                       -43%                 -57%              -57%                              -64%                    -75%        -89%
print with localtime+Time::Piece->strftime  62500/s       75%                                         --                 -25%              -25%                              -37%                    -56%        -81%
print with localtime                        83333/s      133%                                        33%                   --               -0%                              -17%                    -42%        -75%
print with gmtime                           83333/s      133%                                        33%                   0%                --                              -17%                    -42%        -75%
print with gettimeofday+localtime          100000/s      180%                                        60%                  20%               20%                                --                    -30%        -70%
print with Time::Moment                    142857/s      300%                                       129%                  71%               71%                               43%                      --        -57%
plain print                                333333/s      833%                                       433%                 300%              300%                              233%                    133%          --

http://blogs.perl.org/users/chansen/2014/08/timemoment-vs-datetime.html also explains how Time::Moment can provide better performance

Also I found out that the timestamp that is generated in os-autoinst and openQA is incorrect. There is %%04d filled with the calculation result of 1000 * ($time - int($time)) meaning the milliseconds of current time but with a leading zero leading to values like .0513 which should be .513.

Code sample to check:

use Time::HiRes 'gettimeofday';
use POSIX 'strftime';
$time = gettimeofday; printf(sprintf(strftime("[%FT%T.%%04d %Z] [$level] ", localtime($time)), 1000 * ($time - int($time))) . " " . Time::Moment->now);

this for example outputs:

[2021-10-03T20:46:57.0513 CEST] []  2021-10-03T20:46:57.513202+02:00

Se the difference .0513 vs. .513202 ? So better we use Time::Moment->now->with_precision(3)

Updated https://github.com/os-autoinst/openQA/pull/4248

Actions

Copy link

Updated by okurz over 3 years ago

Also created https://github.com/os-autoinst/os-autoinst/pull/1801 for os-autoinst

Actions

Copy link

Updated by okurz over 3 years ago

both PRs for os-autoinst and openQA merged. I am awaiting production results.

Actions

Copy link

Updated by okurz over 3 years ago

Copied to action #100599: Asset cleanup takes very long to process 60k files in "other" - too verbose logging, switch some debug to trace? added

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

I can see the new timestamp in the logging format effective. I know that Time::Moment showed to be more efficient than our p revious approaches but I can't say what's the real-time effect on openQA in production. I triggered an artificial assets cleanup run by configuring the asset cleanup threshold on OSD in /etc/openqa/openqa.ini to be "28" percent as the last entry on https://openqa.suse.de/minion/jobs?task=limit_assets stated that it does not need to run cleanup because there is 27% of free space. Then I restarted the gru service with systemctl restart openqa-gru and called /usr/share/openqa/script/openqa-enqueue-asset-cleanup, looked up the most likely correct process in watch 'ps auxf | grep gru' and called on that process strace -f -eopen,stat -p $pid. There are still many "stat" calls to /etc/localtime but I according to https://stackoverflow.com/questions/17149668/is-stat-an-expensive-system-call and https://www.linuxquestions.org/questions/programming-9/stat-call-performance-535082/ stat has some caching and that should be fine, meaning we don't need to optimize this further. Certainly the amount of logging information ending up in /var/log/openqa is huge and very verbose. I guess what we could do is turn some "debug" statements into "trace" which is (now) supported by https://docs.mojolicious.org/Mojo/Log#trace -> #100599

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #99426

Asset cleanup takes very long to process 60k files in "other" - suboptimal logging?

Motivation¶

Acceptance criteria¶

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago