Project

General

Profile

action #55922

limit tasks needs to be limited (in runtime)

Added by coolo 9 months ago. Updated about 1 month ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2019-08-24
Due date:
% Done:

0%

Estimated time:
Difficulty:
medium
Duration:

Description

I changed some job groups to delete jobs for around 200 days. The performance of deleting jobs is unfortunately not good - and so we increased the lock timeout.

But this created another problem: we gained 7% in the assets partition in the timeframe before I stopped limit_logs_and_results. Fortunately that only meant going from 71% to 78% - but if we were closer to the limit (as we usually are when we change job groups to delete 200 days :), it would be a desaster.

So possibly the task itself should verify it's within 2 hours and prefer cleaning up more later.


Related issues

Related to openQA Project - action #55241: Prevent spamming cleanup jobsResolved2019-08-08

History

#1 Updated by mkittler 7 months ago

#2 Updated by cdywan 7 months ago

So would that mean we have a timer every 2 hours and the task itself would be aware of its runtime and always terminate before the 2 hours are up?

#3 Updated by okurz 6 months ago

I don't understand how this is related to #55241 , isn't this what we discussed as alternative in #55241 to unconditionally enqueue any tasks but make them delete themselves if one is already running?

#4 Updated by coolo 6 months ago

This is not about limiting the count or frequency, this is about limiting the runtime of the job itself. And as such it's related: the frequency you schedule in needs to be in relationship to the expected runtime.

#5 Updated by okurz 6 months ago

  • Subject changed from limit tasks needs to be limited to limit tasks needs to be limited (in runtime)

ok, good. So I updated the subject. I assume it's rather a feature request unless we have a regression?

#6 Updated by coolo 6 months ago

That highly depends on your definition of bug and feature :)

But being able to limit logs without running out of disk space due to assets being unlimited is a crucial aspect, so the increased lock timeout causes a regression in this context.

#7 Updated by mkittler 4 months ago

  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint

Currently we're seeing the problem again on o3. The limit_results_and_logs task keeps running forever. It isn't clear how long it actually takes because the service was killed by systemd before it could finish. So at least for that particular cleanup task it would make sense to interrupt (with a successful result) at some point to continue later. (The other cleanup tasks are usually executed quite fast or at least under 2 hours.)

#8 Updated by mkittler 4 months ago

The main problem is really the limit_results_and_logs task. But simply stopping and restarting it will not help because then it needs to go though all screenshot IDs again from the beginning (and that is what needs time). Ideas which came up in the chat today:

  1. Split the job into smaller "sub jobs". Each sub job does a specific range of screenshot IDs. That should be easy to implement because the query is already done in batches.
  2. Run the cleanup task less often so it is less disturbing.
  3. Use refcounting. It might be problematic as well due to lock congestion and of course is also harder to implement then 1. and 2..
  4. Use hardlinks and let the filesytem do the refcounting. Although certain limitations might not be relevant anymore it isn't clear how well this solution will perform/scale.

So maybe a combination of 1. and 2. is the best we can do for now?


By the way, the SQL query we use to query one batch looks like this:

openqa=# EXPLAIN ANALYZE SELECT me.id, me.filename
         FROM screenshots me
         LEFT OUTER JOIN screenshot_links links_outer
         ON links_outer.screenshot_id = me.id
         WHERE me.id BETWEEN 247 AND 200247
         AND links_outer.screenshot_id is NULL;
                                                                                    QUERY PLAN                                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=21.67..2345.99 rows=441 width=43) (actual time=1631.959..1631.959 rows=0 loops=1)
   ->  Bitmap Heap Scan on screenshots me  (cost=21.10..1732.76 rows=443 width=43) (actual time=33.954..1434.874 rows=242 loops=1)
         Recheck Cond: ((id >= 247) AND (id <= 200247))
         Heap Blocks: exact=182
         ->  Bitmap Index Scan on screenshots_pkey  (cost=0.00..20.99 rows=443 width=0) (actual time=33.909..33.909 rows=242 loops=1)
               Index Cond: ((id >= 247) AND (id <= 200247))
   ->  Index Only Scan using screenshot_links_idx_screenshot_id on screenshot_links links_outer  (cost=0.57..894.06 rows=1099 width=4) (actual time=0.809..0.809 rows=1 loops=242)
         Index Cond: (screenshot_id = me.id)
         Heap Fetches: 9
 Planning time: 0.535 ms
 Execution time: 1632.010 ms
(11 Zeilen)

#9 Updated by mkittler 4 months ago

Yesterday one limit_results_and_logs even concluded on o3 - with a runtime of 6 hours.


The chosen screenshot ID range in the example above wasn't very representative. Usually it takes about 8 s to select one batch of 200k screenshot IDs:

openqa=> EXPLAIN ANALYZE SELECT me.id, me.filename
openqa->          FROM screenshots me
openqa->          LEFT OUTER JOIN screenshot_links links_outer
openqa->          ON links_outer.screenshot_id = me.id
openqa->          WHERE me.id BETWEEN 100247 AND 1001247
openqa->          AND links_outer.screenshot_id is NULL;
                                                                                     QUERY PLAN                                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=45.42..9912.24 rows=1982 width=43) (actual time=7663.669..7663.669 rows=0 loops=1)
   ->  Bitmap Heap Scan on screenshots me  (cost=44.85..7365.28 rows=1992 width=43) (actual time=60.916..7635.385 rows=1052 loops=1)
         Recheck Cond: ((id >= 100247) AND (id <= 1001247))
         Heap Blocks: exact=799
         ->  Bitmap Index Scan on screenshots_pkey  (cost=0.00..44.36 rows=1992 width=0) (actual time=60.764..60.764 rows=1052 loops=1)
               Index Cond: ((id >= 100247) AND (id <= 1001247))
   ->  Index Only Scan using screenshot_links_idx_screenshot_id on screenshot_links links_outer  (cost=0.57..778.05 rows=1099 width=4) (actual time=0.022..0.022 rows=1 loops=1052)
         Index Cond: (screenshot_id = me.id)
         Heap Fetches: 45
 Planning time: 0.586 ms
 Execution time: 7663.729 ms
(11 Zeilen)

I'm also wondering why o3 is performing so much worse than OSD. This is even notable on a simple select count(id):

OSD:

openqa=> explain analyze select count(id) from screenshots;
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=372845.63..372845.64 rows=1 width=8) (actual time=16669.506..16669.506 rows=1 loops=1)
   ->  Gather  (cost=372845.42..372845.63 rows=2 width=8) (actual time=16669.443..16849.137 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=371845.42..371845.43 rows=1 width=8) (actual time=16664.104..16664.105 rows=1 loops=3)
               ->  Parallel Seq Scan on screenshots  (cost=0.00..351882.13 rows=7985313 width=4) (actual time=0.039..15647.008 rows=6500121 loops=3)
 Planning time: 0.182 ms
 Execution time: 16849.343 ms
(8 Zeilen)

o3:

openqa=> explain analyze select count(id) from screenshots;
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=250941.70..250941.71 rows=1 width=8) (actual time=80145.093..80145.094 rows=1 loops=1)
   ->  Gather  (cost=250941.49..250941.70 rows=2 width=8) (actual time=80144.893..80148.362 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=249941.49..249941.50 rows=1 width=8) (actual time=80136.635..80136.635 rows=1 loops=3)
               ->  Parallel Seq Scan on screenshots  (cost=0.00..234826.59 rows=6045959 width=4) (actual time=41.652..79084.379 rows=4858078 loops=3)
 Planning time: 0.155 ms
 Execution time: 80149.792 ms
(8 Zeilen)

So o3 is 4.75 times slower than OSD even though OSD has 1.33 times more screenshots.

#10 Updated by cdywan 4 months ago

mkittler wrote:

2) Run the cleanup task less often so it is less disturbing.

https://github.com/os-autoinst/openQA/pull/2712

#11 Updated by cdywan 4 months ago

  • Status changed from New to In Progress

#12 Updated by andriinikitin 4 months ago

We are not really looking for alternative implementations at the moment for limit job, but an idea came to me and Marius advised to mention it here. So the plan may be similar to "screenshot usage counter" we discussed earlier, just use expiration date instead. Implementation may be like this:

  1. Job has expires_on column which is set to now()+$keep_in_days during creation
  2. Screenshots have also expires_on column, which is updated to $job->expires_on when it is linked to a job. (or $screenshot->expires_on = max($screenshot->expires_on, $job->expires_on))
  3. Now deleting jobs and screenshots should be simple - only those which have expires_on < now() Could this work?

Obvious disadvantage of this approach is a challenge when $keep_in_days configuration is updated. But I think it is a reasonable tradeoff to claim that configuration will apply on future jobs only.

#13 Updated by mkittler 4 months ago

  • Status changed from In Progress to New

The PR to run the task less often has been merged. I'm setting the issue back to new because it would be good to discuss andriinikitin's idea.

So far the only disadvantage I see is the one he's already mentioned in his comment. However, I wouldn't underestimate that disadvantage. Imaging you set up your new openQA instance and don't put a lot of thought into cleanup. At some point you're running out of disk space and therefore decrease the "cleanup limits" for some job groups. So far this immediately frees disk space on the next cleanup run like one would expect. With andriinikitin proposal it wouldn't be possible to free disk space immediately and the users might even think that the result cleanup is not working.

We could of course update $screenshot->expires_on of all relevant screenshots when somebody changes the limits e.g. for a job group and when we reload the config file. However, updating all relevant screenshots sounds like a very expensive operation.

#14 Updated by mkittler 3 months ago

  • Status changed from New to In Progress

I'd like to implement this as originally planned considering the downsides of andriinikitin's idea. See this draft for how it would look like: https://github.com/os-autoinst/openQA/pull/2710

#15 Updated by okurz 3 months ago

So https://github.com/os-autoinst/openQA/pull/2710 has been merged and is by now deployed on both o3 and osd. On o3 I am worried because of 314G 93% /space. This is not much headroom. I think we had more space available last time I checked, maybe a week ago. Please closely monitor. https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=ariel-opensuse.suse.de&service=space%20partition#pnp_th4/1580595009/1583359809/0 also shows a graph. For OSD we have received a monitoring alert about the /results partition so please also look into https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=74&orgId=1&refresh=30s&from=now-2d&to=now and monitor carefully over the next days.

#16 Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

I don't have permissions to view the graph on thruk.suse.de. On stats.openqa-monitor.qa.suse.de there's a decline in disk usage again. The previous increase might be explained by coolo's mail "Do not blindly raise the days to keep logs" and has nothing to do with my changes.

Looking at the Minion dashboards of OSD and o3 it seems that the splitting worked. However, the runtime of the splitted jobs varies a lot. The actual deletion of the files in the filesystem contributes a lot to the time so batches with a lot of actual deletions take significantly longer to execute than others.

For instance on o3 there was still a job which ran for 16 hours and then failed with the result "Worker went away". That was the range before the last range. Maybe previously the cleanup didn't come so far so there was simply an unusually high amount of screenshots to be cleaned up within that batch. (I've been following the GRU logs while the task was running from time to time and it looked like it really deleted constantly screenshots.) I've now restarted the failed job for that range. Let's see whether it looks as "normal" as on OSD next time the full cleanup is triggered on o3.

#17 Updated by mkittler 3 months ago

On o3 the Minion dashboard looks now good, too. The execution time of a splitted job varies from a few minutes to 4 hours. That's not really what I had aimed for but at least there are not failed/aborted/timed-out jobs anymore which is already a win. We could easily tweak the parameters to split the jobs into smaller parts (or make that configurable) but I guess for now it is ok.

I'm wondering why it looks so different on OSD. Everything looks good there, too but maybe a little bit too good? The execution time of a splitted job varies there from a few seconds (the first ranges which likely contain almost no screenshots anymore) to 19 minutes. However, the filesystem usage looks ok on stats.openqa-monitor.qa.suse.de so I guess these jobs really just don't take long to execute.

#18 Updated by mkittler 3 months ago

On OSD we're fine but on o3 there's a failed job with result "Worker went away" again and a runtime of over 6 hours. So for o3 the splitting parameters should be tweaked to get smaller/more jobs. That means it would likely be useful to have the splitting parameters configurable (instead of defining them via Perl constants). What do you think?

#19 Updated by okurz 3 months ago

your approach in https://github.com/os-autoinst/openQA/pull/2821 could work with a lot of expertise on the side of instance admins which I would not assume. In the end I don't think this will fly long-term but we need to find a better, automatic approach. Having timeouts to limit the runtime makes sense and of course we need to ensure that cleanup eventually finishes. Do you consider it feasible to have an automatic algorithm that tries different values, e.g. stepwise increase/decrease the parameters until runtime/amount limits are kept?

#20 Updated by mkittler 3 months ago

could work with a lot of expertise

Or with a little bit trial and error by the admin.

Do you consider it feasible to have an automatic algorithm that tries different values, e.g. stepwise increase/decrease the parameters until runtime/amount limits are kept?

It would be a Minion job which keeps track of Minion jobs and based on that re-configures these Minion jobs. Sounds like an overkill.

#21 Updated by mkittler 2 months ago

On o3 the max. job runtime is still 4 hours so I set screenshot_cleanup_batches_per_minion_job = 350 to spread the cleanup into one more Minion job. I've just edited the config file on the host under /etc. I hope that's persistent (not sure about any interfering btrfs or salt configuration on that system).

#22 Updated by okurz 2 months ago

mkittler wrote:

I hope that's persistent (not sure about any interfering btrfs or salt configuration on that system).

yes, that's persistent. btrfs snapshots have no effect here and the system is not currently managed by salt.

#23 Updated by mkittler 2 months ago

My last comment was lost. To put it in a nutshell: Even with the new config there are still long-running cleanup jobs on o3 and even the cleanup of the results took too long and was aborted twice last weekend. So this needs further optimization although things are at least keep going (without manual help).

#24 Updated by mkittler about 1 month ago

There are still failed jobs on o3. All batches except for one are processed within 2 hours and most only need a few minutes.

The problematic batch is:

{
  "args" => [
    {
      "max_screenshot_id" => 350000246,
      "min_screenshot_id" => 280000247,
      "screenshots_per_batch" => 200000
    }
  ],
  "attempts" => 1,
  "children" => [],
  "created" => "2020-04-18T02:27:02.8141Z",
  "delayed" => "2020-04-18T02:27:02.8141Z",
  "finished" => "2020-04-18T08:33:53.35698Z",
  "id" => 138425,
  "notes" => {
    "gru_id" => 17091637,
    "ttl" => 172800
  },
  "parents" => [],
  "priority" => 4,
  "queue" => "default",
  "result" => "Worker went away",
  "retried" => undef,
  "retries" => 0,
  "started" => "2020-04-18T02:27:02.82711Z",
  "state" => "failed",
  "task" => "limit_screenshots",
  "time" => "2020-04-23T12:09:46.36116Z",
  "worker" => 288
}

I'm following the currently ongoing task for that batch and it seems to delete screenshots continuously (and does not hang).

Besides that, 19 days ago there was even a failing limit_results_and_logs task.

Not sure how to improve this. The tasks are already interrupted after some time and started again on the next day. Stopping the task after 2 hours is likely not that helpful because this way old screenshots will just pile up even more. I could split it even further to prevent the "Worker went away" failures visible on the dashboard but that leads to lots of very small Minion jobs. Splitting only the problematic batch is maybe quite some effort to implement. Maybe we can also increase the timeout for the Minion jobs. (Wasn't that thing invented to out-source long lasting tasks?)

Also available in: Atom PDF