Project

General

Profile

Actions

action #41459

closed

[sle][functional][u] Explicit test module for btrfs snapshots cleanup performance

Added by okurz about 6 years ago. Updated about 5 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
New test
Target version:
SUSE QA - Milestone 27
Start date:
2018-08-01
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

boo#1063638 is the "openSUSE/SLE sucks bug" which kills systems performance and frustrates people and fails many openQA tests (#39059) after the feature was introduced in https://fate.suse.com/312751. Existing test modules do not show the problem(s) in an obvious enough way even though we reference the bug also in e.g. force_scheduled_tasks and without our help it seems the bug fix is going nowhere. We have already test modules like btrfs_qgroups and snapper_cleanup which we should complement by more test coverage to explicitly cover the system performance impact.

Acceptance criteria

Suggestions

  • Review existing test modules force_scheduled_tasks, btrfs_qgroups and snapper_cleanup
  • Create a new test module that reproduces the bug, e.g. try to fill up the disk a lot, create snapshots, delete fill-up data again, repeat, then trigger the maintenance tasks as in force_scheduled_tasks and measure the load impact
  • During a test run, try tools like stress-ng with a high IO load or trigger tests with bonnie++

Files

btrfs_stress.sh (2.32 KB) btrfs_stress.sh szarate, 2019-01-24 08:54

Related issues 4 (0 open4 closed)

Related to openQA Tests - action #31351: [functional][u][medium] force_cron_run does not actually run any crons (occasionally)Resolvedzluo2018-02-032018-07-03

Actions
Related to openQA Tests - action #43784: [functional][y][sporadic] test fails in yast2_snapper now reproducibly not exiting the "show differences" screenResolvedoorlov2018-11-14

Actions
Copied from openQA Tests - action #39059: [sle][functional][y] detect "openSUSE sucks bug" about btrfs balance and record_soft_fail (was: yast2_gui tests modules as application could not start up)Resolvedriafarov2018-08-012018-10-09

Actions
Copied to openQA Tests - action #41462: [sle][functional][u][medium] Mask btrfs maintenance cron jobs as wellResolvedoorlov2018-08-012018-10-09

Actions
Actions #1

Updated by okurz about 6 years ago

  • Copied from action #39059: [sle][functional][y] detect "openSUSE sucks bug" about btrfs balance and record_soft_fail (was: yast2_gui tests modules as application could not start up) added
Actions #2

Updated by okurz about 6 years ago

  • Related to action #31351: [functional][u][medium] force_cron_run does not actually run any crons (occasionally) added
Actions #3

Updated by okurz about 6 years ago

  • Copied to action #41462: [sle][functional][u][medium] Mask btrfs maintenance cron jobs as well added
Actions #4

Updated by SLindoMansilla about 6 years ago

  • Due date changed from 2018-10-09 to 2018-10-23

We need to clarify what does it means "explicitly reproduce" new test suite? improve existing system_performance test suite?

Probably can be split.

Not enough capacity during sprint 27. And this is better done after: #41462

Actions #5

Updated by okurz about 6 years ago

  • Target version changed from Milestone 19 to Milestone 20
Actions #6

Updated by okurz almost 6 years ago

  • Description updated (diff)
  • Due date deleted (2018-10-23)
  • Status changed from New to Workable
  • Target version changed from Milestone 20 to Milestone 22

#41462 closed, now we can try again. Discussed in SP

Actions #7

Updated by jorauch almost 6 years ago

  • Assignee set to jorauch
Actions #8

Updated by jorauch almost 6 years ago

  • Status changed from Workable to In Progress

Actually working on this for a while, I wonder whats the difference to 'snapper_cleanup'?

Actions #9

Updated by okurz almost 6 years ago

I mention snapper_cleanup in the ticket description because certainly there is some overlap. However as in the suggestion in the ticket description what I suggest to do is simulate the workload of a heavily used and messy disk and then trigger the btrfs maintenance jobs, not snapper, and see how the system is impacted.

Actions #10

Updated by jorauch almost 6 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (jorauch)

Since I failed to deliver within a week I will unassign.

Actions #11

Updated by okurz almost 6 years ago

  • Related to action #43784: [functional][y][sporadic] test fails in yast2_snapper now reproducibly not exiting the "show differences" screen added
Actions #12

Updated by okurz almost 6 years ago

jorauch wrote:

Since I failed to deliver within a week I will unassign.

ok, so what did you accomplish or what did you learn which would be good to know for the next assignee?

Actions #13

Updated by szarate almost 6 years ago

  • Subject changed from [sle][functional][u] Explicit test module for btrfs snapshots cleanup performance to [sle][functional][u] Reproduce boo#1063638 - explicitly trigger btrfs maintenance (btrfs performance issue)
  • Description updated (diff)
Actions #14

Updated by okurz almost 6 years ago

  • Subject changed from [sle][functional][u] Reproduce boo#1063638 - explicitly trigger btrfs maintenance (btrfs performance issue) to [sle][functional][u] Explicit test module for btrfs snapshots cleanup performance

@szarate I don't think your change of the subject is correct. We already explicitly trigger btrfs maintenance, see the test module "force_schedule_tasks". This ticket here is about not triggering maintenance tasks explicitly.

Actions #15

Updated by agraul almost 6 years ago

  • Status changed from Workable to In Progress
  • Assignee set to agraul
Actions #17

Updated by agraul over 5 years ago

WIP PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6487

I'll try to find a way to create more IO load at the end of the test, just before the measurements start and see if that has an influence on the disk IO queue. If I can't find a promising way by the end of the day, I'll follow up with a better summary of what I tried and put the ticket back in the pool for anyone to take.

Actions #18

Updated by agraul over 5 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (agraul)

I am unassigning me, I did not make enough progress to believe I can create a good reproducer now. As promised, here is a summary of what I did and what I think should be done next:

My approach

Create dummy data in a subvolume, then create a btrfs snapshot of it
(a new subvolume that shares data with the snapshotted one). Delete
data in the original subvolume, so only the snapshot is "pointing to
it". The idea here is that btrfs metadata is created and changed,
causing btrfs balance some work.

Read out /proc/diskstats for IO queue to determine load and later give
stress-ng a try by running a benchmark at the beginning and at the
end, while btrfs maintenance jobs do their work.

I ran the tests on Tumbleweed while developing. While it is different
from SLE12 in regards to kernel / filesystem, the very same bug hit me
two weeks ago on my TW Laptop, so it is not (completely) fixed in TW
yet.

Why didn't I reproduce bsc#1063638?

Firstly, my approach of creating work for btrfs to clean up was probably
not good enough. I tried writing data in different ways, but
ultimately it was not enough to cause changes in system performance
when running the maintenance jobs.

Secondly, one issue I had was starting different things that cause the
load asynchronously in openQA. There might be a simple way to solve
this part, but I did not come across it.

Future

There are two things that immediately come to my mind for future work on this ticket:

Actions #19

Updated by szarate over 5 years ago

  • Status changed from Workable to In Progress

I'm taking this one, let's see if a combination of dd and bonie++ can do something

Actions #20

Updated by mgriessmeier over 5 years ago

  • Assignee set to szarate
Actions #21

Updated by szarate over 5 years ago

I have used Alex's attempt and modified it a bit to use /dev/urandom instead and also modified the size of the files to generate a bit more load.

I managed to get from time to time btrfs balance to stall the system for a moment, but nothing conclusive. I'm unassigning for the time being, rather than waste more time. Perhaps a look to how btrfs scrub works and some white box testing could year better results.

I also suggest to lower the priority of the ticket.

Actions #22

Updated by szarate over 5 years ago

  • Priority changed from High to Normal

Downgrading to normal for the time being.

Actions #23

Updated by okurz over 5 years ago

  • Target version changed from Milestone 22 to Milestone 25
Actions #24

Updated by mgriessmeier over 5 years ago

  • Target version changed from Milestone 25 to Milestone 26
Actions #25

Updated by mgriessmeier about 5 years ago

  • Target version changed from Milestone 26 to Milestone 27
Actions #26

Updated by SLindoMansilla about 5 years ago

  • Status changed from Workable to Rejected
  • Assignee set to mgriessmeier
Actions

Also available in: Atom PDF