action #41459: [sle][functional][u] Explicit test module for btrfs snapshots cleanup performance - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #41459

closed

[sle][functional][u] Explicit test module for btrfs snapshots cleanup performance

Added by okurz over 6 years ago. Updated over 5 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

mgriessmeier

Category:

New test

Target version:

SUSE QA (private) - Milestone 27

Start date:

2018-08-01

Due date:

% Done:

Estimated time:

Difficulty:

Description

Motivation¶

boo#1063638 is the "openSUSE/SLE sucks bug" which kills systems performance and frustrates people and fails many openQA tests (#39059) after the feature was introduced in https://fate.suse.com/312751. Existing test modules do not show the problem(s) in an obvious enough way even though we reference the bug also in e.g. force_scheduled_tasks and without our help it seems the bug fix is going nowhere. We have already test modules like btrfs_qgroups and snapper_cleanup which we should complement by more test coverage to explicitly cover the system performance impact.

Acceptance criteria¶

AC1: Tests explicitly reproduce boo#1063638 on SLE12

Suggestions¶

Review existing test modules force_scheduled_tasks, btrfs_qgroups and snapper_cleanup
Create a new test module that reproduces the bug, e.g. try to fill up the disk a lot, create snapshots, delete fill-up data again, repeat, then trigger the maintenance tasks as in force_scheduled_tasks and measure the load impact
During a test run, try tools like stress-ng with a high IO load or trigger tests with bonnie++

Files

btrfs_stress.sh (2.32 KB) btrfs_stress.sh

szarate, 2019-01-24 08:54

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by okurz over 6 years ago

Copied from action #39059: [sle][functional][y] detect "openSUSE sucks bug" about btrfs balance and record_soft_fail (was: yast2_gui tests modules as application could not start up) added

Actions

Copy link

Updated by okurz over 6 years ago

Related to action #31351: [functional][u][medium] force_cron_run does not actually run any crons (occasionally) added

Actions

Copy link

Updated by okurz over 6 years ago

Copied to action #41462: [sle][functional][u][medium] Mask btrfs maintenance cron jobs as well added

Actions

Copy link

Updated by SLindoMansilla over 6 years ago

Due date changed from 2018-10-09 to 2018-10-23

We need to clarify what does it means "explicitly reproduce" new test suite? improve existing system_performance test suite?

Probably can be split.

Not enough capacity during sprint 27. And this is better done after: #41462

Actions

Copy link

Updated by okurz over 6 years ago

Target version changed from Milestone 19 to Milestone 20

Actions

Copy link

Updated by okurz over 6 years ago

Description updated (diff)
Due date deleted (~~2018-10-23~~)
Status changed from New to Workable
Target version changed from Milestone 20 to Milestone 22

#41462 closed, now we can try again. Discussed in SP

Actions

Copy link

Updated by jorauch over 6 years ago

Assignee set to jorauch

Actions

Copy link

Updated by jorauch over 6 years ago

Status changed from Workable to In Progress

Actually working on this for a while, I wonder whats the difference to 'snapper_cleanup'?

Actions

Copy link

Updated by okurz over 6 years ago

I mention snapper_cleanup in the ticket description because certainly there is some overlap. However as in the suggestion in the ticket description what I suggest to do is simulate the workload of a heavily used and messy disk and then trigger the btrfs maintenance jobs, not snapper, and see how the system is impacted.

Actions

Copy link

#10

Updated by jorauch over 6 years ago

Status changed from In Progress to Workable
Assignee deleted (~~jorauch~~)

Since I failed to deliver within a week I will unassign.

Actions

Copy link

#11

Updated by okurz over 6 years ago

Related to action #43784: [functional][y][sporadic] test fails in yast2_snapper now reproducibly not exiting the "show differences" screen added

Actions

Copy link

#12

Updated by okurz over 6 years ago

jorauch wrote:

Since I failed to deliver within a week I will unassign.

ok, so what did you accomplish or what did you learn which would be good to know for the next assignee?

Actions

Copy link

#13

Updated by szarate about 6 years ago

Subject changed from [sle][functional][u] Explicit test module for btrfs snapshots cleanup performance to [sle][functional][u] Reproduce boo#1063638 - explicitly trigger btrfs maintenance (btrfs performance issue)
Description updated (diff)

Actions

Copy link

#14

Updated by okurz about 6 years ago

Subject changed from [sle][functional][u] Reproduce boo#1063638 - explicitly trigger btrfs maintenance (btrfs performance issue) to [sle][functional][u] Explicit test module for btrfs snapshots cleanup performance

@szarate I don't think your change of the subject is correct. We already explicitly trigger btrfs maintenance, see the test module "force_schedule_tasks". This ticket here is about not triggering maintenance tasks explicitly.

Actions

Copy link

#15

Updated by agraul about 6 years ago

Status changed from Workable to In Progress
Assignee set to agraul

Actions

Copy link

#17

Updated by agraul about 6 years ago

WIP PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6487

I'll try to find a way to create more IO load at the end of the test, just before the measurements start and see if that has an influence on the disk IO queue. If I can't find a promising way by the end of the day, I'll follow up with a better summary of what I tried and put the ticket back in the pool for anyone to take.

Actions

Copy link

#18

Updated by agraul about 6 years ago

Status changed from In Progress to Workable
Assignee deleted (~~agraul~~)

I am unassigning me, I did not make enough progress to believe I can create a good reproducer now. As promised, here is a summary of what I did and what I think should be done next:

My approach¶

Create dummy data in a subvolume, then create a btrfs snapshot of it
(a new subvolume that shares data with the snapshotted one). Delete
data in the original subvolume, so only the snapshot is "pointing to
it". The idea here is that btrfs metadata is created and changed,
causing btrfs balance some work.

Read out /proc/diskstats for IO queue to determine load and later give
stress-ng a try by running a benchmark at the beginning and at the
end, while btrfs maintenance jobs do their work.

I ran the tests on Tumbleweed while developing. While it is different
from SLE12 in regards to kernel / filesystem, the very same bug hit me
two weeks ago on my TW Laptop, so it is not (completely) fixed in TW
yet.

Why didn't I reproduce bsc#1063638?¶

Firstly, my approach of creating work for btrfs to clean up was probably
not good enough. I tried writing data in different ways, but
ultimately it was not enough to cause changes in system performance
when running the maintenance jobs.

Secondly, one issue I had was starting different things that cause the
load asynchronously in openQA. There might be a simple way to solve
this part, but I did not come across it.

Future¶

There are two things that immediately come to my mind for future work on this ticket:

There were a few comments on my WIP PR (https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6487), those could be included in the test.
Find a way to create more "dirt" for btrfs to clean up during its maintenance task

Actions

Copy link

#19

Updated by szarate about 6 years ago

Status changed from Workable to In Progress

I'm taking this one, let's see if a combination of dd and bonie++ can do something

Actions

Copy link

#20

Updated by mgriessmeier about 6 years ago

Assignee set to szarate

Actions

Copy link

#21

Updated by szarate about 6 years ago

File btrfs_stress.sh btrfs_stress.sh added
Status changed from In Progress to Workable
Assignee deleted (~~szarate~~)

I have used Alex's attempt and modified it a bit to use /dev/urandom instead and also modified the size of the files to generate a bit more load.

I managed to get from time to time btrfs balance to stall the system for a moment, but nothing conclusive. I'm unassigning for the time being, rather than waste more time. Perhaps a look to how btrfs scrub works and some white box testing could year better results.

I also suggest to lower the priority of the ticket.

Actions

Copy link

#22