action #41459
closed[sle][functional][u] Explicit test module for btrfs snapshots cleanup performance
0%
Description
Motivation¶
boo#1063638 is the "openSUSE/SLE sucks bug" which kills systems performance and frustrates people and fails many openQA tests (#39059) after the feature was introduced in https://fate.suse.com/312751. Existing test modules do not show the problem(s) in an obvious enough way even though we reference the bug also in e.g. force_scheduled_tasks and without our help it seems the bug fix is going nowhere. We have already test modules like btrfs_qgroups and snapper_cleanup which we should complement by more test coverage to explicitly cover the system performance impact.
Acceptance criteria¶
- AC1: Tests explicitly reproduce boo#1063638 on SLE12
Suggestions¶
- Review existing test modules force_scheduled_tasks, btrfs_qgroups and snapper_cleanup
- Create a new test module that reproduces the bug, e.g. try to fill up the disk a lot, create snapshots, delete fill-up data again, repeat, then trigger the maintenance tasks as in force_scheduled_tasks and measure the load impact
- During a test run, try tools like stress-ng with a high IO load or trigger tests with bonnie++
Files
Updated by okurz about 6 years ago
- Copied from action #39059: [sle][functional][y] detect "openSUSE sucks bug" about btrfs balance and record_soft_fail (was: yast2_gui tests modules as application could not start up) added
Updated by okurz about 6 years ago
- Related to action #31351: [functional][u][medium] force_cron_run does not actually run any crons (occasionally) added
Updated by okurz about 6 years ago
- Copied to action #41462: [sle][functional][u][medium] Mask btrfs maintenance cron jobs as well added
Updated by SLindoMansilla about 6 years ago
- Due date changed from 2018-10-09 to 2018-10-23
We need to clarify what does it means "explicitly reproduce" new test suite? improve existing system_performance test suite?
Probably can be split.
Not enough capacity during sprint 27. And this is better done after: #41462
Updated by okurz about 6 years ago
- Target version changed from Milestone 19 to Milestone 20
Updated by jorauch almost 6 years ago
- Status changed from Workable to In Progress
Actually working on this for a while, I wonder whats the difference to 'snapper_cleanup'?
Updated by okurz almost 6 years ago
I mention snapper_cleanup in the ticket description because certainly there is some overlap. However as in the suggestion in the ticket description what I suggest to do is simulate the workload of a heavily used and messy disk and then trigger the btrfs maintenance jobs, not snapper, and see how the system is impacted.
Updated by jorauch almost 6 years ago
- Status changed from In Progress to Workable
- Assignee deleted (
jorauch)
Since I failed to deliver within a week I will unassign.
Updated by okurz almost 6 years ago
- Related to action #43784: [functional][y][sporadic] test fails in yast2_snapper now reproducibly not exiting the "show differences" screen added
Updated by okurz almost 6 years ago
jorauch wrote:
Since I failed to deliver within a week I will unassign.
ok, so what did you accomplish or what did you learn which would be good to know for the next assignee?
Updated by szarate almost 6 years ago
- Subject changed from [sle][functional][u] Explicit test module for btrfs snapshots cleanup performance to [sle][functional][u] Reproduce boo#1063638 - explicitly trigger btrfs maintenance (btrfs performance issue)
- Description updated (diff)
Updated by okurz almost 6 years ago
- Subject changed from [sle][functional][u] Reproduce boo#1063638 - explicitly trigger btrfs maintenance (btrfs performance issue) to [sle][functional][u] Explicit test module for btrfs snapshots cleanup performance
@szarate I don't think your change of the subject is correct. We already explicitly trigger btrfs maintenance, see the test module "force_schedule_tasks". This ticket here is about not triggering maintenance tasks explicitly.
Updated by agraul almost 6 years ago
- Status changed from Workable to In Progress
- Assignee set to agraul
Updated by agraul over 5 years ago
WIP PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6487
I'll try to find a way to create more IO load at the end of the test, just before the measurements start and see if that has an influence on the disk IO queue. If I can't find a promising way by the end of the day, I'll follow up with a better summary of what I tried and put the ticket back in the pool for anyone to take.
Updated by agraul over 5 years ago
- Status changed from In Progress to Workable
- Assignee deleted (
agraul)
I am unassigning me, I did not make enough progress to believe I can create a good reproducer now. As promised, here is a summary of what I did and what I think should be done next:
My approach¶
Create dummy data in a subvolume, then create a btrfs snapshot of it
(a new subvolume that shares data with the snapshotted one). Delete
data in the original subvolume, so only the snapshot is "pointing to
it". The idea here is that btrfs metadata is created and changed,
causing btrfs balance some work.
Read out /proc/diskstats for IO queue to determine load and later give
stress-ng a try by running a benchmark at the beginning and at the
end, while btrfs maintenance jobs do their work.
I ran the tests on Tumbleweed while developing. While it is different
from SLE12 in regards to kernel / filesystem, the very same bug hit me
two weeks ago on my TW Laptop, so it is not (completely) fixed in TW
yet.
Why didn't I reproduce bsc#1063638?¶
Firstly, my approach of creating work for btrfs to clean up was probably
not good enough. I tried writing data in different ways, but
ultimately it was not enough to cause changes in system performance
when running the maintenance jobs.
Secondly, one issue I had was starting different things that cause the
load asynchronously in openQA. There might be a simple way to solve
this part, but I did not come across it.
Future¶
There are two things that immediately come to my mind for future work on this ticket:
- There were a few comments on my WIP PR (https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6487), those could be included in the test.
- Find a way to create more "dirt" for btrfs to clean up during its maintenance task
Updated by szarate over 5 years ago
- Status changed from Workable to In Progress
I'm taking this one, let's see if a combination of dd and bonie++ can do something
Updated by szarate over 5 years ago
- File btrfs_stress.sh btrfs_stress.sh added
- Status changed from In Progress to Workable
- Assignee deleted (
szarate)
I have used Alex's attempt and modified it a bit to use /dev/urandom instead and also modified the size of the files to generate a bit more load.
I managed to get from time to time btrfs balance to stall the system for a moment, but nothing conclusive. I'm unassigning for the time being, rather than waste more time. Perhaps a look to how btrfs scrub works and some white box testing could year better results.
I also suggest to lower the priority of the ticket.
Updated by szarate over 5 years ago
- Priority changed from High to Normal
Downgrading to normal for the time being.
Updated by okurz over 5 years ago
- Target version changed from Milestone 22 to Milestone 25
Updated by mgriessmeier over 5 years ago
- Target version changed from Milestone 25 to Milestone 26
Updated by mgriessmeier about 5 years ago
- Target version changed from Milestone 26 to Milestone 27
Updated by SLindoMansilla about 5 years ago
- Status changed from Workable to Rejected
- Assignee set to mgriessmeier
Bug resolved:fixed
https://bugzilla.opensuse.org/show_bug.cgi?id=1063638