Project

General

Profile

Actions

action #30595

closed

[ppc64le] Deploy bcache on tmpfs workers

Added by yosun over 6 years ago. Updated 5 months ago.

Status:
Closed
Priority:
Low
Assignee:
-
Category:
-
Target version:
Start date:
2018-01-22
Due date:
% Done:

0%

Estimated time:
Tags:

Description

qemu-system-ppc64: -monitor telnet:127.0.0.1:20062,server,nowait: Failed to bind socket: Address already in use

https://openqa.suse.de/tests/1405047
https://openqa.suse.de/tests/1405047/file/autoinst-log.txt


Files

bcache-zram.sh (2.18 KB) bcache-zram.sh Script used by Rudi to generate bcache backed ramdisk szarate, 2018-04-10 14:05

Related issues 6 (1 open5 closed)

Related to openQA Project - action #30388: [qam] openqaworker10:4 - worker fail all tests - qemu instances are left behindResolveddasantiago2018-02-26

Actions
Related to openQA Infrastructure - action #38780: [tools] Remove tmpfs from malbec again and give it a tryRejectedokurz2018-07-24

Actions
Related to openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old resultsResolvedokurz2020-03-18

Actions
Related to openQA Infrastructure - action #106841: Try using tmpfs on openqaworker1 to use RAM more efficientlyNew2022-02-15

Actions
Related to openQA Infrastructure - action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:MResolvedokurz2023-09-20

Actions
Has duplicate openQA Tests - action #30249: [sle][migration][sle15][ppc64le] test fails in install_and_reboot - failed by diskspace is exhaustedResolvedqmsu2018-01-12

Actions
Actions #1

Updated by okurz over 6 years ago

  • Subject changed from [tools] Address competition(Failed to bind socket: Address already in use) to [tools][ppc64le][malbec] Address competition(Failed to bind socket: Address already in use)
  • Category set to 168
  • Status changed from New to In Progress
  • Assignee set to okurz

Could be a duplicate of #30388 . https://openqa.suse.de/tests/1403769 is the last test that did not incomplete. https://openqa.suse.de/tests/1403769/file/autoinst-log.txt shows

[2018-01-20T16:57:14.0206 CET] [debug] no [2018-01-20T16:57:14.0207 CET] [debug] waitpid for 37981 returned 0
[2018-01-20T16:57:14.0207 CET] [debug] capture loop failed Can't write to log: No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 197.

DIE Can't write to log: No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 197.

 at /usr/lib/os-autoinst/backend/baseclass.pm line 80.
    backend::baseclass::die_handler('Can\'t write to log: No space left on device at /usr/lib/os-a...') called at /usr/lib/perl5/5.18.2/Carp.pm line 100
    Carp::croak('Can\'t write to log: No space left on device') called at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Log.pm line 32

I guess that is the problem.

Right now there is no space depletion on malbec (anymore?). I terminated the still running qemu process and will restart the incomplete jobs.

Actions #2

Updated by okurz over 6 years ago

  • Related to action #30388: [qam] openqaworker10:4 - worker fail all tests - qemu instances are left behind added
Actions #3

Updated by okurz over 6 years ago

  • Has duplicate action #30249: [sle][migration][sle15][ppc64le] test fails in install_and_reboot - failed by diskspace is exhausted added
Actions #4

Updated by okurz over 6 years ago

  • Priority changed from Normal to Urgent
Actions #5

Updated by okurz over 6 years ago

  • Assignee deleted (okurz)
  • Priority changed from Urgent to High

About 10 jobs are not incompleted now on malbec . I wonder if certain scenarios, e.g. migration are more prone to deplete the space on malbec. I did the "urgent" part for now, back to backlog for further consideration. I recommend to either further limit the numbers of instances or increase the available space in /var/lib/openqa/pool. Right now it's tmpfs 80G 46G 35G 57% /var/lib/openqa/pool so no big margin.

Actions #6

Updated by coolo over 6 years ago

  • Subject changed from [tools][ppc64le][malbec] Address competition(Failed to bind socket: Address already in use) to [ppc64le] Deploy bcache on tmpfs workers
  • Status changed from In Progress to New

malbec does most of its stuff in tmpfs and is overbooked. So if we have too many jobs running in parallel there, we will run into this. But as we're still short on power workers, we have to live with this and restart the jobs that ended up running in parallel.

Alex suggested using bcache to come over this situation, where bcache would use a file on disk to backstore.

Actions #7

Updated by coolo over 6 years ago

  • Target version set to Ready
Actions #8

Updated by mitiao about 6 years ago

  • Assignee set to mitiao
Actions #9

Updated by mitiao about 6 years ago

  • Status changed from New to In Progress

Reproduced issue locally.
Trying to deploy bcache on a test machine following the documents

Actions #10

Updated by mitiao about 6 years ago

  • Target version changed from Ready to Current Sprint
Actions #11

Updated by mitiao about 6 years ago

Tried some deployments on local x86_64 machines with different configurations.
Wanted to see the inside of malbec and then deploy bcache according to its hardware and os envirenment.

Currently blocked by 'dead' malbec and waiting for it up.
Filed a infra ticket (107529) for this issue:
https://infra.nue.suse.com/Ticket/Display.html?id=107529

Actions #12

Updated by coolo about 6 years ago

Don't hack this into malbec - hack this into the salt receipes.

Actions #13

Updated by mitiao about 6 years ago

coolo wrote:

Don't hack this into malbec - hack this into the salt receipes.

Got it, will look into deploy by salt

Actions #14

Updated by mitiao about 6 years ago

malbec issue has no any update since 2nd of March:

2018-03-01 10:44 The RT System itself - Status changed from 'new' to 'open'
2018-03-01 10:44 lrupp (Lars Vogdt) - Given to mcaj (Martin Caj)
2018-03-01 10:44 lrupp (Lars Vogdt) - Queue changed from BuildOPS to infra
2018-03-01 14:30 mcaj (Martin Caj) - Given to mmaher (Maximilian Maher)
2018-03-02 11:25 fghariani (Fatma Ghariani) - Stolen from mmaher (Maximilian Maher)
2018-03-02 11:25 fghariani (Fatma Ghariani) - Given to ihno (Ihno Krumreich)

Also setup a salt master and minon on local to compose sls file which is for future bcache deployment

Actions #15

Updated by mitiao about 6 years ago

Done a sls file to deploy backing device in a looped file and cache device on a x86_64 test machine.
Looked into alive power8 worker to get disk configurations, adapting sls to match the real worker.

Actions #16

Updated by szarate about 6 years ago

An update, https://fsp1-malbec.arch.suse.de/cgi-bin/cgi is now acessible again but doesn't allow ipmi or any other type of interaction apparently.

However the machine seems to have booted in baremetal mode OR there might be network issues that don't allow the ip to be set... who knows.

Actions #17

Updated by mitiao about 6 years ago

I am adapting bcache sls for Power8-4 worker, and some questions were pop-up:

  1. I can't find any ssd device on the machine, so where the caching device should create?
  2. /var/lib/opneqa is mounted on a partition, is it ok to format the partition as backing device and mount to /var/lib/openqa? so it would clean up all existing workers and re-install them on bcache device. or use looped file as backing device for newly added workers?
Actions #18

Updated by coolo about 6 years ago

if we had ssd in these workers, we wouldn't run them on tmpfs. And we don't want to cache onto ssds, but use the backing device only for overcommiting the memory.

Don't use the full /var/lib/openqa because we don't want to store /var/lib/openqa/cache in memory I guess. But you can shrink it to gain space, but possibly we even other disks.

Actions #20

Updated by mitiao about 6 years ago

malbec is back, last update of https://infra.nue.suse.com/Ticket/Display.html?id=107529:
2018-03-21 11:47 jloeser (Jan Loeser) - Status changed from 'open' to 'resolved'

Actions #21

Updated by mitiao about 6 years ago

Did I misunderstand that we don't need make cache on malbec if no ssd device?
As I talked with Coly, only using backing device is not quite different with not using bcache.

Also i don't get the idea about to reduce memory overcommiting by using bcache, since bcache is mainly to use fast device act as a cache for some slower storage device, and the memory usage would be getting lager along with ssd size increase.

And if a cache is needed and no ssd device, cache on hdd is not meaningful, if cache on tmpfs which is still using memory as cache but also limited by memory size and memory is twice used.

btw, since malbec is alive, further work should be specific on malbec.

Actions #22

Updated by szarate about 6 years ago

I'm guessing that the solution that coolo is refering to, and that alex pointed him to, was to use the bcache as backing device that would extend the memory of the system via zram... this would extend the ammount of memory available in the pool, so that the actual data that is commonly used is kept in the faster ram while the other is moved to the zram-bcache frankenstein... but my memory is foggy here.

@mitiao, the idea with using rotating disks, is that since we have some ram to use, just to make sure that there' s enough extra space to store big files. Speed is second to some extent here.

Gitlab PR looks fine so far... but well, let malbec be the play ground unless coolo says otherwise :)

Actions #23

Updated by coolo about 6 years ago

This is not about zram, no. We have enough memory - in general. But we need to have a strategy where to put the blocks in case the RAM is overcommited. If bcache can't be used with tmpfs, then we need to find a different solution to have RAM as backing store to the pools.

Actions #24

Updated by algraf about 6 years ago

coolo wrote:

This is not about zram, no. We have enough memory - in general. But we need to have a strategy where to put the blocks in case the RAM is overcommited. If bcache can't be used with tmpfs, then we need to find a different solution to have RAM as backing store to the pools.

You can just use a genuine ramdisk rather than tmpfs as backing store for bcache. That gives you the same effect as the ZRAM variant, but without RAM overhead reduction. The big benefit of bcache that we found was that it nicely continuously streams dirty data onto the backing storage, so it actually saturates the links rather than generate big latency spikes when flushing from time to time.

Actions #25

Updated by szarate about 6 years ago

Perhaps the script might shed some more light

Actions #26

Updated by szarate about 6 years ago

It was agreed during the sprint planning to move forward with this.

Actions #27

Updated by szarate about 6 years ago

@mitiao are you going to keep working on this?

Actions #28

Updated by szarate almost 6 years ago

  • Status changed from In Progress to New
  • Assignee deleted (mitiao)
  • Priority changed from High to Normal
  • Target version changed from Current Sprint to Ready

We need to come up with another idea, so most likely this will come back soon

Actions #29

Updated by szarate almost 6 years ago

  • Related to action #38780: [tools] Remove tmpfs from malbec again and give it a try added
Actions #30

Updated by mkittler over 5 years ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (168)

seems to be an infra issue

Actions #31

Updated by okurz about 4 years ago

algraf wrote:

You can just use a genuine ramdisk rather than tmpfs as backing store for bcache. That gives you the same effect as the ZRAM variant, but without RAM overhead reduction.

Do you mean ramfs rather then tmpfs or something like mke2fs -m 0 /dev/ram0 ? Wouldn't in both cases the system memory be prone to run full and cause more havoc than a size-limited tmpfs?

Actions #32

Updated by okurz about 4 years ago

  • Related to coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results added
Actions #33

Updated by okurz over 3 years ago

  • Tags changed from caching, openQA, sporadic, arm, ipmi, worker to worker
Actions #34

Updated by okurz over 3 years ago

  • Priority changed from Normal to Low
Actions #35

Updated by okurz over 3 years ago

  • Target version changed from Ready to future
Actions #36

Updated by okurz about 2 years ago

  • Related to action #106841: Try using tmpfs on openqaworker1 to use RAM more efficiently added
Actions #37

Updated by okurz 5 months ago

  • Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added
Actions #38

Updated by yosun 5 months ago

  • Status changed from New to Closed

When review my opening ticket, I read through this and I feel the poo#106841 could fully cover the issue. IMO bcache could only use high speed disk to increase IO performance, but it can't help with better stability for openQA. I suggest to closing this ticket. Feel free to reopen it if anyone feel this still valid.

Actions #39

Updated by okurz 5 months ago

agreed, thx for the update

Actions

Also available in: Atom PDF