action #30595

[ppc64le] Deploy bcache on tmpfs workers

Added by yosun about 2 years ago. Updated over 1 year ago.

Status:NewStart date:22/01/2018
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:openQA Project - Ready
Duration:

Description

qemu-system-ppc64: -monitor telnet:127.0.0.1:20062,server,nowait: Failed to bind socket: Address already in use

https://openqa.suse.de/tests/1405047
https://openqa.suse.de/tests/1405047/file/autoinst-log.txt

bcache-zram.sh Magnifier - Script used by Rudi to generate bcache backed ramdisk (2.18 KB) szarate, 10/04/2018 02:05 pm


Related issues

Related to openQA Project - action #30388: [qam] openqaworker10:4 - worker fail all tests - qemu in... Resolved 26/02/2018
Related to openQA Infrastructure - action #38780: [tools] Remove tmpfs from malbec again and give it a try Workable 24/07/2018
Duplicated by openQA Tests - action #30249: [sle][migration][sle15][ppc64le] test fails in install_an... Resolved 12/01/2018

History

#1 Updated by okurz about 2 years ago

  • Subject changed from [tools] Address competition(Failed to bind socket: Address already in use) to [tools][ppc64le][malbec] Address competition(Failed to bind socket: Address already in use)
  • Category set to 168
  • Status changed from New to In Progress
  • Assignee set to okurz

Could be a duplicate of #30388 . https://openqa.suse.de/tests/1403769 is the last test that did not incomplete. https://openqa.suse.de/tests/1403769/file/autoinst-log.txt shows

[2018-01-20T16:57:14.0206 CET] [debug] no [2018-01-20T16:57:14.0207 CET] [debug] waitpid for 37981 returned 0
[2018-01-20T16:57:14.0207 CET] [debug] capture loop failed Can't write to log: No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 197.

DIE Can't write to log: No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 197.

 at /usr/lib/os-autoinst/backend/baseclass.pm line 80.
    backend::baseclass::die_handler('Can\'t write to log: No space left on device at /usr/lib/os-a...') called at /usr/lib/perl5/5.18.2/Carp.pm line 100
    Carp::croak('Can\'t write to log: No space left on device') called at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Log.pm line 32

I guess that is the problem.

Right now there is no space depletion on malbec (anymore?). I terminated the still running qemu process and will restart the incomplete jobs.

#2 Updated by okurz about 2 years ago

  • Related to action #30388: [qam] openqaworker10:4 - worker fail all tests - qemu instances are left behind added

#3 Updated by okurz about 2 years ago

  • Duplicated by action #30249: [sle][migration][sle15][ppc64le] test fails in install_and_reboot - failed by diskspace is exhausted added

#4 Updated by okurz about 2 years ago

  • Priority changed from Normal to Urgent

#5 Updated by okurz about 2 years ago

  • Assignee deleted (okurz)
  • Priority changed from Urgent to High

About 10 jobs are not incompleted now on malbec . I wonder if certain scenarios, e.g. migration are more prone to deplete the space on malbec. I did the "urgent" part for now, back to backlog for further consideration. I recommend to either further limit the numbers of instances or increase the available space in /var/lib/openqa/pool. Right now it's tmpfs 80G 46G 35G 57% /var/lib/openqa/pool so no big margin.

#6 Updated by coolo about 2 years ago

  • Subject changed from [tools][ppc64le][malbec] Address competition(Failed to bind socket: Address already in use) to [ppc64le] Deploy bcache on tmpfs workers
  • Status changed from In Progress to New

malbec does most of its stuff in tmpfs and is overbooked. So if we have too many jobs running in parallel there, we will run into this. But as we're still short on power workers, we have to live with this and restart the jobs that ended up running in parallel.

Alex suggested using bcache to come over this situation, where bcache would use a file on disk to backstore.

#7 Updated by coolo about 2 years ago

  • Target version set to Ready

#8 Updated by mitiao about 2 years ago

  • Assignee set to mitiao

#9 Updated by mitiao about 2 years ago

  • Status changed from New to In Progress

Reproduced issue locally.
Trying to deploy bcache on a test machine following the documents

#10 Updated by mitiao about 2 years ago

  • Target version changed from Ready to Current Sprint

#11 Updated by mitiao almost 2 years ago

Tried some deployments on local x86_64 machines with different configurations.
Wanted to see the inside of malbec and then deploy bcache according to its hardware and os envirenment.

Currently blocked by 'dead' malbec and waiting for it up.
Filed a infra ticket (107529) for this issue:
https://infra.nue.suse.com/Ticket/Display.html?id=107529

#12 Updated by coolo almost 2 years ago

Don't hack this into malbec - hack this into the salt receipes.

#13 Updated by mitiao almost 2 years ago

coolo wrote:

Don't hack this into malbec - hack this into the salt receipes.

Got it, will look into deploy by salt

#14 Updated by mitiao almost 2 years ago

malbec issue has no any update since 2nd of March:

2018-03-01 10:44 The RT System itself - Status changed from 'new' to 'open'
2018-03-01 10:44 lrupp (Lars Vogdt) - Given to mcaj (Martin Caj)
2018-03-01 10:44 lrupp (Lars Vogdt) - Queue changed from BuildOPS to infra
2018-03-01 14:30 mcaj (Martin Caj) - Given to mmaher (Maximilian Maher)
2018-03-02 11:25 fghariani (Fatma Ghariani) - Stolen from mmaher (Maximilian Maher)
2018-03-02 11:25 fghariani (Fatma Ghariani) - Given to ihno (Ihno Krumreich)

Also setup a salt master and minon on local to compose sls file which is for future bcache deployment

#15 Updated by mitiao almost 2 years ago

Done a sls file to deploy backing device in a looped file and cache device on a x86_64 test machine.
Looked into alive power8 worker to get disk configurations, adapting sls to match the real worker.

#16 Updated by szarate almost 2 years ago

An update, https://fsp1-malbec.arch.suse.de/cgi-bin/cgi is now acessible again but doesn't allow ipmi or any other type of interaction apparently.

However the machine seems to have booted in baremetal mode OR there might be network issues that don't allow the ip to be set... who knows.

#17 Updated by mitiao almost 2 years ago

I am adapting bcache sls for Power8-4 worker, and some questions were pop-up:

  1. I can't find any ssd device on the machine, so where the caching device should create?
  2. /var/lib/opneqa is mounted on a partition, is it ok to format the partition as backing device and mount to /var/lib/openqa? so it would clean up all existing workers and re-install them on bcache device. or use looped file as backing device for newly added workers?

#18 Updated by coolo almost 2 years ago

if we had ssd in these workers, we wouldn't run them on tmpfs. And we don't want to cache onto ssds, but use the backing device only for overcommiting the memory.

Don't use the full /var/lib/openqa because we don't want to store /var/lib/openqa/cache in memory I guess. But you can shrink it to gain space, but possibly we even other disks.

#20 Updated by mitiao almost 2 years ago

malbec is back, last update of https://infra.nue.suse.com/Ticket/Display.html?id=107529:
2018-03-21 11:47 jloeser (Jan Loeser) - Status changed from 'open' to 'resolved'

#21 Updated by mitiao almost 2 years ago

Did I misunderstand that we don't need make cache on malbec if no ssd device?
As I talked with Coly, only using backing device is not quite different with not using bcache.

Also i don't get the idea about to reduce memory overcommiting by using bcache, since bcache is mainly to use fast device act as a cache for some slower storage device, and the memory usage would be getting lager along with ssd size increase.

And if a cache is needed and no ssd device, cache on hdd is not meaningful, if cache on tmpfs which is still using memory as cache but also limited by memory size and memory is twice used.

btw, since malbec is alive, further work should be specific on malbec.

#22 Updated by szarate almost 2 years ago

I'm guessing that the solution that coolo is refering to, and that alex pointed him to, was to use the bcache as backing device that would extend the memory of the system via zram... this would extend the ammount of memory available in the pool, so that the actual data that is commonly used is kept in the faster ram while the other is moved to the zram-bcache frankenstein... but my memory is foggy here.

@mitiao, the idea with using rotating disks, is that since we have some ram to use, just to make sure that there' s enough extra space to store big files. Speed is second to some extent here.

Gitlab PR looks fine so far... but well, let malbec be the play ground unless coolo says otherwise :)

#23 Updated by coolo almost 2 years ago

This is not about zram, no. We have enough memory - in general. But we need to have a strategy where to put the blocks in case the RAM is overcommited. If bcache can't be used with tmpfs, then we need to find a different solution to have RAM as backing store to the pools.

#24 Updated by algraf almost 2 years ago

coolo wrote:

This is not about zram, no. We have enough memory - in general. But we need to have a strategy where to put the blocks in case the RAM is overcommited. If bcache can't be used with tmpfs, then we need to find a different solution to have RAM as backing store to the pools.

You can just use a genuine ramdisk rather than tmpfs as backing store for bcache. That gives you the same effect as the ZRAM variant, but without RAM overhead reduction. The big benefit of bcache that we found was that it nicely continuously streams dirty data onto the backing storage, so it actually saturates the links rather than generate big latency spikes when flushing from time to time.

#25 Updated by szarate almost 2 years ago

Perhaps the script might shed some more light

#26 Updated by szarate almost 2 years ago

It was agreed during the sprint planning to move forward with this.

#27 Updated by szarate almost 2 years ago

@mitiao are you going to keep working on this?

#28 Updated by szarate almost 2 years ago

  • Status changed from In Progress to New
  • Assignee deleted (mitiao)
  • Priority changed from High to Normal
  • Target version changed from Current Sprint to Ready

We need to come up with another idea, so most likely this will come back soon

#29 Updated by szarate over 1 year ago

  • Related to action #38780: [tools] Remove tmpfs from malbec again and give it a try added

#30 Updated by mkittler over 1 year ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (168)

seems to be an infra issue

Also available in: Atom PDF