action #30595: [ppc64le] Deploy bcache on tmpfs workers - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #30595

closed

[ppc64le] Deploy bcache on tmpfs workers

Added by yosun over 7 years ago. Updated over 1 year ago.

Status:

Closed

Priority:

Low

Assignee:

Category:

Target version:

QA (public) - future

Start date:

2018-01-22

Due date:

% Done:

Estimated time:

Tags:

worker

Description

qemu-system-ppc64: -monitor telnet:127.0.0.1:20062,server,nowait: Failed to bind socket: Address already in use

https://openqa.suse.de/tests/1405047
https://openqa.suse.de/tests/1405047/file/autoinst-log.txt

Files

bcache-zram.sh (2.18 KB) bcache-zram.sh

Script used by Rudi to generate bcache backed ramdisk

szarate, 2018-04-10 14:05

Related issues 6 (1 open — 5 closed)

Related to openQA Project (public) - action #30388: [qam] openqaworker10:4 - worker fail all tests - qemu instances are left behind

Resolved

dasantiago

2018-02-26

Actions

Related to openQA Infrastructure (public) - action #38780: [tools] Remove tmpfs from malbec again and give it a try

Rejected

okurz

2018-07-24

Actions

Related to openQA Project (public) - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results

Resolved

okurz

2020-03-18

Actions

Related to openQA Infrastructure (public) - action #106841: Try using tmpfs on openqaworker1 to use RAM more efficiently

New

2022-02-15

Actions

Related to openQA Infrastructure (public) - action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M

Resolved

okurz

2023-09-20

Actions

Has duplicate openQA Tests (public) - action #30249: [sle][migration][sle15][ppc64le] test fails in install_and_reboot - failed by diskspace is exhausted

Resolved

qmsu

2018-01-12

Actions

Copy link

Updated by okurz over 7 years ago

Subject changed from [tools] Address competition(Failed to bind socket: Address already in use) to [tools][ppc64le][malbec] Address competition(Failed to bind socket: Address already in use)
Category set to 168
Status changed from New to In Progress
Assignee set to okurz

Could be a duplicate of #30388 . https://openqa.suse.de/tests/1403769 is the last test that did not incomplete. https://openqa.suse.de/tests/1403769/file/autoinst-log.txt shows

[2018-01-20T16:57:14.0206 CET] [debug] no [2018-01-20T16:57:14.0207 CET] [debug] waitpid for 37981 returned 0
[2018-01-20T16:57:14.0207 CET] [debug] capture loop failed Can't write to log: No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 197.

DIE Can't write to log: No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 197.

 at /usr/lib/os-autoinst/backend/baseclass.pm line 80.
	backend::baseclass::die_handler('Can\'t write to log: No space left on device at /usr/lib/os-a...') called at /usr/lib/perl5/5.18.2/Carp.pm line 100
	Carp::croak('Can\'t write to log: No space left on device') called at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Log.pm line 32

I guess that is the problem.

Right now there is no space depletion on malbec (anymore?). I terminated the still running qemu process and will restart the incomplete jobs.

Actions

Copy link

Updated by okurz over 7 years ago

Related to action #30388: [qam] openqaworker10:4 - worker fail all tests - qemu instances are left behind added

Actions

Copy link

Updated by okurz over 7 years ago

Has duplicate action #30249: [sle][migration][sle15][ppc64le] test fails in install_and_reboot - failed by diskspace is exhausted added

Actions

Copy link

Updated by okurz over 7 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by okurz over 7 years ago

Assignee deleted (~~okurz~~)
Priority changed from Urgent to High

About 10 jobs are not incompleted now on malbec . I wonder if certain scenarios, e.g. migration are more prone to deplete the space on malbec. I did the "urgent" part for now, back to backlog for further consideration. I recommend to either further limit the numbers of instances or increase the available space in /var/lib/openqa/pool. Right now it's tmpfs 80G 46G 35G 57% /var/lib/openqa/pool so no big margin.

Actions

Copy link

Updated by coolo over 7 years ago

Subject changed from [tools][ppc64le][malbec] Address competition(Failed to bind socket: Address already in use) to [ppc64le] Deploy bcache on tmpfs workers
Status changed from In Progress to New

malbec does most of its stuff in tmpfs and is overbooked. So if we have too many jobs running in parallel there, we will run into this. But as we're still short on power workers, we have to live with this and restart the jobs that ended up running in parallel.

Alex suggested using bcache to come over this situation, where bcache would use a file on disk to backstore.

Actions

Copy link

Updated by coolo over 7 years ago

Target version set to Ready

Actions

Copy link

Updated by mitiao over 7 years ago

Assignee set to mitiao

Actions

Copy link

Updated by mitiao over 7 years ago

Status changed from New to In Progress

Reproduced issue locally.
Trying to deploy bcache on a test machine following the documents

Actions

Copy link

#10

Updated by mitiao over 7 years ago

Target version changed from Ready to Current Sprint

Actions

Copy link

#11

Updated by mitiao about 7 years ago

Tried some deployments on local x86_64 machines with different configurations.
Wanted to see the inside of malbec and then deploy bcache according to its hardware and os envirenment.

Currently blocked by 'dead' malbec and waiting for it up.
Filed a infra ticket (107529) for this issue:
https://infra.nue.suse.com/Ticket/Display.html?id=107529

Actions

Copy link

#12

Updated by coolo about 7 years ago

Don't hack this into malbec - hack this into the salt receipes.

Actions

Copy link

#13

Updated by mitiao about 7 years ago

coolo wrote:

Don't hack this into malbec - hack this into the salt receipes.

Got it, will look into deploy by salt

Actions

Copy link

#14

Updated by mitiao about 7 years ago

malbec issue has no any update since 2nd of March:

2018-03-01 10:44 The RT System itself - Status changed from 'new' to 'open'
2018-03-01 10:44 lrupp (Lars Vogdt) - Given to mcaj (Martin Caj)
2018-03-01 10:44 lrupp (Lars Vogdt) - Queue changed from BuildOPS to infra
2018-03-01 14:30 mcaj (Martin Caj) - Given to mmaher (Maximilian Maher)
2018-03-02 11:25 fghariani (Fatma Ghariani) - Stolen from mmaher (Maximilian Maher)
2018-03-02 11:25 fghariani (Fatma Ghariani) - Given to ihno (Ihno Krumreich)

Also setup a salt master and minon on local to compose sls file which is for future bcache deployment

Actions

Copy link

#15

Updated by mitiao about 7 years ago

Done a sls file to deploy backing device in a looped file and cache device on a x86_64 test machine.
Looked into alive power8 worker to get disk configurations, adapting sls to match the real worker.

Actions

Copy link

#16

Updated by szarate about 7 years ago

An update, https://fsp1-malbec.arch.suse.de/cgi-bin/cgi is now acessible again but doesn't allow ipmi or any other type of interaction apparently.

However the machine seems to have booted in baremetal mode OR there might be network issues that don't allow the ip to be set... who knows.

Actions

Copy link

#17

Updated by mitiao about 7 years ago

I am adapting bcache sls for Power8-4 worker, and some questions were pop-up:

I can't find any ssd device on the machine, so where the caching device should create?
/var/lib/opneqa is mounted on a partition, is it ok to format the partition as backing device and mount to /var/lib/openqa? so it would clean up all existing workers and re-install them on bcache device.
or use looped file as backing device for newly added workers?

Actions

Copy link

#18

Updated by coolo about 7 years ago

if we had ssd in these workers, we wouldn't run them on tmpfs. And we don't want to cache onto ssds, but use the backing device only for overcommiting the memory.

Don't use the full /var/lib/openqa because we don't want to store /var/lib/openqa/cache in memory I guess. But you can shrink it to gain space, but possibly we even other disks.

Actions

Copy link

#19

Updated by mitiao about 7 years ago

wip pr submitted:
https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/37

Actions

Copy link

#20

Updated by mitiao about 7 years ago

malbec is back, last update of https://infra.nue.suse.com/Ticket/Display.html?id=107529:
2018-03-21 11:47 jloeser (Jan Loeser) - Status changed from 'open' to 'resolved'

Actions

Copy link

#21

Updated by mitiao about 7 years ago

Did I misunderstand that we don't need make cache on malbec if no ssd device?
As I talked with Coly, only using backing device is not quite different with not using bcache.

Also i don't get the idea about to reduce memory overcommiting by using bcache, since bcache is mainly to use fast device act as a cache for some slower storage device, and the memory usage would be getting lager along with ssd size increase.

And if a cache is needed and no ssd device, cache on hdd is not meaningful, if cache on tmpfs which is still using memory as cache but also limited by memory size and memory is twice used.

btw, since malbec is alive, further work should be specific on malbec.

Actions

Copy link

#22

Updated by szarate about 7 years ago

I'm guessing that the solution that coolo is refering to, and that alex pointed him to, was to use the bcache as backing device that would extend the memory of the system via zram... this would extend the ammount of memory available in the pool, so that the actual data that is commonly used is kept in the faster ram while the other is moved to the zram-bcache frankenstein... but my memory is foggy here.

@mitiao, the idea with using rotating disks, is that since we have some ram to use, just to make sure that there' s enough extra space to store big files. Speed is second to some extent here.

Gitlab PR looks fine so far... but well, let malbec be the play ground unless coolo says otherwise :)

Actions

Copy link

#23

Updated by coolo about 7 years ago

This is not about zram, no. We have enough memory - in general. But we need to have a strategy where to put the blocks in case the RAM is overcommited. If bcache can't be used with tmpfs, then we need to find a different solution to have RAM as backing store to the pools.

Actions

Copy link

#24

Updated by algraf about 7 years ago

coolo wrote:

This is not about zram, no. We have enough memory - in general. But we need to have a strategy where to put the blocks in case the RAM is overcommited. If bcache can't be used with tmpfs, then we need to find a different solution to have RAM as backing store to the pools.

You can just use a genuine ramdisk rather than tmpfs as backing store for bcache. That gives you the same effect as the ZRAM variant, but without RAM overhead reduction. The big benefit of bcache that we found was that it nicely continuously streams dirty data onto the backing storage, so it actually saturates the links rather than generate big latency spikes when flushing from time to time.

Actions

Copy link

#25

Updated by szarate about 7 years ago

File bcache-zram.sh bcache-zram.sh added

Perhaps the script might shed some more light

Actions

Copy link

#26

Updated by szarate about 7 years ago

It was agreed during the sprint planning to move forward with this.

Actions

Copy link

#27

Updated by szarate about 7 years ago

@mitiao are you going to keep working on this?

Actions

Copy link

#28

Updated by szarate about 7 years ago

Status changed from In Progress to New
Assignee deleted (~~mitiao~~)
Priority changed from High to Normal
Target version changed from Current Sprint to Ready

We need to come up with another idea, so most likely this will come back soon

Actions

Copy link

#29

Updated by szarate almost 7 years ago

Related to action #38780: [tools] Remove tmpfs from malbec again and give it a try added

Actions

Copy link

#30

Updated by mkittler over 6 years ago

Project changed from openQA Project (public) to openQA Infrastructure (public)
Category deleted (~~168~~)

seems to be an infra issue

Actions

Copy link

#31

Updated by okurz about 5 years ago

algraf wrote:

You can just use a genuine ramdisk rather than tmpfs as backing store for bcache. That gives you the same effect as the ZRAM variant, but without RAM overhead reduction.

Do you mean ramfs rather then tmpfs or something like mke2fs -m 0 /dev/ram0 ? Wouldn't in both cases the system memory be prone to run full and cause more havoc than a size-limited tmpfs?

Actions

Copy link

#32

Updated by okurz about 5 years ago

Related to coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results added

Actions

Copy link

#33

Updated by okurz almost 5 years ago

Tags changed from caching, openQA, sporadic, arm, ipmi, worker to worker

Actions

Copy link

#34

Updated by okurz over 4 years ago

Priority changed from Normal to Low

Actions

Copy link

#35

Updated by okurz over 4 years ago

Target version changed from Ready to future

Actions

Copy link

#36

Updated by okurz over 3 years ago

Related to action #106841: Try using tmpfs on openqaworker1 to use RAM more efficiently added

Actions

Copy link

#37

Updated by okurz over 1 year ago

Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added

Actions

Copy link

#38

Updated by yosun over 1 year ago

Status changed from New to Closed

When review my opening ticket, I read through this and I feel the poo#106841 could fully cover the issue. IMO bcache could only use high speed disk to increase IO performance, but it can't help with better stability for openQA. I suggest to closing this ticket. Feel free to reopen it if anyone feel this still valid.

Actions

Copy link

#39

Updated by okurz over 1 year ago

agreed, thx for the update

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #30595

[ppc64le] Deploy bcache on tmpfs workers

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by mitiao over 7 years ago

Updated by mitiao over 7 years ago

Updated by mitiao over 7 years ago

Updated by mitiao about 7 years ago

Updated by coolo about 7 years ago

Updated by mitiao about 7 years ago

Updated by mitiao about 7 years ago

Updated by mitiao about 7 years ago

Updated by szarate about 7 years ago

Updated by mitiao about 7 years ago

Updated by coolo about 7 years ago

Updated by mitiao about 7 years ago

Updated by mitiao about 7 years ago

Updated by mitiao about 7 years ago

Updated by szarate about 7 years ago

Updated by coolo about 7 years ago

Updated by algraf about 7 years ago

Updated by szarate about 7 years ago

Updated by szarate about 7 years ago

Updated by szarate about 7 years ago

Updated by szarate about 7 years ago

Updated by szarate almost 7 years ago

Updated by mkittler over 6 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 3 years ago

Updated by okurz over 1 year ago

Updated by yosun over 1 year ago

Updated by okurz over 1 year ago