action #30595
closed[ppc64le] Deploy bcache on tmpfs workers
Added by yosun almost 7 years ago. Updated about 1 year ago.
0%
Description
qemu-system-ppc64: -monitor telnet:127.0.0.1:20062,server,nowait: Failed to bind socket: Address already in use
https://openqa.suse.de/tests/1405047
https://openqa.suse.de/tests/1405047/file/autoinst-log.txt
Files
bcache-zram.sh (2.18 KB) bcache-zram.sh | Script used by Rudi to generate bcache backed ramdisk | szarate, 2018-04-10 14:05 |
Updated by okurz almost 7 years ago
- Subject changed from [tools] Address competition(Failed to bind socket: Address already in use) to [tools][ppc64le][malbec] Address competition(Failed to bind socket: Address already in use)
- Category set to 168
- Status changed from New to In Progress
- Assignee set to okurz
Could be a duplicate of #30388 . https://openqa.suse.de/tests/1403769 is the last test that did not incomplete. https://openqa.suse.de/tests/1403769/file/autoinst-log.txt shows
[2018-01-20T16:57:14.0206 CET] [debug] no [2018-01-20T16:57:14.0207 CET] [debug] waitpid for 37981 returned 0
[2018-01-20T16:57:14.0207 CET] [debug] capture loop failed Can't write to log: No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 197.
DIE Can't write to log: No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 197.
at /usr/lib/os-autoinst/backend/baseclass.pm line 80.
backend::baseclass::die_handler('Can\'t write to log: No space left on device at /usr/lib/os-a...') called at /usr/lib/perl5/5.18.2/Carp.pm line 100
Carp::croak('Can\'t write to log: No space left on device') called at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Log.pm line 32
I guess that is the problem.
Right now there is no space depletion on malbec (anymore?). I terminated the still running qemu process and will restart the incomplete jobs.
Updated by okurz almost 7 years ago
- Related to action #30388: [qam] openqaworker10:4 - worker fail all tests - qemu instances are left behind added
Updated by okurz almost 7 years ago
- Has duplicate action #30249: [sle][migration][sle15][ppc64le] test fails in install_and_reboot - failed by diskspace is exhausted added
Updated by okurz almost 7 years ago
- Assignee deleted (
okurz) - Priority changed from Urgent to High
About 10 jobs are not incompleted now on malbec . I wonder if certain scenarios, e.g. migration are more prone to deplete the space on malbec. I did the "urgent" part for now, back to backlog for further consideration. I recommend to either further limit the numbers of instances or increase the available space in /var/lib/openqa/pool. Right now it's tmpfs 80G 46G 35G 57% /var/lib/openqa/pool
so no big margin.
Updated by coolo almost 7 years ago
- Subject changed from [tools][ppc64le][malbec] Address competition(Failed to bind socket: Address already in use) to [ppc64le] Deploy bcache on tmpfs workers
- Status changed from In Progress to New
malbec does most of its stuff in tmpfs and is overbooked. So if we have too many jobs running in parallel there, we will run into this. But as we're still short on power workers, we have to live with this and restart the jobs that ended up running in parallel.
Alex suggested using bcache to come over this situation, where bcache would use a file on disk to backstore.
Updated by mitiao almost 7 years ago
- Status changed from New to In Progress
Reproduced issue locally.
Trying to deploy bcache on a test machine following the documents
Updated by mitiao almost 7 years ago
- Target version changed from Ready to Current Sprint
Updated by mitiao almost 7 years ago
Tried some deployments on local x86_64 machines with different configurations.
Wanted to see the inside of malbec and then deploy bcache according to its hardware and os envirenment.
Currently blocked by 'dead' malbec and waiting for it up.
Filed a infra ticket (107529) for this issue:
https://infra.nue.suse.com/Ticket/Display.html?id=107529
Updated by coolo almost 7 years ago
Don't hack this into malbec - hack this into the salt receipes.
Updated by mitiao almost 7 years ago
coolo wrote:
Don't hack this into malbec - hack this into the salt receipes.
Got it, will look into deploy by salt
Updated by mitiao almost 7 years ago
malbec issue has no any update since 2nd of March:
2018-03-01 10:44 The RT System itself - Status changed from 'new' to 'open'
2018-03-01 10:44 lrupp (Lars Vogdt) - Given to mcaj (Martin Caj)
2018-03-01 10:44 lrupp (Lars Vogdt) - Queue changed from BuildOPS to infra
2018-03-01 14:30 mcaj (Martin Caj) - Given to mmaher (Maximilian Maher)
2018-03-02 11:25 fghariani (Fatma Ghariani) - Stolen from mmaher (Maximilian Maher)
2018-03-02 11:25 fghariani (Fatma Ghariani) - Given to ihno (Ihno Krumreich)
Also setup a salt master and minon on local to compose sls file which is for future bcache deployment
Updated by mitiao almost 7 years ago
Done a sls file to deploy backing device in a looped file and cache device on a x86_64 test machine.
Looked into alive power8 worker to get disk configurations, adapting sls to match the real worker.
Updated by szarate almost 7 years ago
An update, https://fsp1-malbec.arch.suse.de/cgi-bin/cgi is now acessible again but doesn't allow ipmi or any other type of interaction apparently.
However the machine seems to have booted in baremetal mode OR there might be network issues that don't allow the ip to be set... who knows.
Updated by mitiao almost 7 years ago
I am adapting bcache sls for Power8-4 worker, and some questions were pop-up:
- I can't find any ssd device on the machine, so where the caching device should create?
- /var/lib/opneqa is mounted on a partition, is it ok to format the partition as backing device and mount to /var/lib/openqa? so it would clean up all existing workers and re-install them on bcache device. or use looped file as backing device for newly added workers?
Updated by coolo almost 7 years ago
if we had ssd in these workers, we wouldn't run them on tmpfs. And we don't want to cache onto ssds, but use the backing device only for overcommiting the memory.
Don't use the full /var/lib/openqa because we don't want to store /var/lib/openqa/cache in memory I guess. But you can shrink it to gain space, but possibly we even other disks.
Updated by mitiao over 6 years ago
wip pr submitted:
https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/37
Updated by mitiao over 6 years ago
malbec is back, last update of https://infra.nue.suse.com/Ticket/Display.html?id=107529:
2018-03-21 11:47 jloeser (Jan Loeser) - Status changed from 'open' to 'resolved'
Updated by mitiao over 6 years ago
Did I misunderstand that we don't need make cache on malbec if no ssd device?
As I talked with Coly, only using backing device is not quite different with not using bcache.
Also i don't get the idea about to reduce memory overcommiting by using bcache, since bcache is mainly to use fast device act as a cache for some slower storage device, and the memory usage would be getting lager along with ssd size increase.
And if a cache is needed and no ssd device, cache on hdd is not meaningful, if cache on tmpfs which is still using memory as cache but also limited by memory size and memory is twice used.
btw, since malbec is alive, further work should be specific on malbec.
Updated by szarate over 6 years ago
I'm guessing that the solution that coolo is refering to, and that alex pointed him to, was to use the bcache as backing device that would extend the memory of the system via zram... this would extend the ammount of memory available in the pool, so that the actual data that is commonly used is kept in the faster ram while the other is moved to the zram-bcache frankenstein... but my memory is foggy here.
@mitiao, the idea with using rotating disks, is that since we have some ram to use, just to make sure that there' s enough extra space to store big files. Speed is second to some extent here.
Gitlab PR looks fine so far... but well, let malbec be the play ground unless coolo says otherwise :)
Updated by coolo over 6 years ago
This is not about zram, no. We have enough memory - in general. But we need to have a strategy where to put the blocks in case the RAM is overcommited. If bcache can't be used with tmpfs, then we need to find a different solution to have RAM as backing store to the pools.
Updated by algraf over 6 years ago
coolo wrote:
This is not about zram, no. We have enough memory - in general. But we need to have a strategy where to put the blocks in case the RAM is overcommited. If bcache can't be used with tmpfs, then we need to find a different solution to have RAM as backing store to the pools.
You can just use a genuine ramdisk rather than tmpfs as backing store for bcache. That gives you the same effect as the ZRAM variant, but without RAM overhead reduction. The big benefit of bcache that we found was that it nicely continuously streams dirty data onto the backing storage, so it actually saturates the links rather than generate big latency spikes when flushing from time to time.
Updated by szarate over 6 years ago
- File bcache-zram.sh bcache-zram.sh added
Perhaps the script might shed some more light
Updated by szarate over 6 years ago
It was agreed during the sprint planning to move forward with this.
Updated by szarate over 6 years ago
@mitiao are you going to keep working on this?
Updated by szarate over 6 years ago
- Status changed from In Progress to New
- Assignee deleted (
mitiao) - Priority changed from High to Normal
- Target version changed from Current Sprint to Ready
We need to come up with another idea, so most likely this will come back soon
Updated by szarate over 6 years ago
- Related to action #38780: [tools] Remove tmpfs from malbec again and give it a try added
Updated by mkittler about 6 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Category deleted (
168)
seems to be an infra issue
Updated by okurz almost 5 years ago
algraf wrote:
You can just use a genuine ramdisk rather than tmpfs as backing store for bcache. That gives you the same effect as the ZRAM variant, but without RAM overhead reduction.
Do you mean ramfs rather then tmpfs or something like mke2fs -m 0 /dev/ram0
? Wouldn't in both cases the system memory be prone to run full and cause more havoc than a size-limited tmpfs?
Updated by okurz over 4 years ago
- Related to coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results added
Updated by okurz over 4 years ago
- Tags changed from caching, openQA, sporadic, arm, ipmi, worker to worker
Updated by okurz almost 3 years ago
- Related to action #106841: Try using tmpfs on openqaworker1 to use RAM more efficiently added
Updated by okurz about 1 year ago
- Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added
Updated by yosun about 1 year ago
- Status changed from New to Closed
When review my opening ticket, I read through this and I feel the poo#106841 could fully cover the issue. IMO bcache could only use high speed disk to increase IO performance, but it can't help with better stability for openQA. I suggest to closing this ticket. Feel free to reopen it if anyone feel this still valid.