action #166172
closed[FIRING:1] ada (ada: partitions usage (%) alert Generic partitions_usage_alert_ada generic) size:S
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/GDada/dashboard-for-ada?orgId=1&refresh=1m
disk.mean { device: efivarfs, fstype: efivarfs }
16.0% 16.0% 16.0%
disk.mean { device: nvme0n1p1, fstype: vfat }
1.15% 1.15% 1.15%
disk.mean { device: nvme0n1p2, fstype: btrfs }
86.3% 86.3% 86.3%
disk.mean { device: sda2, fstype: xfs }
0.106% 0.106% 0.106%
Suggestions¶
- Investigate what the numbers mean and what triggers the alert
- DONE Ask people to delete VM's for now
- This is a physical machine - buy more storage
Files
Updated by livdywan 4 months ago 路 Edited
disk.mean { fstype: efivarfs, device: efivarfs }
16.0% 16.0% 16.0%
disk.mean { device: nvme0n1p1, fstype: vfat }
1.15% 1.15% 1.15%
disk.mean { device: nvme0n1p2, fstype: btrfs }
81.4% 86.4% 81.4%
disk.mean { device: sda2, fstype: xfs }
0.106% 0.106% 0.106%
Not sure if anyone deleted files. For now numbers are going down. The alert is not currently active.
Updated by livdywan 4 months ago
- Subject changed from [FIRING:1] ada (ada: partitions usage (%) alert Generic partitions_usage_alert_ada generic) to [FIRING:1] ada (ada: partitions usage (%) alert Generic partitions_usage_alert_ada generic) size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler 3 months ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
This is about the root partition. It is now at 84 % so the usage is slowly increasing again.
There's an additional SSD /dev/sda2
mounted as /extra_storage
. It is not used so far but would gain us 464G so we should probably just make use of it. The question is just how. I'll take a look on the machine. Maybe we can migrate some of the data to this 2nd SSD.
Updated by mkittler 3 months ago
- Status changed from In Progress to Feedback
- Assignee deleted (
mkittler)
Not sure what to do with this from our side. Users simply needed to store their VM images on the other disk. According to gdu most of the data is under /var/lib/libvirt
. This alone is actually 2.4 TiB so the additional SSD would gain us relatively little compared to that. (We could move the VM of Phoenix to that SSD because it alone needs 312.2 GiB and will hopefully not grow much more.) The home directories are also quite big with 29.8 GiB in total.
I also wasn't sure how this setup is supposed to be used at all. There's not much information about ada on https://wiki.suse.net/index.php/SUSE-Quality_Assurance/QE_infrastructure and when following (probably outdated) instructions on https://confluence.suse.com/display/openqa/Create+an+openQA+instance I ran into this error:
authentication unavailable: no polkit agent available to authenticate action 'org.libvirt.unix.manage'
Maybe an issue on my side. However, I can connect to osiris without problems (using the same user name) and to qamaster (using the root user) but none of that worked for ada (although I am also not sure whether I was guessing the root password correctly).
We should probably clarify how to access this machine via Virt Manager and what we can/want to move to the other SSD. We could also remove all images older than a certain data from all home directories (including /root
). We could also try to ask users again to limit their disk usage.
Updated by livdywan 3 months ago
- Status changed from Feedback to In Progress
- Assignee set to livdywan
If you're unassigning yourself the ticket can't really be in Feedback as there is nobody to act on it 馃檭
I will take a look and see if I can follow the instructions. Asking our users directly in Slack also seems like a good idea.
Updated by livdywan 3 months ago 路 Edited
I will take a look and see if I can follow the instructions. Asking our users directly in Slack also seems like a good idea.
I'm able to connect ssh -A -L 8443:ada-ipmi:443 -NT jumpy@qe-jumpy.prg2.suse.org
which gets me a web interface. However I don't know what password to use 馃
I can connect to ada.qe.suse.de via SSH. This is where I'm probably running into the same problem as VMM says authentication unavailable: no polkit agent available to authenticate action 'org.libvirt.unix.manage'
when trying to connect.
So I'm asking for advice on Slack.
Updated by nicksinger 3 months ago
livdywan wrote in #note-10:
I will take a look and see if I can follow the instructions. Asking our users directly in Slack also seems like a good idea.
I'm able to connect
ssh -A -L 8443:ada-ipmi:443 -NT jumpy@qe-jumpy.prg2.suse.org
which gets me a web interface. However I don't know what password to use 馃I can connect to ada.qe.suse.de via SSH. This is where I'm probably running into the same problem as VMM says
authentication unavailable: no polkit agent available to authenticate action 'org.libvirt.unix.manage'
when trying to connect.So I'm asking for advice on Slack.
The credentials for the BMC can be found here: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls?ref_type=heads#L2664
I was able to connect with virt-manager and got a list of all machines. The polkit problems you both encountered seem related to your local configuration although it is indeed strange that connections to e.g. osiris work. I checked my machine and found that /usr/share/polkit-1/rules.d/50-libvirt.rules
should provide the requested permission if the user is in the libvirt
group. The rules-file comes from the libvirt-daemon
-package. Could you check that you have this installed?
Updated by livdywan 3 months ago 路 Edited
I was able to connect with virt-manager and got a list of all machines. The polkit problems you both encountered seem related to your local configuration although it is indeed strange that connections to e.g. osiris work. I checked my machine and found that
/usr/share/polkit-1/rules.d/50-libvirt.rules
should provide the requested permission if the user is in thelibvirt
group. The rules-file comes from thelibvirt-daemon
-package. Could you check that you have this installed?
Turns out my user on ada wasn't in the group and that's why the connection failed. I'm updating the instructions accordingly.
sudo du -h /var/lib/libvirt/images /mnt/SSD_extension /extra_storage
2,5T /var/lib/libvirt/images
0 /mnt/SSD_extension
0 /extra_storage
Nobody is following the recommendation to use SSD_extension written in red front. Which is on the same disk anyway:
sudo df -h /var/lib/libvirt/images /mnt/SSD_extension /extra_storage/
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 3,5T 3,0T 584G 84% /var
/dev/nvme0n1p2 3,5T 3,0T 584G 84% /
/dev/sda2 464G 506M 464G 1% /extra_storage
So I removed SDD_extension from the instructions. Instead I'm recommending extra_storage. The GUI actually shows the amount of available space, too.
Updated by ybonatakis 3 months ago
I removed two images from /extra_storage/liv-openqa-*
to my machine and I moved them on my machine.
I also delete some snapshots and deleted files with sudo lsof |grep deleted | grep qemu | awk '{print $2}' | xargs sudo kill -9
but df
doesnt show reduced size.
only /extra_storage
```iob@ada:~> sudo du -h /var/lib/libvirt/images /mnt/SSD_extension /extra_storage
2.6T /var/lib/libvirt/images
0 /mnt/SSD_extension
0 /extra_storage
Updated by ybonatakis 3 months ago 路 Edited
I dont know if it helps but I run also sudo btrfs balance -d -m /
. I expect to run for a while. currently is 27% done
Updated by okurz 3 months ago 路 Edited
TODOs:
- Disable copy-on-write for the libvirt images, e.g. something like
chattr +C /var/lib/libvirt/images
or however that command was - Use /dev/sda2 as btrfs filesystem extension
- Ensure proper btrfs scrub,trim,balance
- Ask machine and image owners which images can be removed/moved/shrunk
- Consider ordering a bigger SSD to put into a storage extension slot and use that for /var/lib/libvirt/images
Updated by jbaier_cz 3 months ago
- Status changed from Workable to In Progress
A quick fix for a cron script creating unnecessary output: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1283
Updated by jbaier_cz 2 months ago
I used for dom in $(virsh list --all --name); do virsh dumpxml --domain "$dom" --xpath '//devices/disk/source/@file'; done | sort -u | cut -f2 -d= | tr -d \"
to identify all used images and cross-checked it with the output of virsh vol-list default
. That helped me to identify some old, unused qcow files and make some more space.
Updated by jbaier_cz 2 months ago
- Status changed from In Progress to Resolved
I used the mounted SSD as an extension for the root filesystem, which together with the cleaning created quite some spare space:
ada:~ # df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 4.0T 2.8T 1.3T 69% /
Consider ordering a bigger SSD to put into a storage extension slot and use that for /var/lib/libvirt/images
Points 1 to 4 are completed, which leaves option 5 for the future if there is a need for even more VMs
I did not find any silenced alert for this machine, so I guess we should be fine here.