Project

General

Profile

Actions

action #166172

closed

[FIRING:1] ada (ada: partitions usage (%) alert Generic partitions_usage_alert_ada generic) size:S

Added by livdywan 4 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-09-02
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/GDada/dashboard-for-ada?orgId=1&refresh=1m

disk.mean { device: efivarfs, fstype: efivarfs }
16.0%        16.0%        16.0%
disk.mean { device: nvme0n1p1, fstype: vfat }
1.15%        1.15%        1.15%
disk.mean { device: nvme0n1p2, fstype: btrfs }
86.3%        86.3%        86.3%
disk.mean { device: sda2, fstype: xfs }
0.106%        0.106%        0.106%

Suggestions

  • Investigate what the numbers mean and what triggers the alert
  • DONE Ask people to delete VM's for now
  • This is a physical machine - buy more storage

Files

Actions #1

Updated by livdywan 4 months ago

  • Assignee deleted (livdywan)
  • Parent task deleted (#111929)
Actions #2

Updated by livdywan 4 months ago 路 Edited

disk.mean { fstype: efivarfs, device: efivarfs }
16.0%   16.0%   16.0%
disk.mean { device: nvme0n1p1, fstype: vfat }
1.15%   1.15%   1.15%
disk.mean { device: nvme0n1p2, fstype: btrfs }
81.4%   86.4%   81.4%
disk.mean { device: sda2, fstype: xfs }
0.106%  0.106%  0.106%

Not sure if anyone deleted files. For now numbers are going down. The alert is not currently active.

Actions #3

Updated by tinita 4 months ago

I asked in Slack yesterday if people could delete virtual machines. That apparently helped.

Actions #4

Updated by livdywan 4 months ago

  • Description updated (diff)
Actions #5

Updated by livdywan 4 months ago

  • Subject changed from [FIRING:1] ada (ada: partitions usage (%) alert Generic partitions_usage_alert_ada generic) to [FIRING:1] ada (ada: partitions usage (%) alert Generic partitions_usage_alert_ada generic) size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz 3 months ago

  • Priority changed from High to Normal

some storage had been freed. We are down to 81% usage with 706G so lowering prio.

Actions #7

Updated by mkittler 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

This is about the root partition. It is now at 84 % so the usage is slowly increasing again.

There's an additional SSD /dev/sda2 mounted as /extra_storage. It is not used so far but would gain us 464G so we should probably just make use of it. The question is just how. I'll take a look on the machine. Maybe we can migrate some of the data to this 2nd SSD.

Actions #8

Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback
  • Assignee deleted (mkittler)

Not sure what to do with this from our side. Users simply needed to store their VM images on the other disk. According to gdu most of the data is under /var/lib/libvirt. This alone is actually 2.4 TiB so the additional SSD would gain us relatively little compared to that. (We could move the VM of Phoenix to that SSD because it alone needs 312.2 GiB and will hopefully not grow much more.) The home directories are also quite big with 29.8 GiB in total.

I also wasn't sure how this setup is supposed to be used at all. There's not much information about ada on https://wiki.suse.net/index.php/SUSE-Quality_Assurance/QE_infrastructure and when following (probably outdated) instructions on https://confluence.suse.com/display/openqa/Create+an+openQA+instance I ran into this error:

authentication unavailable: no polkit agent available to authenticate action 'org.libvirt.unix.manage'

Maybe an issue on my side. However, I can connect to osiris without problems (using the same user name) and to qamaster (using the root user) but none of that worked for ada (although I am also not sure whether I was guessing the root password correctly).

We should probably clarify how to access this machine via Virt Manager and what we can/want to move to the other SSD. We could also remove all images older than a certain data from all home directories (including /root). We could also try to ask users again to limit their disk usage.

Actions #9

Updated by livdywan 3 months ago

  • Status changed from Feedback to In Progress
  • Assignee set to livdywan

If you're unassigning yourself the ticket can't really be in Feedback as there is nobody to act on it 馃檭

I will take a look and see if I can follow the instructions. Asking our users directly in Slack also seems like a good idea.

Actions #10

Updated by livdywan 3 months ago 路 Edited

I will take a look and see if I can follow the instructions. Asking our users directly in Slack also seems like a good idea.

I'm able to connect ssh -A -L 8443:ada-ipmi:443 -NT jumpy@qe-jumpy.prg2.suse.org which gets me a web interface. However I don't know what password to use 馃

I can connect to ada.qe.suse.de via SSH. This is where I'm probably running into the same problem as VMM says authentication unavailable: no polkit agent available to authenticate action 'org.libvirt.unix.manage' when trying to connect.

So I'm asking for advice on Slack.

Actions #11

Updated by livdywan 3 months ago

  • Status changed from In Progress to Feedback
Actions #12

Updated by nicksinger 3 months ago

livdywan wrote in #note-10:

I will take a look and see if I can follow the instructions. Asking our users directly in Slack also seems like a good idea.

I'm able to connect ssh -A -L 8443:ada-ipmi:443 -NT jumpy@qe-jumpy.prg2.suse.org which gets me a web interface. However I don't know what password to use 馃

I can connect to ada.qe.suse.de via SSH. This is where I'm probably running into the same problem as VMM says authentication unavailable: no polkit agent available to authenticate action 'org.libvirt.unix.manage' when trying to connect.

So I'm asking for advice on Slack.

The credentials for the BMC can be found here: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls?ref_type=heads#L2664

I was able to connect with virt-manager and got a list of all machines. The polkit problems you both encountered seem related to your local configuration although it is indeed strange that connections to e.g. osiris work. I checked my machine and found that /usr/share/polkit-1/rules.d/50-libvirt.rules should provide the requested permission if the user is in the libvirt group. The rules-file comes from the libvirt-daemon-package. Could you check that you have this installed?

Actions #13

Updated by livdywan 3 months ago 路 Edited

I was able to connect with virt-manager and got a list of all machines. The polkit problems you both encountered seem related to your local configuration although it is indeed strange that connections to e.g. osiris work. I checked my machine and found that /usr/share/polkit-1/rules.d/50-libvirt.rules should provide the requested permission if the user is in the libvirt group. The rules-file comes from the libvirt-daemon-package. Could you check that you have this installed?

Turns out my user on ada wasn't in the group and that's why the connection failed. I'm updating the instructions accordingly.

sudo du -h /var/lib/libvirt/images /mnt/SSD_extension /extra_storage
2,5T    /var/lib/libvirt/images
0       /mnt/SSD_extension
0       /extra_storage

Nobody is following the recommendation to use SSD_extension written in red front. Which is on the same disk anyway:

sudo df -h /var/lib/libvirt/images /mnt/SSD_extension /extra_storage/
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  3,5T  3,0T  584G  84% /var
/dev/nvme0n1p2  3,5T  3,0T  584G  84% /
/dev/sda2       464G  506M  464G   1% /extra_storage

So I removed SDD_extension from the instructions. Instead I'm recommending extra_storage. The GUI actually shows the amount of available space, too.

Actions #14

Updated by livdywan 3 months ago

  • Status changed from Feedback to Resolved

With 1TB available in total and updated instructions I'd say we are good here and I don't see an immediate need for more space.

Actions #15

Updated by okurz 3 months ago

  • Status changed from Resolved to Workable
  • Assignee deleted (livdywan)
  • Priority changed from Normal to High
Actions #16

Updated by ybonatakis 3 months ago

I removed two images from /extra_storage/liv-openqa-* to my machine and I moved them on my machine.
I also delete some snapshots and deleted files with sudo lsof |grep deleted | grep qemu | awk '{print $2}' | xargs sudo kill -9
but df doesnt show reduced size.

only /extra_storage
```iob@ada:~> sudo du -h /var/lib/libvirt/images /mnt/SSD_extension /extra_storage
2.6T /var/lib/libvirt/images
0 /mnt/SSD_extension
0 /extra_storage

Actions #17

Updated by ybonatakis 3 months ago 路 Edited

I dont know if it helps but I run also sudo btrfs balance -d -m /. I expect to run for a while. currently is 27% done

Actions #18

Updated by okurz 3 months ago 路 Edited

TODOs:

  1. Disable copy-on-write for the libvirt images, e.g. something like chattr +C /var/lib/libvirt/images or however that command was
  2. Use /dev/sda2 as btrfs filesystem extension
  3. Ensure proper btrfs scrub,trim,balance
  4. Ask machine and image owners which images can be removed/moved/shrunk
  5. Consider ordering a bigger SSD to put into a storage extension slot and use that for /var/lib/libvirt/images
Actions #19

Updated by okurz 3 months ago

  • Assignee set to jbaier_cz
Actions #20

Updated by jbaier_cz 3 months ago

  • Status changed from Workable to In Progress

A quick fix for a cron script creating unnecessary output: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1283

Actions #21

Updated by jbaier_cz 2 months ago

I used for dom in $(virsh list --all --name); do virsh dumpxml --domain "$dom" --xpath '//devices/disk/source/@file'; done | sort -u | cut -f2 -d= | tr -d \" to identify all used images and cross-checked it with the output of virsh vol-list default. That helped me to identify some old, unused qcow files and make some more space.

Actions #22

Updated by jbaier_cz 2 months ago

  • Status changed from In Progress to Resolved

I used the mounted SSD as an extension for the root filesystem, which together with the cleaning created quite some spare space:

ada:~ #  df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  4.0T  2.8T  1.3T  69% /

Consider ordering a bigger SSD to put into a storage extension slot and use that for /var/lib/libvirt/images

Points 1 to 4 are completed, which leaves option 5 for the future if there is a need for even more VMs

I did not find any silenced alert for this machine, so I guess we should be fine here.

Actions

Also available in: Atom PDF