Project

General

Profile

action #102167

Disk monitoring for s390x z/VM backend

Added by geor 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-11-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

There are multiple recent occasions of openQA s390x z/VM tests failing due to Disk or file space is full, eg in scenario sle-15-SP4-Online-s390x-allpatterns@s390x-zVM-vswitch-l3, on bootloader_start

Acceptance Criteria

AC1: Monitor disk space of the z/VM hypervisor
AC2: Trigger a manual or automatic clean-up process

Other Suggestions

  • Increase storage size

History

#1 Updated by geor 3 months ago

  • Subject changed from Disk monitoring for z/VM backend on s390 jobs to Disk monitoring for s390x z/VM backend

#2 Updated by okurz 3 months ago

  • Target version set to future

I consider this a good idea. The maintenance and hence monitoring of those "in the middle" hypervisor hosts so far is done by QE-Core so that's actually within your team then. salt-states-openqa is pretty flexible and we already use it for more hosts than just strictly "openQA" hosts so one could try to just register the according machines with salt and basic monitoring might come automatically along with that :) Anything on top would be possible to do with salt minion roles

#3 Updated by geor 3 months ago

Update: from my understanding from what was mentioned in a thread in qe-core about this issue, it seems like this space is full message was referring to an ephemeral disk created for the installation process.
So the topic of this progress issue for monitoring the said disk is not relevant in that case.

#4 Updated by geor 3 months ago

  • Status changed from New to Resolved
  • Assignee set to mgriessmeier

After the resolution from Matthias (I presume a larger size was defined for the aforementioned ephemeral disk), bootloader_start is no longer failing.
I will mark this progress issue as resolved, because it's Acceptance Criteria were not in line with the nature of the problem, so are now invalid, and the actual problem has been resolved.

#5 Updated by szarate 3 months ago

  • Status changed from Resolved to Feedback

I will still keep it reopen until Matthias provides steps to fix, as for now we don't have the knowledge to fix the problem if it arises (AC2) and the monitoring of it could be implemented in multiple ways.

#6 Updated by mgriessmeier 3 months ago

Hi guys,

let me give you some explanation on the root cause of this issue.

As you might be aware of, to spawn installations on z/VM guests, there is a script called ftpboot to specify the installation medium to be loaded. You might be more familiar with the name "qaboot", which is basically a fork of ftpboot I wrote some years ago to adapt it to our use-cases, cause ftpboot has some functionality we didn't really need.

let me shortly explain how qaboot functions. Basically you can compare it to a PXE setup where you specify the which kernel and initrd the system is going to load and boot from.

qaboot mainly does only few things, as example let's take this call qaboot 10.160.0.207 SLE-15-SP4-Full-s390x-Build61.1-Media1\

  1. it looks on ftp server 10.160.0.207 (which is ftp://openqa.suse.de/) for a folder "SLE-15-SP4-Full-s390x-Build61.1-Media1"
  2. it creates a temporary disk T on the z/VM with a fixed size which was always big enough to hold all the data. (you see where this is going). This disk is destroyed before every creation, to ensure that there is no old data stored somewhere.
  3. it downloads initrd, kernel and default parmfile from the directory (ftp://openqa.suse.de/SLE-15-SP4-Full-s390x-Build61.1-Media1)
  4. it loads initrd and kernel from this temporary disk T and boots from it.

so here was the issue, I took the initial value from ftpboot which was 200KB, usually way larger than what it needs to store initrd, kernel and parmfile.
apparently, with build 61.1 the overall size of these 3 files exceeded this (I didn't quite figure out why, cause when I compared to previous builds it just slightly increased and my math never came to 200KB....)
However, while investigating this issue I stumbled across this 200KB and simply increased it to 500KB within the qaboot script. This already fixed it. I still cannot say why it couldn't fit the data from the new kernel, but probably the ftp download creates some swap files or so.

so long story short, increasing the size of this disk by 150% should solve it for good.

to describe the steps to check the qaboot script (please be cautious, cause this can destroy our workflows):

  1. I logged on as linuxmnt user over x3270 to s390zp11 (our z/VM hypervisor) - linuxmnt is some sort of administrator user who has the privileges to modify those files such as qaboot and distributes it to alle the remaining guests.
  2. I edited the qaboot script to increase the size of the disk.
  3. afterwards a logoff to all QAZVM### was done to make those guests aware of the change in the script.

this was quite a corner case in the z/VM administration area which imo makes it hard to implement proper monitoring for it, but I'd be happy to support here if you have any nice ideas how this could be done.

#7 Updated by szarate about 2 months ago

  • Status changed from Feedback to Resolved

Thanks matthi!

Also available in: Atom PDF