Project

General

Profile

action #80482

openQA Project - coordination #80142: [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

qa-power8-5-kvm has been down for days, use more robust filesystem setup

Added by okurz about 2 months ago. Updated about 13 hours ago.

Status:
In Progress
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Related issues

Related to openQA Infrastructure - action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for nowFeedback2020-12-15

Copied from openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolved2020-11-19

History

#1 Updated by okurz about 2 months ago

  • Copied from action #78218: [openQA][worker] Almost all openQA workers become offline added

#2 Updated by ldevulder about 2 months ago

I restarted the server and found some ext2 errors on the /var/lib/openqa filesystem (on /dev/sdb1 block device), fixed after a full fsck on it.

#3 Updated by okurz about 2 months ago

As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers

#4 Updated by cdywan about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

okurz wrote:

As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers

I think re-creating the filesystem makes sense. It seems to work well enough on machines where we use it.

#5 Updated by nicksinger about 1 month ago

  • Related to action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now added

#6 Updated by cdywan about 1 month ago

  • Subject changed from qa-power8-5-kvm is down since days to qa-power8-5-kvm has been down for days, use more robust filesystem setup

#7 Updated by cdywan about 1 month ago

MR open, need to check again how to deal with existing fs still, salt doesn't seem to like it yet /dev/sda1 already mounted or mount point busy https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/419

#8 Updated by cdywan 23 days ago

After the initial draft where I tried to rely on salt to re-create the filesystem, I updated my MR to keep all of the logic within systemd units since we want to detect the device and configuration dynamically in all cases. I actually moved it to a proper script recreate.sh which can be run independently and is a lot more readable - I tried to figure out the correct operator priority and extend it but gave up after getting lost repeatedly.

It's also worth noting this didn't use to be tested in GitLabCI and I added a fake /etc/fstab which salt expects to find when looking at mountpoints.

#9 Updated by cdywan about 13 hours ago

Brief update, the MR has since been reviewed (and I was waiting on that even though we don't require it for salt repos) and I looked into deploying the changes on a subset of workers first, to confirm everything works. After twice coming back from reading salt docs with weird error messages on how to deploy from a git branch, and learning about git remotes, features and salt-ssh (which never copied my repo to the minion when I tried to use it), none of which we currently use, I'm looking at two options

  1. copy /etc/salt to ~
  2. tweak master to point at ~ rather than /srv/salt/
  3. salt -c .

Or if nothing else, temporarily disable salt somehow, pull my branch into the real branch, and deploy from there.

I think going forward I'll try and document it somewhere.

Also available in: Atom PDF