Project

General

Profile

action #80482

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

qa-power8-5-kvm has been down for days, use more robust filesystem setup

Added by okurz 8 months ago. Updated 3 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Related issues

Related to openQA Infrastructure - action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for nowResolved2020-12-152021-04-16

Copied from openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolved2020-11-19

History

#1 Updated by okurz 8 months ago

  • Copied from action #78218: [openQA][worker] Almost all openQA workers become offline added

#2 Updated by ldevulder 8 months ago

I restarted the server and found some ext2 errors on the /var/lib/openqa filesystem (on /dev/sdb1 block device), fixed after a full fsck on it.

#3 Updated by okurz 8 months ago

As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers

#4 Updated by cdywan 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

okurz wrote:

As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers

I think re-creating the filesystem makes sense. It seems to work well enough on machines where we use it.

#5 Updated by nicksinger 8 months ago

  • Related to action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now added

#6 Updated by cdywan 7 months ago

  • Subject changed from qa-power8-5-kvm is down since days to qa-power8-5-kvm has been down for days, use more robust filesystem setup

#7 Updated by cdywan 7 months ago

MR open, need to check again how to deal with existing fs still, salt doesn't seem to like it yet /dev/sda1 already mounted or mount point busy https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/419

#8 Updated by cdywan 7 months ago

After the initial draft where I tried to rely on salt to re-create the filesystem, I updated my MR to keep all of the logic within systemd units since we want to detect the device and configuration dynamically in all cases. I actually moved it to a proper script recreate.sh which can be run independently and is a lot more readable - I tried to figure out the correct operator priority and extend it but gave up after getting lost repeatedly.

It's also worth noting this didn't use to be tested in GitLabCI and I added a fake /etc/fstab which salt expects to find when looking at mountpoints.

#9 Updated by cdywan 6 months ago

Brief update, the MR has since been reviewed (and I was waiting on that even though we don't require it for salt repos) and I looked into deploying the changes on a subset of workers first, to confirm everything works. After twice coming back from reading salt docs with weird error messages on how to deploy from a git branch, and learning about git remotes, features and salt-ssh (which never copied my repo to the minion when I tried to use it), none of which we currently use, I'm looking at two options

  1. copy /etc/salt to ~
  2. tweak master to point at ~ rather than /srv/salt/
  3. salt -c .

Or if nothing else, temporarily disable salt somehow, pull my branch into the real branch, and deploy from there.

I think going forward I'll try and document it somewhere.

#10 Updated by okurz 6 months ago

I suggest the following on the worker where you want to try out something, e.g. qa-power8-5-kvm.qa, clone state+pillar repo to /srv/salt and /srv/pillar respectively, then call salt-call --local -l error …. Afterwards delete /srv/salt and /srv/pillar again.

#11 Updated by okurz 6 months ago

  • Due date set to 2021-02-02

#12 Updated by cdywan 6 months ago

  • Due date deleted (2021-02-02)
  • Status changed from In Progress to Feedback
  • Assignee deleted (cdywan)

Setting this to Feedback for now, as suggested in the Weekly. Also filed #88197 (and #88195) reflecting points made in that same conversation.

#13 Updated by okurz 6 months ago

  • Status changed from Feedback to Workable
  • Priority changed from High to Normal

I doubt a "Feedback" ticket without assignee works. As the original issue still persists moving it back to "Workable". The problems you encountered are certainly obstacles and make it harder for newcomers to salt but not showstopppers :)

#14 Updated by mkittler 5 months ago

  • Assignee set to mkittler

This ticket is just a more general phrasing for the problem I've also encountered when investigating #88191. Hence I'm assigning to this ticket as well. I'm trying re-use the existing (and still open) SR.

#15 Updated by mkittler 5 months ago

  • Assignee deleted (mkittler)

The existing SR is problematic in my opinion because it tires to change too much at a time. Besides, judging by the code I'm also not sure whether it will actually work (see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/419#note_301037). Nevertheless, here my rebased version in case we want to continue here later: https://gitlab.suse.de/mkittler/salt-states-openqa/-/merge_requests/new?merge_request%5Bsource_branch%5D=mount_ext2_recreate_generic

Being able to apply the code for the file system re-creation to all hosts is not necessary for solving #88191 so I'm unassigning here again.

#16 Updated by mkittler 4 months ago

I'm still waiting to see how well my changes for #88191 work out. Considering the problems I encountered I would not expect a more stable behavior by applying the approach here as well. The raid creation with mdadm just seems quite error prone. So maybe we're for now better of by just switching to e.g. ext4 here. The performance impact might be notable compared to ext2 but I'm using ext4 on my local system with a slow HDD and it is still ok when running a few jobs in parallel.

#17 Updated by okurz 4 months ago

Please see #19238#note-6 for my experiments regarding different filesystems for the openQA pool directory with performance measuremnts. When we can find a setup that is at least as performant than the current one we can easily switch to another filesystem.

#18 Updated by okurz 3 months ago

  • Priority changed from Normal to Low

#19 Updated by mkittler 3 months ago

About the "at least as performant" part: Maybe one can not easily beat a simple filesystem like ext2 in terms of raw I/O speed as measured via your tests with dd. However, what really counts is that there's no notable performance impact for our production workload. (This is of course much harder to measure. Hence I've been sharing my experience with using ext4.) And maybe it is even worth paying a small price for avoiding such issues.

#20 Updated by okurz 3 months ago

I agree. I suggest that we can simply replace the filesystem on one of our non-NVME hosts like qa-power8-5-kvm and monitor on that host in production if there is any significant impact

#21 Updated by okurz 3 months ago

  • Target version changed from Ready to future

Also available in: Atom PDF