Project

General

Profile

Actions

action #80482

closed

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

qa-power8-5-kvm has been down for days, use more robust filesystem setup

Added by okurz over 3 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Related issues 5 (0 open5 closed)

Related to openQA Infrastructure - action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for nowResolvedokurz2020-12-152021-04-16

Actions
Related to openQA Infrastructure - action #116437: Recover qa-power8-5 size:MResolvedmkittler

Actions
Related to openQA Infrastructure - action #115226: Use ext4 (instead of ext2) for /var/lib/openqa on qa-power8 workersResolvedmkittler2022-08-11

Actions
Copied from openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolvedokurz2020-11-19

Actions
Copied to openQA Infrastructure - action #108740: qa-power8-5-kvm minions alert is heart-broken 💔️Rejectedokurz2022-03-22

Actions
Actions #1

Updated by okurz over 3 years ago

  • Copied from action #78218: [openQA][worker] Almost all openQA workers become offline added
Actions #2

Updated by ldevulder over 3 years ago

I restarted the server and found some ext2 errors on the /var/lib/openqa filesystem (on /dev/sdb1 block device), fixed after a full fsck on it.

Actions #3

Updated by okurz over 3 years ago

As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers

Actions #4

Updated by livdywan over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

okurz wrote:

As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers

I think re-creating the filesystem makes sense. It seems to work well enough on machines where we use it.

Actions #5

Updated by nicksinger over 3 years ago

  • Related to action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now added
Actions #6

Updated by livdywan over 3 years ago

  • Subject changed from qa-power8-5-kvm is down since days to qa-power8-5-kvm has been down for days, use more robust filesystem setup
Actions #7

Updated by livdywan over 3 years ago

MR open, need to check again how to deal with existing fs still, salt doesn't seem to like it yet /dev/sda1 already mounted or mount point busy https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/419

Actions #8

Updated by livdywan over 3 years ago

After the initial draft where I tried to rely on salt to re-create the filesystem, I updated my MR to keep all of the logic within systemd units since we want to detect the device and configuration dynamically in all cases. I actually moved it to a proper script recreate.sh which can be run independently and is a lot more readable - I tried to figure out the correct operator priority and extend it but gave up after getting lost repeatedly.

It's also worth noting this didn't use to be tested in GitLabCI and I added a fake /etc/fstab which salt expects to find when looking at mountpoints.

Actions #9

Updated by livdywan over 3 years ago

Brief update, the MR has since been reviewed (and I was waiting on that even though we don't require it for salt repos) and I looked into deploying the changes on a subset of workers first, to confirm everything works. After twice coming back from reading salt docs with weird error messages on how to deploy from a git branch, and learning about git remotes, features and salt-ssh (which never copied my repo to the minion when I tried to use it), none of which we currently use, I'm looking at two options

  1. copy /etc/salt to ~
  2. tweak master to point at ~ rather than /srv/salt/
  3. salt -c .

Or if nothing else, temporarily disable salt somehow, pull my branch into the real branch, and deploy from there.

I think going forward I'll try and document it somewhere.

Actions #10

Updated by okurz over 3 years ago

I suggest the following on the worker where you want to try out something, e.g. qa-power8-5-kvm.qa, clone state+pillar repo to /srv/salt and /srv/pillar respectively, then call salt-call --local -l error …. Afterwards delete /srv/salt and /srv/pillar again.

Actions #11

Updated by okurz over 3 years ago

  • Due date set to 2021-02-02
Actions #12

Updated by livdywan about 3 years ago

  • Due date deleted (2021-02-02)
  • Status changed from In Progress to Feedback
  • Assignee deleted (livdywan)

Setting this to Feedback for now, as suggested in the Weekly. Also filed #88197 (and #88195) reflecting points made in that same conversation.

Actions #13

Updated by okurz about 3 years ago

  • Status changed from Feedback to Workable
  • Priority changed from High to Normal

I doubt a "Feedback" ticket without assignee works. As the original issue still persists moving it back to "Workable". The problems you encountered are certainly obstacles and make it harder for newcomers to salt but not showstopppers :)

Actions #14

Updated by mkittler about 3 years ago

  • Assignee set to mkittler

This ticket is just a more general phrasing for the problem I've also encountered when investigating #88191. Hence I'm assigning to this ticket as well. I'm trying re-use the existing (and still open) SR.

Actions #15

Updated by mkittler about 3 years ago

  • Assignee deleted (mkittler)

The existing SR is problematic in my opinion because it tires to change too much at a time. Besides, judging by the code I'm also not sure whether it will actually work (see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/419#note_301037). Nevertheless, here my rebased version in case we want to continue here later: https://gitlab.suse.de/mkittler/salt-states-openqa/-/merge_requests/new?merge_request%5Bsource_branch%5D=mount_ext2_recreate_generic

Being able to apply the code for the file system re-creation to all hosts is not necessary for solving #88191 so I'm unassigning here again.

Actions #16

Updated by mkittler about 3 years ago

I'm still waiting to see how well my changes for #88191 work out. Considering the problems I encountered I would not expect a more stable behavior by applying the approach here as well. The raid creation with mdadm just seems quite error prone. So maybe we're for now better of by just switching to e.g. ext4 here. The performance impact might be notable compared to ext2 but I'm using ext4 on my local system with a slow HDD and it is still ok when running a few jobs in parallel.

Actions #17

Updated by okurz about 3 years ago

Please see #19238#note-6 for my experiments regarding different filesystems for the openQA pool directory with performance measuremnts. When we can find a setup that is at least as performant than the current one we can easily switch to another filesystem.

Actions #18

Updated by okurz about 3 years ago

  • Priority changed from Normal to Low
Actions #19

Updated by mkittler about 3 years ago

About the "at least as performant" part: Maybe one can not easily beat a simple filesystem like ext2 in terms of raw I/O speed as measured via your tests with dd. However, what really counts is that there's no notable performance impact for our production workload. (This is of course much harder to measure. Hence I've been sharing my experience with using ext4.) And maybe it is even worth paying a small price for avoiding such issues.

Actions #20

Updated by okurz about 3 years ago

I agree. I suggest that we can simply replace the filesystem on one of our non-NVME hosts like qa-power8-5-kvm and monitor on that host in production if there is any significant impact

Actions #21

Updated by okurz almost 3 years ago

  • Target version changed from Ready to future
Actions #22

Updated by livdywan about 2 years ago

  • Copied to action #108740: qa-power8-5-kvm minions alert is heart-broken 💔️ added
Actions #23

Updated by okurz over 1 year ago

Actions #24

Updated by okurz over 1 year ago

  • Related to action #115226: Use ext4 (instead of ext2) for /var/lib/openqa on qa-power8 workers added
Actions #25

Updated by okurz over 1 year ago

  • Status changed from Workable to Resolved
  • Assignee set to okurz

#115226 fixed this

Actions

Also available in: Atom PDF