action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setup - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #80482

closed

openQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

qa-power8-5-kvm has been down for days, use more robust filesystem setup

Added by okurz over 4 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Target version:

QA (public) - future

Start date:

Due date:

% Done:

Estimated time:

Related issues 5 (0 open — 5 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Copied from action #78218: [openQA][worker] Almost all openQA workers become offline added

Actions

Copy link

Updated by ldevulder over 4 years ago

I restarted the server and found some ext2 errors on the /var/lib/openqa filesystem (on /dev/sdb1 block device), fixed after a full fsck on it.

Actions

Copy link

Updated by okurz over 4 years ago

As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers

Actions

Copy link

Updated by livdywan over 4 years ago

Status changed from Workable to In Progress
Assignee set to livdywan

okurz wrote:

As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers

I think re-creating the filesystem makes sense. It seems to work well enough on machines where we use it.

Actions

Copy link

Updated by nicksinger over 4 years ago

Related to action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now added

Actions

Copy link

Updated by livdywan over 4 years ago

Subject changed from qa-power8-5-kvm is down since days to qa-power8-5-kvm has been down for days, use more robust filesystem setup

Actions

Copy link

Updated by livdywan over 4 years ago

MR open, need to check again how to deal with existing fs still, salt doesn't seem to like it yet /dev/sda1 already mounted or mount point busy https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/419

Actions

Copy link

Updated by livdywan over 4 years ago

After the initial draft where I tried to rely on salt to re-create the filesystem, I updated my MR to keep all of the logic within systemd units since we want to detect the device and configuration dynamically in all cases. I actually moved it to a proper script recreate.sh which can be run independently and is a lot more readable - I tried to figure out the correct operator priority and extend it but gave up after getting lost repeatedly.

It's also worth noting this didn't use to be tested in GitLabCI and I added a fake /etc/fstab which salt expects to find when looking at mountpoints.

Actions

Copy link

Updated by livdywan over 4 years ago

Brief update, the MR has since been reviewed (and I was waiting on that even though we don't require it for salt repos) and I looked into deploying the changes on a subset of workers first, to confirm everything works. After twice coming back from reading salt docs with weird error messages on how to deploy from a git branch, and learning about git remotes, features and salt-ssh (which never copied my repo to the minion when I tried to use it), none of which we currently use, I'm looking at two options

copy /etc/salt to ~
tweak master to point at ~ rather than /srv/salt/
salt -c .

Or if nothing else, temporarily disable salt somehow, pull my branch into the real branch, and deploy from there.

I think going forward I'll try and document it somewhere.

Actions

Copy link

#10

Updated by okurz over 4 years ago

I suggest the following on the worker where you want to try out something, e.g. qa-power8-5-kvm.qa, clone state+pillar repo to /srv/salt and /srv/pillar respectively, then call salt-call --local -l error …. Afterwards delete /srv/salt and /srv/pillar again.

Actions

Copy link

#11

Updated by okurz over 4 years ago

Due date set to 2021-02-02

Actions

Copy link

#12

Updated by livdywan over 4 years ago

Due date deleted (~~2021-02-02~~)
Status changed from In Progress to Feedback
Assignee deleted (~~livdywan~~)

Setting this to Feedback for now, as suggested in the Weekly. Also filed #88197 (and #88195) reflecting points made in that same conversation.

Actions

Copy link

#13

Updated by okurz over 4 years ago

Status changed from Feedback to Workable
Priority changed from High to Normal

I doubt a "Feedback" ticket without assignee works. As the original issue still persists moving it back to "Workable". The problems you encountered are certainly obstacles and make it harder for newcomers to salt but not showstopppers :)

Actions

Copy link

#14

Updated by mkittler about 4 years ago

Assignee set to mkittler

This ticket is just a more general phrasing for the problem I've also encountered when investigating #88191. Hence I'm assigning to this ticket as well. I'm trying re-use the existing (and still open) SR.

Actions

Copy link

#15

Updated by mkittler about 4 years ago

Assignee deleted (~~mkittler~~)

The existing SR is problematic in my opinion because it tires to change too much at a time. Besides, judging by the code I'm also not sure whether it will actually work (see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/419#note_301037). Nevertheless, here my rebased version in case we want to continue here later: https://gitlab.suse.de/mkittler/salt-states-openqa/-/merge_requests/new?merge_request%5Bsource_branch%5D=mount_ext2_recreate_generic

Being able to apply the code for the file system re-creation to all hosts is not necessary for solving #88191 so I'm unassigning here again.

Actions

Copy link

#16

Updated by mkittler about 4 years ago

I'm still waiting to see how well my changes for #88191 work out. Considering the problems I encountered I would not expect a more stable behavior by applying the approach here as well. The raid creation with mdadm just seems quite error prone. So maybe we're for now better of by just switching to e.g. ext4 here. The performance impact might be notable compared to ext2 but I'm using ext4 on my local system with a slow HDD and it is still ok when running a few jobs in parallel.

Actions

Copy link

#17

Updated by okurz about 4 years ago

Please see #19238#note-6 for my experiments regarding different filesystems for the openQA pool directory with performance measuremnts. When we can find a setup that is at least as performant than the current one we can easily switch to another filesystem.

Actions

Copy link

#18

Updated by okurz about 4 years ago

Priority changed from Normal to Low

Actions

Copy link

#19

Updated by mkittler about 4 years ago

About the "at least as performant" part: Maybe one can not easily beat a simple filesystem like ext2 in terms of raw I/O speed as measured via your tests with dd. However, what really counts is that there's no notable performance impact for our production workload. (This is of course much harder to measure. Hence I've been sharing my experience with using ext4.) And maybe it is even worth paying a small price for avoiding such issues.

Actions

Copy link

#20

Updated by okurz about 4 years ago

I agree. I suggest that we can simply replace the filesystem on one of our non-NVME hosts like qa-power8-5-kvm and monitor on that host in production if there is any significant impact

Actions

Copy link

#21