action #80482
closedopenQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
qa-power8-5-kvm has been down for days, use more robust filesystem setup
Added by okurz about 4 years ago. Updated over 2 years ago.
0%
Updated by okurz about 4 years ago
- Copied from action #78218: [openQA][worker] Almost all openQA workers become offline added
Updated by ldevulder about 4 years ago
I restarted the server and found some ext2 errors on the /var/lib/openqa filesystem (on /dev/sdb1 block device), fixed after a full fsck on it.
Updated by okurz about 4 years ago
As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers
Updated by livdywan about 4 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
okurz wrote:
As discussed in chat: That ext2 fs is what is used for /var/lib/openqa. On other machines we recreate the filesystem on every boot. And ext2 always showed better performance than ext4 however I agree that an ext2 fs should not be considered reboot-safe. So we should either select a journaling f/s or apply the same filesystem recreation method we use on nvme-enabled workers
I think re-creating the filesystem makes sense. It seems to work well enough on machines where we use it.
Updated by nicksinger about 4 years ago
- Related to action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now added
Updated by livdywan about 4 years ago
- Subject changed from qa-power8-5-kvm is down since days to qa-power8-5-kvm has been down for days, use more robust filesystem setup
Updated by livdywan about 4 years ago
MR open, need to check again how to deal with existing fs still, salt doesn't seem to like it yet /dev/sda1 already mounted or mount point busy
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/419
Updated by livdywan about 4 years ago
After the initial draft where I tried to rely on salt to re-create the filesystem, I updated my MR to keep all of the logic within systemd units since we want to detect the device and configuration dynamically in all cases. I actually moved it to a proper script recreate.sh
which can be run independently and is a lot more readable - I tried to figure out the correct operator priority and extend it but gave up after getting lost repeatedly.
It's also worth noting this didn't use to be tested in GitLabCI and I added a fake /etc/fstab
which salt expects to find when looking at mountpoints.
Updated by livdywan about 4 years ago
Brief update, the MR has since been reviewed (and I was waiting on that even though we don't require it for salt repos) and I looked into deploying the changes on a subset of workers first, to confirm everything works. After twice coming back from reading salt docs with weird error messages on how to deploy from a git branch, and learning about git remotes, features and salt-ssh (which never copied my repo to the minion when I tried to use it), none of which we currently use, I'm looking at two options
- copy
/etc/salt
to~
- tweak
master
to point at~
rather than/srv/salt/
salt -c .
Or if nothing else, temporarily disable salt somehow, pull my branch into the real branch, and deploy from there.
I think going forward I'll try and document it somewhere.
Updated by okurz about 4 years ago
I suggest the following on the worker where you want to try out something, e.g. qa-power8-5-kvm.qa, clone state+pillar repo to /srv/salt and /srv/pillar respectively, then call salt-call --local -l error …
. Afterwards delete /srv/salt and /srv/pillar again.
Updated by livdywan about 4 years ago
- Due date deleted (
2021-02-02) - Status changed from In Progress to Feedback
- Assignee deleted (
livdywan)
Updated by okurz almost 4 years ago
- Status changed from Feedback to Workable
- Priority changed from High to Normal
I doubt a "Feedback" ticket without assignee works. As the original issue still persists moving it back to "Workable". The problems you encountered are certainly obstacles and make it harder for newcomers to salt but not showstopppers :)
Updated by mkittler almost 4 years ago
- Assignee set to mkittler
This ticket is just a more general phrasing for the problem I've also encountered when investigating #88191. Hence I'm assigning to this ticket as well. I'm trying re-use the existing (and still open) SR.
Updated by mkittler almost 4 years ago
- Assignee deleted (
mkittler)
The existing SR is problematic in my opinion because it tires to change too much at a time. Besides, judging by the code I'm also not sure whether it will actually work (see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/419#note_301037). Nevertheless, here my rebased version in case we want to continue here later: https://gitlab.suse.de/mkittler/salt-states-openqa/-/merge_requests/new?merge_request%5Bsource_branch%5D=mount_ext2_recreate_generic
Being able to apply the code for the file system re-creation to all hosts is not necessary for solving #88191 so I'm unassigning here again.
Updated by mkittler almost 4 years ago
I'm still waiting to see how well my changes for #88191 work out. Considering the problems I encountered I would not expect a more stable behavior by applying the approach here as well. The raid creation with mdadm just seems quite error prone. So maybe we're for now better of by just switching to e.g. ext4 here. The performance impact might be notable compared to ext2 but I'm using ext4 on my local system with a slow HDD and it is still ok when running a few jobs in parallel.
Updated by okurz almost 4 years ago
Please see #19238#note-6 for my experiments regarding different filesystems for the openQA pool directory with performance measuremnts. When we can find a setup that is at least as performant than the current one we can easily switch to another filesystem.
Updated by mkittler almost 4 years ago
About the "at least as performant" part: Maybe one can not easily beat a simple filesystem like ext2 in terms of raw I/O speed as measured via your tests with dd. However, what really counts is that there's no notable performance impact for our production workload. (This is of course much harder to measure. Hence I've been sharing my experience with using ext4.) And maybe it is even worth paying a small price for avoiding such issues.
Updated by okurz almost 4 years ago
I agree. I suggest that we can simply replace the filesystem on one of our non-NVME hosts like qa-power8-5-kvm and monitor on that host in production if there is any significant impact
Updated by okurz almost 4 years ago
- Target version changed from Ready to future
Updated by livdywan almost 3 years ago
- Copied to action #108740: qa-power8-5-kvm minions alert is heart-broken 💔️ added
Updated by okurz over 2 years ago
- Related to action #116437: Recover qa-power8-5 size:M added
Updated by okurz over 2 years ago
- Related to action #115226: Use ext4 (instead of ext2) for /var/lib/openqa on qa-power8 workers added
Updated by okurz over 2 years ago
- Status changed from Workable to Resolved
- Assignee set to okurz
#115226 fixed this