action #60272
opencoordination #162539: [saga][epic] future ideas version for version control features within openQA
coordination #162551: [epic] Extend needle version control handling - part 2
Make fetching custom git repos (e.g. needles) more efficient
0%
Description
Cloning a needles repository can take a lot of time and traffic, even if we use --depth 1
.
It could be made more efficient by having a local mirror/proxy repo.
Checking out branches could be done via git worktree
. A worktree will share its .git
directory with the original directory, so this would be faster and use less disk space.
Updated by okurz about 5 years ago
- Category set to Feature requests
Do you mean to make the custom git repo cloning more efficient? Because the normal case is that there is just a single working copy on the webui host which is updated with git fetch
and such. The content is then either provided as a shared mount point to workers or synced with rsync to each worker using the cache service.
Updated by tinita about 5 years ago
- Subject changed from Make fetching needles more efficient to Make fetching (custom) needles more efficient
okurz wrote:
Do you mean to make the custom git repo cloning more efficient?
Yes.
Updated by tinita about 5 years ago
- Subject changed from Make fetching (custom) needles more efficient to Make fetching custom git repos (e.g. needles) more efficient
Updated by tinita almost 5 years ago
Just tested fetching a PR:
cd os-autoinst-needles-opensuse
git fetch -f git@github.com:os-autoinst/os-autoinst-needles-opensuse.git "refs/pull/619/head:PR/619"
git worktree add ../PR-619 PR/619
This took less than 5 seconds.
Updated by okurz over 4 years ago
I have an idea that is simple to do in the meantime and provide another benefit: Try to checkout a git refspec from already existing git working copy: https://github.com/os-autoinst/os-autoinst/pull/1358 . Not sure if this helps us that much though within openQA. We commonly use caching meaning that we have tests which reside within the cache directory, common for all worker instances. We don't want to checkout something in there because this would affect other instances. What we could try to do is to clone locally from cache to pool and then checkout.
Updated by tinita over 4 years ago
okurz wrote:
I have an idea that is simple to do in the meantime and provide another benefit: Try to checkout a git refspec from already existing git working copy
But the working copy would still need a git fetch
, right?
Updated by okurz over 4 years ago
tinita wrote:
But the working copy would still need a
git fetch
, right?
Not necessarily. I envision https://github.com/os-autoinst/os-autoinst/pull/1358 to be used when you want to use an older git commit within "master" from the same repo.
Updated by okurz over 4 years ago
I have looked into "git worktree" and I did not understand if it can provide any benefit over using local git clones which use hardlinks by default. I guess the problem for us is basically the same regardless of the approach: How to map remote repositories to local checkouts and have them available on workers as well. One approach I could think of: Map any remote URLs to corresponding trees in the filesystem depending on repo and refspec and always try to clone/fetch locally first before reverting to any remote operation, e.g.
- https://github.com/os-autoinst/os-autoinst-distri-opensuse#master -> /var/lib/openqa/share/tests/github.com/os-autoinst/os-autoinst-distri-opensuse/master
- https://github.com/okurz/os-autoinst-distri-opensuse#feature/foo -> /var/lib/openqa/share/tests/github.com/okurz/os-autoinst-distri-opensuse/feature/foo
- https://github.com/perlpunk/os-autoinst-distri-opensuse#02535deadbeef -> /var/lib/openqa/share/tests/github.com/perlpunk/os-autoinst-distri-opensuse/02535deadbeef
This would already help to not clone again and again on workers and be able to show corresponding source code. Now when someone triggers tests using another repo or fork or branch we could additionally need to see if a corresponding sibling exists and clone+fetch first from there as a "local cache" and only get what is needed from remote.
Another challenge in general on top is that currently either workers use …/share directly or a copy from openQA cache which is shared for all worker instances. Somehow I have the feeling we are coming back to what I originally already thought some years ago when the "caching" was envisioned: We should just use git to clone from …/share into each worker's pool dir.
Updated by tinita over 4 years ago
One advantage of worktrees is that a list of them is kept in the repo, so you have an automatic overview how many clones are out there.
Also, having manual clones would still mean to do a fetch in the main clone first, and then doing a fetch in the local clone.
I experimented a bit with a normal clone and a worktree:
% git clone git@github.com:os-autoinst/os-autoinst-needles-opensuse --depth 1
% cd os-autoinst-needles-opensuse
% git fetch -f git@github.com:os-autoinst/os-autoinst-needles-opensuse.git "refs/pull/651/head:PR/651" --depth 1
% git fetch -f git@github.com:os-autoinst/os-autoinst-needles-opensuse.git "refs/pull/652/head:PR/652" --depth 1
% git worktree add ../PR-652 PR/652
% cd ..
% git clone os-autoinst-needles-opensuse/.git -b PR/651 PR-651
% du -hs *
1.7G PR-651
863M PR-652
1.7G os-autoinst-needles-opensuse
Also the worktree
command was a bit faster (4s vs. 15s).
I'll repeat the test with a full clone.
Edit: ok, if I have a full original clone, then the local clones have smaller sizes also.
2.7G PR-649 # manual local clone
872M PR-651 # manual local clone
863M PR-652 # worktree
870M os-autoinst-needles-opensuse
Updated by tinita over 4 years ago
okurz wrote:
Map any remote URLs to corresponding trees in the filesystem depending on repo and refspec and always try to clone/fetch locally first before reverting to any remote operation, e.g.
...
Somehow I have the feeling we are coming back to what I originally already thought some years ago when the "caching" was envisioned: We should just use git to clone from …/share into each worker's pool dir.
Regardless of using worktree or not, yes, I think that's necessary.
Updated by tinita over 4 years ago
A disadvantage of worktree
(or a shared .git
folder in general) is, if you remove the original repo folder, the git info in the clone/worktree is lost.
Updated by okurz over 4 years ago
- Priority changed from Normal to Low
- Target version set to Ready
By now we have seen that cloning from github all the time is actually not a problem so far. We should still follow up with this but effectively we have lower prio here.