Git and large files

Posted on 2021-12-29

Git is a cornerstone of software development nowadays, it has become the de-facto version control system.

Its interface is a bit complex to work with but a lot of tooling has been developed over the years that lessen the pain to deal with it.

One shortcoming of git though (and version control system in general), is dealing with huge binary files. These files are usually media assets that are not meant to be diffable, but they do belong to the project nonetheless.

They are different ways to deal with these files:

1. treat them like regular text files

This is the easiest solution: do nothing special. It works perfectly and you keep a clean history. However, as you modify your assets, the repository size will grow and it will become slower to clone on your CI pipeline. It will also put more charge on your git server.

2. keep them out of your repository

Out of the repository, out of trouble! If you keep your large assets in a separate directory (Dropbox for instance), your repository will stay light. But now you need to synchronize your external storage with your repository for your project. Most of the time, only the latest version is kept around, making it impossible to inspect an older revision with the appropriate assets.

3. store a pointer to external storage

As a compromise, you can store a pointer to external storage in your repository. Everytime you checkout a specific revision, you will fetch the according data to external storage and inject it into the project.

The solution 3. is the more convenient solution: we keep regular git workflow, and put the burden of hosting large files out of git itself.

Git Large File Storage

Git Large File Storage (Git LFS) is the more widespread implementation of this mecanism. It is developed by GitHub and is available on all repositories on their platform. It works out of the box: you set it once and you can forget about it.

However, there are some shortcomings with Git LFS.

1. your project is now longer self-contained in git

If you decide to use Git LFS, you will tie your project with the LFS storage server. You won’t be able to walk through your history without having a storage server. GitHub LFS server implementation is currently closed-source and only a “non production ready” reference server is available.

Major hosting platforms have implemented their own implementation and it is possible to migrate your data among compatibles hosting platforms. But your local copy of the repository will never hold all the data needed for your project. In a way, the storage server becomes a centralized piece. You can fetch all data locally to have it available but it won’t be considered a source, it is more like a cache.

2. you can’t easily manage storage in LFS

If you commit a bunch of files, then push your changes, all the files will be stored on the LFS server. If you want to remove them (eg. you uploaded unwanted files), you can do it locally by doing a rebase, then call git lfs prune. However, that will only clean up your local copies of files. What has been pushed will stay on the server.

If you wish to remove files from the server, your options depend on the server implementation: - on GitHub, your only option to reclaim LFS quota and truly delete files from LFS is to delete and recreate your repository - on BitBucket Cloud, you can browse LFS files in the web UI and delete specific files

Git Annex

git-annex is a complete solution to deal with external files in your git repository. It is also more complex than Git LFS. As you can see in their walkthrough, you need to explicitly set remotes for your files, and sync content between remotes.

Data is shared among local repositories in .git/annex, but it won’t be available in common source forges such as GitHub. To make this data available to all people in the project, you can use special remotes which are used as data storage stores, akin to Git lFS (which can be used as a special remote).

Contrary to Git LFS, you can see what content is currently unused, delete unwanted files. It is a more complex solution but it is more flexible.

What I recommend

I think git-annex gives the user more control over its data: it can be fully decentralized and offers tools to manage its content.

Git LFS is simpler and more widely used, but once you hit one of its limitation, it can be costly to break free.