Analysing large files in Git repository

Often I come across stores where they are consuming significantly more disk space than they should be. A healthy stack runs when the disk drives are under 50% utilised,

  • Over 50% utilisation risks performance degradation
  • Over 80% utilisation risks downtime
  • Over 90% utilisation risk data loss

Read more about healthy disk utilisation

Identifying large files

On MageStack, this is simple - as it does it for you. A disk report is generated showing a breakdown of largest files and usage per vhost.

But outside of MageStack (or this report), often the best place to start is to use a simple find command to identify the largest files and debug from there. In this case, I could see a customer was using a large proportion of disk space, and needed to find out why.

So I launched a command to find files larger than 500MB.

cd /microcloud/domains/example/domains/example.com
find -size +500M

The results were interesting,

./.git/objects/pack/pack-7805c8f60d8d15cd154dfe8ce624ec1b386f1b97.pack
...

Git uses pack files to store its objects in a compressed binary format. Ordinarily, these should be quite small and Git even takes steps to perform garbage collection to optimise the contents; however there are occasions when the pack files can grow beyond a reasonable amount.

Try and optimise

First step is to execute an automated cleanup,

git gc

Counting objects: 27891, done.
Delta compression using up to 32 threads.
Compressing objects: 100% (19481/19481), done.
Writing objects: 100% (27891/27891), done.
Total 27891 (delta 5882), reused 27891 (delta 5882)

This has reduced the space a little, but there's clearly still something in the repository (perhaps large binary files like images) that shouldn't be there. So lets de-compose the repository.

List the pack objects

We can produce an output from Git to list all the objects, then sort them by size, to find the largest ones, I've scripted this into a few bash commands.

This can take a few seconds to execute, so you may need to wait,

git rev-list --objects --all | sort -k 2 > file_list_sha.log
git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > file_large.log
for SHA in $(cut -f 1 -d\  < file_large.log); do
  echo $(grep $SHA file_large.log) $(grep $SHA file_list_sha.log) | awk '{print $1,$3,$7}' >> file_size_sorted.log
done

Once the commands above complete, a file will be generated called file_size_sorted.log containing a sorted list of objects and their size,

We can view the top three largest files using head (filenames have been removed for privacy),

head -n 3 file_size_sorted.log
18f67b182c04513de233a1186cb7939b8f01544f 192025920 wp/wp-content/uploads/largefile1.zip
617d7ca48668b4eb8758d71fd0e3bfadda52b3f2 5679809 wp/wp-content/uploads/largefile2.zip
861b45b90e5392b16ed10ca26fd9a4cf3b966d3a 4908285 wp/wp-content/uploads/largefile3.zip

So its clear that someone has mistakenly put the wp-content/uploads directory, containing primarily binary data into version control.

Removing the large files from the repository

Using git filter-branch - you can remove the files from previous commits. This isn't something you should consider lightly as the tree will change and other people using this repository will need to pull the new tree.

Eg. To remove largefile1.zip

git filter-branch --force --index-filter 'git rm --cached -r --ignore-unmatch largefile1.zip' --prune-empty --tag-name-filter cat -- --all
git update-ref -d refs/original/refs/heads/master
git reflog expire --expire=now --all
git gc --prune=now