How to check the size of a git repository
Analyzing the size of your Git repository is crucial for maintaining performance, understanding storage needs, and identifying potential issues. The git-sizer tool provides detailed insights into your repository’s size and structure. This guide walks you through obtaining the tool, setting it up, and using it effectively, with examples to help you get started.
Getting the tool and setting things up
First, you’ll need to download and install git-sizer. The tool is open-source and can be found on its GitHub repository. Installation is straightforward.
Prerequisites
Before proceeding, ensure the following:
- Git CLI Installed: git-sizer requires Git version 2.6 or higher.
- Git in
PATH
: Ensure the git command is available in your system’sPATH
.
You can verify your Git installation by running:
git --version
Note: If you see a version number, you’re good to go. Otherwise, download and install Git from git-scm.com.
Install git-sizer
on Windows
Method 1: Install a Released Version (Recommended)
- Visit the git-sizer Releases page.
- Download the ZIP file corresponding to your platform.
- Extract the ZIP file.
- Move the git-sizer.exe executable to a directory in your
PATH
.
Tip: The easiest option is to copy the executable to the same directory where Git is installed, as it’s already in your PATH
.
Method 2: Build from Source
- Clone the repository:
git clone https://github.com/github/git-sizer.git
- Navigate to the directory:
cd git-sizer
- Build using
make: make
Pro Tip: Dual Invocation
Once installed and part of your PATH
, you can invoke the tool in two ways:
- Directly, by typing
git-sizer
. - Via Git, by typing
git sizer
.
The latter allows you to include Git options between git and sizer, offering additional flexibility.
How to use git-sizer
The primary purpose of git-sizer
is to provide insights into the size and structure of your Git repository. By default, it highlights only problematic areas. For detailed information, use the --verbose
flag.
git sizer --verbose
Options:
--threshold THRESHOLD
: Set the minimum level of concern.--verbose
: Report all statistics.--json
: Output results in JSON format.--[no-]progress
: Report progress to stderr.
Example: Analyzing a Repository
Run the following command in your repository:
git sizer --verbose
Sample Output:
The output is a table showing what was measureed and a rough indiction of which values might be problematic. Note: Only objects that are reachable from references are included.
Key sections in the output:
- The Overall repository size section:
- Includes repository-wide statistics about distinct objects, not including repetition.
- Commits: Number and total size of commits.
- Trees: Number and total size of trees.
- Blobs: Number and total size of blobs.
- Annotated tags: Number of annotated tags.
- References: Number of references.
- Total size is the sum of the sizes of the corresponding objects in their uncompressed form, measured in bytes.
- The overall uncompressed size of all objects is a good indication of how expensive commands like
git gc --aggressive
(andgit repack [-f|-F]
andgit pack-objects --no-reuse-delta
),git fsck
, andgit log [-G|-S]
will be. - The uncompressed size of trees and commits is a good indication of how expensive reachability traversals will be, including clones and fetches and
git gc
.
- The overall uncompressed size of all objects is a good indication of how expensive commands like
- Biggest Objects:
- Highlights the largest commits, trees, and blobs.
- For example, if a blob has a size of 3.35 MiB, it might indicate a large file committed directly into the repository.
- History Structure:
- Shows the maximum depth of the commit history.
- A high value might indicate a deep history, which can slow down operations like
git log
andgit blame
.- A high value might also indicate a large number of branches or merges.
- The maximum tag depth is the maximum number of tags that point to a single commit.
- A high value might indicate a large number of tags or a tag pointing to a commit that is not reachable from any branch.
- Biggest Checkouts:
- Identifies potential issues during checkouts, such as a high number of files or directories.
- Number of directories: Total number of directories.
- Maximum path depth: Maximum depth of paths.
- Maximum path length: Maximum length of paths.
- Number of files: Total number and size of files.
- Total size of files: Total size of files indicates the sum of all file sizes in the single biggest commit.
- Number of symlinks: Total number of symbolic links.
- Number of submodules: Total number of submodules.
- Example: A path depth of 10 or more can complicate navigation and tooling.
- Identifies potential issues during checkouts, such as a high number of files or directories.
Exploring Problematic repositories
This following is the default output using one of the famouse “git bomb” repositoriy repositories:
$ git-sizer
[...]
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Biggest checkouts | | |
| * Number of directories [1] | 1.11 G | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Maximum `PATH` depth [1] | 11 | * |
| * Number of files [1] | ∞ | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Total size of files [2] | 83.8 GiB | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
[1] c1971b07ce6888558e2178a121804774c4201b17 (refs/heads/master^{tree})
[2] d9513477b01825130c48c4bebed114c4b2d50401 (18ed56cbc5012117e24a603e7c072cf65d36d469^{tree})
Insights
- The
Biggest checkouts
section highlights the problematic areas, such as the number of directories and files, and the total file size. - Total file size exceeds 80 GiB, which is far beyond manageable levels for most use cases.
- The
Level of concern
column indicates the severity of the issue, with!
indicating a critical problem. - The
*
symbol indicates the most severe issues.
When using git-sizer
, look out for the following problematic areas that might cause issues:
- Repository size: Ideally under 1 GiB, becomes unwieldy over 5 GiB.
- Number of references: Limit to a few tens of thousands at most.
- Number of objects: Too many can slow down history traversal and garbage collection.
- Gigantic blobs: Git works best with small- to medium-sized files.
- Many versions of large text files: Can be expensive for Git to reconstruct and diff.
Conclusion
git-sizer
is an invaluable tool for diagnosing and addressing repository size issues. By analyzing output metrics like blob sizes, history depth, and checkout structure, you can identify inefficiencies and take corrective actions, such as rewriting history or splitting the repository.
When using git-sizer
, remember:
- Start with a high-level overview.
- Dive deeper into specific areas of concern highlighted in the output.
With this guide and examples, you’re well-equipped to use git-sizer
to keep your Git repositories optimized and manageable.