How to check the size of a git repository

How to check the size of a git repository


Analyzing the size of your Git repository is crucial for maintaining performance, understanding storage needs, and identifying potential issues. The git-sizer tool provides detailed insights into your repository’s size and structure. This guide walks you through obtaining the tool, setting it up, and using it effectively, with examples to help you get started.

Getting the tool and setting things up

First, you’ll need to download and install git-sizer. The tool is open-source and can be found on its GitHub repository. Installation is straightforward.

Prerequisites

Before proceeding, ensure the following:

  • Git CLI Installed: git-sizer requires Git version 2.6 or higher.
  • Git in PATH: Ensure the git command is available in your system’s PATH.

You can verify your Git installation by running:

git --version

Note: If you see a version number, you’re good to go. Otherwise, download and install Git from git-scm.com.

Install git-sizer on Windows

  1. Visit the git-sizer Releases page.
  2. Download the ZIP file corresponding to your platform.
  3. Extract the ZIP file.
  4. Move the git-sizer.exe executable to a directory in your PATH.

Tip: The easiest option is to copy the executable to the same directory where Git is installed, as it’s already in your PATH.

Method 2: Build from Source

  1. Clone the repository: git clone https://github.com/github/git-sizer.git
  2. Navigate to the directory: cd git-sizer
  3. Build using make: make

Pro Tip: Dual Invocation

Once installed and part of your PATH, you can invoke the tool in two ways:

  1. Directly, by typing git-sizer.
  2. Via Git, by typing git sizer.

The latter allows you to include Git options between git and sizer, offering additional flexibility.

How to use git-sizer

The primary purpose of git-sizer is to provide insights into the size and structure of your Git repository. By default, it highlights only problematic areas. For detailed information, use the --verbose flag.

git sizer --verbose

Options:

  • --threshold THRESHOLD: Set the minimum level of concern.
  • --verbose: Report all statistics.
  • --json: Output results in JSON format.
  • --[no-]progress: Report progress to stderr.

Example: Analyzing a Repository

Run the following command in your repository:

git sizer --verbose

Sample Output:

```bash
❯ git sizer --verbose
Processing blobs: 5051
Processing trees: 5112
Processing commits: 1054
Matching commits to trees: 1054
Processing annotated tags: 0
Processing references: 120
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |  1.05 k   |                                |
|   * Total size               |   471 KiB |                                |
| * Trees                      |           |                                |
|   * Count                    |  5.11 k   |                                |
|   * Total size               |  2.13 MiB |                                |
|   * Total tree entries       |  54.3 k   |                                |
| * Blobs                      |           |                                |
|   * Count                    |  5.05 k   |                                |
|   * Total size               |  59.3 MiB |                                |
| * Annotated tags             |           |                                |
|   * Count                    |     0     |                                |
| * References                 |           |                                |
|   * Count                    |   120     |                                |
|     * Branches               |    27     |                                |
|     * Tags                   |     7     |                                |
|     * Remote-tracking refs   |    85     |                                |
|     * Git stash              |     1     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  1.20 KiB |                                |
|   * Maximum parents      [2] |     2     |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |   185     |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  3.35 MiB |                                |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |   625     |                                |
| * Maximum tag depth          |     0     |                                |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [5] |   561     |                                |
| * Maximum `PATH` depth     [5] |    10     | *                              |
| * Maximum `PATH` length    [6] |    96 B   |                                |
| * Number of files        [5] |  2.63 k   |                                |
| * Total size of files    [5] |  21.3 MiB |                                |
| * Number of symlinks         |     0     |                                |
| * Number of submodules       |     0     |                                |

[1]  48c5116db547c885966333231a3dd0998c01f748
[2]  df063044a2c4e4caf20085773e46bc848cd805cd (refs/tags/v1.0.6)
[3]  fcb2391a4f394e51b9ef22ef193a76bd653b06fe (refs/remotes/origin/feature/update-npm:node_modules/npm/node_modules)
[4]  a8cf3760f503fa07db8803f242c7f4a3878e1564 (refs/heads/feature/upgrade-dotnet6:src/Web/src/assets/logo.png)
[5]  803ca70b7a8c19cb71b53c59513642232c7e1c13 (refs/remotes/origin/feature/update-npm^{tree})
[6]  75b4a8b14d79d628770d0ff953f9f893f8420885 (refs/heads/feature/upgrade-dotnet6^{tree})
```

The output is a table showing what was measureed and a rough indiction of which values might be problematic. Note: Only objects that are reachable from references are included.

Key sections in the output:

  1. The Overall repository size section:
    • Includes repository-wide statistics about distinct objects, not including repetition.
    • Commits: Number and total size of commits.
    • Trees: Number and total size of trees.
    • Blobs: Number and total size of blobs.
    • Annotated tags: Number of annotated tags.
    • References: Number of references.
  • Total size is the sum of the sizes of the corresponding objects in their uncompressed form, measured in bytes.
    • The overall uncompressed size of all objects is a good indication of how expensive commands like git gc --aggressive (and git repack [-f|-F] and git pack-objects --no-reuse-delta), git fsck, and git log [-G|-S] will be.
    • The uncompressed size of trees and commits is a good indication of how expensive reachability traversals will be, including clones and fetches and git gc.
  1. Biggest Objects:
    • Highlights the largest commits, trees, and blobs.
    • For example, if a blob has a size of 3.35 MiB, it might indicate a large file committed directly into the repository.
  2. History Structure:
    • Shows the maximum depth of the commit history.
    • A high value might indicate a deep history, which can slow down operations like git log and git blame.
      • A high value might also indicate a large number of branches or merges.
    • The maximum tag depth is the maximum number of tags that point to a single commit.
      • A high value might indicate a large number of tags or a tag pointing to a commit that is not reachable from any branch.
  3. Biggest Checkouts:
    • Identifies potential issues during checkouts, such as a high number of files or directories.
      • Number of directories: Total number of directories.
      • Maximum path depth: Maximum depth of paths.
      • Maximum path length: Maximum length of paths.
      • Number of files: Total number and size of files.
      • Total size of files: Total size of files indicates the sum of all file sizes in the single biggest commit.
      • Number of symlinks: Total number of symbolic links.
      • Number of submodules: Total number of submodules.
    • Example: A path depth of 10 or more can complicate navigation and tooling.

Exploring Problematic repositories

This following is the default output using one of the famouse “git bomb” repositoriy repositories:


$ git-sizer
[...]
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Biggest checkouts            |           |                                |
| * Number of directories  [1] |  1.11 G   | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Maximum `PATH` depth     [1] |    11     | *                              |
| * Number of files        [1] |     ∞     | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Total size of files    [2] |  83.8 GiB | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |

[1]  c1971b07ce6888558e2178a121804774c4201b17 (refs/heads/master^{tree})
[2]  d9513477b01825130c48c4bebed114c4b2d50401 (18ed56cbc5012117e24a603e7c072cf65d36d469^{tree})

Insights

  • The Biggest checkouts section highlights the problematic areas, such as the number of directories and files, and the total file size.
  • Total file size exceeds 80 GiB, which is far beyond manageable levels for most use cases.
  • The Level of concern column indicates the severity of the issue, with ! indicating a critical problem.
  • The * symbol indicates the most severe issues.

When using git-sizer, look out for the following problematic areas that might cause issues:

  • Repository size: Ideally under 1 GiB, becomes unwieldy over 5 GiB.
  • Number of references: Limit to a few tens of thousands at most.
  • Number of objects: Too many can slow down history traversal and garbage collection.
  • Gigantic blobs: Git works best with small- to medium-sized files.
  • Many versions of large text files: Can be expensive for Git to reconstruct and diff.

Conclusion

git-sizer is an invaluable tool for diagnosing and addressing repository size issues. By analyzing output metrics like blob sizes, history depth, and checkout structure, you can identify inefficiencies and take corrective actions, such as rewriting history or splitting the repository.

When using git-sizer, remember:

  • Start with a high-level overview.
  • Dive deeper into specific areas of concern highlighted in the output.

With this guide and examples, you’re well-equipped to use git-sizer to keep your Git repositories optimized and manageable.