I Built Git From Scratch in Python. Here's What I Learned

I’ve used Git for a good chunk of time. I knew the commands. I knew roughly what “staging” meant and had a vague mental model of branches as pointers. But I didn’t really understand Git - not until I tried to build it.

Pit is a version control system I wrote from scratch in Python. It supports init, add, commit, branch, merge, rebase, stash, diff, and most of the other commands you use daily. This post isn’t a tutorial on how to use it. It’s about what building it taught me about how Git works - and why some parts are cleverer than they first appear.

The Three Primitives

Everything in Git - and by extension Pit - is built on three ideas.

Blobs are the simplest. A blob is just file content, hashed with SHA-1 and stored by that hash. No filename. No path. Just content. If two files in your project have identical content, they share one blob. This is content-addressable storage: the key is derived from the value itself.

Trees represent directories. A tree object maps names to either blobs (files) or other trees (subdirectories). This is a Merkle tree - each node’s hash is computed from its children’s hashes. If anything changes anywhere in the directory structure, the root tree hash changes. That property is what makes Git’s integrity guarantees work.

Commits wrap a tree with metadata: author, timestamp, message, and a pointer to the parent commit. That parent pointer is what makes the history a directed acyclic graph rather than just a flat list. Merge commits have two parents.

Once I implemented these three types in utils/objects.py, the rest of Pit was mostly just connecting them correctly.

The Index Is the Part Nobody Talks About

Everyone learns that Git has a staging area. What I didn’t understand until I built it is that the staging area is a persistent data structure - a binary file at .git/index that tracks the current state of every file you’ve staged.

In Pit, the index maps file paths to blob hashes. When you run pit add, it hashes the file’s content, writes the blob to the object store, and updates the index entry for that path. When you run pit commit, it reads the index and constructs the tree from it. When you run pit status, it compares three things:

What’s in the working directory
What’s in the index
What’s in the last commit’s tree

That three-way comparison is what produces “staged”, “unstaged”, and “untracked” - three separate states for each file. I had mentally collapsed staging and committing into one thing before building this. They’re not the same thing at all.

How Merge Actually Works

This was the part I was most curious about, and the most satisfying to implement.

pit merge implements a three-way merge. The name describes the three versions it compares:

The common ancestor of the two branches
The current HEAD
The branch being merged in

Finding the common ancestor is the first problem. Pit uses BFS on the commit DAG - it walks back through both branches’ parent pointers simultaneously until it finds the first commit they share. That commit is the merge base.

With the merge base identified, the algorithm compares each file across all three versions:

If only one branch changed a file, that change wins automatically.
If both branches changed the same file in the same way, no conflict.
If both branches changed the same file differently, that’s a conflict.

When a conflict occurs, Pit writes the familiar markers into the file:

<<<<<<< HEAD
Content from the current branch
=======
Content from the branch being merged
>>>>>>> feature-branch

Then it stops and waits. pit mergetool opens an external tool to resolve it. Once resolved, you stage the file and commit.

What I found interesting is how little “magic” is involved. Merge is just structured file comparison with a clear rule for what counts as a conflict. The hard part is finding the right ancestor - the rest follows from that.

Rebase Is Just Cherry-Pick in a Loop

Before building Pit, rebase felt like the advanced, dangerous Git command that senior developers used and juniors were told to avoid. After implementing it, I find it hard to be intimidated by it.

pit rebase <upstream> does this:

Find the common ancestor of the current branch and the upstream
Collect every commit on the current branch that comes after that ancestor
Check out the upstream tip
Replay each collected commit, one by one, onto the new base

Each replay is essentially a cherry-pick: compute the diff that commit introduced, apply it to the current state, and create a new commit with the same message and author but a different parent and a new hash.

That last part is the key thing to understand about rebase: it rewrites history. The commits look the same but they’re new objects with new hashes. This is why rebasing shared branches causes problems - you’re replacing commits other people already have.

If a conflict occurs mid-replay, Pit pauses and lets you resolve it. You stage the resolved files, then run pit rebase --continue to proceed to the next commit. pit rebase --abort restores the original state by saving a reference to HEAD before the rebase starts.

Stash Is a Commit You Don’t Put on a Branch

Stash felt like it should be complicated. It isn’t.

When you run pit stash, it does two things: saves the current index state as a commit object and saves the working directory changes as another. It stores references to both in .pit/logs/stash as a stack - LIFO order, so pit stash pop restores the most recent one.

The working directory is then reset to match the last real commit. When you pop the stash, Pit reads those saved commit objects and re-applies the changes.

The thing that makes this elegant is that it reuses the same object storage as everything else. Blobs, trees, commits - stash entries are just commits that happen to live in a different reference log rather than on a branch.

What Pit Can’t Do

Two things are missing, and they’re not small.

No networking. There’s no clone, fetch, push, or pull. Git’s network layer is a separate protocol on top of the object model - it’s not trivial to add, and it wasn’t the point of this project. Pit is strictly local.

Basic conflict handling only. When merge or rebase produces a conflict, Pit writes the markers and stops. Real Git has rename detection, binary file handling, theirs/ours strategies, and more. Pit’s conflict detection only covers content conflicts in text files.

These aren’t bugs. They’re scope decisions. The goal was to understand the core model - object storage, history as a DAG, branch pointers, index management - not to replicate Git entirely.

What Actually Changed in How I Use Git

A few things shifted after building this.

I stopped being confused by git reset. There are three modes - --soft, --mixed, --hard - because they move HEAD, the index, and the working directory independently. Once you understand those as three separate things, the modes are obvious.

I stopped being confused by detached HEAD. It just means HEAD points directly at a commit hash instead of at a branch name. No branch is tracking your position, so new commits won’t be reachable by any branch reference once you move away. That’s all it is.

I started thinking about commits as immutable objects rather than editable history. When you amend a commit or rebase, you’re not editing anything - you’re creating new objects and moving branch pointers to them. The old objects still exist until garbage collection removes them. This is why git reflog can rescue you from almost anything.

Code and Contributing

Pit is on GitHub at BIJJUDAMA/pit. It’s Python only, no external dependencies for the core. The test suite runs on Ubuntu, Windows, and macOS across Python 3.10, 3.11, and 3.12 via GitHub Actions.

If you want to understand Git from the inside, I’d recommend building something like this over reading any documentation. The documentation tells you what commands do. Building it tells you why they work the way they do.