I Built Git From Scratch in Python. Here's What I Learned
I’ve used Git for a good chunk of time. I knew the commands. I knew roughly what “staging” meant and had a vague mental model of branches as pointers. But I didn’t really understand Git - not until I tried to build it.
Pit is a version control system I wrote from scratch in Python. It supports init, add, commit, branch, merge, rebase, stash, diff, and most of the other commands you use daily. This post isn’t a tutorial on how to use it. It’s about what building it taught me about how Git works - and why some parts are cleverer than they first appear.
The Three Primitives
Everything in Git - and by extension Pit - is built on three ideas.
Blobs are the simplest. A blob is just file content, hashed with SHA-1 and stored by that hash. No filename. No path. Just content. If two files in your project have identical content, they share one blob. This is content-addressable storage: the key is derived from the value itself.
Trees represent directories. A tree object maps names to either blobs (files) or other trees (subdirectories). This is a Merkle tree - each node’s hash is computed from its children’s hashes. If anything changes anywhere in the directory structure, the root tree hash changes. That property is what makes Git’s integrity guarantees work.
Commits wrap a tree with metadata: author, timestamp, message, and a pointer to the parent commit. That parent pointer is what makes the history a directed acyclic graph rather than just a flat list. Merge commits have two parents.
Once I implemented these three types in utils/objects.py, the rest of Pit
was mostly just connecting them correctly.
The Index Is the Part Nobody Talks About
Everyone learns that Git has a staging area. What I didn’t understand until
I built it is that the staging area is a persistent data structure - a
binary file at .git/index that tracks the current state of every file
you’ve staged.
In Pit, the index maps file paths to blob hashes. When you run pit add,
it hashes the file’s content, writes the blob to the object store, and
updates the index entry for that path. When you run pit commit, it reads
the index and constructs the tree from it. When you run pit status, it
compares three things:
- What’s in the working directory
- What’s in the index
- What’s in the last commit’s tree
That three-way comparison is what produces “staged”, “unstaged”, and “untracked” - three separate states for each file. I had mentally collapsed staging and committing into one thing before building this. They’re not the same thing at all.
How Merge Actually Works
This was the part I was most curious about, and the most satisfying to implement.
pit merge implements a three-way merge. The name describes the three
versions it compares:
- The common ancestor of the two branches
- The current HEAD
- The branch being merged in
Finding the common ancestor is the first problem. Pit uses BFS on the commit DAG - it walks back through both branches’ parent pointers simultaneously until it finds the first commit they share. That commit is the merge base.
With the merge base identified, the algorithm compares each file across all three versions:
- If only one branch changed a file, that change wins automatically.
- If both branches changed the same file in the same way, no conflict.
- If both branches changed the same file differently, that’s a conflict.
When a conflict occurs, Pit writes the familiar markers into the file:
<<<<<<< HEAD
Content from the current branch
=======
Content from the branch being merged
>>>>>>> feature-branch
Then it stops and waits. pit mergetool opens an external tool to resolve
it. Once resolved, you stage the file and commit.
What I found interesting is how little “magic” is involved. Merge is just structured file comparison with a clear rule for what counts as a conflict. The hard part is finding the right ancestor - the rest follows from that.
Rebase Is Just Cherry-Pick in a Loop
Before building Pit, rebase felt like the advanced, dangerous Git command that senior developers used and juniors were told to avoid. After implementing it, I find it hard to be intimidated by it.
pit rebase <upstream> does this:
- Find the common ancestor of the current branch and the upstream
- Collect every commit on the current branch that comes after that ancestor
- Check out the upstream tip
- Replay each collected commit, one by one, onto the new base
Each replay is essentially a cherry-pick: compute the diff that commit introduced, apply it to the current state, and create a new commit with the same message and author but a different parent and a new hash.
That last part is the key thing to understand about rebase: it rewrites history. The commits look the same but they’re new objects with new hashes. This is why rebasing shared branches causes problems - you’re replacing commits other people already have.
If a conflict occurs mid-replay, Pit pauses and lets you resolve it. You
stage the resolved files, then run pit rebase --continue to proceed to
the next commit. pit rebase --abort restores the original state by saving
a reference to HEAD before the rebase starts.
Stash Is a Commit You Don’t Put on a Branch
Stash felt like it should be complicated. It isn’t.
When you run pit stash, it does two things: saves the current index state
as a commit object and saves the working directory changes as another. It
stores references to both in .pit/logs/stash as a stack - LIFO order,
so pit stash pop restores the most recent one.
The working directory is then reset to match the last real commit. When you pop the stash, Pit reads those saved commit objects and re-applies the changes.
The thing that makes this elegant is that it reuses the same object storage as everything else. Blobs, trees, commits - stash entries are just commits that happen to live in a different reference log rather than on a branch.
What Pit Can’t Do
Two things are missing, and they’re not small.
No networking. There’s no clone, fetch, push, or pull. Git’s
network layer is a separate protocol on top of the object model - it’s not
trivial to add, and it wasn’t the point of this project. Pit is strictly
local.
Basic conflict handling only. When merge or rebase produces a conflict, Pit writes the markers and stops. Real Git has rename detection, binary file handling, theirs/ours strategies, and more. Pit’s conflict detection only covers content conflicts in text files.
These aren’t bugs. They’re scope decisions. The goal was to understand the core model - object storage, history as a DAG, branch pointers, index management - not to replicate Git entirely.
What Actually Changed in How I Use Git
A few things shifted after building this.
I stopped being confused by git reset. There are three modes - --soft,
--mixed, --hard - because they move HEAD, the index, and the working
directory independently. Once you understand those as three separate things,
the modes are obvious.
I stopped being confused by detached HEAD. It just means HEAD points directly at a commit hash instead of at a branch name. No branch is tracking your position, so new commits won’t be reachable by any branch reference once you move away. That’s all it is.
I started thinking about commits as immutable objects rather than editable
history. When you amend a commit or rebase, you’re not editing anything -
you’re creating new objects and moving branch pointers to them. The old
objects still exist until garbage collection removes them. This is why
git reflog can rescue you from almost anything.
Code and Contributing
Pit is on GitHub at BIJJUDAMA/pit. It’s Python only, no external dependencies for the core. The test suite runs on Ubuntu, Windows, and macOS across Python 3.10, 3.11, and 3.12 via GitHub Actions.
If you want to understand Git from the inside, I’d recommend building something like this over reading any documentation. The documentation tells you what commands do. Building it tells you why they work the way they do.