Literate Git

Explaining software history

The history of a software project captured by a version control system is extensive. Commit logs are good at showing individual changes, but it is difficult to condense them into a coherent summary.

This project explores ways to create a high-level summary of the various modules in a software project alongside a step-by-step tutorial explaining the implementation choices made at each step of their creation.

An example tutorial created using this approach is located here.

Motivation

Computer programs tend to stick around longer than their authors. A programmer new to a project must spend some time inspecting it to gain an understanding of its structure and operation. Documentation usually explains how to use the software, rather than how it was created. Version control systems can show a detailed history of changes, but they don't often provide summary views.

Adding two specific types of documentation would help the onboarding process greatly. A high-level summary of the various modules within a program written in everyday language can help a newcomer understand how the various parts of a project work together. A tutorial-style guide breaking down the implementation of each of these features can provide insight into the decisions made during the project's creation.

Direct inspiration

Ben North's git-dendrify project transforms a linear git repository into a tree with separate sections for each feature. His related literate-git project transforms a hierarchical feature-based tree into an HTML document that can be navigated like a step-by-step tutorial. Major steps can function as a high level overview for individual features, and minor steps can provide detailed information about implementation details.

The literate-git tool is flexible and easy to use. The behavior of literate-git can be modified to handle a variety of hierarchically-structured trees, and its output format can be tweaked depending on the type of project being annotated. For example, the Result links that go to HTML pages in Ben's tamagotchi app tutorial could instead link to the Rust Playground in a Rust tutorial.

More interesting is the issue of building the hierarchical tree literate-git consumes. git-dendrify can build one, but it requires significant manipulation of git history, even when working on a new project. Commits to be dendrified must be reordered, split, and combined. Errors made in the production of the tutorial must be fixed by rewriting history in order to preserve the correct structure.

Modified approach

In order to overcome some of the tediousness involved in creating a linear structure for use with git-dendrify, I have developed a three-headed approach.

To validate this approach on a real project, I applied it to the git repository for Rust's semver-parser crate. The resulting repository is located on GitLab and the HTML document produced by running literate-git on this repository is located here.

First head - source branch

The first of the three heads is the main line of development. For simple projects, this might just be the master branch. In a large project, perhaps this is a filtered branch containing only commits touching a specific portion of the codebase. Let's refer to this as the source branch. For the semver-parser demonstration, the source branch is master.

We don't want to modify the source branch, it is where active work is happening. It would be improper to rewrite commits and possibly troublesome to add new ones just for documentation purposes.

Second head - preparation branch

The second head is where we will have the freedom to split, reorder, amend, and merge commits to suit our needs. In the semver-parser demo, this branch is called preparation.

The goal of the preparation branch is to end up with a linear tree that can be parsed by a tool like git-dendrify. With the final presentation in mind, individual commits in the presentation branch are tagged with a feature name and a series number. Features can be chosen arbitrarily, they will represent the major steps in the final tutorial. Numbered sub-tasks to create each feature will be minor steps. The restructuring tool can use the tag names to build the final hierarchical structure.

The preparation branch in the demo project ended up with about twice as many commits as the source branch. Many commits were split in order to compartmentalize changes. Patch hunks relating to the version module were tagged with version and an ordering number, but portions of them that modified test code were split off and tagged test instead. Commits that modified code in both the version and range modules were also split.

Creation

To create the preparation branch, I started with an orphan commit then did a large cherry-pick from the initial commit to the tip of the master branch.

After getting all the commits in place on the preparation branch, I did an interactive rebase from the initial commit to the tip, stopping at each commit to edit. After manual inspection of the diff, I split commits that touched multiple features. Once split, each commit was tagged with the appropriate feature and a step number within the feature.

An example tag from the demo project is version.3.3. The first portion is the feature name, the number following that is the step number, and the third value is a patch number. The commit tagged version.3.3 is a minor change or bugfix for the code originally introduced in version.3.1. In the final HTML output a summary for step 3 of the version feature will show only the final diff from step 2 to step 3, rolling up all the patch-level fixes. This behavior is discussed in more detail in the Versioning a tutorial section below.

Even commits that only dealt with one feature were sometimes split. If a single commit included a bugfix for some code introduced in step 3 and a bugfix for code introduced in step 4, it was split into two commits that could be individually tagged.

Some commits were modified in more significant ways. One file in the semver-parser project changed names halfway through the commit history. Since I knew this happened later on, I decided to modify the commit that introduced the affected code to use the final file name instead. Because of the way git operates, later patches to this file applied fine, even though the file name was different.

Tooling

One useful property of the preparation branch is it should be stable once created. The work only needs to be done once. Any future commits to the source branch can be cherry-picked to the preparation branch, and the existing work on the preparation branch should not need to be modified. However, the initial creation of the preparation branch may be very time-consuming. For a small project with approximately 35 commits, it took over three hours to go through this process.

In order for this approach to catch on, some tooling work to assist in the creation of the preparation branch would be a necessity. I could have gone through commits much faster if each patch hunk was displayed with a blame-like log and an input control where I could identify which feature the hunk belonged to and whether it was a patch to an existing step or a new step. So far, I have made no effort to create such a tool, but if I wanted to apply this approach to another repository, creating something to assist with this process would be my first step.

Some parts of the process also had a significant number of merge conflicts. Tooling might be able to help speed up resolution.

Third head - final branch

Once the preparation branch is created, it should be possible to turn it into a hierarchical structure with minimal human intervention. The git-dendrify tool could be modified to take care of this step. Instead of acting on hidden tags in commit messages, it would need to look for our feature-named tags and put them together in the order specified by their step and patch numbers. Because the commits in the preparation branch have been split into small pieces, merge conflicts in the creation of the final branch should be straightforward to handle.

Instead of modifying the git-dendrify tool, I created the final branch in the semver-parser demo manually. I once again started with an orphan commit. Building on top of this new orphan, I cherry-picked all the misc.x commits into the misc.final branch, then merged that into the final branch. A similar process was used for each step of the version and range features.

Writing the documentation

Although creating the preparation and final branches was a significant undertaking, no actual documentation was written in the process. The text displayed in the web demo comes from the tags ending with summary on the final branch. After a minor modification to source the text from these annotated tags rather than commit messages, literate-git was able to generate the demo pages from the final branch.

For this demo, I just wrote all these tags after creating the final code structure. With tooling to create the preparation and final branches, these summary tags could be propagated from commit messages in the source branch. Some massaging to the commit messages could be applied during the creation of the preparation branch, and they could be rolled up into the summary tags during the creation of the final branch. By using annotated tags on the preparation or final branch rather than commit messages, updates to the tutorial text can be made without altering all the following commit hashes.

My tutorial text is probably not very good, as I am not very familiar with the Rust language. The goal was to explain what was in the programmer's head while documenting the individual steps, and give a broad overview of a feature in the feature-level summaries. Since I was not the original author of this code, I have little confidence in the accuracy of the comments on the individual steps within each feature. I passed my tutorial along to the original author and will update it with any feedback I receieve.

Versioning a tutorial

My favorite result from this experimentation is the way bugs present in the initial implementation of a piece of code can be made invisible to the tutorial reader if they are fixed in a patch within the same step number.

Over half the commits in the preparation branch have a patch number of 2 or higher. Much of the noise of minor bugfixes can be rolled up into one final clean step.

The README file of the git-dendrify project lead me to a collection of thoughts from James Tauber about versioning a literate programming presentation. Quoting James:

The problem is there's really two "version dimensions" in this context: the versioning of the code being developed as part of the tutorial and the versioning of the tutorial itself.

You need to be able to go back and fix a problem in tutorial step 2 and have it cascade through all subsequent steps. So in one sense, you're rewriting history, rebasing, etc. But there's still a separate history your tutorial writing is following.

By rolling up patches into a final presentation of the completed step, we can at least solve half of this "version dimension" dilemma. Errors in the code creation are hidden from the final display.

What about the progression of the tutorial descriptions? Can they be versioned as well?

One interesting property of annotated tags is if a tag is applied with the --force option upon itself, git will keep both the new and the previous version of the tag. The new tag points to the old tag with the same name, which points to the commit. If the documentation is written in annotated tags on the preparation branch, then their history is at least preserved. Only the latest versions would be propagated each time the preparation branch is re-converted to the final branch. The history of the tags could be extracted, but I have not experimented with git to figure out a command to do this automatically.

What's next?

I would love to gather feedback and have a discussion about the method presented here. The literate-git tool is wonderful, but hard to use without better tooling to create a hierarchical tree from an existing project.

The next step is probably to create tools to speed up the process of making that hierarchical tree, but it is hard to justify the time expenditure without having some validation of this approach.

I couldn't think of a better name for this page than Literate Git, but I don't want to be confused with the literate-git project. Any renaming ideas are appreciated.

If you have something to share, you can email me at:

scott@sabbey.net

I would love to know about any groups where it would be appropriate to share my explorations.

Acknowledgements

None of this would have been possible without the literate-git and git-dendrify tools created by Ben North. The resources he listed in his projects have also been very valuable to help frame my wandering thoughts.

Appendix: Existing documentation approaches

Many languages provide for a special type of code comments that can be automatically formatted into documentation pages. These comments often explain in great detail how a module is used, but don't often include any information about implementation. The consumers of this style of documentation are not interested in such details.

Donald Knuth's Literate Programming merges the coding and documenting steps into one. Authors can explain their choices at every step. Over thirty years old, this process has never really caught on. It would be difficult to transition an existing project to the literate programming style.

A major benefit of using a version control system is having a record of all changes to a project. Commit logs often fail to explain why certain implementation decisions were made. It can be hard to determine if a questionable choice was made out of ignorance or if it was actually a carefully selected solution. Programmers are encouraged to leave comments in their code or write detailed commit messages to explain their choices, but the rationale for a change is often not connected to the permanent history.

Some substantial projects, such as GNU libavl, have been crafted with special attention given to understandability. These projects are few and far between, however. Why aren't there more? One possible explanation is that many programmers are just not great writers. It is hard for a writer to step in after a programmer has finished and explain what was in the programmer's mind.