A reddit visualization

Here I investigate the intersection between the most popular subreddits. By “intersection” I mean those users who have activity (upvote or downvote on a post) in multiple subreddits. You can also think of a subreddit as the collection of its users, and intersections are then people in two or more subreddits at the same time. The data I use contains activity of ~10k users across ~11k subreddits (more details in the last section). A few hints on how to read the diagram (more on this in a previous post):

  • Subreddits are represented on a circle; chords between circle segments represent common users between subreddits (the ticker the chord, the larger is the intersection size).
  • Intersections can have different “orders”: if we consider, say, the four subreddits A, B, C and D, I call a “4th order intersection” all those element common to all four sets; a “3rd order intersection” will be all those elements common to A, B and C but with no element in D, and so on. The diagram represents a n-th order intersection via a sort of “polygon”, i.e. a loop of n chords connecting the interested sets.
  • To reduce the clutter of the visualization, the concept of “threshold” is introduced: all intersections smaller than a threshold T won’t be represented, but “absorbed” by lower order intersections. To be precise, a small intersection of order n is evenly split among all the (n-1)th order intersections it is contained in. This process is iterative, since there is no guaranteed that a “upward redistribution” of elements will make the receivers pass the threshold.

The top 10 subreddits

This plot tells us that, as far as the top 10 subreddits go, everyone is everywhere. The big fat dark blue intersection touches all of them. Then we have three groups of people: light orange, who likes all subs but /r/gaming; dark orange, everywhere but on /r/AskReddit; then the green group, which is not on /r/gaming nor /r/AskReddit. The last interesting story is that another group, light blue, is only on the main page (reddit.com) and on /r/pics, with no interest in anything else.
If we now line up those 10 subreddits by size, we see that reddit.com and /r/pics are the most popular.

8407	reddit.com
7790	pics
7462	funny
6966	WTF
6828	politics
6424	science
6378	technology
6101	worldnews
5519	AskReddit
5115	gaming

The diagram shows that the edge they have is partly due to the light-blue group of people, who boosts up both. On the other end of the scale, /r/gaming and /r/AskReddit are at the bottom; indeed the light orange and dark orange groups avoid one or the other, and the green group avoids both.
Note that these numbers don’t represent the actual subreddit populaton (which is way larger), but the number of users who participated to the experiment accepting to have their activity anonimously published.

Looking down in the ranks

It would be very nice to get a diagram with more than 10 lists, but unfortunately it’s very expensive to compute: the algorithm that calculates the size of the chords is exponential in the number of lists one considers. To see why, remember that the power set of a set of n elements (the collection of all its subsets) has 2^n elements; and these diagrams need you to compute all possible intersections between the lists, which is taking everything that’s in the power set and compute the size of its intersection. So I’ll resort to sampling; I’ll take 10 random subreddits and see what happens. Since the population of subreddits follows a pareto (see below for pictures), a uniform sampling would favour very small elements; so what I do is sorting the subs and put them into 10 buckets where each of them contains roughly the same amount of users. The first buckets will be made of only a few large subs; the last will contain a lot of small ones (I don’t consider subreddits with less then 300 users). Here the exact figures I got:

[3, 4, 4, 6, 8, 12, 17, 26, 43]

and the corresponding number of users per bucket:

[23659, 26596, 21296, 21967, 21229, 21011, 21371, 21544, 21296]

I’ll now take one random subreddit from each bucket. Here a couple of shots:

Uhm. What’s the story behind this? A nice discovery would be something like “subreddits X and Y have a strong affinity, because users in X tend to like Y and vice-versa”. But that’s not what’s in the diagram. It’s a mere reflection of the popularity of the four big subreddits /r/pics, /r/worldnews, /r/science and /r/videos. It’s just telling us “the top subreddits captivate attention from all users”, which isn’t news. It’s evident if we focus on the two medium-sized subs in the diagram, /r/blog and /r/fffffffuuuuuuuuuuuu (it’s a sub about comics with funny faces that say crazy things, I had to check): they’re composed by groups that follow all the big subs, and this doesn’t say much about affinities. Another shot to make sure this is not a particular case:

Again, the medium-sized /r/news shares user with all the bigs without any particular skew.
So I removed the top 20 subreddits from my lists and recomputed the buckets. Here two examples:

Some non trivial links come to the surface: people in /r/energy unsurprisingly care about /r/environment, a big chunk of the NSFW subreddit /r/gonewild likes the funny comics at /r/fffffffuuuuuuuuuuuu and enjoy /r/humor, people interested in /r/marijuana also like /r/environment and /r/humor. One more:

Somebody’s at the same time into /r/blog, /r/adviceanimals and /r/bestof, people in /r/happy and /r/libertarian are often also on /r/bestof.

I spent more time I care to admit generating these random matches and discovering subreddits I didn’t know they existed!

Data sanity check

The data I use is from Alex Mizrahi (/u/killerstorm), who posted it to the rrecommender google group in 2012 in the context of an initiative to build a recommender for subreddits launched in 2010 by David King (/u/ketralnis, reddit employee at the time).

The dataset shows the activity of 9579 users (who opted-in to contribute their data for research) across 10855 subreddits. “Activity” means merely that a user made an upvote or downvote in a subreddit. Here I’ll have a quantitative look at the dataset, to see how healthy it is.
It’s reasonable to expect the subreddits population to follow a pareto distribution (see this post by Terence Tao for a definition of “reasonable”). The simplest way to check is to see if the cumulative distribution of subreddit sizes is a pareto (see “Power laws, Pareto distributions and Zipf’s law” by M. E. J. Newman for details). Here we go:


The log-log plot looks a lot like a straight line, and with some handwaving we can conclude the data goes like a pareto. This diagram, also called “rank/frequency plot” because of its use in word frequency analysis in texts, is very simple to draw: line up your population sizes sorted from largest to smallest, and plot their rank (position in the list counted starting from 1) versus sizes. The first population will have only one largest or equal element (i.e., itself); the second will have two (itself and the previous one, which is larger since elements are sorted) and so on.

The first dump by David King showed users with abnormal activity; Alex Mizrahi’s sampling was said to mitigate the problem. So here I plot the histograms of user activities; i.e. what’s the distribution of the number of subreddits that users follow.


Most of the users are on <20 subreddits. Somebody has activity on ~250 subs, but the histogram shows those are outliers.

Translating this imperative algorithm into a good-looking clojure program

Here I describe how I turned some imperative python code that relies on updating a dictionary in-place to keep track of its state into a functional-style clojure program. My clojure experience amounts to about two hours and a half, so I’d be glad if someone can point out how to make the result more idiomatic and easy to read.

The following setup might appear toyish and flaky, please bear with me so that I can make my point.

Problem statement

Andy’s guacamole

Andy owns a chain of shops that produce guacamole in jars; each shop is in a different city.


He provisions his shops with avocados every day; to make a jar of the dip, a shop needs exactly 10 avocados. If, on a given day, a shop has less than 10 avocados, it can’t make even a single jar and the day is lost. So if two shops have 6 and 4 avocados respectively, that’s quite a pity, since none of the two can make a jar; much better if Andy can arrange for the first shop to send its fruits to the second shop, so that at least one of the two can produce and sell something that day.

The road network

Andy cannot just pick two of its shops and just decide that the first sends some fruits to the other. The shops are arranged on the following map of roads, and not all of them are directly connected to one another:


Provisions can go from a city to another only if there is an arrow between the two on the map, and only in the direction of the arrow. For example, Andy can decide to send something from Abardeen to Bath, but not from Birmingham to Abardeen nor to Bradford. The list of connections between cities is as in the above picture:

from Aberdeen to Bath, Birmingham, Bradford, Bristol

from Bath to Cambridge, Canterbury, Chelmsford
from Birmingham to Cambridge, Carlisle, Chester
from Bradford to Canterbury, Carlisle, Coventry
from Bristol to Chelmsford, Chester, Coventry

from Cambridge to Derby, Durham
from Canterbury to Derby, Dundee
from Carlisle to Derby, Dover
from Chelmsford to Durham, Dundee
from Chester to Durham, Dover
from Coventry to Dundee, Dover

Today’s stocks

Today’s Andy looks at the reports from the shops and the situation is:

Aberdeen                 1

Bath                     2
Birmingham               3
Bradford                 4
Bristol                  5

Cambridge                6
Canterbury               7
Carlisle                 8
Chelmsford               9
Chester                 10
Coventry                11

Derby                   12
Durham                  13
Dundee                  14
Dover                   15


Andy’s algorithm

Andy’s strategy to rearrange the shops’ stocks of avocados is as follows.
He considers the shops by their distance from Abardeen (in hops on the graph). He says that the shops in “level N” are those N hops away from the Abardeen shop, so the list of levels is:

Level 0: Aberdeen

Level 1: Bath, Birmingham, Bradford, Bristol

Level 2: Cambridge, Canterbury, Carlisle, Chelmsford, Chester, Coventry

Level 3: Derby, Durham, Dundee, Dover

First Andy looks at “Level 0″, which is, Abardeen shop. It has only one avocado, which is sent to Bath: the first shop on Level 2 if you read the map from left to right.

Then Andy studies the updated level 2 (where Bath has now two avocados). For each city there, Andy tries to split avocados evenly between the cities in Level 3 connected to it. When the division is not exact because the number of avocados isn’t a multiple of 3, Andy takes the remaining fruits and send them to Level 3′s cities from left to write (not all will be served: this is only the remainder of the division). Example: Bristol is connected to Chelmsford, Chester and Coventry, and has 5 avocados. Each of the three cities gets 1 fruit, plus Chelmsford and Chester (first two reading from left to right) gets one more each.

Then Andy moves to level 3 (the last one where he can redistribute fruits, because on level 4 no city has outgoing arrows), and does the same. As I stated before, if a city has 10 or more avocados, those do not get split. It’s Andy’s rule.


  1. Start from level 0, i.e. the Aberdeen shop (which have one avocado).
  2. Do not move to level N+1 before processing each of the city at level N.
  3. If a given city has less than 10 avocados, take them away and distribute evenly to all cities reachable from there via the road network.
  4. If the integer division is not exact, split the remainder of the division giving one avocado to each target city in the same order as they are listed in the above map from left to right.
  5. If a city has more than 10 avocados, nothing to do. Move to next city (or next level).

The question

Here the game is not to question Andy’s way to optimize his avocados stocks. He’s been in the business since a long time and knows his trade. The point is to write a program to compute the end result of Andy’s algorithm.

Imperative implementation (Python)

If I had to implement this algorithm with an imperative style and good ol’ mutable data structures, I’d go for something like the following (in python).

A dictionary for the map of roads:

roads = {"Aberdeen":   ["Bath", "Birmingham", "Bradford", "Bristol"],
         "Bath":       ["Cambridge", "Canterbury", "Chelmsford"],
         "Birmingham": ["Cambridge", "Carlisle", "Chester"],
         "Bradford":   ["Canterbury", "Carlisle", "Coventry"],
         "Bristol":    ["Chelmsford", "Chester", "Coventry"],

         "Cambridge":  ["Derby", "Durham"],
         "Canterbury": ["Derby", "Dundee"],
         "Carlisle":   ["Derby", "Dover"],
         "Chelmsford": ["Durham", "Dundee"],
         "Chester":    ["Durham", "Dover"],
         "Coventry":   ["Dundee", "Dover"]}

A list of list of pairs for the shop stocks kept together by “level”:

level0 = [["Aberdeen",    1]]

level1 = [["Bath",        2],
          ["Birmingham",  3],
          ["Bradford",    4],
          ["Bristol",     5]]

level2 = [["Cambridge",   6],
          ["Canterbury",  7],
          ["Carlisle",    8],
          ["Chelmsford",  9],
          ["Chester",    10],
          ["Coventry",   11]]

level3 = [["Derby",      12],
          ["Durham",     13],
          ["Dundee",     14],
          ["Dover",      15]]

stocks = [level0,

Here the fuction that process a single level, redistributing what was taken from the “lower” level:

def distributeonelevel(currentlevel, frompreviouslevel, threshold):
    tmplevel = dict(currentlevel)
    # first stage: spread stocks from a shop to its neighbors
    for city, stock in frompreviouslevel:
        targets = roads[city]
        ntargets = len(targets)
        q, r = stock / ntargets, stock % ntargets
        for cnt, target in enumerate(targets):
            tmplevel[target] += q
            if cnt + 1 <= r:
                tmplevel[target] += 1
    # second stage: shop's stock passes threshold
    updatedlevel = []
    tonextlevel = []
    for city, stock in tmplevel.iteritems():
        if stock < threshold:
            updatedlevel.append([city, 0])
            tonextlevel.append([city, stock])
            updatedlevel.append([city, stock])
    return updatedlevel, tonextlevel

The work is dividend in two steps: first the leftovers from the previous level are redistributed, then it checks if the updated values pass the threshold. If not, they aren’t assigned to the current level but transferred to the following level.
Note that the roads dictionary lists destinations in the very same order as they appear on the map (from left to right for each level). This assures that when iterating over enumerate(targets) the rule about distributing the remainder of the division is automatically fulfilled.
Then I launch that function in a loop, iterating on all levels:

def distributeall(stocks, threshold):
    updatedstocks = []
    frompreviouslevel = []
    for currentlevel in stocks:
        updatedlevel, frompreviouslevel = \
    return updatedstocks

and the result is

>>> distributeall(stocks, 10)
[[['Aberdeen', 0]],
 [['Bradford', 0], ['Birmingham', 0], ['Bristol', 0], ['Bath', 0]],
 [['Cambridge', 0], ['Coventry', 13], ['Chester', 13], ['Canterbury', 10], ['Carlisle', 10], ['Chelmsford', 12]],
 [['Dundee', 14], ['Durham', 17], ['Dover', 15], ['Derby', 16]]]

The thing that worried me most about translating the above example into a functional-style program was the tmplevel dictionary: it is initialized to some currentlevel value, and then updated with new data as we carry the computation on. It turns out it isn’t much of a big deal: conceptually “update this dictionary with that (key, value) pair” and “give me a new dictionary wich is the same as this one plus a given (key, value) pair” it’s not a great difference, and algorithms translate pretty much by themselves.

Functional implementation (Clojure)

First off, input data: a hashmap for the road network:

(def roads {"Aberdeen"   ["Bath", "Birmingham", "Bradford", "Bristol"],
            "Bath"       ["Cambridge", "Canterbury", "Chelmsford"],
            "Birmingham" ["Cambridge", "Carlisle", "Chester"],
            "Bradford"   ["Canterbury", "Carlisle", "Coventry"],
            "Bristol"    ["Chelmsford", "Chester", "Coventry"],
            "Cambridge"  ["Derby", "Durham"],
            "Canterbury" ["Derby", "Dundee"],
            "Carlisle"   ["Derby", "Dover"],
            "Chelmsford" ["Durham", "Dundee"],
            "Chester"    ["Durham", "Dover"],
            "Coventry"   ["Dundee", "Dover"]})

and nested lists to describe the levels:

(def level0 [["Aberdeen",    1]])

(def level1 [["Bath",        2],
             ["Birmingham",  3],
             ["Bradford",    4],
             ["Bristol",     5]])

(def level2 [["Cambridge",   6],
             ["Canterbury",  7],
             ["Carlisle",    8],
             ["Chelmsford",  9],
             ["Chester",    10],
             ["Coventry",   11]])

(def level3 [["Derby",      12],
             ["Durham",     13],
             ["Dundee",     14],
             ["Dover",      15]])

(def stocks  [level0,

Now let’s look at the distributeonelevel python function. It’s split into two stages:

  • Loop over remaining items from the previous level, and spread them across the cities of the current level
  • Check if this updated stock values are large enough to pass the threshold. If yes, “commit”; if no, pass on to next level.

Both steps are accomplished by a for loop. My rule of thumb is that an imperative for loop, i.e. computations over collections where each successive steps depends somehow on the previous one, translates into a clojure reduce. In this case, the first step just described is made out of two nested for loops; I’ll use a reduce for each one. Here the “inner reduce”:

(defn distrib-one-city [city stock tmplevel]
  (let [targets (get roads city)
        ntargets (count targets)
        q (quot stock ntargets)
        r (mod stock ntargets)]
    (reduce (fn [tmplevel- [cnt target]]
              (if (<= (+ cnt 1) r)
                (update-in tmplevel- [target] #(+ % q 1))
                (update-in tmplevel- [target] #(+ % q))))
            (map-indexed vector targets))))

used as in

user=> (distrib-one-city "Aberdeen" 1 (->> level1
                                           (apply concat)
                                           (apply hash-map)))
{"Bradford" 4, "Bath" 3, "Birmingham" 3, "Bristol" 5}

Note how I pass the hashmap tmplevel to this function: it constitutes the initializer of the reduce expression, and the whole point of distrib-one-city is to return an updated copy of tmplevel.
As in the python counterpart, the list of destination (get roads city) has the same order as in the map (from left to right). This assures the rule about distributing the division remainder is respected.
Now the outer loop of the first step, another reduce expression:

(defn distrib-one-level [from-previous-level current-level]
  (let [tmplevel (->> current-level (apply concat) (apply hash-map))]
    (reduce (fn [tmplevel- [city stock]]
              (distrib-one-city city stock tmplevel-))

Here I use the let form to bind the initial value of tmplevel to a hashmap version of current-level. The first argument of the reducer function is named tmplevel- (note the trailing dash): probably a poor choice, my point was to mark its relation with tmplevel yet avoiding confusion.

All in all, what we have here is a hashmap named tmplevel that gets updated for each city we need to redistribute, for each destination reachable from said city. Note how what we really care about is the evaluate the outermost reduce expression; after we get that value, that’ll be the return of distrib-one-level. Nobody will ever know about a temporary hashmap named tmplevel… au revoir.

We use it like

user=> (distrib-one-level
         [["Bath" 2] ["Birmingham" 3] ["Bradford" 4] ["Bristol" 5]]
         [["Cambridge" 6] ["Canterbury" 7] ["Carlisle" 8] ["Chelmsford" 9] ["Chester" 10] ["Coventry" 11]])
{"Coventry" 13, "Canterbury" 10, "Chelmsford" 11, "Chester" 13, "Cambridge" 8, "Carlisle" 10}

And now thresholding, the second step announced before: just another reduce.

(defn threshold-filter [tmplevel threshold]
  (reduce (fn [[updated-level to-next-level] [city stock]]
            (if (< stock threshold)
              (list (conj updated-level [city 0])
                    (conj to-next-level [city stock]))
              (list (conj updated-level [city stock])
          (list () ())
          (seq tmplevel)))

The first argument is the level before thresholding, and I again call that tmplevel. Note how it carries along a pair of lists: udpated value for the current level and values to transfer to the next level.
The following ties together first and second step, i.e. redistribution and thresholding of items:

(defn distrib-and-filter [from-previous-level current-level threshold]
  (threshold-filter (distrib-one-level from-previous-level current-level)
user=> (distrib-and-filter level1 level2 10)
((["Carlisle" 10]
  ["Cambridge" 0]
  ["Chester" 13]
  ["Chelmsford" 11]
  ["Canterbury" 10]
  ["Coventry" 13])
 (["Cambridge" 8]))

All it’s left now is to loop over all level. Again, reduce:

(defn distrib-all- [stocks threshold]
  (reduce (fn [[updated-pile from-previous-level] current-level]
            (let [[updated-level to-next-level]
                  (distrib-and-filter from-previous-level
              (list (conj updated-pile updated-level)
          (list () ())

(defn distrib-all [stocks threshold]
  (first (distrib-all- stocks threshold)))

The first “utility” function distrib-all- does all the job, but carries along the items to redistribute to the next level. Hence I define a tiny wrapper on top of it that throws away the second element of the result.

user=> (distrib-all stocks 10)
((["Dover" 15] ["Durham" 17] ["Derby" 16] ["Dundee" 14])
 (["Carlisle" 10] ["Cambridge" 0] ["Chester" 13] ["Chelmsford" 12] ["Canterbury" 10] ["Coventry" 13])
 (["Bristol" 0] ["Birmingham" 0] ["Bath" 0] ["Bradford" 0])
 (["Aberdeen" 0]))

Venn diagrams are so eighteenth century

We use Venn diagrams to show relationships between sets. Draw a couple of overlapping circles, what’s in the overlap region belongs to both sets. The problems are:

  • the relative sizes of sets are hard to represent accurately
  • if you have more than two or three sets the diagram gets messy very quickly.

Here I show an alternative way to display intersections of lists using the so called “chord diagram”. The visualization I am advocating here is packaged as a d3 plugin I am calling “chord2″, which provides a convenient API for the chord diagram specifically devised to represent intersections.


How to read a chord intersection diagram

A chord diagram is composed by arcs, which lie on the perimeter of a circle, and chords (or ribbons) that connect arcs together. Each arc represents a set or collection of things: the width of an arc’s angle is proportional to the size of the set it represents. A chord represents common elements between two sets (the arcs it connects); the wider the chord, the more elements we find in the intersection of the sets. Let’s consider this example of three lists of people:

nice people := [Charline, Martin, Benedikt, Camilla]
rich people := [Sebastian, Camilla]
smart people := [Larry, Camilla, Charline, Gustavo, Audrey]

Here the chord and venn diagrams side by side for the lists above.


Each overlap region of the Venn diagram is mapped to one or more chords. The region corresponding to Camilla, which belongs to all sets, is represented by three chords:


Charline, who is nice and smart but not rich, is in a two-sets intersection, hence is represented by a single chord (if you want, a chord with two ends, as opposed to a three-ended triangular chord as for Camilla):


All other people lies in one and only one list, so there is no chord for them.

The need for exploding intersections


To draw the chord diagram above you’ll need to split your lists into the atomic components of their intersection as depicted in the figure. And I have you covered: in a previous post I illustrate how that is done.

A visual optimization: polygons of chords

As the number of sets increases, the naive approach to join arcs is to connect each side with all other sides that have elements in common. But is much better to economize on chords, as that approach would lead to lots of chords happily running around the diagram for each “nth order” overlap (group of elements common to n sets). What we can do instead is to draw only n chords so that they form a chain: where a chord “arrives”, another one “leaves”, and all regions in the overlap are touched by exactly two chords, no matter what the order of the overlap is. I call these constructs “polygons”, as if each chord is a “side” in a polygon.

The following figure explains how much clutter can be saved: on the left, the naive “connect everything” solution. On the right, the cleaner strategy. My chord2 d3 plugin uses the latter.


Use fading to make polygons stand out

The optimization discussed above has a price: given an arc, now it takes a little more effort to recognize all the other arcs it connects to. We need to follow polygons around, instead of just getting the answer from multiple chords all springing out of a single source.

To get back the desired readability, when you mouse over an arc the chord2 plugin will make all unrelated chords fade out. Try it for yourself in the diagram below; all other images in this post are static PNGs, but this one is a real chord diagram powered by d3.

It always passes the tests

Last february Graydon Hoare (of Monotone and Rust fame) wrote about what he calls “The Not Rocket Science Rule Of Software Engineering“: automatically maintain a repository of code that always passes all the tests.

Graydon outlines that:

  • Tests on pull request X must be performed before that code gets into trunk, not after.
  • Tests on pull request X must be performed on the resulting merge of X with the current trunk, not on X alone. Your pull request can have unexpected interactions with the rest of the codebase, and you want to test that.

It’s a very simple principle, i.e. not rocket science at all. In my team we adopt an integration workflow that respects The Not Rocket Science Rule, and here is how we do it.

It all starts from Main

Our central repository has only one branch. We call it Main. All developments starts from there.

Clone Main and work

Alice starts the development of her feature (or bugfix) by cloning Main and committing on top of it.

Make a pull request

When done, Alice makes a pull request and asks for a code review.

Meanwhile, Main goes on

Meanwhile, Main has evolved: Bob and Chuck already terminated the integration process, and their code is now in the central repository.

Buildbot takes it

After the code review passes, our buildbot instance will merge Alice’s work and current Main in a throw-away branch. If there are conflicts, the pull request is rejected and Alice has to pull/merge to fix them, then re-submit.
When I explain it, this is the point where people yell at me “ZOMG unsupervised merges, ya craaazy!?!”. My answer is: first, it’s C++ and if the merge screws things up, it’s unlikely it will even compile; second, the three-way merge algorithm is pretty solid and with a simple topology like ours (i.e.: no criss crosses) I’ve never seen it failing; third, the code still needs to pass the tests (see next section).


Buildbot will then proceed to run unit tests and integration tests. Buildbot processes only one pull request at the time, so no other code will be pushed before Alice’s tests are done.

Push back to Main

If tests are successful, buildbot will push Alice’s work to Main, the central repository.

The “next branch” workflow and criss-cross merges

I’ll describe the so-called “next branch” workflow, used for the development of the Git project and for the Linux kernel, and illustrate how it systematically leads to criss-cross merges. A criss-cross merge happens when the greatest common ancestor of the nodes to be merged is not unique. It’s more of a curiosity; I made this observation and wanted to share it. Feel free to prove me wrong if I am missing things.
If you are in a hurry, you can skip the text and jump from image to image reading the captions; that should give you a fair idea of the content of this post.

The “next branch” workflow

This workflow is described in great detail at ​the gitworkflows(7) manpage. The main idea behind it is to keep a so called “throw-away integration branch” called “next” (hence the name), where you merge all the contributions from your developers. The purpose of next is to see how all those features interact together “in the wild”: some people use to build the software directly from next, use it and report back to the developers. With next acting as a “buffer” for features, you can decide which ones are ready for prime time and which ones needs more time, testing, improvements. It’s not “all or nothing”, you can select what you like.
When you’re satisfied by a feature that is sitting in next you merge that feature alone in your “principal branch” (as Git calls it, “master“), the one that sooner or later will be tagged, released and distributed to your users; this process is called “graduation”.
Note that some features might graduate from next to master very quickly, while others might sit in next for a long time before seeing master and hit the release.

Let’s see the workflow in action stage by stage.

It all starts from master


As it is described in the gitworkflows(7) manpage, this workflow uses two more branches: maint and pu (which stands for “proposed updates”). The most stable of the four is maint (the maintainance branch, meant to host fixes to the previous release), then comes master, next, and the most experimental of them is pu. I will not consider maint nor pu, but limit this description to master and next for the sake of simplicity.
So, for what matters to this post, “master” is the principal and most stable branch.

next is an experimental fork of master


next, our “throw-away” integration branch, is a fork of master and hosts features that you aren’t yet sure work well. Since it’s clear to everybody that next is only for testers and developers, regular users who build from the source know that the software from next is not to be used in production.

Contributions are based on master


Most of the developers send their patches to the mailing list. Once those diffs are peer reviewed, the git maintaner (Junio Hamano) applies them in his repository as “topic branches” that forks master. Junio’s repository, with all those topic branches, is available at ​http://github.com/gister/git.

Topic branches are merged into next


Alice and Bob, from the picture, are good developers and wrote patches that pass all tests and had favorable reviews on the mailing list. But how do their changes interact together? The combined result might not work at all.
This is the culprit of the whole workflow: Junio uses a throw-away integration branch to try things out, the next branch.
next is built and used by people willing to tolerate alpha software and help the developers reporting their experiences with it. I hear that the Debian community is involved into using Git built form next and reporting what they see.

Topics may (or may not) graduate to master


In the picture, the topic branch from Bob is merged into master. This process is called “graduation”, and happens when Junio sees that Bob’s feature is sufficiently mature for prime time. master will eventually be tagged as an official Git release, so graduation is kind of a big deal.
Note that in my picture Bob was the first to be merged into next, and it’s also the first feature to graduate. Things might not happen in that order. Namely, a feature can stay in next indefinitely — there is no timeline for graduation. It might even not happen at all. The whole point of the “next workflow” is to have next acting as a buffer. When you like a feature, you pick it from next and merge it in master. Note that this doesn’t happen via cherry picking, but via merging branches; merging offers some benefits over cherry-picking:

  • changes are in a single place in the graph, as opposed to having the same changeset in many places under different hashes.
  • merges are easier to handle, since with a single merge you can add many commits at once to a development line, as opposed to cherry-picking commits one by one.


Uh, look, while we were talking about merging and cherry picking, Alice’s topic graduated to master. Congratulations, Alice!

What if we merge next into master?

This would transform the “next branch workflow” in a completely different thing. I’ve seen this workflow erroneously called “topic branches workflow”, as if the central part of it was the fact that developers do their work in separate branches. While being an important characteristic, topic branches are not the defining trait of this way to work; better: it’s not what surprised me the most when learning about this workflow. The very point is that you try things out in next, and when you’re happy you put the in master.
If you were to merge the whole next into master, it would become “all-or-nothing”; no more freedom to accept mature features into the release while giving less stable developments some more time to stabilize.

Bugfixes go to the oldest stable branch…


There are changes that are both trivially correct and urgently needed. Those bugfixes get applied directly on a stable branch, skipping all the graduation dance.

… and propagates to experimental branches via merges


The small bugfixes I mentioned above aren’t needed only in the stable branch, they’re needed everywhere. The reason the get applied in master and then propagated via a merge of master into next is that merging in the opposite direction is forbidden.
Think of this: you apply the bigfix to next first, then you’d want to merge next into master. That’s not allowed, such merge would carry a lot of things that are not for master (all the topics that haven’t graduated yet). This is to say: it have to be this way.

Here comes the criss-cross merge

The very reason I wrote this post is to share the following remark: when you merge master into next you get a criss-cross merge. A criss cross merge is one where the greatest common ancestor of your two heads is not unique. Let’s see how the head of Alice’s topic branch is a greatest common ancestor for master and next:


as you can imagine, also Bob’s branch head is a greatest common ancestor:


Criss-cross merges might or might not be a big deal. Sometimes they present headaches. If you use this workflow and find yourself in this situation, I hope I provided some useful insights on the topology of the history graph that can help you understand what’s going on.

master and next continue their parallel lives…


The steps above are repeated as necessary: master and next continue to evolve.

… until the end of the cycle

I said that next is a “throw away” branch. If really next and master where to live side to side indefinitely, you’d end up with two branches who’s content differs a lot. At some point, you want to kill next and start afresh. At the end of a “cycle”, which for Git typically lasts for 8 to 10 weeks, next is indeed erased and rebuild afresh by merging all topics that haven’t graduated yet. This gives the opportunity to restart clean, reorder topics etc.

This step is of capital importance to stay on top of all this growing complexity and helps to reason about what’s in where. An invariant like “next has been forked off master not more than 10 weeks ago” might come handy.
But it is also a lot of work: all the conflicts you had merging topics into next will show up again, and again, and again until the conflicting topics finally gets into master. Junio uses a git feature called rerere to remember how he solved conflicts in the past. Whatever trick you use, the luxury of merging features selectively from a sandbox branch to a production branch comes at the price of a lot of housekeeping.
It has been rightly observed that there’s no good reason to do this if you’re not automating next. This is what Linux does: Linux-next is an automatically-built branch combining >50 sources and detects conflicts and so on early.


  • The next branch workflow uses two parallel branches: one is production and the other is a sandbox where you try things out. This architecture gives you complete freedom on what features to promote from the sandbox to the real world.
  • It reflects a development model where you operate under no tight deadlines and only code that meet the highest quality standards is allowed to be released.
  • The complexity of this workflow is nontrivial, and can only be handled if the next branch is built automatically.
  • If the product of your shop is a web service, i.e. you have marketing pushing for features, you operate your own platform and can patch the servers on the go if needed, probably a branch model where you merge the whole “development branch” into production all at once with scheduled release dates is most suited for your needs. The next branch model lets you put features on hold indefinitely, but the sales people are on your neck anyway.

Thanks to Antoine Pelisse who introduced me to this workflow, and for the many insightful discussions we had on this.

Exploding intersections with the Inclusion-Exclusion Principle

How many elements in the union of two sets? The sum of those in each set respectively, you say? Of course not, there might be some element in common that you counted twice. How if the sets involved are three? Or four?

The answer is known as “the Inclusion-Exclusion Principle”, and in this post I’ll describe how to use it to “explode intersections”, i.e. give the cardinality of each region of a Venn diagram. With a picture, here the case of three sets:

The Inclusion-Exclusion Principle tells us that the cardinality of the union of three sets is given by the formula

|A ∪ B ∪ C| = |A| + |B| + |C|
              - |A∩B| - |A∩C| - |B∩C|
              + |A∩B∩C|

which doesn’t come too much as a surprise: we first sum the sets’ cardinalities, then see that we added once too much the “two-by-two” intersections and subtract them, but then ehy, we added the “all-in” interesection three times and subtracted three times, gotta put it back in.

The generalized case follows the same pattern: add all “order one” intersections (the sets alone), remove all “order two” intersections (the two-by-two), add back all “order three” ones (the three-by-three), remove all “order four”, and so on until all no more cases exists.

When thinking of that formula applied to our problem of “exploding” the intersection, I find useful to visualize the situation with the following lattice, where I stress the fact that higher order intersections are contained into some lower order ones. Nodes in the lattice are intersections (all nicely lined up by order), edges represent the partial order relationship “contains”.

The diagram reminds that, for instance, to get how many elements are only in A (and not in B nor C), one has to follow all paths from A down to the highest order intersection, and flip the sign as the order of intersections increases. Which is: sum |A∩B| and |A∩C|, then subtract |A∩B∩C|. The result corresponds to the cardinality of (A∩B) ∪ (A∩C), and has to be subtracted from the cardinality of A.

Slightly harder: 4 sets instead of 3. Following the above diagram, we can compute the number of items in A but in none of the other sets via following all paths that go from A to the very bottom intersection A∩B∩C∩D, and using the principle to know what to add and what to subtract:

in A only = |A| - |A∩B| - |A∩C| - |A∩D|
                + |A∩B∩C| + |A∩B∩D| + |A∩C∩D|
                - |A∩B∩C∩D|

In this other example we want to see how many things are un A and C but in none of the other sets; same story, following the paths and flipping the sign at each step down.

in A∩C only = |A∩C| - |A∩B∩C| - |A∩C∩D|
                    + |A∩B∩C∩D|

Now we implement this algorithm in code. First some helper functions; list flattening:

def flatten(listlist):
    return [elem for list_ in listlist for elem in list_]
>>> flatten([['a', 'b'], ['c', 'd', 'e']])
['a', 'b', 'c', 'd', 'e']

permutations of a list:

def perm(sequence):
    if len(sequence) == 0:
        return [[]]
        out = []
        for cnt, elem in enumerate(sequence):
            out += map(lambda x: [elem] + x,
                       perm(sequence[:cnt] +
        return out
>>> perm(['a', 'b', 'c'])
[['a', 'b', 'c'],
 ['a', 'c', 'b'],
 ['b', 'a', 'c'],
 ['b', 'c', 'a'],
 ['c', 'a', 'b'],
 ['c', 'b', 'a']]

All subsets of given cardinality of a list:

def choose_n(n, srcList):
    if n == 0:
        return [[]]
        out = []
        for cnt, elem in enumerate(srcList):
            out += map(lambda x: [elem] + x,
                       choose_n(n-1, srcList[(cnt+1):]))
        return out
>>> choose_n(2, ['a', 'b', 'c'])
[['a', 'b'], ['a', 'c'], ['b', 'c']]

All possible subsets of a list, excluded the empty one:

def allsubsets(srcList):
    out = []
    tot = len(srcList)+1
    complem = lambda universe, ensemble: \
        list(set(universe) - set(ensemble))
    complSrcList = lambda ensemble: complem(srcList,
    for i in range(1, tot):
        if i > tot / 2: # integer division
            yield map(complSrcList,
                      choose_n(tot - (i + 1), srcList))
            yield choose_n(i, srcList)

Note the optimization where I work on the complementary list to reduce the number of calls to choose_n, which is equal to the size of the argument.

>>> list(allsubsets(['a', 'b', 'c']))
[[['a'], ['b'], ['c']],
 [['a', 'b'], ['a', 'c'], ['b', 'c']],
 [['a', 'c', 'b']]]

Now a function to precompute all intersection values. We store them in a dictionary where values are intersection cardinalities, and keys are the names of the sets involved in the intersection. Many items in this lookup table represent the same intersection, with a different order of the sets in the key. I do this to speed up the lookup: instead of normalizing the queries, I store the answer to all possible queries.

def intersLookup(lists):
    toInters = flatten(allsubsets(lists.keys()))
    def inters_n(names):
        return reduce(lambda s, t: set(s) & set(t),
                      [lists[name] for name in names[1:]],
    lookup = {}
    for sequence in toInters:
        cardinality = len(inters_n(sequence))
        for variant in perm(sequence):
            lookup['&'.join(variant)] = cardinality
    return lookup
>>> intersLookup({'a': ['apple', 'banana'],
...               'b': ['orange', 'apple', 'watermelon'],
...               'c': ['peach', 'plum', 'pear', 'apple', 'orange']})
{'a': 2, 'c': 5, 'b': 3, 
 'a&b&c': 1, 'b&c&a': 1, 'c&b&a': 1,
 'a&c&b': 1, 'b&a&c': 1, 'c&a&b': 1,
 'b&a': 1, 'a&b': 1,
 'b&c': 2, 'c&b': 2,
 'c&a': 1, 'a&c': 1}

The following generator, named subints as in “sub-intersections”, is particularly important since it implements the concept of “walking down all paths” in the above diagrams, “one level at a time”. It returns the nodes of the lattice conveniently grouped for the sign flipping that happens in the Inclusion Exclusion formula.

def subints(elem, compl):
    for thing in allsubsets(compl):
        yield map(lambda x: elem + x, thing)

See the example above, the first I gave for the 4 sets case. We want to see how much of A is “eaten” by the sub-intersections. What to sum, what to subtract? Here the answer:

>>> si = subints(['a'], ['b', 'c', 'd'])
>>> si.next() # add
[['a', 'b'], ['a', 'c'], ['a', 'd']]
>>> si.next() # subtract
[['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'c', 'd']]
>>> si.next() # add
[['a', 'c', 'b', 'd']]
>>> si.next() # it's over
Traceback (most recent call last):
  File "", line 1, in 

The trick is to observe that in a sub-lattice of our diagrams lower levels are just intersections of the nodes in upper levels plus all sets left out. Now it’s clear why allsubsets works the way it does: it returns subsets grouped by size, so that subints can be written more concisely.
Let’s see the second example above, again from the 4-sets case:

>>> si =  subints(['a', 'c'], ['b', 'd'])
>>> si.next()
[['a', 'c', 'b'], ['a', 'c', 'd']]
>>> si.next()
[['a', 'c', 'b', 'd']]
>>> si.next()
Traceback (most recent call last):
  File "", line 1, in 

We’ve now everything needed to write the inclusion-exclusion formula:

def flip():
    curr = 1
    while True:
        yield curr
        curr *= -1

def o(f, g):
    # function composition
    def helper(x):
        return f(g(x))
    return helper

def inclusionexclusion(elem, compl, allInters):
    currentInters = allInters['&'.join(elem)]
    sign = flip()
    subintValue = \
        sum(map(lambda (sign, value): sign * value,
                    map(o(sum, (lambda nodes: 
                                map(lambda node: allInters['&'.join(node)],
                        subints(elem, compl)))))
    return currentInters - subintValue

Let’s see:

>>> allInters = intersLookup({'a': ['apple', 'banana'],
                'b': ['orange', 'apple', 'watermelon'],
                'c': ['peach', 'plum', 'pear', 'apple', 'orange']})

how many items in a and b but not in c?

>>> inclusionexclusion(['a', 'b'], ['c'], allInters)

zero, since the only common element is apple, but it’s also in c. How many elements in the intersection of all three?

>>> inclusionexclusion(['a', 'b', 'c'], [], allInters)

the apple, indeed. How many in a, but not in b nor c?

>>> inclusionexclusion(['a'], ['b', 'c'], allInters)

that’s the banana.

The last thing left to do is to package all of this into a shiny function that computes our complete explosion:

def explode(allNames, allInters):
    out = {}
    for elem in flatten(allsubsets(allNames)):
        out['&'.join(elem)] = \
                                       complem(allNames, elem),
    return out

There we go:

>>> sets = {'a': ['apple', 'banana'],
...         'b': ['orange', 'apple', 'watermelon'],
...         'c': ['peach', 'plum', 'pear', 'apple', 'orange']}
>>> allInters = intersLookup(sets)
>>> explode(sets.keys(), allInters)
{'a': 1, 'a&c&b': 1, 'c': 3, 'b': 1, 'c&b': 1, 'a&b': 0, 'a&c': 0}

The code is available at https://github.com/gghh/intersections. Images are drawn with gimp and zwibbler. The 4-sets Venn diagram comes from Wikipedia.

The annoying merge conflicts in case of criss cross

Here I show an example of problematic merge that sometimes I have with Mercurial. What happens is that for a given file foo.txt two commits, one parent of the other, get in competition at merge time. Obviously the child edit should be selected as winner of the merge, but this involves user intervention when it shouldn’t be necessary.

Here a minimal example of history:

Our goal is to merge node 5 and 6. The first thing we note is that the GCA isn’t unique, since both 1 and 4 are good candidates. Since for my example I need 4 to be selected (so that things can go bad), I inserted node 3; this way 4 is the common ancestor furthest from the root, and will then be selected.

Why did I color 0, 1 and 2 in pink? Because those are the only nodes where I edit my foo.txt file, if we exclude merges. Before I show a “Try It And See” example with Mercurial, I’ll describe what happens at each step in plain english.

  1. foo.txt is edited.
  2. foo.txt is edited again.
  3. foo.txt is edited a third time.
  4. Somebody forks at 0. foo.txt is not modified.
  5. Development continues on top of 3. foo.txt is not modified. At this stage, foo.txt has the very same content as it had in 0.
  6. 4 is merged with 1. With respect to foo.txt, the 3way merge algorithm automatically select changes in 1 as winners: if a line is changed in “this” and unchanged in “other”, the change is selected.
  7. 2 is merged with 4. Same story, ordinary business for 3way merge, most recent change is selected.

And then we want to merge 5 and 6. Mercurial selects 4 as base. Let’s observe things from foo.txt point of view:

  • (this) in 6 we have the latest version (from 2).
  • (other) in 5 we have the before-the-last version (from 1). This should not “compete” with the latest… does it?
  • yes, because in 4 (base of the merge) we have the very first version, and so 3way merge has no idea of who is more recent, if “this” or “other”. They are both changes as far as you compare them with the early version in 0 (inherited by 4).

Now, the real example. First, we go on with commits 0, 1 and 2

$ hg init
$ echo 'I am a cow.' > foo.txt ; hg add foo.txt ; hg commit -m foo
$ echo 'I have superpowers.' > foo.txt ; hg commit -m foo
$ echo 'I am a cow with superpowers.' > foo.txt ; hg commit -m foo

Then we create the second branch, with commits 3 and 4 forking from 0.

$ hg up 0
$ touch bar.txt ; hg add bar.txt ; hg commit -m bar
$ touch baz.txt ; hg add baz.txt ; hg commit -m baz

We are now at 4, and we merge with 1. The base will be 0, and all goes fine: the most recent version of foo.txt is selected.

$ hg merge 1 ; hg commit -m merged
$ cat foo.txt 
I have superpowers.

We then update to 2 and merge with 4. Again, no surprises from the merge. The most recent version (i.e. 2) is selected.

$ hg up 2
$ hg merge 4 ; hg commit -m merged
$ cat foo.txt 
I am a cow with superpowers.

A look at the history so far:

$ hg log --graph --template '{rev} {desc}'
@    6 merged
| | o  5 merged
| |/|
| o |  4 baz
| | |
| o |  3 bar
| | |
o---+  2 foo
 / /
| o  1 foo
o  0 foo

Being now in 6, we merge with 5

$ hg merge

As expected, the stupid conflict.

It is worth to note that the Consensus Merge algorithm would not solve this situation. It would take into account the two candidates for gca (1 and 4), and weight their “opinion” equally. But one is right, the other is wrong: in this small example, democracy fails to apply.

My naive observation is that when the history of a file is all on the same side of the merge, that file should not be touched at merge time. Just take the (single) head of the sub-DAG that tracks all commits affecting said file. But maybe I am missing undesired side effects of this approach. Also, thinking “per file” is usually the wrong way to reason about source control. Considering the repository as a whole is a thing that Mercurial and Git got right with respect to their predecessors. A “file” is a very artificial way to split code into chunks, as far as version control is concerned. The atomic entity is “the single line of code”.

In this case, with multiple GCAs, one can see that one of those GCA is good for the file we’re considering (foo.txt should be merged with 1 as base), and not the other. I wander if the following principle holds: “if the GCA is unique, then it is optimal for all files (better: lines of code)” where by “optimal” I mean “it doesn’t compare a revision against its ancestors in stupid ways”. On the other side, if there are many GCAs, each line of code has its favourite one, and it’s impossible to choose.

How does Mercurial select the Greatest Common Ancestor

The GCA of two nodes X and Y in a DAG (Directed Acyclic Graph) is an ancestor of both X and Y such that no descendant of it is also a common ancestor of X and Y. This definition is equivalent to the revset expression heads(ancestors(X) and ancestors(Y)), i.e. the head (if unique) of the intersection among the ancestors of X and those of Y.

Mercurial adopts a slightly different definition: the GCA is the common ancestor of the two nodes which is furthest from a root, as measured by the longest path. This makes sense, since the GCA is used as base for merges and it’s meaningful to take a (topologically) more recent version in case of ambiguity.

I am going to discuss the algorithm that used to perform GCA computations in mercurial until version 2.5.4; in 2.5.5 this code was replaced by a faster algorithm.

The interesting part is what happens when there are multiple candidates for the GCA, i.e. the GCA is not unique. What happens is that mercurial will pick the one that finds first. Here we walk the algorithm step by step and see how it works.

First off, where to find things: the ancestor() function is in the mercurial.ancestor module. Here the docstring:

$ python

>>> from mercurial import ancestor
>>> help(ancestor.ancestor)

ancestor(a, b, pfunc)
    Returns the common ancestor of a and b that is furthest from a
    root (as measured by longest path) or None if no ancestor is
    found. If there are multiple common ancestors at the same
    distance, the first one found is returned.
    pfunc must return a list of parent vertices for a given vertex

We’ll work with the following graph, where GCA(4, 5) isn’t unique: candidates are 1 and 2, equally good (parents are on the left and children on the right).

We can try it right away, like this:

>>> parents = {0: [], 2: [0], 1: [0], 3: [2],
...            5: [1, 2], 4: [3, 1]}
>>> pfunc = lambda x: parents[x]
>>> ancestor.ancestor(4, 5, pfunc)

1 is chosen… apparently it’s the one found first. Let’s dig deeper into the algorithm. The ancestor() function does its job in two steps:

  1. prepare a depth dictionary that associates each ancestor of both nodes X and Y to its distance from a root, computed as the longest path to it.
  2. use a generator for each of the two starting vertices that, at each iteration, returns all ancestor “of the same generation”, i.e. all ancestors at a given distance d from the root. The two generators are such that they return generations in descending order of distance. If you want, you can picture this as two ancestor list that “race” towards the root, until they meet at the same distance level (or don’t meet, in which case no common ancestor is found).

I have split the code in three separate functions for this explanation. First, the construction of the depth dictionary:

def depthf(a, b, pfunc):
    a, b = sorted([a, b])
    # find depth from root of all ancestors
    # depth is stored as a negative for heapq
    parentcache = {}
    visit = [a, b]
    depth = {}
    while visit:
        vertex = visit[-1]
        pl = pfunc(vertex)
        parentcache[vertex] = pl
        if not pl:
            depth[vertex] = 0
            for p in pl:
                if p not in depth:
            if visit[-1] == vertex:
                # -(maximum distance of parents + 1)
                depth[vertex] = min([depth[p] for p in pl]) - 1
    return depth, parentcache

The parentcache dictionary is there to save some time at a later step; we’ll pass it along to the forthcoming functions.

What’s to note here is that we can compute depth for a node only if we have depth of all its parents, taking the max and adding 1 (line 22).

We visit nodes in a depth-first fashion. We’ll encounter a lot of unseen vertices and append them the the visit array (line 19), until at some point we’ll hit a root and set its depth to 0 (line 14). At this point we start backtracking, pop()-ing nodes from visit as soon as we compute their depth (line 23).

Note that the condition visit[-1] == vertex (line 20) is equivalent to “all parents of vertex already have depth computed” since we are handling DAGs: the absence of cycles assure us that no vertex can be found twice in the visit list. Well, strictly speaking, the point is that no node can be parent of itself: the only threat to the last elem of visit being unchanged since line 10 is the loop at line 17.

Here the result if applied to our toy example:

>>> depth, parentcache = depthf(4, 5, pfunc)
>>> depth
{0: 0, 1: -1, 2: -1, 3: -2, 4: -3, 5: -2}

depths are stored negated since we’ll later use a heap to figure out the furthest elements.

Here the helper function that traverses ancestors in order of decreasing distance from the root:

def ancestors(vertex, depth, parentcache):
    h = [(depth[vertex], vertex)]
    seen = set()
    while h:
        d, n = heapq.heappop(h)
        if n not in seen:
            yield (d, n)
            for p in parentcache[n]:
                heapq.heappush(h, (depth[p], p))

A few things to note here:

  1. duplicate nodes are removed
  2. since the distance of a node is always larger than the distance of any of its parents, it’s safe to pop() an element and only later push() its parents into the heap. Subsequent pop() will always return elements in the desired order.

Here the result of applying it to the vertex 4:

>>> list(ancestors(4, depth, parentcache))
[(-3, 4), (-2, 3), (-1, 1), (-1, 2), (0, 0)]

Another helper function: we need to get all ancestors of a given depth, not one by one. This one does the buffering into “generations”:

def generations(vertex, depth, parentcache):
    sg, s = None, set()
    for g, v in ancestors(vertex, depth, parentcache):
        if g != sg:
            if sg:
                yield sg, s
            sg, s = g, set((v,))
     yield sg, s

Here the generations that lead to the vertex 4:

>>> list(generations(4, depth, parentcache))
[(-3, set([4])), (-2, set([3])), (-1, set([1, 2])), (0, set([0]))]

Finally here the function that put the pieces together and gives the final result:

def gca(a, b, depth, parentcache):
    x = generations(a, depth, parentcache)
    y = generations(b, depth, parentcache)
    gx = x.next()
    gy = y.next()
    # increment each ancestor list until it is closer to root
    # than the other, or they match
        while True:
            if gx[0] == gy[0]:
                for v in gx[1]:
                    if v in gy[1]:
                        return v
                gy = y.next()
                gx = x.next()
            elif gx[0] > gy[0]:
                gy = y.next()
                gx = x.next()
    except StopIteration:
        return None

gx and gy are tuples like (depth, nodes). A GCA can by found only when gx[0] == gy[0]; since we start from the max depths and progressively decrease it, we’re guaranteed to find an element in the intersection of the ancestors of the original nodes that is the furthest away from the root, or we find nothing at all.

Here we run it, showing the value of gx and gy as we go along:

>>> gca(4, 5, d, p)
gx (-3, set([4]))
gy (-2, set([5]))
gx (-2, set([3]))
gx (-1, set([1, 2]))
gy (-1, set([1, 2]))

Note the 1 is chosen, instead of 2, for no particular reason: just because it’s the first element of list(set([1, 2])].

Interestingly, if we add a node 6 between 0 and 2, this makes 2 the common ancestor furthest from root, and so the algorithm doesn’t see any more ambiguity in the choice:

>>> parents = {0: [], 2: [6], 1: [0], 6: [0], 3: [2],
...            5: [1, 2], 4: [3, 1]}
>>> pfunc = lambda x: parents[x]
>>> depth, parentcache = depthf(4, 5, pfunc)
>>> gca(4, 5, depth, parentcache)
gx (-4, set([4]))
gy (-3, set([5]))
gx (-3, set([3]))
gx (-2, set([2]))
gy (-2, set([2]))

In the last iteration the intersection between gx[1] and gy[1] is the element 2 alone.

Caching in python with a descriptor and a decorator

The Mercurial SCM has several strategies to cache expensive computations; you can find an overview of all of them at this page. This post is about the propertycache decorator: a mechanism to replace a function with its result the very first time it is called. The trick is made possible by the combination of two features of the python programming language: descriptors and decorators.

Python descriptors

The python descriptor protocol is something that works as follows: if you override the methods __get__(), __set__() or __delete__() in a (new style) python class, you’re given the chance to execute your own code when an object of that class is evaluated (__get__()) or is bound to another object (__set__()). Which is, you can decide what happens when that evaluation (or binding) occurs.

One more thing to remember is that an object that overrides __get__() and __set__() behaves like a descriptor (which is: these methods are actually called instead of the default evaluation/binding behaviour) only if it is a member of another class.

Here a few references on descriptors:

A dummy example

Here we use the “descriptor protocol” to preempt user code from assigning to an attribute of an object. That attribute is kept to a constant value instead:

class joke(object):
    def __get__(self, obj, objtype):
        return 'Bazinga!'
    def __set__(self, obj, value):

class foo(object):
    x = joke()

We had to implement __set__() too, despite the fact that it’s empty: python would have ragarded joke as a non-data descriptor, which is a slightly different beast. Here is what happens when writing and reading the attribute x of an object of type foo:

>>> f = foo()
>>> f.x
>>> f.x = 5
>>> f.x

Python decorators

Are cool. In python any callable that wraps a function is called “decorator”. Example: you have a function that does stuff, like

def deco(arg):
    # stuff

and another function, that does some other stuff:

def fun(arg):
    # other stuff

Sometimes is interesting to rebind fun like

fun = deco(fun)

and deco is said to be a “decorator” for fun. Some basic references:

Python has a convenient syntactic sugar for the rebinding above: it can be written as

def fun(arg):
    # other stuff

I often use decorators when I need to enhance the behaviour of a function by having something done before the function call, or after. So I write the following closure and use it as a decorator:

def deco(fun):
    def helper(arg)
        # do something before
        result = fun(arg)
        # do something after
        return result
    return helper

and then, for whatever fun this makes sense, I rebind its name to

fun = deco(fun)

The propertycache decorator

Without further ado, here the code, taken verbatim from the mercurial codebase:

class propertycache(object):
    def __init__(self, func):
        self.func = func
        self.name = func.__name__
    def __get__(self, obj, type=None):
        result = self.func(obj)
        self.cachevalue(obj, result)
        return result

    def cachevalue(self, obj, value):
        setattr(obj, self.name, value)

and the example usage, taken straight from the mercurial wiki:

from util import propertycache

class foo(object):
    def expensive(self):
        print 'created'
        return [1, 2, 3]

f = foo()
f.expensive # prints created
f.expensive # simply returns [1, 2, 3]

# to forget the saved value and calculate it again, use delattr
delattr(f, 'expensive')

How does it work

Before anything happens, python defines

f.expensive = propertycache(f.expensive)

Note that this is a descriptor object, since the class propertycache overrides the __get__() method. From this point on, you can’t call the method f.expensive() anymore:

>>> f.expensive()
<__main__.foo object at 0xb742d28c>
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'list' object is not callable

What you do, instead, is to refer to f.expensive as a data member (which is, no () parens). The first time that f.expensive is evalutated, the original method f.expensive() is called (remember: it was saved in the func member of the object of type propertycache). Its result is then rebound to f.expensive by the cachevalue() method.
Now f.expensive is not a descriptor object anymore, but only the returned value of the original method: further evaluations won’t go thru the call of the original method anymore, just return the cached value.

As a final remark, note that this doesn’t generalize to methods that have more arguments than the self reference, since the __get__() method has a well defined signature and you can’t add more items in there: only a reference to the “owner” object is allowed (the instance of foo in our example).

How was it done before?

Descriptors were introduced in the Mercurial codebase with revision 595ba2537d4f by Dirkjan Ochtman, the same person who wrote the excellent chapter on the AOSA book about Mercurial.

How was caching done before? It looked like this:

class foo:
    def __getattr__(self, name):
        if name == 'expensive1':
            print 'Computing expensive1...'
            self.expensive1 = [1,2,3]
            return self.expensive1
        elif name == 'expensive2':
            print 'Computing expensive2...'
            self.expensive2 = [4,5,6]
            return self.expensive2
            raise AttributeError(name)

Let’s see that in action:

>>> f = foo()
>>> f.expensive1
Computing expensive1...
[1, 2, 3]
>>> f.expensive2
Computing expensive2...
[4, 5, 6]
>>> f.expensive1
[1, 2, 3]

Here the concertmaster is the __getattr__() method, which, if implemented, intercepts all accesses to members not found otherwise ; in our example, those are expensive1 and expensive2. Once accessed the first time, their value is computed and bound to a datamember; subsequent calls will find them without passing thru __getattr__.

It goes without saying that the decorator described above is much more terse than this boilerplate code; it factors away all the common behaviour.

Mercurial histedit behind the scenes

What’s the magic behind the history rewriting extension in Mercurial? It’s called histedit, it’s written by Augie Fackler, and it’s a fundamental tool in my daily coding activity. I fired up pdb and watched closely to what it does. Lean back and enjoy (all good ideas are Augie’s, all mistakes are mine!).

First off, I am obliged to rise a giant “don’t try this at home” sign upfront. I will use internal Mercurial APIs, which are deliberately undocumented; the reason is very simple: they might change next week without further notice.

Having said that, let me ask you a question: how would you create a changeset that is equivalent to a sequence of N other changesets? Which is, in mercurial lingo, how would you implement folding N changesets? You’d probably smash the first two, then add the third one to that result, and after you’d add the fourth, the fifth and so on. That’s what happens in histedit. Let’s see that in action.

Before we tackle the adventurous task of editing the history of our repository, we have to learn how to do two operations:

  1. pick a changeset somewhere in the history and apply it to the working directory
  2. collapse a chain of commits into a single one

Pick there, apply here

The histedit extension special-cases two situations.

In place

consider the repo created with

$ hg init
$ touch foo ; hg add foo ; hg commit -m foo
$ touch bar ; hg add bar ; hg commit -m bar
$ hg up 0

which then looks like

what we want to do here is to extract changeset 1 and apply it to the working directory. Since the parents of cset 1 and of the working directory are the same (they’re both children of changeset 0), the situation can be handled by a simple shot of cmdutil.revert(). Let’s switch to the python shell:

from mercurial import ui, cmdutil, node, localrepo

repo = localrepo.localrepository(ui.ui(), '.')
cmdutil.revert(ui.ui(), repo, repo[1],
               (repo[0].node(), node.nullid), all=True)
repo.commit(text="Applied cset 1")

this, as promised, leads us to

The revert(ui, repo, ctx, parents, *pats, **opts) function restore the files to the state they had at revision ctx, and set the parents of the working directory. You might think there is something odd going on here, and I did too: this is a quite unusual application of revert, which is more often use to restore an ancestor revision. Here we’re picking stuff from another branch (let’s say, from “the future”).

Far away nodes: merge needed

The other situation is when the parents of the working directory and of our target revision differ but… those two (different) parents have the same content. It might seem an artificial situation, but it is what happen in a generic step of the histedit algorithm. In that case we resort to a merge. Let’s first build the sample repo:

$ touch foo ; hg add foo ; hg commit -m foo
$ touch bar ; hg add bar ; hg commit -m bar
$ touch baz ; hg add baz ; hg commit -m baz
$ hg up 0

then again the revert() trick to apply changeset 1 on the the of 0, but on a parallel line; the code is exactly the python snippet above, so we have

what’s the histedit recipe to apply 2 on the top of 3? We merge the working directory (updated to3) with 2, setting 1 as the common ancestor; this isn’t the topological truth (1 is not an ancestor of 3) but is morally ok, since 1 and 3 are contain the same state.

repo = localrepo.localrepository(ui.ui(), '.')
merge.update(repo, repo[2].node(), True, True, False, repo[1].node())
repo.setparents(repo[3].node(), node.nullid)
repo.commit(text="Applied cset 2")

here the result:

Note at how I carefully cleanup the parents of the changeset, removing traces of the merge that just happened.

All of this is nicely packaged in the histedit.applychanges() function in the histedit extension.

Collapse many into one

Again, let’s kickstart with a toy repository:

$ hg init
$ touch foo ; hg add foo ; hg commit -m foo
$ touch bar ; hg add bar ; hg commit -m bar
$ touch baz ; hg add baz ; hg commit -m baz
$ touch woo ; hg add woo ; hg commit -m woo

We want to take changeset from 1 to 3 and smash them into one that sums them all. Here is the python code:

from mercurial import ui, cmdutil, node, localrepo, context

repo = localrepo.localrepository(ui.ui(), '.')

files = set(repo[1].files()) | set(repo[2].files()) | set(repo[3].files())
mf = repo[3].manifest()

def filectxfn(repo, ctx, path):
    if path in mf:
        fctx = repo[3][path]
        mctx = context.memfilectx(fctx.path(), fctx.data())
        return mctx
    raise IOError()

parents = (repo[0].node(), node.nullid)
new = context.memctx(repo,
                     text='collapsed 1 to 3',


Again, there is a function for this in the histedit extension, histedit.collapse(). I was suprised looking at that code: I have a mental model of commits as, say, an ordered sequence of diffs; that’s how I read this picture anyway. If I am collapsing changeset, I expect some sort of iteration over that chain. But that’s not what collapse() does. Look at the closure filectxfn: basically it consider only filelogs from the last changeset, skipping those filelogs that didn’t changed since first. As if in last you have all needed things necessary to know what happened in between.

Which is exactly how mercurial works: filelogs just internally make the diff sequence, externally only the full contents are exposed. So basically all you can do is ask a filelog for the contents of something plus the parents at a version.

When does this show start? It’s getting late

With all this machinery in place, we’re now ready to juggle with those little balls so fast that tortoiseHg will have an hard time catching up.

First off, the toy repo. For the occasion we’re giving foo, bar and baz a little break and use some piece of real literature.

$ hg init
$ echo 'Dorothy lived in the midst of the great\
> Kansas prairies, with Uncle' > chap1.txt
$ hg add chap1.txt ; hg commit -m "line 1"

$ echo "Henry, who was a farmer, and Aunt Em,\
> who was the farmer's wife. Their" >> chap1.txt
$ hg commit -m "line 2"

$ echo 'house was small, for the lumber to build it\
> had to be carried by wagon' >> chap1.txt
$ hg commit -m "line 3"

$ echo 'many miles.  There were four walls, a floor\
> and a roof, which made one' >> chap1.txt
$ hg commit -m "line 4"

$ echo 'room; and this room contained a rusty looking\
> cookstove, a cupboard for' >> chap1.txt
$ hg commit -m "line 5"

$ echo 'the dishes, a table, three or four chairs,\
> and the beds.' >> chap1.txt
$ hg commit -m "line 6"

Our plan is to fold all changesets between 1 and 5 included into a single one. Which is, we would run

hg histedit 1

The histedit commands will be:

(pick, 1)
(fold, 2)
(fold, 3)
(fold, 4)
(fold, 5)

The first thing we do is to update the working directory to 0 and play the cmdutil.revert() trick to apply 2 on the top of 1.

from mercurial import ui, hg, commands, localrepo, node, merge
from mercurial import cmdutil, scmutil, context, hg, repair
from hgext import histedit

repo = localrepo.localrepository(ui.ui(), '.')

hg.update(repo, repo[1].node())

cmdutil.revert(ui.ui(), repo, repo[2],
               (repo[1].node(), node.nullid), all=True)
repo.commit(text="2 bis")

Now we again update to 0, and collapse the line 1 -- 6 into changeset 7, which we apply on the top of 0.

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[1], repo[6], {})
hg.update(repo, repo[7].node())

Now we want to apply 3 on the top of 7. That’s the merge case we saw earlier; our reference ancestor will be 2, since 0 -- 7 is the same as 0 -- 1 -- 2.

merge.update(repo, repo[3].node(), True, True, False, repo[2].node())
repo.setparents(repo[7].node(), node.nullid)
repo.commit(text="7 and 3")

Now guess what? collapse the line 7 -- 8, call it 9.

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[7], repo[8], {})
hg.update(repo, repo[9].node())

Come on, a couple more and we’re done: 4 on top of 9, then 5 on top of 11.

merge.update(repo, repo[4].node(), True, True, False, repo[3].node())
repo.setparents(repo[9].node(), node.nullid)
repo.commit(text="4 bis")

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[9], repo[10], {})
hg.update(repo, repo[11].node())

merge.update(repo, repo[5].node(), True, True, False, repo[4].node())
repo.setparents(repo[11].node(), node.nullid)
repo.commit(text="5 bis")

now the final collapse: changeset 13 is the result of 1 -- 2 -- 3 -- 4 -- 5 all folded together.

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[11], repo[12], {})
hg.update(repo, repo[13].node())

Now it’s time to throw away the leftovers. repair.strip() comes to the rescue:

repair.strip(ui.ui(), repo, repo[1].node())

repair.strip(ui.ui(), repo, repo[1].node())

repair.strip(ui.ui(), repo, repo[1].node())

repair.strip(ui.ui(), repo, repo[1].node())