The “next branch” workflow and criss-cross merges

I’ll describe the so-called “next branch” workflow, used for the development of the Git project and for the Linux kernel, and illustrate how it systematically leads to criss-cross merges. A criss-cross merge happens when the greatest common ancestor of the nodes to be merged is not unique. It’s more of a curiosity; I made this observation and wanted to share it. Feel free to prove me wrong if I am missing things.
If you are in a hurry, you can skip the text and jump from image to image reading the captions; that should give you a fair idea of the content of this post.

The “next branch” workflow

This workflow is described in great detail at ​the gitworkflows(7) manpage. The main idea behind it is to keep a so called “throw-away integration branch” called “next” (hence the name), where you merge all the contributions from your developers. The purpose of next is to see how all those features interact together “in the wild”: some people use to build the software directly from next, use it and report back to the developers. With next acting as a “buffer” for features, you can decide which ones are ready from prime time and which ones needs more time, testing, improvements. It’s not “all or nothing”, you can select what you like.
When you’re satisfied by a feature that is sitting in next you merge that feature alone in your “principal branch” (as Git calls it, “master“), the one that sooner or later will be tagged, released and distributed to your users; this process is called “graduation”.
Note that some features might graduate from next to master very quickly, while others might sit in next for a long time before seeing master and hit the release.

Let’s see the workflow in action stage by stage.

It all starts from master

p1

As it is described in the gitworkflows(7) manpage, this workflow uses two more branches: maint and pu (which stands for “proposed updates”). The most stable of the four is maint (the maintainance branch, meant to host fixes to the previous release), then comes master, next, and the most experimental of them is pu. I will not consider maint nor pu, but limit this description to master and next for the sake of simplicity.
So, for what matters to this post, “master” is the principal and most stable branch.

next is an experimental fork of master

p2

next, our “throw-away” integration branch, is a fork of master and hosts features that you aren’t yet sure work well. Since it’s clear to everybody that next is only for testers and developers, regular users who build from the source know that the software from next is not to be used in production.

Contributions are based on master

p3

Most of the developers send their patches to the mailing list. Once those diffs are peer reviewed, the git maintaner (Junio Hamano) applies them in his repository as “topic branches” that forks master. Junio’s repository, with all those topic branches, is available at ​http://github.com/gister/git.

Topic branches are merged into next

p4

Alice and Bob, from the picture, are good developers and wrote patches that pass all tests and had favorable reviews on the mailing list. But how do their changes interact together? The combined result might not work at all.
This is the culprit of the whole workflow: Junio uses a throw-away integration branch to try things out, the next branch.
next is built and used by people willing to tolerate alpha software and help the developers reporting their experiences with it. I hear that the Debian community is involved into using Git built form next and reporting what they see.

Topics may (or may not) graduate to master

p5

In the picture, the topic branch from Bob is merged into master. This process is called “graduation”, and happens when Junio sees that Bob’s feature is sufficiently mature for prime time. master will eventually be tagged as an official Git release, so graduation is kind of a big deal.
Note that in my picture Bob was the first to be merged into next, and it’s also the first feature to graduate. Things might not happen in that order. Namely, a feature can stay in next indefinitely — there is no timeline for graduation. It might even not happen at all. The whole point of the “next workflow” is to have next acting as a buffer. When you like a feature, you pick it from next and merge it in master. Note that this doesn’t happen via cherry picking, but via merging branches; merging offers some benefits over cherry-picking:

  • changes are in a single place in the graph, as opposed to having the same changeset in many places under different hashes.
  • merges are easier to handle, since with a single merge you can add many commits at once to a development line, as opposed to cherry-picking commits one by one.

p6

Uh, look, while we were talking about merging and cherry picking, Alice’s topic graduated to master. Congratulations, Alice!

What if we merge next into master?

This would transform the “next branch workflow” in a completely different thing. I’ve seen this workflow erroneously called “topic branches workflow”, as if the central part of it was the fact that developers do their work in separate branches. While being an important characteristic, topic branches are not the defining trait of this way to work; better: it’s not what surprised me the most when learning about this workflow. The very point is that you try things out in next, and when you’re happy you put the in master.
If you were to merge the whole next into master, it would become “all-or-nothing”; no more freedom to accept mature features into the release while giving less stable developments some more time to stabilize.

Bugfixes go to the oldest stable branch…

p7

There are changes that are both trivially correct and urgently needed. Those bugfixes get applied directly on a stable branch, skipping all the graduation dance.

… and propagates to experimental branches via merges

p8

The small bugfixes I mentioned above aren’t needed only in the stable branch, they’re needed everywhere. The reason the get applied in master and then propagated via a merge of master into next is that merging in the opposite direction is forbidden.
Think of this: you apply the bigfix to next first, then you’d want to merge next into master. That’s not allowed, such merge would carry a lot of things that are not for master (all the topics that haven’t graduated yet). This is to say: it have to be this way.

Here comes the criss-cross merge

The very reason I wrote this post is to share the following remark: when you merge master into next you get a criss-cross merge. A criss cross merge is one where the greatest common ancestor of your two heads is not unique. Let’s see how the head of Alice’s topic branch is a greatest common ancestor for master and next:

px1

as you can imagine, also Bob’s branch head is a greatest common ancestor:

px2

Criss-cross merges might or might not be a big deal. Sometimes they present headaches. If you use this workflow and find yourself in this situation, I hope I provided some useful insights on the topology of the history graph that can help you understand what’s going on.

master and next continue their parallel lives…

p9

The steps above are repeated as necessary: master and next continue to evolve.

… until the end of the cycle

I said that next is a “throw away” branch. If really next and master where to live side to side indefinitely, you’d end up with two branches who’s content differs a lot. At some point, you want to kill next and start afresh. At the end of a “cycle”, which for Git typically lasts for 8 to 10 weeks, next is indeed erased and rebuild afresh by merging all topics that haven’t graduated yet. This gives the opportunity to restart clean, reorder topics etc.

This step is of capital importance to stay on top of all this growing complexity and helps to reason about what’s in where. An invariant like “next has been forked off master not more than 10 weeks ago” might come handy.
But it is also a lot of work: all the conflicts you had merging topics into next will show up again, and again, and again until the conflicting topics finally gets into master. Junio uses a git feature called rerere to remember how he solved conflicts in the past. Whatever trick you use, the luxury of merging features selectively from a sandbox branch to a production branch comes at the price of a lot of housekeeping.
It has been rightly observed that there’s no good reason to do this if you’re not automating next. This is what Linux does: Linux-next is an automatically-built branch combining >50 sources and detects conflicts and so on early.

Conclusion

  • The next branch workflow uses two parallel branches: one is production and the other is a sandbox where you try things out. This architecture gives you complete freedom on what features to promote from the sandbox to the real world.
  • It reflects a development model where you operate under no tight deadlines and only code that meet the highest quality standards is allowed to be released.
  • The complexity of this workflow is nontrivial, and can only be handled if the next branch is built automatically.
  • If the product of your shop is a web service, i.e. you have marketing pushing for features, you operate your own platform and can patch the servers on the go if needed, probably a branch model where you merge the whole “development branch” into production all at once with scheduled release dates is most suited for your needs. The next branch model lets you put features on hold indefinitely, but the sales people are on your neck anyway.

Thanks to Antoine Pelisse who introduced me to this workflow, and for the many insightful discussions we had on this.

Exploding intersections with the Inclusion-Exclusion Principle

How many elements in the union of two sets? The sum of those in each set respectively, you say? Of course not, there might be some element in common that you counted twice. How if the sets involved are three? Or four?

The answer is known as “the Inclusion-Exclusion Principle”, and in this post I’ll describe how to use it to “explode intersections”, i.e. give the cardinality of each region of a Venn diagram. With a picture, here the case of three sets:

The Inclusion-Exclusion Principle tells us that the cardinality of the union of three sets is given by the formula

|A ∪ B ∪ C| = |A| + |B| + |C|
              - |A∩B| - |A∩C| - |B∩C|
              + |A∩B∩C|

which doesn’t come too much as a surprise: we first sum the sets’ cardinalities, then see that we added once too much the “two-by-two” intersections and subtract them, but then ehy, we added the “all-in” interesection three times and subtracted three times, gotta put it back in.

The generalized case follows the same pattern: add all “order one” intersections (the sets alone), remove all “order two” intersections (the two-by-two), add back all “order three” ones (the three-by-three), remove all “order four”, and so on until all no more cases exists.

When thinking of that formula applied to our problem of “exploding” the intersection, I find useful to visualize the situation with the following lattice, where I stress the fact that higher order intersections are contained into some lower order ones. Nodes in the lattice are intersections (all nicely lined up by order), edges represent the partial order relationship “contains”.

The diagram reminds that, for instance, to get how many elements are only in A (and not in B nor C), one has to follow all paths from A down to the highest order intersection, and flip the sign as the order of intersections increases. Which is: sum |A∩B| and |A∩C|, then subtract |A∩B∩C|. The result corresponds to the cardinality of (A∩B) ∪ (A∩C), and has to be subtracted from the cardinality of A.

Slightly harder: 4 sets instead of 3. Following the above diagram, we can compute the number of items in A but in none of the other sets via following all paths that go from A to the very bottom intersection A∩B∩C∩D, and using the principle to know what to add and what to subtract:

in A only = |A| - |A∩B| - |A∩C| - |A∩D|
                + |A∩B∩C| + |A∩B∩D| + |A∩C∩D|
                - |A∩B∩C∩D|

In this other example we want to see how many things are un A and C but in none of the other sets; same story, following the paths and flipping the sign at each step down.

in A∩C only = |A∩C| - |A∩B∩C| - |A∩C∩D|
                    + |A∩B∩C∩D|

Now we implement this algorithm in code. First some helper functions; list flattening:

def flatten(listlist):
    return [elem for list_ in listlist for elem in list_]
>>> flatten([['a', 'b'], ['c', 'd', 'e']])
['a', 'b', 'c', 'd', 'e']

permutations of a list:

def perm(sequence):
    if len(sequence) == 0:
        return [[]]
    else:
        out = []
        for cnt, elem in enumerate(sequence):
            out += map(lambda x: [elem] + x,
                       perm(sequence[:cnt] +
                            sequence[(cnt+1):]))
        return out
>>> perm(['a', 'b', 'c'])
[['a', 'b', 'c'],
 ['a', 'c', 'b'],
 ['b', 'a', 'c'],
 ['b', 'c', 'a'],
 ['c', 'a', 'b'],
 ['c', 'b', 'a']]

All subsets of given cardinality of a list:

def choose_n(n, srcList):
    if n == 0:
        return [[]]
    else:
        out = []
        for cnt, elem in enumerate(srcList):
            out += map(lambda x: [elem] + x,
                       choose_n(n-1, srcList[(cnt+1):]))
        return out
>>> choose_n(2, ['a', 'b', 'c'])
[['a', 'b'], ['a', 'c'], ['b', 'c']]

All possible subsets of a list, excluded the empty one:

def allsubsets(srcList):
    out = []
    tot = len(srcList)+1
    complem = lambda universe, ensemble: \
        list(set(universe) - set(ensemble))
    complSrcList = lambda ensemble: complem(srcList,
                                            ensemble)
    for i in range(1, tot):
        if i > tot / 2: # integer division
            yield map(complSrcList,
                      choose_n(tot - (i + 1), srcList))
        else:
            yield choose_n(i, srcList)

Note the optimization where I work on the complementary list to reduce the number of calls to choose_n, which is equal to the size of the argument.

>>> list(allsubsets(['a', 'b', 'c']))
[[['a'], ['b'], ['c']],
 [['a', 'b'], ['a', 'c'], ['b', 'c']],
 [['a', 'c', 'b']]]

Now a function to precompute all intersection values. We store them in a dictionary where values are intersection cardinalities, and keys are the names of the sets involved in the intersection. Many items in this lookup table represent the same intersection, with a different order of the sets in the key. I do this to speed up the lookup: instead of normalizing the queries, I store the answer to all possible queries.

def intersLookup(lists):
    toInters = flatten(allsubsets(lists.keys()))
    def inters_n(names):
        return reduce(lambda s, t: set(s) & set(t),
                      [lists[name] for name in names[1:]],
                      lists[names[0]])
    lookup = {}
    for sequence in toInters:
        cardinality = len(inters_n(sequence))
        for variant in perm(sequence):
            lookup['&'.join(variant)] = cardinality
    return lookup
>>> intersLookup({'a': ['apple', 'banana'],
...               'b': ['orange', 'apple', 'watermelon'],
...               'c': ['peach', 'plum', 'pear', 'apple', 'orange']})
{'a': 2, 'c': 5, 'b': 3, 
 'a&b&c': 1, 'b&c&a': 1, 'c&b&a': 1,
 'a&c&b': 1, 'b&a&c': 1, 'c&a&b': 1,
 'b&a': 1, 'a&b': 1,
 'b&c': 2, 'c&b': 2,
 'c&a': 1, 'a&c': 1}

The following generator, named subints as in “sub-intersections”, is particularly important since it implements the concept of “walking down all paths” in the above diagrams, “one level at a time”. It returns the nodes of the lattice conveniently grouped for the sign flipping that happens in the Inclusion Exclusion formula.

def subints(elem, compl):
    for thing in allsubsets(compl):
        yield map(lambda x: elem + x, thing)

See the example above, the first I gave for the 4 sets case. We want to see how much of A is “eaten” by the sub-intersections. What to sum, what to subtract? Here the answer:

>>> si = subints(['a'], ['b', 'c', 'd'])
>>> si.next() # add
[['a', 'b'], ['a', 'c'], ['a', 'd']]
>>> si.next() # subtract
[['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'c', 'd']]
>>> si.next() # add
[['a', 'c', 'b', 'd']]
>>> si.next() # it's over
Traceback (most recent call last):
  File "", line 1, in 
StopIteration

The trick is to observe that in a sub-lattice of our diagrams lower levels are just intersections of the nodes in upper levels plus all sets left out. Now it’s clear why allsubsets works the way it does: it returns subsets grouped by size, so that subints can be written more concisely.
Let’s see the second example above, again from the 4-sets case:

>>> si =  subints(['a', 'c'], ['b', 'd'])
>>> si.next()
[['a', 'c', 'b'], ['a', 'c', 'd']]
>>> si.next()
[['a', 'c', 'b', 'd']]
>>> si.next()
Traceback (most recent call last):
  File "", line 1, in 
StopIteration

We’ve now everything needed to write the inclusion-exclusion formula:

def flip():
    curr = 1
    while True:
        yield curr
        curr *= -1

def o(f, g):
    # function composition
    def helper(x):
        return f(g(x))
    return helper

def inclusionexclusion(elem, compl, allInters):
    currentInters = allInters['&'.join(elem)]
    sign = flip()
    subintValue = \
        sum(map(lambda (sign, value): sign * value,
                zip(sign,
                    map(o(sum, (lambda nodes: 
                                map(lambda node: allInters['&'.join(node)],
                                    nodes))),
                        subints(elem, compl)))))
    return currentInters - subintValue

Let’s see:

>>> allInters = intersLookup({'a': ['apple', 'banana'],
                'b': ['orange', 'apple', 'watermelon'],
                'c': ['peach', 'plum', 'pear', 'apple', 'orange']})

how many items in a and b but not in c?

>>> inclusionexclusion(['a', 'b'], ['c'], allInters)
0

zero, since the only common element is apple, but it’s also in c. How many elements in the intersection of all three?

>>> inclusionexclusion(['a', 'b', 'c'], [], allInters)
1

the apple, indeed. How many in a, but not in b nor c?

>>> inclusionexclusion(['a'], ['b', 'c'], allInters)
1

that’s the banana.

The last thing left to do is to package all of this into a shiny function that computes our complete explosion:

def explode(allNames, allInters):
    out = {}
    for elem in flatten(allsubsets(allNames)):
        out['&'.join(elem)] = \
                    inclusionexclusion(elem,
                                       complem(allNames, elem),
                                       allInters)
    return out

There we go:

>>> sets = {'a': ['apple', 'banana'],
...         'b': ['orange', 'apple', 'watermelon'],
...         'c': ['peach', 'plum', 'pear', 'apple', 'orange']}
>>> 
>>> allInters = intersLookup(sets)
>>> explode(sets.keys(), allInters)
{'a': 1, 'a&c&b': 1, 'c': 3, 'b': 1, 'c&b': 1, 'a&b': 0, 'a&c': 0}

The code is available at https://github.com/gghh/intersections. Images are drawn with gimp and zwibbler. The 4-sets Venn diagram comes from Wikipedia.

The annoying merge conflicts in case of criss cross

Here I show an example of problematic merge that sometimes I have with Mercurial. What happens is that for a given file foo.txt two commits, one parent of the other, get in competition at merge time. Obviously the child edit should be selected as winner of the merge, but this involves user intervention when it shouldn’t be necessary.

Here a minimal example of history:

Our goal is to merge node 5 and 6. The first thing we note is that the GCA isn’t unique, since both 1 and 4 are good candidates. Since for my example I need 4 to be selected (so that things can go bad), I inserted node 3; this way 4 is the common ancestor furthest from the root, and will then be selected.

Why did I color 0, 1 and 2 in pink? Because those are the only nodes where I edit my foo.txt file, if we exclude merges. Before I show a “Try It And See” example with Mercurial, I’ll describe what happens at each step in plain english.

  1. foo.txt is edited.
  2. foo.txt is edited again.
  3. foo.txt is edited a third time.
  4. Somebody forks at 0. foo.txt is not modified.
  5. Development continues on top of 3. foo.txt is not modified. At this stage, foo.txt has the very same content as it had in 0.
  6. 4 is merged with 1. With respect to foo.txt, the 3way merge algorithm automatically select changes in 1 as winners: if a line is changed in “this” and unchanged in “other”, the change is selected.
  7. 2 is merged with 4. Same story, ordinary business for 3way merge, most recent change is selected.

And then we want to merge 5 and 6. Mercurial selects 4 as base. Let’s observe things from foo.txt point of view:

  • (this) in 6 we have the latest version (from 2).
  • (other) in 5 we have the before-the-last version (from 1). This should not “compete” with the latest… does it?
  • yes, because in 4 (base of the merge) we have the very first version, and so 3way merge has no idea of who is more recent, if “this” or “other”. They are both changes as far as you compare them with the early version in 0 (inherited by 4).

Now, the real example. First, we go on with commits 0, 1 and 2

$ hg init
$ echo 'I am a cow.' > foo.txt ; hg add foo.txt ; hg commit -m foo
$ echo 'I have superpowers.' > foo.txt ; hg commit -m foo
$ echo 'I am a cow with superpowers.' > foo.txt ; hg commit -m foo

Then we create the second branch, with commits 3 and 4 forking from 0.

$ hg up 0
$ touch bar.txt ; hg add bar.txt ; hg commit -m bar
$ touch baz.txt ; hg add baz.txt ; hg commit -m baz

We are now at 4, and we merge with 1. The base will be 0, and all goes fine: the most recent version of foo.txt is selected.

$ hg merge 1 ; hg commit -m merged
$ cat foo.txt 
I have superpowers.

We then update to 2 and merge with 4. Again, no surprises from the merge. The most recent version (i.e. 2) is selected.

$ hg up 2
$ hg merge 4 ; hg commit -m merged
$ cat foo.txt 
I am a cow with superpowers.

A look at the history so far:

$ hg log --graph --template '{rev} {desc}'
@    6 merged
|\
| | o  5 merged
| |/|
| o |  4 baz
| | |
| o |  3 bar
| | |
o---+  2 foo
 / /
| o  1 foo
|/
o  0 foo

Being now in 6, we merge with 5

$ hg merge

As expected, the stupid conflict.

It is worth to note that the Consensus Merge algorithm would not solve this situation. It would take into account the two candidates for gca (1 and 4), and weight their “opinion” equally. But one is right, the other is wrong: in this small example, democracy fails to apply.

My naive observation is that when the history of a file is all on the same side of the merge, that file should not be touched at merge time. Just take the (single) head of the sub-DAG that tracks all commits affecting said file. But maybe I am missing undesired side effects of this approach. Also, thinking “per file” is usually the wrong way to reason about source control. Considering the repository as a whole is a thing that Mercurial and Git got right with respect to their predecessors. A “file” is a very artificial way to split code into chunks, as far as version control is concerned. The atomic entity is “the single line of code”.

In this case, with multiple GCAs, one can see that one of those GCA is good for the file we’re considering (foo.txt should be merged with 1 as base), and not the other. I wander if the following principle holds: “if the GCA is unique, then it is optimal for all files (better: lines of code)” where by “optimal” I mean “it doesn’t compare a revision against its ancestors in stupid ways”. On the other side, if there are many GCAs, each line of code has its favourite one, and it’s impossible to choose.

How does Mercurial select the Greatest Common Ancestor

The GCA of two nodes X and Y in a DAG (Directed Acyclic Graph) is an ancestor of both X and Y such that no descendant of it is also a common ancestor of X and Y. This definition is equivalent to the revset expression heads(ancestors(X) and ancestors(Y)), i.e. the head (if unique) of the intersection among the ancestors of X and those of Y.

Mercurial adopts a slightly different definition: the GCA is the common ancestor of the two nodes which is furthest from a root, as measured by the longest path. This makes sense, since the GCA is used as base for merges and it’s meaningful to take a (topologically) more recent version in case of ambiguity.

I am going to discuss the algorithm that used to perform GCA computations in mercurial until version 2.5.4; in 2.5.5 this code was replaced by a faster algorithm.

The interesting part is what happens when there are multiple candidates for the GCA, i.e. the GCA is not unique. What happens is that mercurial will pick the one that finds first. Here we walk the algorithm step by step and see how it works.

First off, where to find things: the ancestor() function is in the mercurial.ancestor module. Here the docstring:

$ python

>>> from mercurial import ancestor
>>> help(ancestor.ancestor)

ancestor(a, b, pfunc)
    Returns the common ancestor of a and b that is furthest from a
    root (as measured by longest path) or None if no ancestor is
    found. If there are multiple common ancestors at the same
    distance, the first one found is returned.
    
    pfunc must return a list of parent vertices for a given vertex

We’ll work with the following graph, where GCA(4, 5) isn’t unique: candidates are 1 and 2, equally good (parents are on the left and children on the right).

We can try it right away, like this:

>>> parents = {0: [], 2: [0], 1: [0], 3: [2],
...            5: [1, 2], 4: [3, 1]}
>>> 
>>> pfunc = lambda x: parents[x]
>>> ancestor.ancestor(4, 5, pfunc)
1

1 is chosen… apparently it’s the one found first. Let’s dig deeper into the algorithm. The ancestor() function does its job in two steps:

  1. prepare a depth dictionary that associates each ancestor of both nodes X and Y to its distance from a root, computed as the longest path to it.
  2. use a generator for each of the two starting vertices that, at each iteration, returns all ancestor “of the same generation”, i.e. all ancestors at a given distance d from the root. The two generators are such that they return generations in descending order of distance. If you want, you can picture this as two ancestor list that “race” towards the root, until they meet at the same distance level (or don’t meet, in which case no common ancestor is found).

I have split the code in three separate functions for this explanation. First, the construction of the depth dictionary:

def depthf(a, b, pfunc):
    a, b = sorted([a, b])
    
    # find depth from root of all ancestors
    # depth is stored as a negative for heapq
    parentcache = {}
    visit = [a, b]
    depth = {}
    while visit:
        vertex = visit[-1]
        pl = pfunc(vertex)
        parentcache[vertex] = pl
        if not pl:
            depth[vertex] = 0
            visit.pop()
        else:
            for p in pl:
                if p not in depth:
                    visit.append(p)
            if visit[-1] == vertex:
                # -(maximum distance of parents + 1)
                depth[vertex] = min([depth[p] for p in pl]) - 1
                visit.pop()
    return depth, parentcache

The parentcache dictionary is there to save some time at a later step; we’ll pass it along to the forthcoming functions.

What’s to note here is that we can compute depth for a node only if we have depth of all its parents, taking the max and adding 1 (line 22).

We visit nodes in a depth-first fashion. We’ll encounter a lot of unseen vertices and append them the the visit array (line 19), until at some point we’ll hit a root and set its depth to 0 (line 14). At this point we start backtracking, pop()-ing nodes from visit as soon as we compute their depth (line 23).

Note that the condition visit[-1] == vertex (line 20) is equivalent to “all parents of vertex already have depth computed” since we are handling DAGs: the absence of cycles assure us that no vertex can be found twice in the visit list. Well, strictly speaking, the point is that no node can be parent of itself: the only threat to the last elem of visit being unchanged since line 10 is the loop at line 17.

Here the result if applied to our toy example:

>>> depth, parentcache = depthf(4, 5, pfunc)
>>> depth
{0: 0, 1: -1, 2: -1, 3: -2, 4: -3, 5: -2}

depths are stored negated since we’ll later use a heap to figure out the furthest elements.

Here the helper function that traverses ancestors in order of decreasing distance from the root:

def ancestors(vertex, depth, parentcache):
    h = [(depth[vertex], vertex)]
    seen = set()
    while h:
        d, n = heapq.heappop(h)
        if n not in seen:
            seen.add(n)
            yield (d, n)
            for p in parentcache[n]:
                heapq.heappush(h, (depth[p], p))

A few things to note here:

  1. duplicate nodes are removed
  2. since the distance of a node is always larger than the distance of any of its parents, it’s safe to pop() an element and only later push() its parents into the heap. Subsequent pop() will always return elements in the desired order.

Here the result of applying it to the vertex 4:

>>> list(ancestors(4, depth, parentcache))
[(-3, 4), (-2, 3), (-1, 1), (-1, 2), (0, 0)]

Another helper function: we need to get all ancestors of a given depth, not one by one. This one does the buffering into “generations”:

def generations(vertex, depth, parentcache):
    sg, s = None, set()
    for g, v in ancestors(vertex, depth, parentcache):
        if g != sg:
            if sg:
                yield sg, s
            sg, s = g, set((v,))
        else:
            s.add(v)
     yield sg, s

Here the generations that lead to the vertex 4:

>>> list(generations(4, depth, parentcache))
[(-3, set([4])), (-2, set([3])), (-1, set([1, 2])), (0, set([0]))]

Finally here the function that put the pieces together and gives the final result:

def gca(a, b, depth, parentcache):
    x = generations(a, depth, parentcache)
    y = generations(b, depth, parentcache)
    gx = x.next()
    gy = y.next()
    # increment each ancestor list until it is closer to root
    # than the other, or they match
    try:
        while True:
            if gx[0] == gy[0]:
                for v in gx[1]:
                    if v in gy[1]:
                        return v
                gy = y.next()
                gx = x.next()
            elif gx[0] > gy[0]:
                gy = y.next()
            else:
                gx = x.next()
    except StopIteration:
        return None

gx and gy are tuples like (depth, nodes). A GCA can by found only when gx[0] == gy[0]; since we start from the max depths and progressively decrease it, we’re guaranteed to find an element in the intersection of the ancestors of the original nodes that is the furthest away from the root, or we find nothing at all.

Here we run it, showing the value of gx and gy as we go along:

>>> gca(4, 5, d, p)
gx (-3, set([4]))
gy (-2, set([5]))
----
gx (-2, set([3]))
----
gx (-1, set([1, 2]))
gy (-1, set([1, 2]))
----
1

Note the 1 is chosen, instead of 2, for no particular reason: just because it’s the first element of list(set([1, 2])].

Interestingly, if we add a node 6 between 0 and 2, this makes 2 the common ancestor furthest from root, and so the algorithm doesn’t see any more ambiguity in the choice:

>>> parents = {0: [], 2: [6], 1: [0], 6: [0], 3: [2],
...            5: [1, 2], 4: [3, 1]}
>>> 
>>> pfunc = lambda x: parents[x]
>>> 
>>> depth, parentcache = depthf(4, 5, pfunc)
>>> gca(4, 5, depth, parentcache)
gx (-4, set([4]))
gy (-3, set([5]))
----
gx (-3, set([3]))
----
gx (-2, set([2]))
gy (-2, set([2]))
----
2

In the last iteration the intersection between gx[1] and gy[1] is the element 2 alone.

Caching in python with a descriptor and a decorator

The Mercurial SCM has several strategies to cache expensive computations; you can find an overview of all of them at this page. This post is about the propertycache decorator: a mechanism to replace a function with its result the very first time it is called. The trick is made possible by the combination of two features of the python programming language: descriptors and decorators.

Python descriptors

The python descriptor protocol is something that works as follows: if you override the methods __get__(), __set__() or __delete__() in a (new style) python class, you’re given the chance to execute your own code when an object of that class is evaluated (__get__()) or is bound to another object (__set__()). Which is, you can decide what happens when that evaluation (or binding) occurs.

One more thing to remember is that an object that overrides __get__() and __set__() behaves like a descriptor (which is: these methods are actually called instead of the default evaluation/binding behaviour) only if it is a member of another class.

Here a few references on descriptors:

A dummy example

Here we use the “descriptor protocol” to preempt user code from assigning to an attribute of an object. That attribute is kept to a constant value instead:

class joke(object):
    def __get__(self, obj, objtype):
        return 'Bazinga!'
    def __set__(self, obj, value):
        pass

class foo(object):
    x = joke()

We had to implement __set__() too, despite the fact that it’s empty: python would have ragarded joke as a non-data descriptor, which is a slightly different beast. Here is what happens when writing and reading the attribute x of an object of type foo:

>>> f = foo()
>>> f.x
'Bazinga!'
>>> f.x = 5
>>> f.x
'Bazinga!'

Python decorators

Are cool. In python any callable that wraps a function is called “decorator”. Example: you have a function that does stuff, like

def deco(arg):
    # stuff

and another function, that does some other stuff:

def fun(arg):
    # other stuff

Sometimes is interesting to rebind fun like

fun = deco(fun)

and deco is said to be a “decorator” for fun. Some basic references:

Python has a convenient syntactic sugar for the rebinding above: it can be written as

@deco
def fun(arg):
    # other stuff

I often use decorators when I need to enhance the behaviour of a function by having something done before the function call, or after. So I write the following closure and use it as a decorator:

def deco(fun):
    def helper(arg)
        # do something before
        result = fun(arg)
        # do something after
        return result
    return helper

and then, for whatever fun this makes sense, I rebind its name to

fun = deco(fun)

The propertycache decorator

Without further ado, here the code, taken verbatim from the mercurial codebase:

class propertycache(object):
    def __init__(self, func):
        self.func = func
        self.name = func.__name__
    def __get__(self, obj, type=None):
        result = self.func(obj)
        self.cachevalue(obj, result)
        return result

    def cachevalue(self, obj, value):
        setattr(obj, self.name, value)

and the example usage, taken straight from the mercurial wiki:

from util import propertycache

class foo(object):
    @propertycache
    def expensive(self):
        print 'created'
        return [1, 2, 3]

f = foo()
f.expensive # prints created
f.expensive # simply returns [1, 2, 3]

# to forget the saved value and calculate it again, use delattr
delattr(f, 'expensive')

How does it work

Before anything happens, python defines

f.expensive = propertycache(f.expensive)

Note that this is a descriptor object, since the class propertycache overrides the __get__() method. From this point on, you can’t call the method f.expensive() anymore:

>>> f.expensive()
<__main__.foo object at 0xb742d28c>
created
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'list' object is not callable

What you do, instead, is to refer to f.expensive as a data member (which is, no () parens). The first time that f.expensive is evalutated, the original method f.expensive() is called (remember: it was saved in the func member of the object of type propertycache). Its result is then rebound to f.expensive by the cachevalue() method.
Now f.expensive is not a descriptor object anymore, but only the returned value of the original method: further evaluations won’t go thru the call of the original method anymore, just return the cached value.

As a final remark, note that this doesn’t generalize to methods that have more arguments than the self reference, since the __get__() method has a well defined signature and you can’t add more items in there: only a reference to the “owner” object is allowed (the instance of foo in our example).

How was it done before?

Descriptors were introduced in the Mercurial codebase with revision 595ba2537d4f by Dirkjan Ochtman, the same person who wrote the excellent chapter on the AOSA book about Mercurial.

How was caching done before? It looked like this:

class foo:
    def __getattr__(self, name):
        if name == 'expensive1':
            print 'Computing expensive1...'
            self.expensive1 = [1,2,3]
            return self.expensive1
        elif name == 'expensive2':
            print 'Computing expensive2...'
            self.expensive2 = [4,5,6]
            return self.expensive2
        else:
            raise AttributeError(name)

Let’s see that in action:

>>> f = foo()
>>> f.expensive1
Computing expensive1...
[1, 2, 3]
>>> f.expensive2
Computing expensive2...
[4, 5, 6]
>>> f.expensive1
[1, 2, 3]

Here the concertmaster is the __getattr__() method, which, if implemented, intercepts all accesses to members not found otherwise ; in our example, those are expensive1 and expensive2. Once accessed the first time, their value is computed and bound to a datamember; subsequent calls will find them without passing thru __getattr__.

It goes without saying that the decorator described above is much more terse than this boilerplate code; it factors away all the common behaviour.

Mercurial histedit behind the scenes

What’s the magic behind the history rewriting extension in Mercurial? It’s called histedit, it’s written by Augie Fackler, and it’s a fundamental tool in my daily coding activity. I fired up pdb and watched closely to what it does. Lean back and enjoy (all good ideas are Augie’s, all mistakes are mine!).


First off, I am obliged to rise a giant “don’t try this at home” sign upfront. I will use internal Mercurial APIs, which are deliberately undocumented; the reason is very simple: they might change next week without further notice.

Having said that, let me ask you a question: how would you create a changeset that is equivalent to a sequence of N other changesets? Which is, in mercurial lingo, how would you implement folding N changesets? You’d probably smash the first two, then add the third one to that result, and after you’d add the fourth, the fifth and so on. That’s what happens in histedit. Let’s see that in action.

Before we tackle the adventurous task of editing the history of our repository, we have to learn how to do two operations:

  1. pick a changeset somewhere in the history and apply it to the working directory
  2. collapse a chain of commits into a single one

Pick there, apply here

The histedit extension special-cases two situations.

In place

consider the repo created with

$ hg init
$ touch foo ; hg add foo ; hg commit -m foo
$ touch bar ; hg add bar ; hg commit -m bar
$ hg up 0

which then looks like


what we want to do here is to extract changeset 1 and apply it to the working directory. Since the parents of cset 1 and of the working directory are the same (they’re both children of changeset 0), the situation can be handled by a simple shot of cmdutil.revert(). Let’s switch to the python shell:

from mercurial import ui, cmdutil, node, localrepo

repo = localrepo.localrepository(ui.ui(), '.')
cmdutil.revert(ui.ui(), repo, repo[1],
               (repo[0].node(), node.nullid), all=True)
repo.commit(text="Applied cset 1")

this, as promised, leads us to


The revert(ui, repo, ctx, parents, *pats, **opts) function restore the files to the state they had at revision ctx, and set the parents of the working directory. You might think there is something odd going on here, and I did too: this is a quite unusual application of revert, which is more often use to restore an ancestor revision. Here we’re picking stuff from another branch (let’s say, from “the future”).

Far away nodes: merge needed

The other situation is when the parents of the working directory and of our target revision differ but… those two (different) parents have the same content. It might seem an artificial situation, but it is what happen in a generic step of the histedit algorithm. In that case we resort to a merge. Let’s first build the sample repo:

$ touch foo ; hg add foo ; hg commit -m foo
$ touch bar ; hg add bar ; hg commit -m bar
$ touch baz ; hg add baz ; hg commit -m baz
$ hg up 0

then again the revert() trick to apply changeset 1 on the the of 0, but on a parallel line; the code is exactly the python snippet above, so we have


what’s the histedit recipe to apply 2 on the top of 3? We merge the working directory (updated to3) with 2, setting 1 as the common ancestor; this isn’t the topological truth (1 is not an ancestor of 3) but is morally ok, since 1 and 3 are contain the same state.

repo = localrepo.localrepository(ui.ui(), '.')
merge.update(repo, repo[2].node(), True, True, False, repo[1].node())
repo.setparents(repo[3].node(), node.nullid)
repo.dirstate.write()
repo.commit(text="Applied cset 2")

here the result:


Note at how I carefully cleanup the parents of the changeset, removing traces of the merge that just happened.

All of this is nicely packaged in the histedit.applychanges() function in the histedit extension.

Collapse many into one

Again, let’s kickstart with a toy repository:

$ hg init
$ touch foo ; hg add foo ; hg commit -m foo
$ touch bar ; hg add bar ; hg commit -m bar
$ touch baz ; hg add baz ; hg commit -m baz
$ touch woo ; hg add woo ; hg commit -m woo

We want to take changeset from 1 to 3 and smash them into one that sums them all. Here is the python code:

from mercurial import ui, cmdutil, node, localrepo, context

repo = localrepo.localrepository(ui.ui(), '.')

files = set(repo[1].files()) | set(repo[2].files()) | set(repo[3].files())
mf = repo[3].manifest()

def filectxfn(repo, ctx, path):
    if path in mf:
        fctx = repo[3][path]
        mctx = context.memfilectx(fctx.path(), fctx.data())
        return mctx
    raise IOError()

parents = (repo[0].node(), node.nullid)
new = context.memctx(repo,
                     parents=parents,
                     text='collapsed 1 to 3',
                     files=files,
                     filectxfn=filectxfn,
                     user='gghh')

repo.commitctx(new)


Again, there is a function for this in the histedit extension, histedit.collapse(). I was suprised looking at that code: I have a mental model of commits as, say, an ordered sequence of diffs; that’s how I read this picture anyway. If I am collapsing changeset, I expect some sort of iteration over that chain. But that’s not what collapse() does. Look at the closure filectxfn: basically it consider only filelogs from the last changeset, skipping those filelogs that didn’t changed since first. As if in last you have all needed things necessary to know what happened in between.

Which is exactly how mercurial works: filelogs just internally make the diff sequence, externally only the full contents are exposed. So basically all you can do is ask a filelog for the contents of something plus the parents at a version.

When does this show start? It’s getting late

With all this machinery in place, we’re now ready to juggle with those little balls so fast that tortoiseHg will have an hard time catching up.


First off, the toy repo. For the occasion we’re giving foo, bar and baz a little break and use some piece of real literature.

$ hg init
$ echo 'Dorothy lived in the midst of the great\
> Kansas prairies, with Uncle' > chap1.txt
$ hg add chap1.txt ; hg commit -m "line 1"

$ echo "Henry, who was a farmer, and Aunt Em,\
> who was the farmer's wife. Their" >> chap1.txt
$ hg commit -m "line 2"

$ echo 'house was small, for the lumber to build it\
> had to be carried by wagon' >> chap1.txt
$ hg commit -m "line 3"

$ echo 'many miles.  There were four walls, a floor\
> and a roof, which made one' >> chap1.txt
$ hg commit -m "line 4"

$ echo 'room; and this room contained a rusty looking\
> cookstove, a cupboard for' >> chap1.txt
$ hg commit -m "line 5"

$ echo 'the dishes, a table, three or four chairs,\
> and the beds.' >> chap1.txt
$ hg commit -m "line 6"


Our plan is to fold all changesets between 1 and 5 included into a single one. Which is, we would run

hg histedit 1

The histedit commands will be:

(pick, 1)
(fold, 2)
(fold, 3)
(fold, 4)
(fold, 5)

The first thing we do is to update the working directory to 0 and play the cmdutil.revert() trick to apply 2 on the top of 1.

from mercurial import ui, hg, commands, localrepo, node, merge
from mercurial import cmdutil, scmutil, context, hg, repair
from hgext import histedit

repo = localrepo.localrepository(ui.ui(), '.')

hg.update(repo, repo[1].node())

cmdutil.revert(ui.ui(), repo, repo[2],
               (repo[1].node(), node.nullid), all=True)
repo.commit(text="2 bis")


Now we again update to 0, and collapse the line 1 -- 6 into changeset 7, which we apply on the top of 0.

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[1], repo[6], {})
hg.update(repo, repo[7].node())


Now we want to apply 3 on the top of 7. That’s the merge case we saw earlier; our reference ancestor will be 2, since 0 -- 7 is the same as 0 -- 1 -- 2.

merge.update(repo, repo[3].node(), True, True, False, repo[2].node())
repo.setparents(repo[7].node(), node.nullid)
repo.dirstate.write()
repo.commit(text="7 and 3")


Now guess what? collapse the line 7 -- 8, call it 9.

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[7], repo[8], {})
hg.update(repo, repo[9].node())


Come on, a couple more and we’re done: 4 on top of 9, then 5 on top of 11.

merge.update(repo, repo[4].node(), True, True, False, repo[3].node())
repo.setparents(repo[9].node(), node.nullid)
repo.dirstate.write()
repo.commit(text="4 bis")

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[9], repo[10], {})
hg.update(repo, repo[11].node())

merge.update(repo, repo[5].node(), True, True, False, repo[4].node())
repo.setparents(repo[11].node(), node.nullid)
repo.dirstate.write()
repo.commit(text="5 bis")

now the final collapse: changeset 13 is the result of 1 -- 2 -- 3 -- 4 -- 5 all folded together.

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[11], repo[12], {})
hg.update(repo, repo[13].node())


Now it’s time to throw away the leftovers. repair.strip() comes to the rescue:

repair.strip(ui.ui(), repo, repo[1].node())

repair.strip(ui.ui(), repo, repo[1].node())

repair.strip(ui.ui(), repo, repo[1].node())

repair.strip(ui.ui(), repo, repo[1].node())


Hooray!

Is TCP turing complete?


Is the TCP protocol capable of… computing things? In other words: if you consider the topology of your network (nodes and connections between them) and the data that each node plans to send across the network as “the input” of TCP as a “computing machine”, and the data exchange (or maybe, just the rate of the data exchange at steady state) that actually happens to be “the output”, can one imagine some weird complicated “encoding scheme” so that, at the end of the day, one can make TCP compute prime numbers?

In much more cryptic terms, for people used to theoretical computer science lingo, is the TCP protocol somehow Turing Complete?

I asked myself this question since I learnt of the AIMD congestion control scheme, where basically the rate at which data is transmitted increases linearly in time until a packet loss is detected. At that point, the transmission rate is divided by a constant factor. Well, I feel that this scheme is complex enough that there might be some computation power lurking in there.

Personally I find it very amusing when turing completeness sneaks into systems “by accident”, and this happens quite often in real world situations. Some examples:

Usually when you have turing completeness by accident it’s quite an undesirable thing, as famous programmer Jamie Zawinsky points out in his blog here; interesting also the comment on the DNS resource record NAPTR.

Any comment is *very* wellcome! I feel the formulation of my question is quite fuzzy, so even if somebody can make me better understand *what* I am asking… well that would make my day :-)

A feature of (q)bzr I really love

Bazaar lets you compare branches. Not just branches from the same repository, i.e. all stored in the same .bzr/ folder, but arbitrary branches that may (or may not) have some common history. All those years of typing code for fun and profit, and staring at directed acyclic graphs wandering where the heck that stupid C macro that breaks the build was introduced, might have changed my sense of “awesome”.
But ladies and gentlemen, this is pretty much awesome.

Let me show you an example.

Playing the TIAS game with bazaar

TIAS is Try It And See, no less no more. I did some TIAS in another post about Mercurial, and I am doing it again. Let’s get this party started.

First thing we create a repository and fork it two times. Actually in bzr the correspondence between repositories and branches can be one to one, with bzr init, or one to many, with bzr init-repo. The latter creates what is called a “shared repository”, i.e. a repository that can hosts many branches. This is indeed the way git and mercurial work; the former option is a bzr-only thing. Let’s stick to the one-repo-one-branch setup; it doesn’t make a big difference for the purpose of this post, and it requires less typing.

$ bzr init b
Created a standalone tree (format: 2a)
$ bzr branch b c1
Branched 0 revisions.
$ bzr branch b c2
Branched 0 revisions.

We now switch to c1, enter a couple of commits, and push them back to the parent branch, which is b/.

$ cd c1

/c1$ echo 'hello world!' > src1.txt
/c1$ bzr add src1.txt
adding src1.txt
/c1$ bzr commit -m "initial import"
Committing to: c1/
added src1.txt
Committed revision 1.

/c1$ echo 'hello world!\nIt is gonna be a nice day.' > src1.txt
/c1$ bzr commit -m "added optimism."
Committing to: c1/
modified src1.txt
Committed revision 2.

/c1$ bzr push :parent
All changes applied successfully.
Pushed up to revision 2.

Some time later, developer c2 enters the scene, retrieves the current content of b/ and makes a commit. He doesn’t push back to b/, because he’s not quite done yet.

$ cd ../c2/

/c2$ bzr pull
Using saved parent location: b/
+N  src1.txt
All changes applied successfully.
Now on revision 2.

/c2$ echo "That's a nice Tnettenba" > src2.txt
/c2$ bzr add src2.txt
adding src2.txt
/c2$ bzr commit -m "added joke"
Committing to: c2/
added src2.txt
Committed revision 3.

c1 gets back to work: he continues to develop his thing.

/c2$ cd ../c1

/c1$ echo 'voodoo atchoo to you too' > src3.txt
/c1$ bzr add src3.txt
adding src3.txt
/c1$ bzr commit -m 'thank you!'
Committing to: c1/
added src3.txt
Committed revision 3.

The release engineer, who has direct access to the server where b/ is stored, has been thinking on how to fix that shell script that handles the deployment. He finally figures it out; he’s in a hurry, and doesn’t want to branch and merge — he commits directly on b/. It happens.

$ cd ../b/

/b$ echo 'the grumpy cat says it had fun, once' > src4.txt
/b$ bzr add src4.txt
adding src4.txt
/b$ bzr commit -m 'that never happened.'
Committing to: b/
added src4.txt
Committed revision 3.

It’s c1‘s turn to play now. He had a coffee with c2; they talked about the code and c1 realizes he needs the new feature that c2 just committed. So he merges with c2, and then goes a little further.

/c1$ bzr merge ../c2
+N  src2.txt
All changes applied successfully.
/c1$ bzr commit -m "thanks c2, that was neat"
Committing to: c1/
added src2.txt
Committed revision 4.

/c1$ echo 'if (0 == a);' > src6.txt
/c1$ bzr add src6.txt
adding src6.txt
/c1$ bzr commit -m 'Zero the value is not. Yoda style.'
Committing to: c1/
added src6.txt
Committed revision 5.

It’s almost time to leave, and c2 has just the time to fix that little annoying bug before heading home.

/c1$ cd ../c2

/c2$ echo 'NaNNaNaNNaNaN' > src5.txt
/c2$ bzr add src5.txt 
adding src5.txt
/c2$ bzr commit -m 'Batman. I need Batman.'
Committing to: c2/
added src5.txt
Committed revision 4.

The sweet pretty pictures

The day after c1 and c2 get back to work, chit-chat a little, and realize that they don’t know exactly how their two branches relate. Ok, somebody pushed, somebody merged… but they’d like to see it crystal clear, to anticipate all the problems that will arise when, next wednesday, all lines are going to be merged into the production code. More over, they’ve no idea that the release engineer jumped in and added stuff directly to the reference repo b/. What do they do?
Easy peasy: bzr qlog b c1 c2.
The amazing command

bzr qlog <branch_1> <branch_2> ... <branch_n>

shows what all branches have in common, who diverged, who pushed to whom, everything. And recall that we’re talking of branches that reside in different repositories. Too good! This is what the command

bzr qlog b c1 c2

would show:

Isn’t that pretty?
Just for the sake of taking some screenshots and messing around with gimp, I’ll show you that the branches you compare can be totally, and I say totally, unrelated. Here is what you get if you compare the trunk of bzr itself with the latest mysql-server codebase via

bzr qlog lp:bzr lp:mysql-server

I had to shrink it a little bit, as you can imagine; of course the two don’t share a thing, but that’s just for the fun of the abuse:

All your rebase are belong to us

When I hit “commit” on my version control system, I behave very much like when I click “save” on my text editor. I just do it whenever I feel like. I can’t see in advance how the thing is gonna look like at the end, in all details. Result: at the end of the day, I have a bazillion of small commits without any clear, organic structure. I just cannot hand this blob to the release engineer, or code reviewer or whatever — the guy is gonna yell at me all over the building. So, guess what? you have to rewrite your history, my son. And here is how it works (with Mercurial).

Beautify the commit history, even between regular merges

Alice clones Production; commits some stuff; then merges from Production just to keep up to date; commits again a few times, re-merges, etc. Then she would like to beautify a little her commit history, and would like to “squash” all commits she made between a merge and the next one, all in a single commit. Here a picture that explains the scenario:


I apologize for the ugly picture (you might want to read again the title of this blog). At the expense of killing the narrative, I am gonna say upfront how the story is gonna end (Spoiler Alert!): initially Alice would be happy just by folding commits between merge M1 and merge M2. But then she discovers that actually she’s free to move her commits all the way up and down in her development line, and that the two merges with upstream that she did are … not a “boundary” anymore. So, in the final picture, all merges are still there (even if M1 is tecnically just a clone), and commits A, B, C, D, E can be in arbitrary positions with respect to M1 and M2 (just need to respect their relative order); look at the picture at the end of the post for some abstract art representing the final result.
The hg terminology for squashing commits is folding. We are going to see how to handle this situation using the rebase and histedit mercurial extensions. Things we’re gonna need, and reference documents:

Ok, last step is optional but appreciated.

First off, Mercurial TIAS (Try It And See)

TIAS is an effective tecnique to effectively grasp new concepts, and the first step on the way to mastery. For mercurial, it goes like this: let’s create a repo serving as a (local) reference one

$ hg init b

this will create an empty repository in folder b/. Other clones would communicate via this one as if it were remote repos like in a real world (distributed) development workflow. You will then clone one or more times repo b/, simulating actions from several developers:

$ hg clone b c1
$ hg clone b c2
...

Please note that all local repos are basically equal, and here we just talking about conventions. There is no such a thing like a client/server scenario in the distributed version control world, but rather a peer-to-peer setup.
You now have a basic working environment where you can play, experiment, try it, and see.
For instance, why not enter c1, add a small text file and commit it:

$ cd c1/ ; echo "hello world" > src.txt ;
c1/$ hg add src.txt ; hg commit -m "initial import"

now we could push this commit to c1‘s parent repository, wich is b, with

c1/$ hg push

then enter c2/ and retrieve those modifications via

c1/$ cd ../c2
c2/$ hg pull
c2/$ hg update

Did it work? Let’s see:

c2/$ cd ..
$ tree

.
├── b
├── c1
│   └── src.txt
└── c2
    └── src.txt

Hooray! Note that b‘s working tree is empty, because we never run hg update there; nonetheless, all the stuff we care about is safely stored into b/.hg/.

Simulation of the workflow

I’ll now reproduce the situation described above. Let’s create a dumb production repo:

$ hg init b
$ cd b
/b$ echo -e 'Hello\nworld!!' > src1.txt
/b$ cat src1.txt 
Hello
world!!
/b$ hg add src1.txt; hg commit -m "initial import"
/b$ cd ..

Note that I committed in b/, which actually is not what you do in real life; the reference repository only receives push and pull from other repo. But this little idiosyncrasy is inessential here, it doesn’t alter the point I want to make. Let’s then clone b

$ hg clone b c1
updating to branch default
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
$ tree
.
├── b
│   └── src1.txt
└── c1
    └── src1.txt

2 directories, 2 files

back on prod to do a commit

$ cd b/
/b$ echo -e 'Hello\nworld' > src1.txt
/b$ hg commit -m "removed bangs"
/b$ hg log --graph
@  changeset:   1:38bceaef9ec1
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:03:52 2013 +0200
|  summary:     removed bangs
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

/b$ cd ..

now some commit on repo c1

$ cd c1/
/c1$ echo 'foo bar' > src2.txt
/c1$ hg add src2.txt; hg commit -m "added a source file"

another commit

/c1$ echo -e 'foo\nbar' > src2.txt
/c1$ cat src2.txt 
foo
bar
/c1$ hg commit -m "inserted newline"

a third commit

/c1$ echo -e 'foo\nbar\nbaz\nbazinga' > src2.txt
/c1$ hg commit -m "more text"

checking history

/c1$ hg log --graph
@  changeset:   3:5df72ef842d2
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:07:12 2013 +0200
|  summary:     more text
|
o  changeset:   2:26a980e728e2
|  user:        gghh
|  date:        Thu Apr 11 10:07:00 2013 +0200
|  summary:     inserted newline
|
o  changeset:   1:9f7e0b186fa4
|  user:        gghh
|  date:        Thu Apr 11 10:06:43 2013 +0200
|  summary:     added a source file
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

now pull and merge from b (production)

/c1$ hg pull
pulling from /b
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files (+1 heads)
(run 'hg heads' to see heads, 'hg merge' to merge)
/c1$ hg merge
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
(branch merge, don't forget to commit)

/c1$ hg commit -m "merged with production"

checking history

/c1$ hg log --graph
@    changeset:   5:0877a5c95a08
|\   tag:         tip
| |  parent:      3:5df72ef842d2
| |  parent:      4:38bceaef9ec1
| |  user:        gghh
| |  date:        Thu Apr 11 10:09:04 2013 +0200
| |  summary:     merged with production
| |
| o  changeset:   4:38bceaef9ec1
| |  parent:      0:1e0b6ab683de
| |  user:        gghh
| |  date:        Thu Apr 11 10:03:52 2013 +0200
| |  summary:     removed bangs
| |
o |  changeset:   3:5df72ef842d2
| |  user:        gghh
| |  date:        Thu Apr 11 10:07:12 2013 +0200
| |  summary:     more text
| |
o |  changeset:   2:26a980e728e2
| |  user:        gghh
| |  date:        Thu Apr 11 10:07:00 2013 +0200
| |  summary:     inserted newline
| |
o |  changeset:   1:9f7e0b186fa4
|/   user:        gghh
|    date:        Thu Apr 11 10:06:43 2013 +0200
|    summary:     added a source file
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

checking phases

/c1$ hg phase -r 0::5
0: public
1: draft
2: draft
3: draft
4: public
5: draft

Wait, hold on: phases? whazat?

Mercurial phases are a powerful features that prevent you from doing mistakes when rewriting the history of a repo. A changeset (commit) marked as public is a changeset that has been shared (i.e., thru a push/pull), so it is not wise to modify it. Mercurial keeps track of that, and an attempt to rewrite a public changeset is prevented, unless you change the phase level of the commit with

hg phase -v --draft [rev] --force

note the option --force, since we’re about to take the risk of rising the commit level. Just to give a little context on how advanced this feature is, I’ll mention the fact the git is going to implement it in the next Google Summer of Code.

Rewrite history!

Back to the original problem, note that commits 1, 2, 3 and 5 are draft, since they never left the repo c1, so it’s safe to move them around. We do that using the rebase mercurial extension, available since version 2.3.

État des lieux

Before we start messing up with our repository, let’s take note of all changesets:

/c1$ hg log -p
changeset:   5:0877a5c95a08
tag:         tip
parent:      3:5df72ef842d2
parent:      4:38bceaef9ec1
user:        gghh
date:        Thu Apr 11 10:09:04 2013 +0200
summary:     merged with production

diff -r 5df72ef842d2 -r 0877a5c95a08 src1.txt
--- a/src1.txt	Thu Apr 11 10:07:12 2013 +0200
+++ b/src1.txt	Thu Apr 11 10:09:04 2013 +0200
@@ -1,2 +1,2 @@
 Hello
-world!!
+world

changeset:   4:38bceaef9ec1
parent:      0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:52 2013 +0200
summary:     removed bangs

diff -r 1e0b6ab683de -r 38bceaef9ec1 src1.txt
--- a/src1.txt	Thu Apr 11 10:03:06 2013 +0200
+++ b/src1.txt	Thu Apr 11 10:03:52 2013 +0200
@@ -1,2 +1,2 @@
 Hello
-world!!
+world

changeset:   3:5df72ef842d2
user:        gghh
date:        Thu Apr 11 10:07:12 2013 +0200
summary:     more text

diff -r 26a980e728e2 -r 5df72ef842d2 src2.txt
--- a/src2.txt	Thu Apr 11 10:07:00 2013 +0200
+++ b/src2.txt	Thu Apr 11 10:07:12 2013 +0200
@@ -1,2 +1,4 @@
 foo
 bar
+baz
+bazinga

changeset:   2:26a980e728e2
user:        gghh
date:        Thu Apr 11 10:07:00 2013 +0200
summary:     inserted newline

diff -r 9f7e0b186fa4 -r 26a980e728e2 src2.txt
--- a/src2.txt	Thu Apr 11 10:06:43 2013 +0200
+++ b/src2.txt	Thu Apr 11 10:07:00 2013 +0200
@@ -1,1 +1,2 @@
-foo bar
+foo
+bar

changeset:   1:9f7e0b186fa4
user:        gghh
date:        Thu Apr 11 10:06:43 2013 +0200
summary:     added a source file

diff -r 1e0b6ab683de -r 9f7e0b186fa4 src2.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src2.txt	Thu Apr 11 10:06:43 2013 +0200
@@ -0,0 +1,1 @@
+foo bar

changeset:   0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:06 2013 +0200
summary:     initial import

diff -r 000000000000 -r 1e0b6ab683de src1.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src1.txt	Thu Apr 11 10:03:06 2013 +0200
@@ -0,0 +1,2 @@
+Hello
+world!!

Rebase

Let’s do it! just as in this example from the docs,

/c1$ hg rebase --dest 4 --source 1
saved backup bundle to c1/.hg/strip-backup/9f7e0b186fa4-backup.hg

Note that, for the sake of readability, I am referring to changesets via their revision numbers; the other option is to use their hashes. This is perfectly working, just be aware that revision numbers aren’t unique across repositories, while hashes are.
What does the history looks like now?

/c1$ hg log --graph
@  changeset:   4:c61940d6fbeb
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:07:12 2013 +0200
|  summary:     more text
|
o  changeset:   3:45eebb742cee
|  user:        gghh
|  date:        Thu Apr 11 10:07:00 2013 +0200
|  summary:     inserted newline
|
o  changeset:   2:9ad745b5c9db
|  user:        gghh
|  date:        Thu Apr 11 10:06:43 2013 +0200
|  summary:     added a source file
|
o  changeset:   1:38bceaef9ec1
|  user:        gghh
|  date:        Thu Apr 11 10:03:52 2013 +0200
|  summary:     removed bangs
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

and phases are

/c1$ hg phase 0::4
0: public
1: public
2: draft
3: draft
4: draft

Now collapse (fold)

To do the folding from revision 2 onwards, we finally resort to the “histedit” Mercurial extension:

/c1$ hg histedit 2

change the commands from

pick 9ad745b5c9db 2 added a source file
pick 45eebb742cee 3 inserted newline
pick c61940d6fbeb 4 more text

# Edit history between 9ad745b5c9db and c61940d6fbeb
#
# Commands:
#  p, pick = use commit
#  e, edit = use commit, but stop for amending
#  f, fold = use commit, but fold into previous commit (combines N and N-1)
#  d, drop = remove commit from history
#  m, mess = edit message without changing commit content
#

to

pick 9ad745b5c9db 2 added a source file
fold 45eebb742cee 3 inserted newline
fold c61940d6fbeb 4 more text

add an updated commit message and it’s done. A last glance at our pretty history:

/c1$ hg log --graph
@  changeset:   2:5f8e8722ad9a
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:07:12 2013 +0200
|  summary:     added a source file and other stuff
|
o  changeset:   1:38bceaef9ec1
|  user:        gghh
|  date:        Thu Apr 11 10:03:52 2013 +0200
|  summary:     removed bangs
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

now the last revision is the sum of the three.

/c1$ hg log -p -r 2
changeset:   2:5f8e8722ad9a
tag:         tip
user:        gghh
date:        Thu Apr 11 10:07:12 2013 +0200
summary:     added a source file and other stuff

diff -r 38bceaef9ec1 -r 5f8e8722ad9a src2.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src2.txt	Thu Apr 11 10:07:12 2013 +0200
@@ -0,0 +1,4 @@
+foo
+bar
+baz
+bazinga

So that’s all I have got. Just a few other pointers for the men of good will:

hg pull --rebase

and

hg rebase --dest [rev] --source [rev] --collapse

can perform common task at once, as described here and ?here.

A note about getting the latest Mercurial

If you happen to work on a ubuntu box like me, you can bypass the beaurocracy of official ubuntu repositories and get the bleeding latest Mercurial with a simple

sudo add-apt-repository ppa:mercurial-ppa/releases
sudo apt-get update  
sudo apt-get install mercurial

Recap

To recap, we started with the goal of helping Alice reorganize her commits between one merge with upstream and the following, and we discovered that much more is possible: if the situation isn’t pathological, marges aren’t a bound at all, and “draft” commits from Alice can really be moved around.

Mercurial has a really powerful mechanism to edit the history of a repo, and more awesomeness still has to come. More over, “phases” help to keep the thing safe.

chord diagram layout in d3.js: implementation

In a previous post I wrote how much I like the d3 chord diagram? per se, and how little I like to configure it via the matrix API function?. I argued that a different interface can be more useful; in this post I describe the implementation details of this alternative layout function.

To recap, for me a chord diagram is a collection of “polygons”. A “polygon” is a collection of “vertices”; a “vertex” is given by a pair of values, i.e. the ID of the arc the vertex belongs to and a “size”. The layout implementation will connect those vertices with chords in the natural way, i.e. with the shortest closed path of chords. There can be many of such chord paths, and we don’t really care which one is chosen. It’s inessential, as long as the path is closed and minimal. Here a sample javascript object that contains all necessary information to build such a diagram:

connections = [ [ {group: 1, value: 1},
                  {group: 3, value: 1},
                  {group: 5, value: 1},
                  {group: 7, value: 1} ],

                [ {group: 0, value: 2} ],

                [ {group: 2, value: 2} ],

                [ {group: 4, value: 2} ],

                [ {group: 6, value: 2} ] ]

Here the resulting diagram; as far as we’re concerned, both renderings represent faithfully the input, and we leave to the layout engine to “fill in the blanks” and choose at its will.

The goal of the game is to massage our input data until we get a representation of it compatible with the interfaces of d3.svg.chord, i.e. groups are described by an array of objects like

{endAngle:   ...,
 index:      ...,
 startAngle: ...,
 value:      ...}

where startAngle and endAngle are expressed in radians and indentify the arc, index is an integer ID corresponding to the group properties found in connections, value is the size of the group expressed in some scale (namely the sum of all connection‘s values where that group appears; chords shall be described by an array of object like

{source: {endAngle:   ...,
          index:      ...,
          startAngle: ...,
          subindex:   ...,
          value:      ...,
          groups:     {ID_1: '',
                       ID_2: '', 
                       ...}      } ,
target:  {endAngle:   ...,
          index:      ...,
          startAngle: ...,
          subindex:   ...,
          value:      ...,
          groups:     {ID_1: '', 
                       ID_2: '', 
                       ...}      } }

where the properties source and target identify the start and end point of a chord. The sub-properties startAngle, endAngle and value describe the width of the chord base in radians and in an absolute scale, index identify the group where the base should be placed. subindex tells the engine where to put the chord inside a group; if two chord’s endpoints have the same index and subindex, their bases will be superposed. The groups subproperty isn’t present in the current d3 implementation; it is a collection of group IDs: all the group touched by the polygon the chord belongs to. I use that datum for the fade-out effect on whole polygons.

My implementation is a sequence of loops, where I pass over and over my initial data to change its representation. To describe what I do, I will here use the javascript REPL provided by nodejs (you thought that node was just a webserver… you were wrong!). So fire up your node REPL:

$ node
>

From now on, if you have the node REPL on your machine, just cut and paste my code snippet to follow live what’s going on. Let’s first input the some connection data

var connections = [ [ {group: 0, value: 42} ],
                    [ {group: 0, value: 5},
                      {group: 1, value: 7},
                      {group: 2, value: 7},
                      {group: 3, value: 5} ],
                    [ {group: 0, value: 11},
                      {group: 2, value: 11} ] ];

First thing, we compute the size of each group

var ep, i, j, k, ngroups = 0, groupSums = {};

k = 0, i = -1; while (++i < connections.length) {
    j = -1; while (++j < connections[i].length) {
        ep = connections[i][j].group;
        if (ep in groupSums) {
            groupSums[ep] += connections[i][j].value;
        } else {
            groupSums[ep] = connections[i][j].value;
            ++ngroups;
        }
        k += connections[i][j].value;
    }
}

we’re running over all connections element, and each time we find a group ID we update the corresponding entry in the groupSums dictionary. Let’s inspect it:

> ngroups
4
> groupSums
{ '0': 58,
  '1': 7,
  '2': 18,
  '3': 5 }

As expected, ngroups is 4 and, for instance, the total size of group 0 is 42+5+11=58. Those valuen will then be rendered as portions of a circle. Like in a pie chart, but only on the perimeter. I now need the list of IDs for groups; since importing external modules in the REPL is a little tricky, I’ll not resort to d3.range but provide it by hand, just for the sake of this example

var range = function(n) {
    var out = [], i = -1; 
    while(++i<n) {out.push(i);} 
    return out;
}

var groupIndex = range(ngroups);
> groupIndex
[ 0, 1, 2, 3 ]

as a side note, in javascript there is no builtin and portable way to get the number of properties of an object; that’s why I have to keep the ngroups counter running on the side. Now convert k from the total sum of groups to our reference angular unity, expressed in radians

var padding = 0.05;
k = (2 * 3.1415 - padding * ngroups) / k;
> k
0.069125

now build the polygons data structure. With the following snippet we loop over the input connections; we saw that a connection element correspond to a polygon, i.e. a closed path of chords. Since all we want is to join vertices in the natural way, there’s no need to specify the edges in input — but at a given moment we have to compute those edges; that moment is now. For every vertex v_i in a connection element, we add the edge (v_i-1, v_i) to the corresponding polygon.

var polygons = [];
var poly = {edges: [],
            vertices: {}};

i = -1; while (++i < connections.length) {
    if (connections[i].length > 1) {
        j = 0; while (++j < connections[i].length) {
            poly.edges.push({
                                source: connections[i][j-1],
                                target: connections[i][j]
                            });
            poly.vertices[connections[i][j-1].group + ''] = '';
        }
        poly.vertices[connections[i][j-1].group + ''] = '';
        if (poly.edges.length > 1) {
            poly.edges.push({
       		                source: connections[i][0],
	                        target: connections[i][j-1]
                            });
        }
        polygons.push(poly);
        poly = {edges: [],
                vertices: {}};
    }
};

Note that we set apart two special cases: single-vertex connections (they correspond to no polygon at all, we just need to remember to leave some free space in the interested group) and two-vertices polygons (they’re are composed of a single chord, no need to close the path at the end). Let’s us use the REPL with joy, and inspect the polygons object.

> polygons.length
2

As expected; one polygon is a “square”, and the other is a single segment.

> polygons[0]
{ edges: 
   [ { source: [Object], target: [Object] },
     { source: [Object], target: [Object] },
     { source: [Object], target: [Object] },
     { source: [Object], target: [Object] } ],
  vertices: 
   { '0': '',
     '1': '',
     '2': '',
     '3': '' } }

the REPL’s pretty printer doesn’t show us more than three level of nesting, so we need to access those by ourselves. By the way, note the “vertices” property of a polygon; this will be useful when we’ll set the fade effect on whole polygons.

> polygons[0].edges
[ { source: { group: 0, value: 5 },
    target: { group: 1, value: 7 } },
  { source: { group: 1, value: 7 },
    target: { group: 2, value: 7 } },
  { source: { group: 2, value: 7 },
    target: { group: 3, value: 5 } },
  { source: { group: 0, value: 5 },
    target: { group: 3, value: 5 } } ]
> polygons[0].vertices
{ '0': '',
  '1': '',
  '2': '',
  '3': '' }

polygons[i].edges.source and polygons[i].edges.target are reference to connections sub-elements.

> polygons[1].edges
[ { source: { group: 0, value: 11 },
    target: { group: 2, value: 11 } } ]
> polygons[1].vertices
{ '0': '', '2': '' }

Everything as expected. We can move on to the next loop, where we fill up a new datastructure called subgroups; a “subgroup” is something that live inside a group (one the arcs of our piechart), and takes care of multiple ribbons sharing the same base. If two ribbons departs from the same group and, moreover, have that base in common, they belong to the same subgroup. Otherwise the SVG engine would have no way to know that they have to be placed together, and would draw them one next to the other.

Of course a chord is referenced in two distinct subgroups, one for each group it touches (begin and end).

var h, samebase, subgroups = [];

i = -1; while (++i < ngroups) {
    subgroups[i] = [];
    j = -1; while (++j < polygons.length) {
        samebase = {'ribbons': [],
                    'basevalue': 0};
        h = -1; while (++h < polygons[j].edges.length) {
            if (polygons[j].edges[h].source.group === i) {
                samebase.ribbons.push(polygons[j].edges[h]);
                samebase.basevalue = polygons[j].edges[h].source.value;
            } else if (polygons[j].edges[h].target.group === i) {
                samebase.ribbons.push(polygons[j].edges[h]);
                samebase.basevalue = polygons[j].edges[h].target.value;
            }
        }
        subgroups[i].push(samebase);
    }
}

here we set up a collection of subgroups for each group. For each group, we scan all polygons and check if it has a vertex in that group. In our polygon-based model, each subgroup will hold at most two ribbons, since for each corner in a segment, you only have to sides departing from it. Special cases are the segment and the single point, which have one and no ribbons respectively. Let’s us inspect our subgroups object so far:

> subgroups
[ [ { ribbons: [Object], basevalue: 5 },
    { ribbons: [Object], basevalue: 11 } ],
  [ { ribbons: [Object], basevalue: 7 },
    { ribbons: [], basevalue: 0 } ],
  [ { ribbons: [Object], basevalue: 7 },
    { ribbons: [Object], basevalue: 11 } ],
  [ { ribbons: [Object], basevalue: 5 },
    { ribbons: [], basevalue: 0 } ] ]

we actually added the basevalue property; it tells us the size of the chord’s base. We can inspect closely the subgroups of the first group:

> subgroups[0]
[ { ribbons: [ [Object], [Object] ],
    basevalue: 5 },
  { ribbons: [ [Object] ], basevalue: 11 } ]

there are two, one with two ribbon who share a base of size 5, and one with single ribbon of base 11 (that clearly belong to the segment, check again connections, you’ll find it in the third position of the array). We can inspect what are the ribbons involved, via

> subgroups[0][0].ribbons
[ { source: { group: 0, value: 5 },
    target: { group: 1, value: 7 } },
  { source: { group: 0, value: 5 },
    target: { group: 3, value: 5 } } ]
> subgroups[0][1].ribbons
[ { source: { group: 0, value: 11 },
    target: { group: 2, value: 11 } } ]

Note that, again, we’re dealing with reference to connections’ components. Change one of them, and you’ll be writing directly into connections. We’ll use this in a little while.

Ehy! I was about to forget: the single-point connections. We need to add them too to the subgroups, since they have to be taking into account when computing the position of the graphical elements. Here we go:

i = -1; while (++i < connections.length) {
    if (connections[i].length === 1) {
        subgroups[connections[i][0].group].push({
                   'ribbons': [],
                   'basevalue': connections[i][0].value
                 });
    }
}

you can check that one of the no-ribbons subgroups (the one from group 0, check again connections and you’ll seen that’s correct) has been updated to have a basevalue of 42:

> subgroups
[ [ { ribbons: [Object], basevalue: 5 },
    { ribbons: [Object], basevalue: 11 },
    { ribbons: [], basevalue: 42 } ],
  [ { ribbons: [Object], basevalue: 7 },
    { ribbons: [], basevalue: 0 } ],
  [ { ribbons: [Object], basevalue: 7 },
    { ribbons: [Object], basevalue: 11 } ],
  [ { ribbons: [Object], basevalue: 5 },
    { ribbons: [], basevalue: 0 } ] ]

we now use the range function, defined before, to generate the array of arrays subgroupIndex, in case we need to sort subgroups; it’s inessential to the scope of this article, but I introduce this to keep the naming consistent with the implementation on my github branch.

i = -1; while (++i < ngroups) {
    subgroupIndex.push(range(subgroups[i].length));
}

now everything is in place to find out the geometry of our diagram:

var pt, pt1, pt2, groups = [];

var x = 0, i = -1; while (++i < ngroups) {
    var di = groupIndex[i];
    var x0 = x, j = -1; while (++j < subgroupIndex[di].length) {
        var dj = subgroupIndex[di][j],
        v = subgroups[di][dj].basevalue,
        a0 = x,
        a1 = x += v * k;
        h = -1; while(++h < subgroups[di][dj].ribbons.length) {
            pt1 = subgroups[di][dj].ribbons[h].source;
            pt2 = subgroups[di][dj].ribbons[h].target;
            if (pt1.group === di) { pt = pt1; }
            else { pt = pt2; }
            pt['geometry'] = {
                index: di,
                subindex: dj,
                startAngle: a0,
                endAngle: a1,
                value: v
            };
        }
    }
    groups[di] = {
        index: di,
        startAngle: x0,
        endAngle: x,
        value: (x - x0) / k
    };
    x += padding;
}

we are scanning all our subgroups in the order defined by subgroupIndex and groupIndex; we then pick all the ribbons in a subgroup, select the right end of it, and augment that datum with the geometry property. Note that since we’re managing references, these changes propagates to all objects holding that reference, like if they were a sort of global variables. Here we modify the values via the subgroups array, later we’ll access those chord endpoints via the polygons arrays. Pretty cool. Let’s just check what ended up in the groups array

> groups
[ { index: 0,
    startAngle: 0,
    endAngle: 4.009250000000001,
    value: 58.00000000000001 },
  { index: 1,
    startAngle: 4.0592500000000005,
    endAngle: 4.543125000000001,
    value: 7.0000000000000036 },
  { index: 2,
    startAngle: 4.593125000000001,
    endAngle: 5.837375000000001,
    value: 18 },
  { index: 3,
    startAngle: 5.8873750000000005,
    endAngle: 6.2330000000000005,
    value: 5.000000000000001 } ]

and in the geometry property of our chord’s endpoints: here the outer layout of the object

> subgroups[0][0].ribbons
[ { source: 
     { group: 0,
       value: 5,
       geometry: [Object] },
    target: 
     { group: 1,
       value: 7,
       geometry: [Object] } },
  { source: 
     { group: 0,
       value: 5,
       geometry: [Object] },
    target: 
     { group: 3,
       value: 5,
       geometry: [Object] } } ]

zoom in a little:

> subgroups[0][0].ribbons[0].source
{ group: 0,
  value: 5,
  geometry: 
   { index: 0,
     subindex: 0,
     startAngle: 0,
     endAngle: 0.345625,
     value: 5 } }

so far so good. We’re almost there; we have to explicitly build chords objects.

var chords = [];

i = -1; while (++i < polygons.length) {
    j = -1; while (++j < polygons[i].edges.length) {
        var source = polygons[i].edges[j].source.geometry,
        target = polygons[i].edges[j].target.geometry;
        if (source.value || target.value) {
            chords.push(source.value < target.value
                        ? {source: target, target: source, groups: polygons[i].vertices}
                        : {source: source, target: target, groups: polygons[i].vertices});
        }
    }
}

Done! recall that the property groups in chords element means: “the index of all groups touched by the polygon this chord belongs to”, and it will be used for the fading effect. It directly comes from polygons[i].vertices objects; since objects access is O(1) as opposed to O(n) for arrays (well, that’s kind of a mistery to me, arrays are objects in fact) and I need fast lookup, I encoded vertices as key of a dictionary where values are dummy empty strings. Let’s proudly inspect the result of our efforts:

> chords.length
5

as expected: a 4-sides polygon plus a segment. What does a chord look like?

> chords[0]
{ source: 
   { index: 1,
     subindex: 0,
     startAngle: 4.0592500000000005,
     endAngle: 4.543125000000001,
     value: 7 },
  target: 
   { index: 0,
     subindex: 0,
     startAngle: 0,
     endAngle: 0.345625,
     value: 5 },
  groups: 
   { '0': '',
     '1': '',
     '2': '',
     '3': '' } }

Here the resulting diagram, where I also added labels to the arcs for clarity: