Exploding intersections with the Inclusion-Exclusion Principle

How many elements in the union of two sets? The sum of those in each set respectively, you say? Of course not, there might be some element in common that you counted twice. How if the sets involved are three? Or four?

The answer is known as “the Inclusion-Exclusion Principle”, and in this post I’ll describe how to use it to “explode intersections”, i.e. give the cardinality of each region of a Venn diagram. With a picture, here the case of three sets:

The Inclusion-Exclusion Principle tells us that the cardinality of the union of three sets is given by the formula

|A ∪ B ∪ C| = |A| + |B| + |C|
              - |A∩B| - |A∩C| - |B∩C|
              + |A∩B∩C|

which doesn’t come too much as a surprise: we first sum the sets’ cardinalities, then see that we added once too much the “two-by-two” intersections and subtract them, but then ehy, we added the “all-in” interesection three times and subtracted three times, gotta put it back in.

The generalized case follows the same pattern: add all “order one” intersections (the sets alone), remove all “order two” intersections (the two-by-two), add back all “order three” ones (the three-by-three), remove all “order four”, and so on until all no more cases exists.

When thinking of that formula applied to our problem of “exploding” the intersection, I find useful to visualize the situation with the following lattice, where I stress the fact that higher order intersections are contained into some lower order ones. Nodes in the lattice are intersections (all nicely lined up by order), edges represent the partial order relationship “contains”.

The diagram reminds that, for instance, to get how many elements are only in A (and not in B nor C), one has to follow all paths from A down to the highest order intersection, and flip the sign as the order of intersections increases. Which is: sum |A∩B| and |A∩C|, then subtract |A∩B∩C|. The result corresponds to the cardinality of (A∩B) ∪ (A∩C), and has to be subtracted from the cardinality of A.

Slightly harder: 4 sets instead of 3. Following the above diagram, we can compute the number of items in A but in none of the other sets via following all paths that go from A to the very bottom intersection A∩B∩C∩D, and using the principle to know what to add and what to subtract:

in A only = |A| - |A∩B| - |A∩C| - |A∩D|
                + |A∩B∩C| + |A∩B∩D| + |A∩C∩D|
                - |A∩B∩C∩D|

In this other example we want to see how many things are un A and C but in none of the other sets; same story, following the paths and flipping the sign at each step down.

in A∩C only = |A∩C| - |A∩B∩C| - |A∩C∩D|
                    + |A∩B∩C∩D|

Now we implement this algorithm in code. First some helper functions; list flattening:

def flatten(listlist):
    return [elem for list_ in listlist for elem in list_]
>>> flatten([['a', 'b'], ['c', 'd', 'e']])
['a', 'b', 'c', 'd', 'e']

permutations of a list:

def perm(sequence):
    if len(sequence) == 0:
        return [[]]
    else:
        out = []
        for cnt, elem in enumerate(sequence):
            out += map(lambda x: [elem] + x,
                       perm(sequence[:cnt] +
                            sequence[(cnt+1):]))
        return out
>>> perm(['a', 'b', 'c'])
[['a', 'b', 'c'],
 ['a', 'c', 'b'],
 ['b', 'a', 'c'],
 ['b', 'c', 'a'],
 ['c', 'a', 'b'],
 ['c', 'b', 'a']]

All subsets of given cardinality of a list:

def choose_n(n, srcList):
    if n == 0:
        return [[]]
    else:
        out = []
        for cnt, elem in enumerate(srcList):
            out += map(lambda x: [elem] + x,
                       choose_n(n-1, srcList[(cnt+1):]))
        return out
>>> choose_n(2, ['a', 'b', 'c'])
[['a', 'b'], ['a', 'c'], ['b', 'c']]

All possible subsets of a list, excluded the empty one:

def allsubsets(srcList):
    out = []
    tot = len(srcList)+1
    complem = lambda universe, ensemble: \
        list(set(universe) - set(ensemble))
    complSrcList = lambda ensemble: complem(srcList,
                                            ensemble)
    for i in range(1, tot):
        if i > tot / 2: # integer division
            yield map(complSrcList,
                      choose_n(tot - (i + 1), srcList))
        else:
            yield choose_n(i, srcList)

Note the optimization where I work on the complementary list to reduce the number of calls to choose_n, which is equal to the size of the argument.

>>> list(allsubsets(['a', 'b', 'c']))
[[['a'], ['b'], ['c']],
 [['a', 'b'], ['a', 'c'], ['b', 'c']],
 [['a', 'c', 'b']]]

Now a function to precompute all intersection values. We store them in a dictionary where values are intersection cardinalities, and keys are the names of the sets involved in the intersection. Many items in this lookup table represent the same intersection, with a different order of the sets in the key. I do this to speed up the lookup: instead of normalizing the queries, I store the answer to all possible queries.

def intersLookup(lists):
    toInters = flatten(allsubsets(lists.keys()))
    def inters_n(names):
        return reduce(lambda s, t: set(s) & set(t),
                      [lists[name] for name in names[1:]],
                      lists[names[0]])
    lookup = {}
    for sequence in toInters:
        cardinality = len(inters_n(sequence))
        for variant in perm(sequence):
            lookup['&'.join(variant)] = cardinality
    return lookup
>>> intersLookup({'a': ['apple', 'banana'],
...               'b': ['orange', 'apple', 'watermelon'],
...               'c': ['peach', 'plum', 'pear', 'apple', 'orange']})
{'a': 2, 'c': 5, 'b': 3, 
 'a&b&c': 1, 'b&c&a': 1, 'c&b&a': 1,
 'a&c&b': 1, 'b&a&c': 1, 'c&a&b': 1,
 'b&a': 1, 'a&b': 1,
 'b&c': 2, 'c&b': 2,
 'c&a': 1, 'a&c': 1}

The following generator, named subints as in “sub-intersections”, is particularly important since it implements the concept of “walking down all paths” in the above diagrams, “one level at a time”. It returns the nodes of the lattice conveniently grouped for the sign flipping that happens in the Inclusion Exclusion formula.

def subints(elem, compl):
    for thing in allsubsets(compl):
        yield map(lambda x: elem + x, thing)

See the example above, the first I gave for the 4 sets case. We want to see how much of A is “eaten” by the sub-intersections. What to sum, what to subtract? Here the answer:

>>> si = subints(['a'], ['b', 'c', 'd'])
>>> si.next() # add
[['a', 'b'], ['a', 'c'], ['a', 'd']]
>>> si.next() # subtract
[['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'c', 'd']]
>>> si.next() # add
[['a', 'c', 'b', 'd']]
>>> si.next() # it's over
Traceback (most recent call last):
  File "", line 1, in 
StopIteration

The trick is to observe that in a sub-lattice of our diagrams lower levels are just intersections of the nodes in upper levels plus all sets left out. Now it’s clear why allsubsets works the way it does: it returns subsets grouped by size, so that subints can be written more concisely.
Let’s see the second example above, again from the 4-sets case:

>>> si =  subints(['a', 'c'], ['b', 'd'])
>>> si.next()
[['a', 'c', 'b'], ['a', 'c', 'd']]
>>> si.next()
[['a', 'c', 'b', 'd']]
>>> si.next()
Traceback (most recent call last):
  File "", line 1, in 
StopIteration

We’ve now everything needed to write the inclusion-exclusion formula:

def flip():
    curr = 1
    while True:
        yield curr
        curr *= -1

def o(f, g):
    # function composition
    def helper(x):
        return f(g(x))
    return helper

def inclusionexclusion(elem, compl, allInters):
    currentInters = allInters['&'.join(elem)]
    sign = flip()
    subintValue = \
        sum(map(lambda (sign, value): sign * value,
                zip(sign,
                    map(o(sum, (lambda nodes: 
                                map(lambda node: allInters['&'.join(node)],
                                    nodes))),
                        subints(elem, compl)))))
    return currentInters - subintValue

Let’s see:

>>> allInters = intersLookup({'a': ['apple', 'banana'],
                'b': ['orange', 'apple', 'watermelon'],
                'c': ['peach', 'plum', 'pear', 'apple', 'orange']})

how many items in a and b but not in c?

>>> inclusionexclusion(['a', 'b'], ['c'], allInters)
0

zero, since the only common element is apple, but it’s also in c. How many elements in the intersection of all three?

>>> inclusionexclusion(['a', 'b', 'c'], [], allInters)
1

the apple, indeed. How many in a, but not in b nor c?

>>> inclusionexclusion(['a'], ['b', 'c'], allInters)
1

that’s the banana.

The last thing left to do is to package all of this into a shiny function that computes our complete explosion:

def explode(allNames, allInters):
    out = {}
    for elem in flatten(allsubsets(allNames)):
        out['&'.join(elem)] = \
                    inclusionexclusion(elem,
                                       complem(allNames, elem),
                                       allInters)
    return out

There we go:

>>> sets = {'a': ['apple', 'banana'],
...         'b': ['orange', 'apple', 'watermelon'],
...         'c': ['peach', 'plum', 'pear', 'apple', 'orange']}
>>> 
>>> allInters = intersLookup(sets)
>>> explode(sets.keys(), allInters)
{'a': 1, 'a&c&b': 1, 'c': 3, 'b': 1, 'c&b': 1, 'a&b': 0, 'a&c': 0}

The code is available at https://github.com/gghh/intersections. Images are drawn with gimp and zwibbler. The 4-sets Venn diagram comes from Wikipedia.

The annoying merge conflicts in case of criss cross

Here I show an example of problematic merge that sometimes I have with Mercurial. What happens is that for a given file foo.txt two commits, one parent of the other, get in competition at merge time. Obviously the child edit should be selected as winner of the merge, but this involves user intervention when it shouldn’t be necessary.

Here a minimal example of history:

Our goal is to merge node 5 and 6. The first thing we note is that the GCA isn’t unique, since both 1 and 4 are good candidates. Since for my example I need 4 to be selected (so that things can go bad), I inserted node 3; this way 4 is the common ancestor furthest from the root, and will then be selected.

Why did I color 0, 1 and 2 in pink? Because those are the only nodes where I edit my foo.txt file, if we exclude merges. Before I show a “Try It And See” example with Mercurial, I’ll describe what happens at each step in plain english.

  1. foo.txt is edited.
  2. foo.txt is edited again.
  3. foo.txt is edited a third time.
  4. Somebody forks at 0. foo.txt is not modified.
  5. Development continues on top of 3. foo.txt is not modified. At this stage, foo.txt has the very same content as it had in 0.
  6. 4 is merged with 1. With respect to foo.txt, the 3way merge algorithm automatically select changes in 1 as winners: if a line is changed in “this” and unchanged in “other”, the change is selected.
  7. 2 is merged with 4. Same story, ordinary business for 3way merge, most recent change is selected.

And then we want to merge 5 and 6. Mercurial selects 4 as base. Let’s observe things from foo.txt point of view:

  • (this) in 6 we have the latest version (from 2).
  • (other) in 5 we have the before-the-last version (from 1). This should not “compete” with the latest… does it?
  • yes, because in 4 (base of the merge) we have the very first version, and so 3way merge has no idea of who is more recent, if “this” or “other”. They are both changes as far as you compare them with the early version in 0 (inherited by 4).

Now, the real example. First, we go on with commits 0, 1 and 2

$ hg init
$ echo 'I am a cow.' > foo.txt ; hg add foo.txt ; hg commit -m foo
$ echo 'I have superpowers.' > foo.txt ; hg commit -m foo
$ echo 'I am a cow with superpowers.' > foo.txt ; hg commit -m foo

Then we create the second branch, with commits 3 and 4 forking from 0.

$ hg up 0
$ touch bar.txt ; hg add bar.txt ; hg commit -m bar
$ touch baz.txt ; hg add baz.txt ; hg commit -m baz

We are now at 4, and we merge with 1. The base will be 0, and all goes fine: the most recent version of foo.txt is selected.

$ hg merge 1 ; hg commit -m merged
$ cat foo.txt 
I have superpowers.

We then update to 2 and merge with 4. Again, no surprises from the merge. The most recent version (i.e. 2) is selected.

$ hg up 2
$ hg merge 4 ; hg commit -m merged
$ cat foo.txt 
I am a cow with superpowers.

A look at the history so far:

$ hg log --graph --template '{rev} {desc}'
@    6 merged
|\
| | o  5 merged
| |/|
| o |  4 baz
| | |
| o |  3 bar
| | |
o---+  2 foo
 / /
| o  1 foo
|/
o  0 foo

Being now in 6, we merge with 5

$ hg merge

As expected, the stupid conflict.

It is worth to note that the Consensus Merge algorithm would not solve this situation. It would take into account the two candidates for gca (1 and 4), and weight their “opinion” equally. But one is right, the other is wrong: in this small example, democracy fails to apply.

My naive observation is that when the history of a file is all on the same side of the merge, that file should not be touched at merge time. Just take the (single) head of the sub-DAG that tracks all commits affecting said file. But maybe I am missing undesired side effects of this approach. Also, thinking “per file” is usually the wrong way to reason about source control. Considering the repository as a whole is a thing that Mercurial and Git got right with respect to their predecessors. A “file” is a very artificial way to split code into chunks, as far as version control is concerned. The atomic entity is “the single line of code”.

In this case, with multiple GCAs, one can see that one of those GCA is good for the file we’re considering (foo.txt should be merged with 1 as base), and not the other. I wander if the following principle holds: “if the GCA is unique, then it is optimal for all files (better: lines of code)” where by “optimal” I mean “it doesn’t compare a revision against its ancestors in stupid ways”. On the other side, if there are many GCAs, each line of code has its favourite one, and it’s impossible to choose.

How does Mercurial select the Greatest Common Ancestor

The GCA of two nodes X and Y in a DAG (Directed Acyclic Graph) is an ancestor of both X and Y such that no descendant of it is also a common ancestor of X and Y. This definition is equivalent to the revset expression heads(ancestors(X) and ancestors(Y)), i.e. the head (if unique) of the intersection among the ancestors of X and those of Y.

Mercurial adopts a slightly different definition: the GCA is the common ancestor of the two nodes which is furthest from a root, as measured by the longest path. This makes sense, since the GCA is used as base for merges and it’s meaningful to take a (topologically) more recent version in case of ambiguity.

I am going to discuss the algorithm that used to perform GCA computations in mercurial until version 2.5.4; in 2.5.5 this code was replaced by a faster algorithm.

The interesting part is what happens when there are multiple candidates for the GCA, i.e. the GCA is not unique. What happens is that mercurial will pick the one that finds first. Here we walk the algorithm step by step and see how it works.

First off, where to find things: the ancestor() function is in the mercurial.ancestor module. Here the docstring:

$ python

>>> from mercurial import ancestor
>>> help(ancestor.ancestor)

ancestor(a, b, pfunc)
    Returns the common ancestor of a and b that is furthest from a
    root (as measured by longest path) or None if no ancestor is
    found. If there are multiple common ancestors at the same
    distance, the first one found is returned.
    
    pfunc must return a list of parent vertices for a given vertex

We’ll work with the following graph, where GCA(4, 5) isn’t unique: candidates are 1 and 2, equally good (parents are on the left and children on the right).

We can try it right away, like this:

>>> parents = {0: [], 2: [0], 1: [0], 3: [2],
...            5: [1, 2], 4: [3, 1]}
>>> 
>>> pfunc = lambda x: parents[x]
>>> ancestor.ancestor(4, 5, pfunc)
1

1 is chosen… apparently it’s the one found first. Let’s dig deeper into the algorithm. The ancestor() function does its job in two steps:

  1. prepare a depth dictionary that associates each ancestor of both nodes X and Y to its distance from a root, computed as the longest path to it.
  2. use a generator for each of the two starting vertices that, at each iteration, returns all ancestor “of the same generation”, i.e. all ancestors at a given distance d from the root. The two generators are such that they return generations in descending order of distance. If you want, you can picture this as two ancestor list that “race” towards the root, until they meet at the same distance level (or don’t meet, in which case no common ancestor is found).

I have split the code in three separate functions for this explanation. First, the construction of the depth dictionary:

def depthf(a, b, pfunc):
    a, b = sorted([a, b])
    
    # find depth from root of all ancestors
    # depth is stored as a negative for heapq
    parentcache = {}
    visit = [a, b]
    depth = {}
    while visit:
        vertex = visit[-1]
        pl = pfunc(vertex)
        parentcache[vertex] = pl
        if not pl:
            depth[vertex] = 0
            visit.pop()
        else:
            for p in pl:
                if p not in depth:
                    visit.append(p)
            if visit[-1] == vertex:
                # -(maximum distance of parents + 1)
                depth[vertex] = min([depth[p] for p in pl]) - 1
                visit.pop()
    return depth, parentcache

The parentcache dictionary is there to save some time at a later step; we’ll pass it along to the forthcoming functions.

What’s to note here is that we can compute depth for a node only if we have depth of all its parents, taking the max and adding 1 (line 22).

We visit nodes in a depth-first fashion. We’ll encounter a lot of unseen vertices and append them the the visit array (line 19), until at some point we’ll hit a root and set its depth to 0 (line 14). At this point we start backtracking, pop()-ing nodes from visit as soon as we compute their depth (line 23).

Note that the condition visit[-1] == vertex (line 20) is equivalent to “all parents of vertex already have depth computed” since we are handling DAGs: the absence of cycles assure us that no vertex can be found twice in the visit list. Well, strictly speaking, the point is that no node can be parent of itself: the only threat to the last elem of visit being unchanged since line 10 is the loop at line 17.

Here the result if applied to our toy example:

>>> depth, parentcache = depthf(4, 5, pfunc)
>>> depth
{0: 0, 1: -1, 2: -1, 3: -2, 4: -3, 5: -2}

depths are stored negated since we’ll later use a heap to figure out the furthest elements.

Here the helper function that traverses ancestors in order of decreasing distance from the root:

def ancestors(vertex, depth, parentcache):
    h = [(depth[vertex], vertex)]
    seen = set()
    while h:
        d, n = heapq.heappop(h)
        if n not in seen:
            seen.add(n)
            yield (d, n)
            for p in parentcache[n]:
                heapq.heappush(h, (depth[p], p))

A few things to note here:

  1. duplicate nodes are removed
  2. since the distance of a node is always larger than the distance of any of its parents, it’s safe to pop() an element and only later push() its parents into the heap. Subsequent pop() will always return elements in the desired order.

Here the result of applying it to the vertex 4:

>>> list(ancestors(4, depth, parentcache))
[(-3, 4), (-2, 3), (-1, 1), (-1, 2), (0, 0)]

Another helper function: we need to get all ancestors of a given depth, not one by one. This one does the buffering into “generations”:

def generations(vertex, depth, parentcache):
    sg, s = None, set()
    for g, v in ancestors(vertex, depth, parentcache):
        if g != sg:
            if sg:
                yield sg, s
            sg, s = g, set((v,))
        else:
            s.add(v)
     yield sg, s

Here the generations that lead to the vertex 4:

>>> list(generations(4, depth, parentcache))
[(-3, set([4])), (-2, set([3])), (-1, set([1, 2])), (0, set([0]))]

Finally here the function that put the pieces together and gives the final result:

def gca(a, b, depth, parentcache):
    x = generations(a, depth, parentcache)
    y = generations(b, depth, parentcache)
    gx = x.next()
    gy = y.next()
    # increment each ancestor list until it is closer to root
    # than the other, or they match
    try:
        while True:
            if gx[0] == gy[0]:
                for v in gx[1]:
                    if v in gy[1]:
                        return v
                gy = y.next()
                gx = x.next()
            elif gx[0] > gy[0]:
                gy = y.next()
            else:
                gx = x.next()
    except StopIteration:
        return None

gx and gy are tuples like (depth, nodes). A GCA can by found only when gx[0] == gy[0]; since we start from the max depths and progressively decrease it, we’re guaranteed to find an element in the intersection of the ancestors of the original nodes that is the furthest away from the root, or we find nothing at all.

Here we run it, showing the value of gx and gy as we go along:

>>> gca(4, 5, d, p)
gx (-3, set([4]))
gy (-2, set([5]))
----
gx (-2, set([3]))
----
gx (-1, set([1, 2]))
gy (-1, set([1, 2]))
----
1

Note the 1 is chosen, instead of 2, for no particular reason: just because it’s the first element of list(set([1, 2])].

Interestingly, if we add a node 6 between 0 and 2, this makes 2 the common ancestor furthest from root, and so the algorithm doesn’t see any more ambiguity in the choice:

>>> parents = {0: [], 2: [6], 1: [0], 6: [0], 3: [2],
...            5: [1, 2], 4: [3, 1]}
>>> 
>>> pfunc = lambda x: parents[x]
>>> 
>>> depth, parentcache = depthf(4, 5, pfunc)
>>> gca(4, 5, depth, parentcache)
gx (-4, set([4]))
gy (-3, set([5]))
----
gx (-3, set([3]))
----
gx (-2, set([2]))
gy (-2, set([2]))
----
2

In the last iteration the intersection between gx[1] and gy[1] is the element 2 alone.

Caching in python with a descriptor and a decorator

The Mercurial SCM has several strategies to cache expensive computations; you can find an overview of all of them at this page. This post is about the propertycache decorator: a mechanism to replace a function with its result the very first time it is called. The trick is made possible by the combination of two features of the python programming language: descriptors and decorators.

Python descriptors

The python descriptor protocol is something that works as follows: if you override the methods __get__(), __set__() or __delete__() in a (new style) python class, you’re given the chance to execute your own code when an object of that class is evaluated (__get__()) or is bound to another object (__set__()). Which is, you can decide what happens when that evaluation (or binding) occurs.

One more thing to remember is that an object that overrides __get__() and __set__() behaves like a descriptor (which is: these methods are actually called instead of the default evaluation/binding behaviour) only if it is a member of another class.

Here a few references on descriptors:

A dummy example

Here we use the “descriptor protocol” to preempt user code from assigning to an attribute of an object. That attribute is kept to a constant value instead:

class joke(object):
    def __get__(self, obj, objtype):
        return 'Bazinga!'
    def __set__(self, obj, value):
        pass

class foo(object):
    x = joke()

We had to implement __set__() too, despite the fact that it’s empty: python would have ragarded joke as a non-data descriptor, which is a slightly different beast. Here is what happens when writing and reading the attribute x of an object of type foo:

>>> f = foo()
>>> f.x
'Bazinga!'
>>> f.x = 5
>>> f.x
'Bazinga!'

Python decorators

Are cool. In python any callable that wraps a function is called “decorator”. Example: you have a function that does stuff, like

def deco(arg):
    # stuff

and another function, that does some other stuff:

def fun(arg):
    # other stuff

Sometimes is interesting to rebind fun like

fun = deco(fun)

and deco is said to be a “decorator” for fun. Some basic references:

Python has a convenient syntactic sugar for the rebinding above: it can be written as

@deco
def fun(arg):
    # other stuff

I often use decorators when I need to enhance the behaviour of a function by having something done before the function call, or after. So I write the following closure and use it as a decorator:

def deco(fun):
    def helper(arg)
        # do something before
        result = fun(arg)
        # do something after
        return result
    return helper

and then, for whatever fun this makes sense, I rebind its name to

fun = deco(fun)

The propertycache decorator

Without further ado, here the code, taken verbatim from the mercurial codebase:

class propertycache(object):
    def __init__(self, func):
        self.func = func
        self.name = func.__name__
    def __get__(self, obj, type=None):
        result = self.func(obj)
        self.cachevalue(obj, result)
        return result

    def cachevalue(self, obj, value):
        setattr(obj, self.name, value)

and the example usage, taken straight from the mercurial wiki:

from util import propertycache

class foo(object):
    @propertycache
    def expensive(self):
        print 'created'
        return [1, 2, 3]

f = foo()
f.expensive # prints created
f.expensive # simply returns [1, 2, 3]

# to forget the saved value and calculate it again, use delattr
delattr(f, 'expensive')

How does it work

Before anything happens, python defines

f.expensive = propertycache(f.expensive)

Note that this is a descriptor object, since the class propertycache overrides the __get__() method. From this point on, you can’t call the method f.expensive() anymore:

>>> f.expensive()
<__main__.foo object at 0xb742d28c>
created
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'list' object is not callable

What you do, instead, is to refer to f.expensive as a data member (which is, no () parens). The first time that f.expensive is evalutated, the original method f.expensive() is called (remember: it was saved in the func member of the object of type propertycache). Its result is then rebound to f.expensive by the cachevalue() method.
Now f.expensive is not a descriptor object anymore, but only the returned value of the original method: further evaluations won’t go thru the call of the original method anymore, just return the cached value.

As a final remark, note that this doesn’t generalize to methods that have more arguments than the self reference, since the __get__() method has a well defined signature and you can’t add more items in there: only a reference to the “owner” object is allowed (the instance of foo in our example).

How was it done before?

Descriptors were introduced in the Mercurial codebase with revision 595ba2537d4f by Dirkjan Ochtman, the same person who wrote the excellent chapter on the AOSA book about Mercurial.

How was caching done before? It looked like this:

class foo:
    def __getattr__(self, name):
        if name == 'expensive1':
            print 'Computing expensive1...'
            self.expensive1 = [1,2,3]
            return self.expensive1
        elif name == 'expensive2':
            print 'Computing expensive2...'
            self.expensive2 = [4,5,6]
            return self.expensive2
        else:
            raise AttributeError(name)

Let’s see that in action:

>>> f = foo()
>>> f.expensive1
Computing expensive1...
[1, 2, 3]
>>> f.expensive2
Computing expensive2...
[4, 5, 6]
>>> f.expensive1
[1, 2, 3]

Here the concertmaster is the __getattr__() method, which, if implemented, intercepts all accesses to members not found otherwise ; in our example, those are expensive1 and expensive2. Once accessed the first time, their value is computed and bound to a datamember; subsequent calls will find them without passing thru __getattr__.

It goes without saying that the decorator described above is much more terse than this boilerplate code; it factors away all the common behaviour.

Mercurial histedit behind the scenes

What’s the magic behind the history rewriting extension in Mercurial? It’s called histedit, it’s written by Augie Fackler, and it’s a fundamental tool in my daily coding activity. I fired up pdb and watched closely to what it does. Lean back and enjoy (all good ideas are Augie’s, all mistakes are mine!).


First off, I am obliged to rise a giant “don’t try this at home” sign upfront. I will use internal Mercurial APIs, which are deliberately undocumented; the reason is very simple: they might change next week without further notice.

Having said that, let me ask you a question: how would you create a changeset that is equivalent to a sequence of N other changesets? Which is, in mercurial lingo, how would you implement folding N changesets? You’d probably smash the first two, then add the third one to that result, and after you’d add the fourth, the fifth and so on. That’s what happens in histedit. Let’s see that in action.

Before we tackle the adventurous task of editing the history of our repository, we have to learn how to do two operations:

  1. pick a changeset somewhere in the history and apply it to the working directory
  2. collapse a chain of commits into a single one

Pick there, apply here

The histedit extension special-cases two situations.

In place

consider the repo created with

$ hg init
$ touch foo ; hg add foo ; hg commit -m foo
$ touch bar ; hg add bar ; hg commit -m bar
$ hg up 0

which then looks like


what we want to do here is to extract changeset 1 and apply it to the working directory. Since the parents of cset 1 and of the working directory are the same (they’re both children of changeset 0), the situation can be handled by a simple shot of cmdutil.revert(). Let’s switch to the python shell:

from mercurial import ui, cmdutil, node, localrepo

repo = localrepo.localrepository(ui.ui(), '.')
cmdutil.revert(ui.ui(), repo, repo[1],
               (repo[0].node(), node.nullid), all=True)
repo.commit(text="Applied cset 1")

this, as promised, leads us to


The revert(ui, repo, ctx, parents, *pats, **opts) function restore the files to the state they had at revision ctx, and set the parents of the working directory. You might think there is something odd going on here, and I did too: this is a quite unusual application of revert, which is more often use to restore an ancestor revision. Here we’re picking stuff from another branch (let’s say, from “the future”).

Far away nodes: merge needed

The other situation is when the parents of the working directory and of our target revision differ but… those two (different) parents have the same content. It might seem an artificial situation, but it is what happen in a generic step of the histedit algorithm. In that case we resort to a merge. Let’s first build the sample repo:

$ touch foo ; hg add foo ; hg commit -m foo
$ touch bar ; hg add bar ; hg commit -m bar
$ touch baz ; hg add baz ; hg commit -m baz
$ hg up 0

then again the revert() trick to apply changeset 1 on the the of 0, but on a parallel line; the code is exactly the python snippet above, so we have


what’s the histedit recipe to apply 2 on the top of 3? We merge the working directory (updated to3) with 2, setting 1 as the common ancestor; this isn’t the topological truth (1 is not an ancestor of 3) but is morally ok, since 1 and 3 are contain the same state.

repo = localrepo.localrepository(ui.ui(), '.')
merge.update(repo, repo[2].node(), True, True, False, repo[1].node())
repo.setparents(repo[3].node(), node.nullid)
repo.dirstate.write()
repo.commit(text="Applied cset 2")

here the result:


Note at how I carefully cleanup the parents of the changeset, removing traces of the merge that just happened.

All of this is nicely packaged in the histedit.applychanges() function in the histedit extension.

Collapse many into one

Again, let’s kickstart with a toy repository:

$ hg init
$ touch foo ; hg add foo ; hg commit -m foo
$ touch bar ; hg add bar ; hg commit -m bar
$ touch baz ; hg add baz ; hg commit -m baz
$ touch woo ; hg add woo ; hg commit -m woo

We want to take changeset from 1 to 3 and smash them into one that sums them all. Here is the python code:

from mercurial import ui, cmdutil, node, localrepo, context

repo = localrepo.localrepository(ui.ui(), '.')

files = set(repo[1].files()) | set(repo[2].files()) | set(repo[3].files())
mf = repo[3].manifest()

def filectxfn(repo, ctx, path):
    if path in mf:
        fctx = repo[3][path]
        mctx = context.memfilectx(fctx.path(), fctx.data())
        return mctx
    raise IOError()

parents = (repo[0].node(), node.nullid)
new = context.memctx(repo,
                     parents=parents,
                     text='collapsed 1 to 3',
                     files=files,
                     filectxfn=filectxfn,
                     user='gghh')

repo.commitctx(new)


Again, there is a function for this in the histedit extension, histedit.collapse(). I was suprised looking at that code: I have a mental model of commits as, say, an ordered sequence of diffs; that’s how I read this picture anyway. If I am collapsing changeset, I expect some sort of iteration over that chain. But that’s not what collapse() does. Look at the closure filectxfn: basically it consider only filelogs from the last changeset, skipping those filelogs that didn’t changed since first. As if in last you have all needed things necessary to know what happened in between.

Which is exactly how mercurial works: filelogs just internally make the diff sequence, externally only the full contents are exposed. So basically all you can do is ask a filelog for the contents of something plus the parents at a version.

When does this show start? It’s getting late

With all this machinery in place, we’re now ready to juggle with those little balls so fast that tortoiseHg will have an hard time catching up.


First off, the toy repo. For the occasion we’re giving foo, bar and baz a little break and use some piece of real literature.

$ hg init
$ echo 'Dorothy lived in the midst of the great\
> Kansas prairies, with Uncle' > chap1.txt
$ hg add chap1.txt ; hg commit -m "line 1"

$ echo "Henry, who was a farmer, and Aunt Em,\
> who was the farmer's wife. Their" >> chap1.txt
$ hg commit -m "line 2"

$ echo 'house was small, for the lumber to build it\
> had to be carried by wagon' >> chap1.txt
$ hg commit -m "line 3"

$ echo 'many miles.  There were four walls, a floor\
> and a roof, which made one' >> chap1.txt
$ hg commit -m "line 4"

$ echo 'room; and this room contained a rusty looking\
> cookstove, a cupboard for' >> chap1.txt
$ hg commit -m "line 5"

$ echo 'the dishes, a table, three or four chairs,\
> and the beds.' >> chap1.txt
$ hg commit -m "line 6"


Our plan is to fold all changesets between 1 and 5 included into a single one. Which is, we would run

hg histedit 1

The histedit commands will be:

(pick, 1)
(fold, 2)
(fold, 3)
(fold, 4)
(fold, 5)

The first thing we do is to update the working directory to 0 and play the cmdutil.revert() trick to apply 2 on the top of 1.

from mercurial import ui, hg, commands, localrepo, node, merge
from mercurial import cmdutil, scmutil, context, hg, repair
from hgext import histedit

repo = localrepo.localrepository(ui.ui(), '.')

hg.update(repo, repo[1].node())

cmdutil.revert(ui.ui(), repo, repo[2],
               (repo[1].node(), node.nullid), all=True)
repo.commit(text="2 bis")


Now we again update to 0, and collapse the line 1 -- 6 into changeset 7, which we apply on the top of 0.

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[1], repo[6], {})
hg.update(repo, repo[7].node())


Now we want to apply 3 on the top of 7. That’s the merge case we saw earlier; our reference ancestor will be 2, since 0 -- 7 is the same as 0 -- 1 -- 2.

merge.update(repo, repo[3].node(), True, True, False, repo[2].node())
repo.setparents(repo[7].node(), node.nullid)
repo.dirstate.write()
repo.commit(text="7 and 3")


Now guess what? collapse the line 7 -- 8, call it 9.

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[7], repo[8], {})
hg.update(repo, repo[9].node())


Come on, a couple more and we’re done: 4 on top of 9, then 5 on top of 11.

merge.update(repo, repo[4].node(), True, True, False, repo[3].node())
repo.setparents(repo[9].node(), node.nullid)
repo.dirstate.write()
repo.commit(text="4 bis")

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[9], repo[10], {})
hg.update(repo, repo[11].node())

merge.update(repo, repo[5].node(), True, True, False, repo[4].node())
repo.setparents(repo[11].node(), node.nullid)
repo.dirstate.write()
repo.commit(text="5 bis")

now the final collapse: changeset 13 is the result of 1 -- 2 -- 3 -- 4 -- 5 all folded together.

hg.update(repo, repo[0].node())
histedit.collapse(repo, repo[11], repo[12], {})
hg.update(repo, repo[13].node())


Now it’s time to throw away the leftovers. repair.strip() comes to the rescue:

repair.strip(ui.ui(), repo, repo[1].node())

repair.strip(ui.ui(), repo, repo[1].node())

repair.strip(ui.ui(), repo, repo[1].node())

repair.strip(ui.ui(), repo, repo[1].node())


Hooray!

Is TCP turing complete?


Is the TCP protocol capable of… computing things? In other words: if you consider the topology of your network (nodes and connections between them) and the data that each node plans to send across the network as “the input” of TCP as a “computing machine”, and the data exchange (or maybe, just the rate of the data exchange at steady state) that actually happens to be “the output”, can one imagine some weird complicated “encoding scheme” so that, at the end of the day, one can make TCP compute prime numbers?

In much more cryptic terms, for people used to theoretical computer science lingo, is the TCP protocol somehow Turing Complete?

I asked myself this question since I learnt of the AIMD congestion control scheme, where basically the rate at which data is transmitted increases linearly in time until a packet loss is detected. At that point, the transmission rate is divided by a constant factor. Well, I feel that this scheme is complex enough that there might be some computation power lurking in there.

Personally I find it very amusing when turing completeness sneaks into systems “by accident”, and this happens quite often in real world situations. Some examples:

Usually when you have turing completeness by accident it’s quite an undesirable thing, as famous programmer Jamie Zawinsky points out in his blog here; interesting also the comment on the DNS resource record NAPTR.

Any comment is *very* wellcome! I feel the formulation of my question is quite fuzzy, so even if somebody can make me better understand *what* I am asking… well that would make my day :-)

A feature of (q)bzr I really love

Bazaar lets you compare branches. Not just branches from the same repository, i.e. all stored in the same .bzr/ folder, but arbitrary branches that may (or may not) have some common history. All those years of typing code for fun and profit, and staring at directed acyclic graphs wandering where the heck that stupid C macro that breaks the build was introduced, might have changed my sense of “awesome”.
But ladies and gentlemen, this is pretty much awesome.

Let me show you an example.

Playing the TIAS game with bazaar

TIAS is Try It And See, no less no more. I did some TIAS in another post about Mercurial, and I am doing it again. Let’s get this party started.

First thing we create a repository and fork it two times. Actually in bzr the correspondence between repositories and branches can be one to one, with bzr init, or one to many, with bzr init-repo. The latter creates what is called a “shared repository”, i.e. a repository that can hosts many branches. This is indeed the way git and mercurial work; the former option is a bzr-only thing. Let’s stick to the one-repo-one-branch setup; it doesn’t make a big difference for the purpose of this post, and it requires less typing.

$ bzr init b
Created a standalone tree (format: 2a)
$ bzr branch b c1
Branched 0 revisions.
$ bzr branch b c2
Branched 0 revisions.

We now switch to c1, enter a couple of commits, and push them back to the parent branch, which is b/.

$ cd c1

/c1$ echo 'hello world!' > src1.txt
/c1$ bzr add src1.txt
adding src1.txt
/c1$ bzr commit -m "initial import"
Committing to: c1/
added src1.txt
Committed revision 1.

/c1$ echo 'hello world!\nIt is gonna be a nice day.' > src1.txt
/c1$ bzr commit -m "added optimism."
Committing to: c1/
modified src1.txt
Committed revision 2.

/c1$ bzr push :parent
All changes applied successfully.
Pushed up to revision 2.

Some time later, developer c2 enters the scene, retrieves the current content of b/ and makes a commit. He doesn’t push back to b/, because he’s not quite done yet.

$ cd ../c2/

/c2$ bzr pull
Using saved parent location: b/
+N  src1.txt
All changes applied successfully.
Now on revision 2.

/c2$ echo "That's a nice Tnettenba" > src2.txt
/c2$ bzr add src2.txt
adding src2.txt
/c2$ bzr commit -m "added joke"
Committing to: c2/
added src2.txt
Committed revision 3.

c1 gets back to work: he continues to develop his thing.

/c2$ cd ../c1

/c1$ echo 'voodoo atchoo to you too' > src3.txt
/c1$ bzr add src3.txt
adding src3.txt
/c1$ bzr commit -m 'thank you!'
Committing to: c1/
added src3.txt
Committed revision 3.

The release engineer, who has direct access to the server where b/ is stored, has been thinking on how to fix that shell script that handles the deployment. He finally figures it out; he’s in a hurry, and doesn’t want to branch and merge — he commits directly on b/. It happens.

$ cd ../b/

/b$ echo 'the grumpy cat says it had fun, once' > src4.txt
/b$ bzr add src4.txt
adding src4.txt
/b$ bzr commit -m 'that never happened.'
Committing to: b/
added src4.txt
Committed revision 3.

It’s c1‘s turn to play now. He had a coffee with c2; they talked about the code and c1 realizes he needs the new feature that c2 just committed. So he merges with c2, and then goes a little further.

/c1$ bzr merge ../c2
+N  src2.txt
All changes applied successfully.
/c1$ bzr commit -m "thanks c2, that was neat"
Committing to: c1/
added src2.txt
Committed revision 4.

/c1$ echo 'if (0 == a);' > src6.txt
/c1$ bzr add src6.txt
adding src6.txt
/c1$ bzr commit -m 'Zero the value is not. Yoda style.'
Committing to: c1/
added src6.txt
Committed revision 5.

It’s almost time to leave, and c2 has just the time to fix that little annoying bug before heading home.

/c1$ cd ../c2

/c2$ echo 'NaNNaNaNNaNaN' > src5.txt
/c2$ bzr add src5.txt 
adding src5.txt
/c2$ bzr commit -m 'Batman. I need Batman.'
Committing to: c2/
added src5.txt
Committed revision 4.

The sweet pretty pictures

The day after c1 and c2 get back to work, chit-chat a little, and realize that they don’t know exactly how their two branches relate. Ok, somebody pushed, somebody merged… but they’d like to see it crystal clear, to anticipate all the problems that will arise when, next wednesday, all lines are going to be merged into the production code. More over, they’ve no idea that the release engineer jumped in and added stuff directly to the reference repo b/. What do they do?
Easy peasy: bzr qlog b c1 c2.
The amazing command

bzr qlog <branch_1> <branch_2> ... <branch_n>

shows what all branches have in common, who diverged, who pushed to whom, everything. And recall that we’re talking of branches that reside in different repositories. Too good! This is what the command

bzr qlog b c1 c2

would show:

Isn’t that pretty?
Just for the sake of taking some screenshots and messing around with gimp, I’ll show you that the branches you compare can be totally, and I say totally, unrelated. Here is what you get if you compare the trunk of bzr itself with the latest mysql-server codebase via

bzr qlog lp:bzr lp:mysql-server

I had to shrink it a little bit, as you can imagine; of course the two don’t share a thing, but that’s just for the fun of the abuse:

All your rebase are belong to us

When I hit “commit” on my version control system, I behave very much like when I click “save” on my text editor. I just do it whenever I feel like. I can’t see in advance how the thing is gonna look like at the end, in all details. Result: at the end of the day, I have a bazillion of small commits without any clear, organic structure. I just cannot hand this blob to the release engineer, or code reviewer or whatever — the guy is gonna yell at me all over the building. So, guess what? you have to rewrite your history, my son. And here is how it works (with Mercurial).

Beautify the commit history, even between regular merges

Alice clones Production; commits some stuff; then merges from Production just to keep up to date; commits again a few times, re-merges, etc. Then she would like to beautify a little her commit history, and would like to “squash” all commits she made between a merge and the next one, all in a single commit. Here a picture that explains the scenario:


I apologize for the ugly picture (you might want to read again the title of this blog). At the expense of killing the narrative, I am gonna say upfront how the story is gonna end (Spoiler Alert!): initially Alice would be happy just by folding commits between merge M1 and merge M2. But then she discovers that actually she’s free to move her commits all the way up and down in her development line, and that the two merges with upstream that she did are … not a “boundary” anymore. So, in the final picture, all merges are still there (even if M1 is tecnically just a clone), and commits A, B, C, D, E can be in arbitrary positions with respect to M1 and M2 (just need to respect their relative order); look at the picture at the end of the post for some abstract art representing the final result.
The hg terminology for squashing commits is folding. We are going to see how to handle this situation using the rebase and histedit mercurial extensions. Things we’re gonna need, and reference documents:

Ok, last step is optional but appreciated.

First off, Mercurial TIAS (Try It And See)

TIAS is an effective tecnique to effectively grasp new concepts, and the first step on the way to mastery. For mercurial, it goes like this: let’s create a repo serving as a (local) reference one

$ hg init b

this will create an empty repository in folder b/. Other clones would communicate via this one as if it were remote repos like in a real world (distributed) development workflow. You will then clone one or more times repo b/, simulating actions from several developers:

$ hg clone b c1
$ hg clone b c2
...

Please note that all local repos are basically equal, and here we just talking about conventions. There is no such a thing like a client/server scenario in the distributed version control world, but rather a peer-to-peer setup.
You now have a basic working environment where you can play, experiment, try it, and see.
For instance, why not enter c1, add a small text file and commit it:

$ cd c1/ ; echo "hello world" > src.txt ;
c1/$ hg add src.txt ; hg commit -m "initial import"

now we could push this commit to c1‘s parent repository, wich is b, with

c1/$ hg push

then enter c2/ and retrieve those modifications via

c1/$ cd ../c2
c2/$ hg pull
c2/$ hg update

Did it work? Let’s see:

c2/$ cd ..
$ tree

.
├── b
├── c1
│   └── src.txt
└── c2
    └── src.txt

Hooray! Note that b‘s working tree is empty, because we never run hg update there; nonetheless, all the stuff we care about is safely stored into b/.hg/.

Simulation of the workflow

I’ll now reproduce the situation described above. Let’s create a dumb production repo:

$ hg init b
$ cd b
/b$ echo -e 'Hello\nworld!!' > src1.txt
/b$ cat src1.txt 
Hello
world!!
/b$ hg add src1.txt; hg commit -m "initial import"
/b$ cd ..

Note that I committed in b/, which actually is not what you do in real life; the reference repository only receives push and pull from other repo. But this little idiosyncrasy is inessential here, it doesn’t alter the point I want to make. Let’s then clone b

$ hg clone b c1
updating to branch default
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
$ tree
.
├── b
│   └── src1.txt
└── c1
    └── src1.txt

2 directories, 2 files

back on prod to do a commit

$ cd b/
/b$ echo -e 'Hello\nworld' > src1.txt
/b$ hg commit -m "removed bangs"
/b$ hg log --graph
@  changeset:   1:38bceaef9ec1
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:03:52 2013 +0200
|  summary:     removed bangs
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

/b$ cd ..

now some commit on repo c1

$ cd c1/
/c1$ echo 'foo bar' > src2.txt
/c1$ hg add src2.txt; hg commit -m "added a source file"

another commit

/c1$ echo -e 'foo\nbar' > src2.txt
/c1$ cat src2.txt 
foo
bar
/c1$ hg commit -m "inserted newline"

a third commit

/c1$ echo -e 'foo\nbar\nbaz\nbazinga' > src2.txt
/c1$ hg commit -m "more text"

checking history

/c1$ hg log --graph
@  changeset:   3:5df72ef842d2
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:07:12 2013 +0200
|  summary:     more text
|
o  changeset:   2:26a980e728e2
|  user:        gghh
|  date:        Thu Apr 11 10:07:00 2013 +0200
|  summary:     inserted newline
|
o  changeset:   1:9f7e0b186fa4
|  user:        gghh
|  date:        Thu Apr 11 10:06:43 2013 +0200
|  summary:     added a source file
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

now pull and merge from b (production)

/c1$ hg pull
pulling from /b
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files (+1 heads)
(run 'hg heads' to see heads, 'hg merge' to merge)
/c1$ hg merge
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
(branch merge, don't forget to commit)

/c1$ hg commit -m "merged with production"

checking history

/c1$ hg log --graph
@    changeset:   5:0877a5c95a08
|\   tag:         tip
| |  parent:      3:5df72ef842d2
| |  parent:      4:38bceaef9ec1
| |  user:        gghh
| |  date:        Thu Apr 11 10:09:04 2013 +0200
| |  summary:     merged with production
| |
| o  changeset:   4:38bceaef9ec1
| |  parent:      0:1e0b6ab683de
| |  user:        gghh
| |  date:        Thu Apr 11 10:03:52 2013 +0200
| |  summary:     removed bangs
| |
o |  changeset:   3:5df72ef842d2
| |  user:        gghh
| |  date:        Thu Apr 11 10:07:12 2013 +0200
| |  summary:     more text
| |
o |  changeset:   2:26a980e728e2
| |  user:        gghh
| |  date:        Thu Apr 11 10:07:00 2013 +0200
| |  summary:     inserted newline
| |
o |  changeset:   1:9f7e0b186fa4
|/   user:        gghh
|    date:        Thu Apr 11 10:06:43 2013 +0200
|    summary:     added a source file
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

checking phases

/c1$ hg phase -r 0::5
0: public
1: draft
2: draft
3: draft
4: public
5: draft

Wait, hold on: phases? whazat?

Mercurial phases are a powerful features that prevent you from doing mistakes when rewriting the history of a repo. A changeset (commit) marked as public is a changeset that has been shared (i.e., thru a push/pull), so it is not wise to modify it. Mercurial keeps track of that, and an attempt to rewrite a public changeset is prevented, unless you change the phase level of the commit with

hg phase -v --draft [rev] --force

note the option --force, since we’re about to take the risk of rising the commit level. Just to give a little context on how advanced this feature is, I’ll mention the fact the git is going to implement it in the next Google Summer of Code.

Rewrite history!

Back to the original problem, note that commits 1, 2, 3 and 5 are draft, since they never left the repo c1, so it’s safe to move them around. We do that using the rebase mercurial extension, available since version 2.3.

État des lieux

Before we start messing up with our repository, let’s take note of all changesets:

/c1$ hg log -p
changeset:   5:0877a5c95a08
tag:         tip
parent:      3:5df72ef842d2
parent:      4:38bceaef9ec1
user:        gghh
date:        Thu Apr 11 10:09:04 2013 +0200
summary:     merged with production

diff -r 5df72ef842d2 -r 0877a5c95a08 src1.txt
--- a/src1.txt	Thu Apr 11 10:07:12 2013 +0200
+++ b/src1.txt	Thu Apr 11 10:09:04 2013 +0200
@@ -1,2 +1,2 @@
 Hello
-world!!
+world

changeset:   4:38bceaef9ec1
parent:      0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:52 2013 +0200
summary:     removed bangs

diff -r 1e0b6ab683de -r 38bceaef9ec1 src1.txt
--- a/src1.txt	Thu Apr 11 10:03:06 2013 +0200
+++ b/src1.txt	Thu Apr 11 10:03:52 2013 +0200
@@ -1,2 +1,2 @@
 Hello
-world!!
+world

changeset:   3:5df72ef842d2
user:        gghh
date:        Thu Apr 11 10:07:12 2013 +0200
summary:     more text

diff -r 26a980e728e2 -r 5df72ef842d2 src2.txt
--- a/src2.txt	Thu Apr 11 10:07:00 2013 +0200
+++ b/src2.txt	Thu Apr 11 10:07:12 2013 +0200
@@ -1,2 +1,4 @@
 foo
 bar
+baz
+bazinga

changeset:   2:26a980e728e2
user:        gghh
date:        Thu Apr 11 10:07:00 2013 +0200
summary:     inserted newline

diff -r 9f7e0b186fa4 -r 26a980e728e2 src2.txt
--- a/src2.txt	Thu Apr 11 10:06:43 2013 +0200
+++ b/src2.txt	Thu Apr 11 10:07:00 2013 +0200
@@ -1,1 +1,2 @@
-foo bar
+foo
+bar

changeset:   1:9f7e0b186fa4
user:        gghh
date:        Thu Apr 11 10:06:43 2013 +0200
summary:     added a source file

diff -r 1e0b6ab683de -r 9f7e0b186fa4 src2.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src2.txt	Thu Apr 11 10:06:43 2013 +0200
@@ -0,0 +1,1 @@
+foo bar

changeset:   0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:06 2013 +0200
summary:     initial import

diff -r 000000000000 -r 1e0b6ab683de src1.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src1.txt	Thu Apr 11 10:03:06 2013 +0200
@@ -0,0 +1,2 @@
+Hello
+world!!

Rebase

Let’s do it! just as in this example from the docs,

/c1$ hg rebase --dest 4 --source 1
saved backup bundle to c1/.hg/strip-backup/9f7e0b186fa4-backup.hg

Note that, for the sake of readability, I am referring to changesets via their revision numbers; the other option is to use their hashes. This is perfectly working, just be aware that revision numbers aren’t unique across repositories, while hashes are.
What does the history looks like now?

/c1$ hg log --graph
@  changeset:   4:c61940d6fbeb
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:07:12 2013 +0200
|  summary:     more text
|
o  changeset:   3:45eebb742cee
|  user:        gghh
|  date:        Thu Apr 11 10:07:00 2013 +0200
|  summary:     inserted newline
|
o  changeset:   2:9ad745b5c9db
|  user:        gghh
|  date:        Thu Apr 11 10:06:43 2013 +0200
|  summary:     added a source file
|
o  changeset:   1:38bceaef9ec1
|  user:        gghh
|  date:        Thu Apr 11 10:03:52 2013 +0200
|  summary:     removed bangs
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

and phases are

/c1$ hg phase 0::4
0: public
1: public
2: draft
3: draft
4: draft

Now collapse (fold)

To do the folding from revision 2 onwards, we finally resort to the “histedit” Mercurial extension:

/c1$ hg histedit 2

change the commands from

pick 9ad745b5c9db 2 added a source file
pick 45eebb742cee 3 inserted newline
pick c61940d6fbeb 4 more text

# Edit history between 9ad745b5c9db and c61940d6fbeb
#
# Commands:
#  p, pick = use commit
#  e, edit = use commit, but stop for amending
#  f, fold = use commit, but fold into previous commit (combines N and N-1)
#  d, drop = remove commit from history
#  m, mess = edit message without changing commit content
#

to

pick 9ad745b5c9db 2 added a source file
fold 45eebb742cee 3 inserted newline
fold c61940d6fbeb 4 more text

add an updated commit message and it’s done. A last glance at our pretty history:

/c1$ hg log --graph
@  changeset:   2:5f8e8722ad9a
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:07:12 2013 +0200
|  summary:     added a source file and other stuff
|
o  changeset:   1:38bceaef9ec1
|  user:        gghh
|  date:        Thu Apr 11 10:03:52 2013 +0200
|  summary:     removed bangs
|
o  changeset:   0:1e0b6ab683de
   user:        gghh
   date:        Thu Apr 11 10:03:06 2013 +0200
   summary:     initial import

now the last revision is the sum of the three.

/c1$ hg log -p -r 2
changeset:   2:5f8e8722ad9a
tag:         tip
user:        gghh
date:        Thu Apr 11 10:07:12 2013 +0200
summary:     added a source file and other stuff

diff -r 38bceaef9ec1 -r 5f8e8722ad9a src2.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src2.txt	Thu Apr 11 10:07:12 2013 +0200
@@ -0,0 +1,4 @@
+foo
+bar
+baz
+bazinga

So that’s all I have got. Just a few other pointers for the men of good will:

hg pull --rebase

and

hg rebase --dest [rev] --source [rev] --collapse

can perform common task at once, as described here and ?here.

A note about getting the latest Mercurial

If you happen to work on a ubuntu box like me, you can bypass the beaurocracy of official ubuntu repositories and get the bleeding latest Mercurial with a simple

sudo add-apt-repository ppa:mercurial-ppa/releases
sudo apt-get update  
sudo apt-get install mercurial

Recap

To recap, we started with the goal of helping Alice reorganize her commits between one merge with upstream and the following, and we discovered that much more is possible: if the situation isn’t pathological, marges aren’t a bound at all, and “draft” commits from Alice can really be moved around.

Mercurial has a really powerful mechanism to edit the history of a repo, and more awesomeness still has to come. More over, “phases” help to keep the thing safe.

chord diagram layout in d3.js: implementation

In a previous post I wrote how much I like the d3 chord diagram? per se, and how little I like to configure it via the matrix API function?. I argued that a different interface can be more useful; in this post I describe the implementation details of this alternative layout function.

To recap, for me a chord diagram is a collection of “polygons”. A “polygon” is a collection of “vertices”; a “vertex” is given by a pair of values, i.e. the ID of the arc the vertex belongs to and a “size”. The layout implementation will connect those vertices with chords in the natural way, i.e. with the shortest closed path of chords. There can be many of such chord paths, and we don’t really care which one is chosen. It’s inessential, as long as the path is closed and minimal. Here a sample javascript object that contains all necessary information to build such a diagram:

connections = [ [ {group: 1, value: 1},
                  {group: 3, value: 1},
                  {group: 5, value: 1},
                  {group: 7, value: 1} ],

                [ {group: 0, value: 2} ],

                [ {group: 2, value: 2} ],

                [ {group: 4, value: 2} ],

                [ {group: 6, value: 2} ] ]

Here the resulting diagram; as far as we’re concerned, both renderings represent faithfully the input, and we leave to the layout engine to “fill in the blanks” and choose at its will.

The goal of the game is to massage our input data until we get a representation of it compatible with the interfaces of d3.svg.chord, i.e. groups are described by an array of objects like

{endAngle:   ...,
 index:      ...,
 startAngle: ...,
 value:      ...}

where startAngle and endAngle are expressed in radians and indentify the arc, index is an integer ID corresponding to the group properties found in connections, value is the size of the group expressed in some scale (namely the sum of all connection‘s values where that group appears; chords shall be described by an array of object like

{source: {endAngle:   ...,
          index:      ...,
          startAngle: ...,
          subindex:   ...,
          value:      ...,
          groups:     {ID_1: '',
                       ID_2: '', 
                       ...}      } ,
target:  {endAngle:   ...,
          index:      ...,
          startAngle: ...,
          subindex:   ...,
          value:      ...,
          groups:     {ID_1: '', 
                       ID_2: '', 
                       ...}      } }

where the properties source and target identify the start and end point of a chord. The sub-properties startAngle, endAngle and value describe the width of the chord base in radians and in an absolute scale, index identify the group where the base should be placed. subindex tells the engine where to put the chord inside a group; if two chord’s endpoints have the same index and subindex, their bases will be superposed. The groups subproperty isn’t present in the current d3 implementation; it is a collection of group IDs: all the group touched by the polygon the chord belongs to. I use that datum for the fade-out effect on whole polygons.

My implementation is a sequence of loops, where I pass over and over my initial data to change its representation. To describe what I do, I will here use the javascript REPL provided by nodejs (you thought that node was just a webserver… you were wrong!). So fire up your node REPL:

$ node
>

From now on, if you have the node REPL on your machine, just cut and paste my code snippet to follow live what’s going on. Let’s first input the some connection data

var connections = [ [ {group: 0, value: 42} ],
                    [ {group: 0, value: 5},
                      {group: 1, value: 7},
                      {group: 2, value: 7},
                      {group: 3, value: 5} ],
                    [ {group: 0, value: 11},
                      {group: 2, value: 11} ] ];

First thing, we compute the size of each group

var ep, i, j, k, ngroups = 0, groupSums = {};

k = 0, i = -1; while (++i < connections.length) {
    j = -1; while (++j < connections[i].length) {
        ep = connections[i][j].group;
        if (ep in groupSums) {
            groupSums[ep] += connections[i][j].value;
        } else {
            groupSums[ep] = connections[i][j].value;
            ++ngroups;
        }
        k += connections[i][j].value;
    }
}

we’re running over all connections element, and each time we find a group ID we update the corresponding entry in the groupSums dictionary. Let’s inspect it:

> ngroups
4
> groupSums
{ '0': 58,
  '1': 7,
  '2': 18,
  '3': 5 }

As expected, ngroups is 4 and, for instance, the total size of group 0 is 42+5+11=58. Those valuen will then be rendered as portions of a circle. Like in a pie chart, but only on the perimeter. I now need the list of IDs for groups; since importing external modules in the REPL is a little tricky, I’ll not resort to d3.range but provide it by hand, just for the sake of this example

var range = function(n) {
    var out = [], i = -1; 
    while(++i<n) {out.push(i);} 
    return out;
}

var groupIndex = range(ngroups);
> groupIndex
[ 0, 1, 2, 3 ]

as a side note, in javascript there is no builtin and portable way to get the number of properties of an object; that’s why I have to keep the ngroups counter running on the side. Now convert k from the total sum of groups to our reference angular unity, expressed in radians

var padding = 0.05;
k = (2 * 3.1415 - padding * ngroups) / k;
> k
0.069125

now build the polygons data structure. With the following snippet we loop over the input connections; we saw that a connection element correspond to a polygon, i.e. a closed path of chords. Since all we want is to join vertices in the natural way, there’s no need to specify the edges in input — but at a given moment we have to compute those edges; that moment is now. For every vertex v_i in a connection element, we add the edge (v_i-1, v_i) to the corresponding polygon.

var polygons = [];
var poly = {edges: [],
            vertices: {}};

i = -1; while (++i < connections.length) {
    if (connections[i].length > 1) {
        j = 0; while (++j < connections[i].length) {
            poly.edges.push({
                                source: connections[i][j-1],
                                target: connections[i][j]
                            });
            poly.vertices[connections[i][j-1].group + ''] = '';
        }
        poly.vertices[connections[i][j-1].group + ''] = '';
        if (poly.edges.length > 1) {
            poly.edges.push({
       		                source: connections[i][0],
	                        target: connections[i][j-1]
                            });
        }
        polygons.push(poly);
        poly = {edges: [],
                vertices: {}};
    }
};

Note that we set apart two special cases: single-vertex connections (they correspond to no polygon at all, we just need to remember to leave some free space in the interested group) and two-vertices polygons (they’re are composed of a single chord, no need to close the path at the end). Let’s us use the REPL with joy, and inspect the polygons object.

> polygons.length
2

As expected; one polygon is a “square”, and the other is a single segment.

> polygons[0]
{ edges: 
   [ { source: [Object], target: [Object] },
     { source: [Object], target: [Object] },
     { source: [Object], target: [Object] },
     { source: [Object], target: [Object] } ],
  vertices: 
   { '0': '',
     '1': '',
     '2': '',
     '3': '' } }

the REPL’s pretty printer doesn’t show us more than three level of nesting, so we need to access those by ourselves. By the way, note the “vertices” property of a polygon; this will be useful when we’ll set the fade effect on whole polygons.

> polygons[0].edges
[ { source: { group: 0, value: 5 },
    target: { group: 1, value: 7 } },
  { source: { group: 1, value: 7 },
    target: { group: 2, value: 7 } },
  { source: { group: 2, value: 7 },
    target: { group: 3, value: 5 } },
  { source: { group: 0, value: 5 },
    target: { group: 3, value: 5 } } ]
> polygons[0].vertices
{ '0': '',
  '1': '',
  '2': '',
  '3': '' }

polygons[i].edges.source and polygons[i].edges.target are reference to connections sub-elements.

> polygons[1].edges
[ { source: { group: 0, value: 11 },
    target: { group: 2, value: 11 } } ]
> polygons[1].vertices
{ '0': '', '2': '' }

Everything as expected. We can move on to the next loop, where we fill up a new datastructure called subgroups; a “subgroup” is something that live inside a group (one the arcs of our piechart), and takes care of multiple ribbons sharing the same base. If two ribbons departs from the same group and, moreover, have that base in common, they belong to the same subgroup. Otherwise the SVG engine would have no way to know that they have to be placed together, and would draw them one next to the other.

Of course a chord is referenced in two distinct subgroups, one for each group it touches (begin and end).

var h, samebase, subgroups = [];

i = -1; while (++i < ngroups) {
    subgroups[i] = [];
    j = -1; while (++j < polygons.length) {
        samebase = {'ribbons': [],
                    'basevalue': 0};
        h = -1; while (++h < polygons[j].edges.length) {
            if (polygons[j].edges[h].source.group === i) {
                samebase.ribbons.push(polygons[j].edges[h]);
                samebase.basevalue = polygons[j].edges[h].source.value;
            } else if (polygons[j].edges[h].target.group === i) {
                samebase.ribbons.push(polygons[j].edges[h]);
                samebase.basevalue = polygons[j].edges[h].target.value;
            }
        }
        subgroups[i].push(samebase);
    }
}

here we set up a collection of subgroups for each group. For each group, we scan all polygons and check if it has a vertex in that group. In our polygon-based model, each subgroup will hold at most two ribbons, since for each corner in a segment, you only have to sides departing from it. Special cases are the segment and the single point, which have one and no ribbons respectively. Let’s us inspect our subgroups object so far:

> subgroups
[ [ { ribbons: [Object], basevalue: 5 },
    { ribbons: [Object], basevalue: 11 } ],
  [ { ribbons: [Object], basevalue: 7 },
    { ribbons: [], basevalue: 0 } ],
  [ { ribbons: [Object], basevalue: 7 },
    { ribbons: [Object], basevalue: 11 } ],
  [ { ribbons: [Object], basevalue: 5 },
    { ribbons: [], basevalue: 0 } ] ]

we actually added the basevalue property; it tells us the size of the chord’s base. We can inspect closely the subgroups of the first group:

> subgroups[0]
[ { ribbons: [ [Object], [Object] ],
    basevalue: 5 },
  { ribbons: [ [Object] ], basevalue: 11 } ]

there are two, one with two ribbon who share a base of size 5, and one with single ribbon of base 11 (that clearly belong to the segment, check again connections, you’ll find it in the third position of the array). We can inspect what are the ribbons involved, via

> subgroups[0][0].ribbons
[ { source: { group: 0, value: 5 },
    target: { group: 1, value: 7 } },
  { source: { group: 0, value: 5 },
    target: { group: 3, value: 5 } } ]
> subgroups[0][1].ribbons
[ { source: { group: 0, value: 11 },
    target: { group: 2, value: 11 } } ]

Note that, again, we’re dealing with reference to connections’ components. Change one of them, and you’ll be writing directly into connections. We’ll use this in a little while.

Ehy! I was about to forget: the single-point connections. We need to add them too to the subgroups, since they have to be taking into account when computing the position of the graphical elements. Here we go:

i = -1; while (++i < connections.length) {
    if (connections[i].length === 1) {
        subgroups[connections[i][0].group].push({
                   'ribbons': [],
                   'basevalue': connections[i][0].value
                 });
    }
}

you can check that one of the no-ribbons subgroups (the one from group 0, check again connections and you’ll seen that’s correct) has been updated to have a basevalue of 42:

> subgroups
[ [ { ribbons: [Object], basevalue: 5 },
    { ribbons: [Object], basevalue: 11 },
    { ribbons: [], basevalue: 42 } ],
  [ { ribbons: [Object], basevalue: 7 },
    { ribbons: [], basevalue: 0 } ],
  [ { ribbons: [Object], basevalue: 7 },
    { ribbons: [Object], basevalue: 11 } ],
  [ { ribbons: [Object], basevalue: 5 },
    { ribbons: [], basevalue: 0 } ] ]

we now use the range function, defined before, to generate the array of arrays subgroupIndex, in case we need to sort subgroups; it’s inessential to the scope of this article, but I introduce this to keep the naming consistent with the implementation on my github branch.

i = -1; while (++i < ngroups) {
    subgroupIndex.push(range(subgroups[i].length));
}

now everything is in place to find out the geometry of our diagram:

var pt, pt1, pt2, groups = [];

var x = 0, i = -1; while (++i < ngroups) {
    var di = groupIndex[i];
    var x0 = x, j = -1; while (++j < subgroupIndex[di].length) {
        var dj = subgroupIndex[di][j],
        v = subgroups[di][dj].basevalue,
        a0 = x,
        a1 = x += v * k;
        h = -1; while(++h < subgroups[di][dj].ribbons.length) {
            pt1 = subgroups[di][dj].ribbons[h].source;
            pt2 = subgroups[di][dj].ribbons[h].target;
            if (pt1.group === di) { pt = pt1; }
            else { pt = pt2; }
            pt['geometry'] = {
                index: di,
                subindex: dj,
                startAngle: a0,
                endAngle: a1,
                value: v
            };
        }
    }
    groups[di] = {
        index: di,
        startAngle: x0,
        endAngle: x,
        value: (x - x0) / k
    };
    x += padding;
}

we are scanning all our subgroups in the order defined by subgroupIndex and groupIndex; we then pick all the ribbons in a subgroup, select the right end of it, and augment that datum with the geometry property. Note that since we’re managing references, these changes propagates to all objects holding that reference, like if they were a sort of global variables. Here we modify the values via the subgroups array, later we’ll access those chord endpoints via the polygons arrays. Pretty cool. Let’s just check what ended up in the groups array

> groups
[ { index: 0,
    startAngle: 0,
    endAngle: 4.009250000000001,
    value: 58.00000000000001 },
  { index: 1,
    startAngle: 4.0592500000000005,
    endAngle: 4.543125000000001,
    value: 7.0000000000000036 },
  { index: 2,
    startAngle: 4.593125000000001,
    endAngle: 5.837375000000001,
    value: 18 },
  { index: 3,
    startAngle: 5.8873750000000005,
    endAngle: 6.2330000000000005,
    value: 5.000000000000001 } ]

and in the geometry property of our chord’s endpoints: here the outer layout of the object

> subgroups[0][0].ribbons
[ { source: 
     { group: 0,
       value: 5,
       geometry: [Object] },
    target: 
     { group: 1,
       value: 7,
       geometry: [Object] } },
  { source: 
     { group: 0,
       value: 5,
       geometry: [Object] },
    target: 
     { group: 3,
       value: 5,
       geometry: [Object] } } ]

zoom in a little:

> subgroups[0][0].ribbons[0].source
{ group: 0,
  value: 5,
  geometry: 
   { index: 0,
     subindex: 0,
     startAngle: 0,
     endAngle: 0.345625,
     value: 5 } }

so far so good. We’re almost there; we have to explicitly build chords objects.

var chords = [];

i = -1; while (++i < polygons.length) {
    j = -1; while (++j < polygons[i].edges.length) {
        var source = polygons[i].edges[j].source.geometry,
        target = polygons[i].edges[j].target.geometry;
        if (source.value || target.value) {
            chords.push(source.value < target.value
                        ? {source: target, target: source, groups: polygons[i].vertices}
                        : {source: source, target: target, groups: polygons[i].vertices});
        }
    }
}

Done! recall that the property groups in chords element means: “the index of all groups touched by the polygon this chord belongs to”, and it will be used for the fading effect. It directly comes from polygons[i].vertices objects; since objects access is O(1) as opposed to O(n) for arrays (well, that’s kind of a mistery to me, arrays are objects in fact) and I need fast lookup, I encoded vertices as key of a dictionary where values are dummy empty strings. Let’s proudly inspect the result of our efforts:

> chords.length
5

as expected: a 4-sides polygon plus a segment. What does a chord look like?

> chords[0]
{ source: 
   { index: 1,
     subindex: 0,
     startAngle: 4.0592500000000005,
     endAngle: 4.543125000000001,
     value: 7 },
  target: 
   { index: 0,
     subindex: 0,
     startAngle: 0,
     endAngle: 0.345625,
     value: 5 },
  groups: 
   { '0': '',
     '1': '',
     '2': '',
     '3': '' } }

Here the resulting diagram, where I also added labels to the arcs for clarity:

A new API to configure chord diagrams in d3.js

I really like the chord diagram from the d3 javascript library?, but the function used to build those diagrams could be better. Here I discuss a possible alternative.

Have you ever heard of circos? It’s a visualization tool written in perl; it draws pretty pictures arranging your data on a circle. It’s a powerful idea by Martin Krzywinski — unfortunately the perl program is quite difficult to use (an awful lot of configuration files) and the code isn’t that friendly to jump in and hack away. I always had the impression that only Martin can use it at full power; all we others can do is to copy and paste some of his examples, hoping that he wrote something similar to what you’re trying to do. Things might get better over time, since Circos is starting to be packaged for ubuntu? and some real programmer is putting hands on it. Still, for the time being, circos is not a viable option if you need a quick and meaningful way to explore your data.

“It’s a pity!” you’re saying? I did too, but cheer up! There is an alternative. It’s Mike Bostock‘s d3 javascript library.

Mike saw Martin’s work, liked it, and ported the table viewer example? to d3, giving us what he calls the “chord diagram“?. Mike is a good programmer, and got things right: he separated the low level drawing API, in d3.svg.chord?, from the higher level layout setting, in d3.layout.chord. While the drawing API lets you create every possible chord diagram you can think of, it’s pretty heavy to use; the layout API is much more straightforward, but it only implements something called “the table viewer layout”. My whole point is that this is far from the best possible use of this very nice visualization idea. Thankfully to this separation of roles in the code design, when I decided that the table viewing layout API doesn’t have the level of expressivity I need, it was relatively straightforward to write a new layout API; I just had to stick to the interfaces designed by Mike.

The General Circos Idea

Here it goes: you have a circumference, which you split into several arcs. You then draw ribbons connecting arcs. The meaning you attach to this diagram is up to you; there are situations in which the semantics of those drawings is so natural that you immediately understand what is going on. You just get it. It’s a simple yet powerful visualization technique.

The Table Viewer Thing

Even if Circos was designed with genomic data in mind, Martin suggests an application which is representing tabular data. It works like this: you have a square table full of numbers; each arc in the circle corresponds to a row and a column at the same time. A ribbon which have ends in the i-th and j-th arcs will have the values for its bases in the i-j and j-i entries of the table respectively. Here I try to explain this with a picture:

Frankly, I believe that such a representation is as messy as the initial table of data. Per se, the chord diagram is a very powerful visualization; this just isn’t its killer application. But I am very happy that it caught Mike’s attention, since whatever layout system you write to simplify the creation of chord diagrams, the d3 API that creates the basic SVG elements, such as arcs and ribbons, is already written and will not change.

The Chord Diagram Matrix Layout

It is really useful to have a mechanism to configure a chord diagram with a simple javascript object, but the matrix input in d3.layout.chord just doesn’t cut it. Here three situations I can’t represent via the chord.matrix setter:

  1. I cannot specify that two chords (or ribbons) share the same base.
  2. I cannot set a portion of an arc to don’t have any chord departing from it. The matrix layout force the “same arc” chords to always exist; those chords correspond to values on the diagonal of the input matrix. I feel that a chord with same departure and arrival arc is redundant; make every pixel count, and just leave it off.
  3. I cannot draw “parallel chords”, i.e. two distinct chords that both begin in a group A and end in a group B.



Proposition: a Polygon Layout

My personal experience in drawing chord diagrams suggests a different approach, based on what I call “polygons”. If a chord is like a segment connecting two arcs, a polygon is a closed path of chords; a given polygon touches an arc only once. If a polygon represent a relation, or a property, then all arcs touched by the polygon share that property; the width of the chords involved in the polygon is a parameter that can carry additional information (i.e., “how much” of a property a given arc has). In the following pretty picture, I call the ensemble of chords showing up “a pentagon”:

A polygon is described by an array of vertices [v0, v1, ..., vn]; it’s up to the d3 engine to connect those vertices in the natural way, i.e. with the edges [(v0, v1), (v1, v2), ..., (v_n-1, v_n), (v_n, v0)]. A vertex is represented by a javascript object with two properties, “group” and “value”: the former is the ID of the arc where the vertex is, the latter is its width. Therefore the full configuration of a chord diagram is given by an array of polygons. As a side note, a polygon with only two vertices is a single chord (i.e., just a segment), and a “polygon” with only one vertex is a dummy placeholder in an arc, meaning “leave an empty space in this arc, i.e. without any chord”. To follow the d3.layout.chord convention, arcs are called “groups” in what follows; moreover, to pile up some more confusion, I’ll give the name “connections” to the above mentioned array of polygons.

Up to know I have shown several examples of chords diagram, but never specified what is the javascript object that actually drives the layout. Now I do exactly that; each diagram in this post have a number that I use here to refer to it. The javascript object below is for diagram #1, i.e. pentagon and triangle with two more empty arcs; one can clearly recognize the polygon (five elements array), the triangle (three elements array) and the remaining two “empty arcs”.

connections = [ [ {group: 0, value: 1},
                  {group: 2, value: 1},
                  {group: 4, value: 1},
                  {group: 6, value: 1},
                  {group: 8, value: 1} ],

                [ {group: 1, value: 1},
                  {group: 5, value: 1},
                  {group: 9, value: 1} ],

                [ {group: 3, value: 1} ],

                [ {group: 7, value: 1} ] ]

Next is the descriptor for diagram #2, triangle with three bigger empty arcs interleaved, used as example for ribbons that share an arc. Sharing an arc is the very basic primitive for drawing polygons, and this API makes it very natural to use. The triangle is the three-edged polygon; the three other arcs with no ribbons are specified as single-element arrays in the polygon list.

connections = [ [ {group: 0, value: 1},
                  {group: 2, value: 1},
                  {group: 4, value: 1} ],

                [ {group: 1, value: 2} ],

                [ {group: 3, value: 2} ],

                [ {group: 5, value: 2} ] ]

Diagram #3, used as an example for arcs that are partially empty; note that the empty space on arc 0 is specified via a single-edged polygon in the polygon list.

connections = [ [ {group: 0, value: 1},
                  {group: 1, value: 1} ],

                [ {group: 0, value: 1},
                  {group: 2, value: 1} ],

                [ {group: 0, value: 1},
                  {group: 3, value: 1} ],

                [ {group: 0, value: 1},
                  {group: 4, value: 1} ],

                [ {group: 0, value: 4} ] ]

Diagram #4, with “parallel ribbons”; note that each ribbon is a two-edges polygon on its own. I also added blank spaces on each arc.

connections = [ [ {group: 0, value: 1},
                  {group: 1, value: 1} ],

                [ {group: 0, value: 1},
                  {group: 1, value: 2} ],

                [ {group: 0, value: 1},
                  {group: 1, value: 3} ],

                [ {group: 0, value: 1},
                  {group: 1, value: 4} ],

                [ {group: 0, value: 1} ],

                [ {group: 1, value: 1} ] ]

And here diagram #5, a pentagon with blank arcs interleaved:

connections = [ [ {group: 0, value: 1},
                  {group: 2, value: 1},
                  {group: 4, value: 1},
                  {group: 6, value: 1},
                  {group: 8, value: 1} ],

                [ {group: 1, value: 1} ],

                [ {group: 3, value: 1} ],

                [ {group: 5, value: 1} ],

                [ {group: 7, value: 1} ],

                [ {group: 9, value: 1} ] ]

The final detail is the fading effect when you point your mouse on an arc. This is specific to Mike’s implementation of this kind of diagram, since the original Circos only produces a still image. Well, in fact is a general property that you can apply to every d3 generated graphic with little effort. Anyway, in the sample diagram? Mike choose to fade out all chords not departing from the arc under your mouse. With all this polygon thing, another strategy makes more sense: fade out all polygons not having a vertex in the current arc. In this figure you see diagram #1 with the pentagon fading out, since the mouse pointer is on one of the arcs of the yellow triangle.

In a forthcoming post I’ll describe the implementation of this alternative layout.