# It always passes the tests

Last february Graydon Hoare (of Monotone and Rust fame) wrote about what he calls “The Not Rocket Science Rule Of Software Engineering“: automatically maintain a repository of code that always passes all the tests.

Graydon outlines that:

• Tests on pull request X must be performed before that code gets into trunk, not after.
• Tests on pull request X must be performed on the resulting merge of X with the current trunk, not on X alone. Your pull request can have unexpected interactions with the rest of the codebase, and you want to test that.

It’s a very simple principle, i.e. not rocket science at all. In my team we adopt an integration workflow that respects The Not Rocket Science Rule, and here is how we do it.

## It all starts from Main

Our central repository has only one branch. We call it Main. All developments starts from there.

## Clone Main and work

Alice starts the development of her feature (or bugfix) by cloning Main and committing on top of it.

## Make a pull request

When done, Alice makes a pull request and asks for a code review.

## Meanwhile, Main goes on

Meanwhile, Main has evolved: Bob and Chuck already terminated the integration process, and their code is now in the central repository.

## Buildbot takes it

After the code review passes, our buildbot instance will merge Alice’s work and current Main in a throw-away branch. If there are conflicts, the pull request is rejected and Alice has to pull/merge to fix them, then re-submit.
When I explain it, this is the point where people yell at me “ZOMG unsupervised merges, ya craaazy!?!”. My answer is: first, it’s C++ and if the merge screws things up, it’s unlikely it will even compile; second, the three-way merge algorithm is pretty solid and with a simple topology like ours (i.e.: no criss crosses) I’ve never seen it failing; third, the code still needs to pass the tests (see next section).

## Testing

Buildbot will then proceed to run unit tests and integration tests. Buildbot processes only one pull request at the time, so no other code will be pushed before Alice’s tests are done.

## Push back to Main

If tests are successful, buildbot will push Alice’s work to Main, the central repository.

# The “next branch” workflow and criss-cross merges

I’ll describe the so-called “next branch” workflow, used for the development of the Git project and for the Linux kernel, and illustrate how it systematically leads to criss-cross merges. A criss-cross merge happens when the greatest common ancestor of the nodes to be merged is not unique. It’s more of a curiosity; I made this observation and wanted to share it. Feel free to prove me wrong if I am missing things.
If you are in a hurry, you can skip the text and jump from image to image reading the captions; that should give you a fair idea of the content of this post.

## The “next branch” workflow

This workflow is described in great detail at ​the gitworkflows(7) manpage. The main idea behind it is to keep a so called “throw-away integration branch” called “next” (hence the name), where you merge all the contributions from your developers. The purpose of next is to see how all those features interact together “in the wild”: some people use to build the software directly from next, use it and report back to the developers. With next acting as a “buffer” for features, you can decide which ones are ready for prime time and which ones needs more time, testing, improvements. It’s not “all or nothing”, you can select what you like.
When you’re satisfied by a feature that is sitting in next you merge that feature alone in your “principal branch” (as Git calls it, “master“), the one that sooner or later will be tagged, released and distributed to your users; this process is called “graduation”.
Note that some features might graduate from next to master very quickly, while others might sit in next for a long time before seeing master and hit the release.

Let’s see the workflow in action stage by stage.

### It all starts from master

As it is described in the gitworkflows(7) manpage, this workflow uses two more branches: maint and pu (which stands for “proposed updates”). The most stable of the four is maint (the maintainance branch, meant to host fixes to the previous release), then comes master, next, and the most experimental of them is pu. I will not consider maint nor pu, but limit this description to master and next for the sake of simplicity.
So, for what matters to this post, “master” is the principal and most stable branch.

### next is an experimental fork of master

next, our “throw-away” integration branch, is a fork of master and hosts features that you aren’t yet sure work well. Since it’s clear to everybody that next is only for testers and developers, regular users who build from the source know that the software from next is not to be used in production.

### Contributions are based on master

Most of the developers send their patches to the mailing list. Once those diffs are peer reviewed, the git maintaner (Junio Hamano) applies them in his repository as “topic branches” that forks master. Junio’s repository, with all those topic branches, is available at ​http://github.com/gister/git.

### Topic branches are merged into next

Alice and Bob, from the picture, are good developers and wrote patches that pass all tests and had favorable reviews on the mailing list. But how do their changes interact together? The combined result might not work at all.
This is the culprit of the whole workflow: Junio uses a throw-away integration branch to try things out, the next branch.
next is built and used by people willing to tolerate alpha software and help the developers reporting their experiences with it. I hear that the Debian community is involved into using Git built form next and reporting what they see.

### Topics may (or may not) graduate to master

In the picture, the topic branch from Bob is merged into master. This process is called “graduation”, and happens when Junio sees that Bob’s feature is sufficiently mature for prime time. master will eventually be tagged as an official Git release, so graduation is kind of a big deal.
Note that in my picture Bob was the first to be merged into next, and it’s also the first feature to graduate. Things might not happen in that order. Namely, a feature can stay in next indefinitely — there is no timeline for graduation. It might even not happen at all. The whole point of the “next workflow” is to have next acting as a buffer. When you like a feature, you pick it from next and merge it in master. Note that this doesn’t happen via cherry picking, but via merging branches; merging offers some benefits over cherry-picking:

• changes are in a single place in the graph, as opposed to having the same changeset in many places under different hashes.
• merges are easier to handle, since with a single merge you can add many commits at once to a development line, as opposed to cherry-picking commits one by one.

Uh, look, while we were talking about merging and cherry picking, Alice’s topic graduated to master. Congratulations, Alice!

### What if we merge next into master?

This would transform the “next branch workflow” in a completely different thing. I’ve seen this workflow erroneously called “topic branches workflow”, as if the central part of it was the fact that developers do their work in separate branches. While being an important characteristic, topic branches are not the defining trait of this way to work; better: it’s not what surprised me the most when learning about this workflow. The very point is that you try things out in next, and when you’re happy you put the in master.
If you were to merge the whole next into master, it would become “all-or-nothing”; no more freedom to accept mature features into the release while giving less stable developments some more time to stabilize.

### Bugfixes go to the oldest stable branch…

There are changes that are both trivially correct and urgently needed. Those bugfixes get applied directly on a stable branch, skipping all the graduation dance.

### … and propagates to experimental branches via merges

The small bugfixes I mentioned above aren’t needed only in the stable branch, they’re needed everywhere. The reason the get applied in master and then propagated via a merge of master into next is that merging in the opposite direction is forbidden.
Think of this: you apply the bigfix to next first, then you’d want to merge next into master. That’s not allowed, such merge would carry a lot of things that are not for master (all the topics that haven’t graduated yet). This is to say: it have to be this way.

### Here comes the criss-cross merge

The very reason I wrote this post is to share the following remark: when you merge master into next you get a criss-cross merge. A criss cross merge is one where the greatest common ancestor of your two heads is not unique. Let’s see how the head of Alice’s topic branch is a greatest common ancestor for master and next:

as you can imagine, also Bob’s branch head is a greatest common ancestor:

Criss-cross merges might or might not be a big deal. Sometimes they present headaches. If you use this workflow and find yourself in this situation, I hope I provided some useful insights on the topology of the history graph that can help you understand what’s going on.

### master and next continue their parallel lives…

The steps above are repeated as necessary: master and next continue to evolve.

### … until the end of the cycle

I said that next is a “throw away” branch. If really next and master where to live side to side indefinitely, you’d end up with two branches who’s content differs a lot. At some point, you want to kill next and start afresh. At the end of a “cycle”, which for Git typically lasts for 8 to 10 weeks, next is indeed erased and rebuild afresh by merging all topics that haven’t graduated yet. This gives the opportunity to restart clean, reorder topics etc.

This step is of capital importance to stay on top of all this growing complexity and helps to reason about what’s in where. An invariant like “next has been forked off master not more than 10 weeks ago” might come handy.
But it is also a lot of work: all the conflicts you had merging topics into next will show up again, and again, and again until the conflicting topics finally gets into master. Junio uses a git feature called rerere to remember how he solved conflicts in the past. Whatever trick you use, the luxury of merging features selectively from a sandbox branch to a production branch comes at the price of a lot of housekeeping.
It has been rightly observed that there’s no good reason to do this if you’re not automating next. This is what Linux does: Linux-next is an automatically-built branch combining >50 sources and detects conflicts and so on early.

### Conclusion

• The next branch workflow uses two parallel branches: one is production and the other is a sandbox where you try things out. This architecture gives you complete freedom on what features to promote from the sandbox to the real world.
• It reflects a development model where you operate under no tight deadlines and only code that meet the highest quality standards is allowed to be released.
• The complexity of this workflow is nontrivial, and can only be handled if the next branch is built automatically.
• If the product of your shop is a web service, i.e. you have marketing pushing for features, you operate your own platform and can patch the servers on the go if needed, probably a branch model where you merge the whole “development branch” into production all at once with scheduled release dates is most suited for your needs. The next branch model lets you put features on hold indefinitely, but the sales people are on your neck anyway.

Thanks to Antoine Pelisse who introduced me to this workflow, and for the many insightful discussions we had on this.

# Exploding intersections with the Inclusion-Exclusion Principle

How many elements in the union of two sets? The sum of those in each set respectively, you say? Of course not, there might be some element in common that you counted twice. How if the sets involved are three? Or four?

The answer is known as “the Inclusion-Exclusion Principle”, and in this post I’ll describe how to use it to “explode intersections”, i.e. give the cardinality of each region of a Venn diagram. With a picture, here the case of three sets:

The Inclusion-Exclusion Principle tells us that the cardinality of the union of three sets is given by the formula

|A ∪ B ∪ C| = |A| + |B| + |C|
- |A∩B| - |A∩C| - |B∩C|
+ |A∩B∩C|


which doesn’t come too much as a surprise: we first sum the sets’ cardinalities, then see that we added once too much the “two-by-two” intersections and subtract them, but then ehy, we added the “all-in” interesection three times and subtracted three times, gotta put it back in.

The generalized case follows the same pattern: add all “order one” intersections (the sets alone), remove all “order two” intersections (the two-by-two), add back all “order three” ones (the three-by-three), remove all “order four”, and so on until all no more cases exists.

When thinking of that formula applied to our problem of “exploding” the intersection, I find useful to visualize the situation with the following lattice, where I stress the fact that higher order intersections are contained into some lower order ones. Nodes in the lattice are intersections (all nicely lined up by order), edges represent the partial order relationship “contains”.

The diagram reminds that, for instance, to get how many elements are only in A (and not in B nor C), one has to follow all paths from A down to the highest order intersection, and flip the sign as the order of intersections increases. Which is: sum |A∩B| and |A∩C|, then subtract |A∩B∩C|. The result corresponds to the cardinality of (A∩B) ∪ (A∩C), and has to be subtracted from the cardinality of A.

Slightly harder: 4 sets instead of 3. Following the above diagram, we can compute the number of items in A but in none of the other sets via following all paths that go from A to the very bottom intersection A∩B∩C∩D, and using the principle to know what to add and what to subtract:

in A only = |A| - |A∩B| - |A∩C| - |A∩D|
+ |A∩B∩C| + |A∩B∩D| + |A∩C∩D|
- |A∩B∩C∩D|


In this other example we want to see how many things are un A and C but in none of the other sets; same story, following the paths and flipping the sign at each step down.

in A∩C only = |A∩C| - |A∩B∩C| - |A∩C∩D|
+ |A∩B∩C∩D|


Now we implement this algorithm in code. First some helper functions; list flattening:

def flatten(listlist):
return [elem for list_ in listlist for elem in list_]

>>> flatten([['a', 'b'], ['c', 'd', 'e']])
['a', 'b', 'c', 'd', 'e']


permutations of a list:

def perm(sequence):
if len(sequence) == 0:
return [[]]
else:
out = []
for cnt, elem in enumerate(sequence):
out += map(lambda x: [elem] + x,
perm(sequence[:cnt] +
sequence[(cnt+1):]))
return out

>>> perm(['a', 'b', 'c'])
[['a', 'b', 'c'],
['a', 'c', 'b'],
['b', 'a', 'c'],
['b', 'c', 'a'],
['c', 'a', 'b'],
['c', 'b', 'a']]


All subsets of given cardinality of a list:

def choose_n(n, srcList):
if n == 0:
return [[]]
else:
out = []
for cnt, elem in enumerate(srcList):
out += map(lambda x: [elem] + x,
choose_n(n-1, srcList[(cnt+1):]))
return out

>>> choose_n(2, ['a', 'b', 'c'])
[['a', 'b'], ['a', 'c'], ['b', 'c']]


All possible subsets of a list, excluded the empty one:

def allsubsets(srcList):
out = []
tot = len(srcList)+1
complem = lambda universe, ensemble: \
list(set(universe) - set(ensemble))
complSrcList = lambda ensemble: complem(srcList,
ensemble)
for i in range(1, tot):
if i > tot / 2: # integer division
yield map(complSrcList,
choose_n(tot - (i + 1), srcList))
else:
yield choose_n(i, srcList)


Note the optimization where I work on the complementary list to reduce the number of calls to choose_n, which is equal to the size of the argument.

>>> list(allsubsets(['a', 'b', 'c']))
[[['a'], ['b'], ['c']],
[['a', 'b'], ['a', 'c'], ['b', 'c']],
[['a', 'c', 'b']]]


Now a function to precompute all intersection values. We store them in a dictionary where values are intersection cardinalities, and keys are the names of the sets involved in the intersection. Many items in this lookup table represent the same intersection, with a different order of the sets in the key. I do this to speed up the lookup: instead of normalizing the queries, I store the answer to all possible queries.

def intersLookup(lists):
toInters = flatten(allsubsets(lists.keys()))
def inters_n(names):
return reduce(lambda s, t: set(s) & set(t),
[lists[name] for name in names[1:]],
lists[names[0]])
lookup = {}
for sequence in toInters:
cardinality = len(inters_n(sequence))
for variant in perm(sequence):
lookup['&'.join(variant)] = cardinality
return lookup

>>> intersLookup({'a': ['apple', 'banana'],
...               'b': ['orange', 'apple', 'watermelon'],
...               'c': ['peach', 'plum', 'pear', 'apple', 'orange']})
{'a': 2, 'c': 5, 'b': 3,
'a&b&c': 1, 'b&c&a': 1, 'c&b&a': 1,
'a&c&b': 1, 'b&a&c': 1, 'c&a&b': 1,
'b&a': 1, 'a&b': 1,
'b&c': 2, 'c&b': 2,
'c&a': 1, 'a&c': 1}


The following generator, named subints as in “sub-intersections”, is particularly important since it implements the concept of “walking down all paths” in the above diagrams, “one level at a time”. It returns the nodes of the lattice conveniently grouped for the sign flipping that happens in the Inclusion Exclusion formula.

def subints(elem, compl):
for thing in allsubsets(compl):
yield map(lambda x: elem + x, thing)


See the example above, the first I gave for the 4 sets case. We want to see how much of A is “eaten” by the sub-intersections. What to sum, what to subtract? Here the answer:

>>> si = subints(['a'], ['b', 'c', 'd'])
>>> si.next() # add
[['a', 'b'], ['a', 'c'], ['a', 'd']]
>>> si.next() # subtract
[['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'c', 'd']]
>>> si.next() # add
[['a', 'c', 'b', 'd']]
>>> si.next() # it's over
Traceback (most recent call last):
File "", line 1, in
StopIteration


The trick is to observe that in a sub-lattice of our diagrams lower levels are just intersections of the nodes in upper levels plus all sets left out. Now it’s clear why allsubsets works the way it does: it returns subsets grouped by size, so that subints can be written more concisely.
Let’s see the second example above, again from the 4-sets case:

>>> si =  subints(['a', 'c'], ['b', 'd'])
>>> si.next()
[['a', 'c', 'b'], ['a', 'c', 'd']]
>>> si.next()
[['a', 'c', 'b', 'd']]
>>> si.next()
Traceback (most recent call last):
File "", line 1, in
StopIteration


We’ve now everything needed to write the inclusion-exclusion formula:

def flip():
curr = 1
while True:
yield curr
curr *= -1

def o(f, g):
# function composition
def helper(x):
return f(g(x))
return helper

def inclusionexclusion(elem, compl, allInters):
currentInters = allInters['&'.join(elem)]
sign = flip()
subintValue = \
sum(map(lambda (sign, value): sign * value,
zip(sign,
map(o(sum, (lambda nodes:
map(lambda node: allInters['&'.join(node)],
nodes))),
subints(elem, compl)))))
return currentInters - subintValue


Let’s see:

>>> allInters = intersLookup({'a': ['apple', 'banana'],
'b': ['orange', 'apple', 'watermelon'],
'c': ['peach', 'plum', 'pear', 'apple', 'orange']})


how many items in a and b but not in c?

>>> inclusionexclusion(['a', 'b'], ['c'], allInters)
0


zero, since the only common element is apple, but it’s also in c. How many elements in the intersection of all three?

>>> inclusionexclusion(['a', 'b', 'c'], [], allInters)
1


the apple, indeed. How many in a, but not in b nor c?

>>> inclusionexclusion(['a'], ['b', 'c'], allInters)
1


that’s the banana.

The last thing left to do is to package all of this into a shiny function that computes our complete explosion:

def explode(allNames, allInters):
out = {}
for elem in flatten(allsubsets(allNames)):
out['&'.join(elem)] = \
inclusionexclusion(elem,
complem(allNames, elem),
allInters)
return out


There we go:

>>> sets = {'a': ['apple', 'banana'],
...         'b': ['orange', 'apple', 'watermelon'],
...         'c': ['peach', 'plum', 'pear', 'apple', 'orange']}
>>>
>>> allInters = intersLookup(sets)
>>> explode(sets.keys(), allInters)
{'a': 1, 'a&c&b': 1, 'c': 3, 'b': 1, 'c&b': 1, 'a&b': 0, 'a&c': 0}


The code is available at https://github.com/gghh/intersections. Images are drawn with gimp and zwibbler. The 4-sets Venn diagram comes from Wikipedia.

# The annoying merge conflicts in case of criss cross

Here I show an example of problematic merge that sometimes I have with Mercurial. What happens is that for a given file foo.txt two commits, one parent of the other, get in competition at merge time. Obviously the child edit should be selected as winner of the merge, but this involves user intervention when it shouldn’t be necessary.

Here a minimal example of history:

Our goal is to merge node 5 and 6. The first thing we note is that the GCA isn’t unique, since both 1 and 4 are good candidates. Since for my example I need 4 to be selected (so that things can go bad), I inserted node 3; this way 4 is the common ancestor furthest from the root, and will then be selected.

Why did I color 0, 1 and 2 in pink? Because those are the only nodes where I edit my foo.txt file, if we exclude merges. Before I show a “Try It And See” example with Mercurial, I’ll describe what happens at each step in plain english.

1. foo.txt is edited.
2. foo.txt is edited again.
3. foo.txt is edited a third time.
4. Somebody forks at 0. foo.txt is not modified.
5. Development continues on top of 3. foo.txt is not modified. At this stage, foo.txt has the very same content as it had in 0.
6. 4 is merged with 1. With respect to foo.txt, the 3way merge algorithm automatically select changes in 1 as winners: if a line is changed in “this” and unchanged in “other”, the change is selected.
7. 2 is merged with 4. Same story, ordinary business for 3way merge, most recent change is selected.

And then we want to merge 5 and 6. Mercurial selects 4 as base. Let’s observe things from foo.txt point of view:

• (this) in 6 we have the latest version (from 2).
• (other) in 5 we have the before-the-last version (from 1). This should not “compete” with the latest… does it?
• yes, because in 4 (base of the merge) we have the very first version, and so 3way merge has no idea of who is more recent, if “this” or “other”. They are both changes as far as you compare them with the early version in 0 (inherited by 4).

Now, the real example. First, we go on with commits 0, 1 and 2

$hg init$ echo 'I am a cow.' > foo.txt ; hg add foo.txt ; hg commit -m foo
$echo 'I have superpowers.' > foo.txt ; hg commit -m foo$ echo 'I am a cow with superpowers.' > foo.txt ; hg commit -m foo

Then we create the second branch, with commits 3 and 4 forking from 0.

$hg up 0$ touch bar.txt ; hg add bar.txt ; hg commit -m bar
$touch baz.txt ; hg add baz.txt ; hg commit -m baz We are now at 4, and we merge with 1. The base will be 0, and all goes fine: the most recent version of foo.txt is selected. $ hg merge 1 ; hg commit -m merged
$cat foo.txt I have superpowers. We then update to 2 and merge with 4. Again, no surprises from the merge. The most recent version (i.e. 2) is selected. $ hg up 2
$hg merge 4 ; hg commit -m merged$ cat foo.txt
I am a cow with superpowers.

A look at the history so far:

$hg log --graph --template '{rev} {desc}' @ 6 merged |\ | | o 5 merged | |/| | o | 4 baz | | | | o | 3 bar | | | o---+ 2 foo / / | o 1 foo |/ o 0 foo Being now in 6, we merge with 5 $ hg merge

As expected, the stupid conflict.

It is worth to note that the Consensus Merge algorithm would not solve this situation. It would take into account the two candidates for gca (1 and 4), and weight their “opinion” equally. But one is right, the other is wrong: in this small example, democracy fails to apply.

My naive observation is that when the history of a file is all on the same side of the merge, that file should not be touched at merge time. Just take the (single) head of the sub-DAG that tracks all commits affecting said file. But maybe I am missing undesired side effects of this approach. Also, thinking “per file” is usually the wrong way to reason about source control. Considering the repository as a whole is a thing that Mercurial and Git got right with respect to their predecessors. A “file” is a very artificial way to split code into chunks, as far as version control is concerned. The atomic entity is “the single line of code”.

In this case, with multiple GCAs, one can see that one of those GCA is good for the file we’re considering (foo.txt should be merged with 1 as base), and not the other. I wander if the following principle holds: “if the GCA is unique, then it is optimal for all files (better: lines of code)” where by “optimal” I mean “it doesn’t compare a revision against its ancestors in stupid ways”. On the other side, if there are many GCAs, each line of code has its favourite one, and it’s impossible to choose.

# How does Mercurial select the Greatest Common Ancestor

The GCA of two nodes X and Y in a DAG (Directed Acyclic Graph) is an ancestor of both X and Y such that no descendant of it is also a common ancestor of X and Y. This definition is equivalent to the revset expression heads(ancestors(X) and ancestors(Y)), i.e. the head (if unique) of the intersection among the ancestors of X and those of Y.

Mercurial adopts a slightly different definition: the GCA is the common ancestor of the two nodes which is furthest from a root, as measured by the longest path. This makes sense, since the GCA is used as base for merges and it’s meaningful to take a (topologically) more recent version in case of ambiguity.

I am going to discuss the algorithm that used to perform GCA computations in mercurial until version 2.5.4; in 2.5.5 this code was replaced by a faster algorithm.

The interesting part is what happens when there are multiple candidates for the GCA, i.e. the GCA is not unique. What happens is that mercurial will pick the one that finds first. Here we walk the algorithm step by step and see how it works.

First off, where to find things: the ancestor() function is in the mercurial.ancestor module. Here the docstring:

$python >>> from mercurial import ancestor >>> help(ancestor.ancestor) ancestor(a, b, pfunc) Returns the common ancestor of a and b that is furthest from a root (as measured by longest path) or None if no ancestor is found. If there are multiple common ancestors at the same distance, the first one found is returned. pfunc must return a list of parent vertices for a given vertex We’ll work with the following graph, where GCA(4, 5) isn’t unique: candidates are 1 and 2, equally good (parents are on the left and children on the right). We can try it right away, like this: >>> parents = {0: [], 2: [0], 1: [0], 3: [2], ... 5: [1, 2], 4: [3, 1]} >>> >>> pfunc = lambda x: parents[x] >>> ancestor.ancestor(4, 5, pfunc) 1 1 is chosen… apparently it’s the one found first. Let’s dig deeper into the algorithm. The ancestor() function does its job in two steps: 1. prepare a depth dictionary that associates each ancestor of both nodes X and Y to its distance from a root, computed as the longest path to it. 2. use a generator for each of the two starting vertices that, at each iteration, returns all ancestor “of the same generation”, i.e. all ancestors at a given distance d from the root. The two generators are such that they return generations in descending order of distance. If you want, you can picture this as two ancestor list that “race” towards the root, until they meet at the same distance level (or don’t meet, in which case no common ancestor is found). I have split the code in three separate functions for this explanation. First, the construction of the depth dictionary: def depthf(a, b, pfunc): a, b = sorted([a, b]) # find depth from root of all ancestors # depth is stored as a negative for heapq parentcache = {} visit = [a, b] depth = {} while visit: vertex = visit[-1] pl = pfunc(vertex) parentcache[vertex] = pl if not pl: depth[vertex] = 0 visit.pop() else: for p in pl: if p not in depth: visit.append(p) if visit[-1] == vertex: # -(maximum distance of parents + 1) depth[vertex] = min([depth[p] for p in pl]) - 1 visit.pop() return depth, parentcache  The parentcache dictionary is there to save some time at a later step; we’ll pass it along to the forthcoming functions. What’s to note here is that we can compute depth for a node only if we have depth of all its parents, taking the max and adding 1 (line 22). We visit nodes in a depth-first fashion. We’ll encounter a lot of unseen vertices and append them the the visit array (line 19), until at some point we’ll hit a root and set its depth to 0 (line 14). At this point we start backtracking, pop()-ing nodes from visit as soon as we compute their depth (line 23). Note that the condition visit[-1] == vertex (line 20) is equivalent to “all parents of vertex already have depth computed” since we are handling DAGs: the absence of cycles assure us that no vertex can be found twice in the visit list. Well, strictly speaking, the point is that no node can be parent of itself: the only threat to the last elem of visit being unchanged since line 10 is the loop at line 17. Here the result if applied to our toy example: >>> depth, parentcache = depthf(4, 5, pfunc) >>> depth {0: 0, 1: -1, 2: -1, 3: -2, 4: -3, 5: -2} depths are stored negated since we’ll later use a heap to figure out the furthest elements. Here the helper function that traverses ancestors in order of decreasing distance from the root: def ancestors(vertex, depth, parentcache): h = [(depth[vertex], vertex)] seen = set() while h: d, n = heapq.heappop(h) if n not in seen: seen.add(n) yield (d, n) for p in parentcache[n]: heapq.heappush(h, (depth[p], p))  A few things to note here: 1. duplicate nodes are removed 2. since the distance of a node is always larger than the distance of any of its parents, it’s safe to pop() an element and only later push() its parents into the heap. Subsequent pop() will always return elements in the desired order. Here the result of applying it to the vertex 4: >>> list(ancestors(4, depth, parentcache)) [(-3, 4), (-2, 3), (-1, 1), (-1, 2), (0, 0)] Another helper function: we need to get all ancestors of a given depth, not one by one. This one does the buffering into “generations”: def generations(vertex, depth, parentcache): sg, s = None, set() for g, v in ancestors(vertex, depth, parentcache): if g != sg: if sg: yield sg, s sg, s = g, set((v,)) else: s.add(v) yield sg, s  Here the generations that lead to the vertex 4: >>> list(generations(4, depth, parentcache)) [(-3, set([4])), (-2, set([3])), (-1, set([1, 2])), (0, set([0]))] Finally here the function that put the pieces together and gives the final result: def gca(a, b, depth, parentcache): x = generations(a, depth, parentcache) y = generations(b, depth, parentcache) gx = x.next() gy = y.next() # increment each ancestor list until it is closer to root # than the other, or they match try: while True: if gx[0] == gy[0]: for v in gx[1]: if v in gy[1]: return v gy = y.next() gx = x.next() elif gx[0] > gy[0]: gy = y.next() else: gx = x.next() except StopIteration: return None  gx and gy are tuples like (depth, nodes). A GCA can by found only when gx[0] == gy[0]; since we start from the max depths and progressively decrease it, we’re guaranteed to find an element in the intersection of the ancestors of the original nodes that is the furthest away from the root, or we find nothing at all. Here we run it, showing the value of gx and gy as we go along: >>> gca(4, 5, d, p) gx (-3, set([4])) gy (-2, set([5])) ---- gx (-2, set([3])) ---- gx (-1, set([1, 2])) gy (-1, set([1, 2])) ---- 1 Note the 1 is chosen, instead of 2, for no particular reason: just because it’s the first element of list(set([1, 2])]. Interestingly, if we add a node 6 between 0 and 2, this makes 2 the common ancestor furthest from root, and so the algorithm doesn’t see any more ambiguity in the choice: >>> parents = {0: [], 2: [6], 1: [0], 6: [0], 3: [2], ... 5: [1, 2], 4: [3, 1]} >>> >>> pfunc = lambda x: parents[x] >>> >>> depth, parentcache = depthf(4, 5, pfunc) >>> gca(4, 5, depth, parentcache) gx (-4, set([4])) gy (-3, set([5])) ---- gx (-3, set([3])) ---- gx (-2, set([2])) gy (-2, set([2])) ---- 2 In the last iteration the intersection between gx[1] and gy[1] is the element 2 alone. # Caching in python with a descriptor and a decorator The Mercurial SCM has several strategies to cache expensive computations; you can find an overview of all of them at this page. This post is about the propertycache decorator: a mechanism to replace a function with its result the very first time it is called. The trick is made possible by the combination of two features of the python programming language: descriptors and decorators. # Python descriptors The python descriptor protocol is something that works as follows: if you override the methods __get__(), __set__() or __delete__() in a (new style) python class, you’re given the chance to execute your own code when an object of that class is evaluated (__get__()) or is bound to another object (__set__()). Which is, you can decide what happens when that evaluation (or binding) occurs. One more thing to remember is that an object that overrides __get__() and __set__() behaves like a descriptor (which is: these methods are actually called instead of the default evaluation/binding behaviour) only if it is a member of another class. Here a few references on descriptors: ## A dummy example Here we use the “descriptor protocol” to preempt user code from assigning to an attribute of an object. That attribute is kept to a constant value instead: class joke(object): def __get__(self, obj, objtype): return 'Bazinga!' def __set__(self, obj, value): pass class foo(object): x = joke()  We had to implement __set__() too, despite the fact that it’s empty: python would have ragarded joke as a non-data descriptor, which is a slightly different beast. Here is what happens when writing and reading the attribute x of an object of type foo: >>> f = foo() >>> f.x 'Bazinga!' >>> f.x = 5 >>> f.x 'Bazinga!' # Python decorators Are cool. In python any callable that wraps a function is called “decorator”. Example: you have a function that does stuff, like def deco(arg): # stuff and another function, that does some other stuff: def fun(arg): # other stuff Sometimes is interesting to rebind fun like fun = deco(fun) and deco is said to be a “decorator” for fun. Some basic references: Python has a convenient syntactic sugar for the rebinding above: it can be written as @deco def fun(arg): # other stuff I often use decorators when I need to enhance the behaviour of a function by having something done before the function call, or after. So I write the following closure and use it as a decorator: def deco(fun): def helper(arg) # do something before result = fun(arg) # do something after return result return helper  and then, for whatever fun this makes sense, I rebind its name to fun = deco(fun) ## The propertycache decorator Without further ado, here the code, taken verbatim from the mercurial codebase: class propertycache(object): def __init__(self, func): self.func = func self.name = func.__name__ def __get__(self, obj, type=None): result = self.func(obj) self.cachevalue(obj, result) return result def cachevalue(self, obj, value): setattr(obj, self.name, value)  and the example usage, taken straight from the mercurial wiki: from util import propertycache class foo(object): @propertycache def expensive(self): print 'created' return [1, 2, 3] f = foo() f.expensive # prints created f.expensive # simply returns [1, 2, 3] # to forget the saved value and calculate it again, use delattr delattr(f, 'expensive')  ## How does it work Before anything happens, python defines f.expensive = propertycache(f.expensive) Note that this is a descriptor object, since the class propertycache overrides the __get__() method. From this point on, you can’t call the method f.expensive() anymore: >>> f.expensive() <__main__.foo object at 0xb742d28c> created Traceback (most recent call last): File "", line 1, in TypeError: 'list' object is not callable What you do, instead, is to refer to f.expensive as a data member (which is, no () parens). The first time that f.expensive is evalutated, the original method f.expensive() is called (remember: it was saved in the func member of the object of type propertycache). Its result is then rebound to f.expensive by the cachevalue() method. Now f.expensive is not a descriptor object anymore, but only the returned value of the original method: further evaluations won’t go thru the call of the original method anymore, just return the cached value. As a final remark, note that this doesn’t generalize to methods that have more arguments than the self reference, since the __get__() method has a well defined signature and you can’t add more items in there: only a reference to the “owner” object is allowed (the instance of foo in our example). ## How was it done before? Descriptors were introduced in the Mercurial codebase with revision 595ba2537d4f by Dirkjan Ochtman, the same person who wrote the excellent chapter on the AOSA book about Mercurial. How was caching done before? It looked like this: class foo: def __getattr__(self, name): if name == 'expensive1': print 'Computing expensive1...' self.expensive1 = [1,2,3] return self.expensive1 elif name == 'expensive2': print 'Computing expensive2...' self.expensive2 = [4,5,6] return self.expensive2 else: raise AttributeError(name)  Let’s see that in action: >>> f = foo() >>> f.expensive1 Computing expensive1... [1, 2, 3] >>> f.expensive2 Computing expensive2... [4, 5, 6] >>> f.expensive1 [1, 2, 3] Here the concertmaster is the __getattr__() method, which, if implemented, intercepts all accesses to members not found otherwise ; in our example, those are expensive1 and expensive2. Once accessed the first time, their value is computed and bound to a datamember; subsequent calls will find them without passing thru __getattr__. It goes without saying that the decorator described above is much more terse than this boilerplate code; it factors away all the common behaviour. # Mercurial histedit behind the scenes What’s the magic behind the history rewriting extension in Mercurial? It’s called histedit, it’s written by Augie Fackler, and it’s a fundamental tool in my daily coding activity. I fired up pdb and watched closely to what it does. Lean back and enjoy (all good ideas are Augie’s, all mistakes are mine!). First off, I am obliged to rise a giant “don’t try this at home” sign upfront. I will use internal Mercurial APIs, which are deliberately undocumented; the reason is very simple: they might change next week without further notice. Having said that, let me ask you a question: how would you create a changeset that is equivalent to a sequence of N other changesets? Which is, in mercurial lingo, how would you implement folding N changesets? You’d probably smash the first two, then add the third one to that result, and after you’d add the fourth, the fifth and so on. That’s what happens in histedit. Let’s see that in action. Before we tackle the adventurous task of editing the history of our repository, we have to learn how to do two operations: 1. pick a changeset somewhere in the history and apply it to the working directory 2. collapse a chain of commits into a single one # Pick there, apply here The histedit extension special-cases two situations. ## In place consider the repo created with $ hg init
$touch foo ; hg add foo ; hg commit -m foo$ touch bar ; hg add bar ; hg commit -m bar
$hg up 0 which then looks like what we want to do here is to extract changeset 1 and apply it to the working directory. Since the parents of cset 1 and of the working directory are the same (they’re both children of changeset 0), the situation can be handled by a simple shot of cmdutil.revert(). Let’s switch to the python shell: from mercurial import ui, cmdutil, node, localrepo repo = localrepo.localrepository(ui.ui(), '.') cmdutil.revert(ui.ui(), repo, repo[1], (repo[0].node(), node.nullid), all=True) repo.commit(text="Applied cset 1")  this, as promised, leads us to The revert(ui, repo, ctx, parents, *pats, **opts) function restore the files to the state they had at revision ctx, and set the parents of the working directory. You might think there is something odd going on here, and I did too: this is a quite unusual application of revert, which is more often use to restore an ancestor revision. Here we’re picking stuff from another branch (let’s say, from “the future”). ## Far away nodes: merge needed The other situation is when the parents of the working directory and of our target revision differ but… those two (different) parents have the same content. It might seem an artificial situation, but it is what happen in a generic step of the histedit algorithm. In that case we resort to a merge. Let’s first build the sample repo: $ touch foo ; hg add foo ; hg commit -m foo
$touch bar ; hg add bar ; hg commit -m bar$ touch baz ; hg add baz ; hg commit -m baz
$hg up 0 then again the revert() trick to apply changeset 1 on the the of 0, but on a parallel line; the code is exactly the python snippet above, so we have what’s the histedit recipe to apply 2 on the top of 3? We merge the working directory (updated to3) with 2, setting 1 as the common ancestor; this isn’t the topological truth (1 is not an ancestor of 3) but is morally ok, since 1 and 3 are contain the same state. repo = localrepo.localrepository(ui.ui(), '.') merge.update(repo, repo[2].node(), True, True, False, repo[1].node()) repo.setparents(repo[3].node(), node.nullid) repo.dirstate.write() repo.commit(text="Applied cset 2")  here the result: Note at how I carefully cleanup the parents of the changeset, removing traces of the merge that just happened. All of this is nicely packaged in the histedit.applychanges() function in the histedit extension. # Collapse many into one Again, let’s kickstart with a toy repository: $ hg init
$touch foo ; hg add foo ; hg commit -m foo$ touch bar ; hg add bar ; hg commit -m bar
$touch baz ; hg add baz ; hg commit -m baz$ touch woo ; hg add woo ; hg commit -m woo

We want to take changeset from 1 to 3 and smash them into one that sums them all. Here is the python code:

from mercurial import ui, cmdutil, node, localrepo, context

repo = localrepo.localrepository(ui.ui(), '.')

files = set(repo[1].files()) | set(repo[2].files()) | set(repo[3].files())
mf = repo[3].manifest()

def filectxfn(repo, ctx, path):
if path in mf:
fctx = repo[3][path]
mctx = context.memfilectx(fctx.path(), fctx.data())
return mctx
raise IOError()

parents = (repo[0].node(), node.nullid)
new = context.memctx(repo,
parents=parents,
text='collapsed 1 to 3',
files=files,
filectxfn=filectxfn,
user='gghh')

repo.commitctx(new)


Again, there is a function for this in the histedit extension, histedit.collapse(). I was suprised looking at that code: I have a mental model of commits as, say, an ordered sequence of diffs; that’s how I read this picture anyway. If I am collapsing changeset, I expect some sort of iteration over that chain. But that’s not what collapse() does. Look at the closure filectxfn: basically it consider only filelogs from the last changeset, skipping those filelogs that didn’t changed since first. As if in last you have all needed things necessary to know what happened in between.

Which is exactly how mercurial works: filelogs just internally make the diff sequence, externally only the full contents are exposed. So basically all you can do is ask a filelog for the contents of something plus the parents at a version.

# When does this show start? It’s getting late

With all this machinery in place, we’re now ready to juggle with those little balls so fast that tortoiseHg will have an hard time catching up.

First off, the toy repo. For the occasion we’re giving foo, bar and baz a little break and use some piece of real literature.

$hg init$ echo 'Dorothy lived in the midst of the great\
> Kansas prairies, with Uncle' > chap1.txt
$hg add chap1.txt ; hg commit -m "line 1"$ echo "Henry, who was a farmer, and Aunt Em,\
> who was the farmer's wife. Their" >> chap1.txt
$hg commit -m "line 2"$ echo 'house was small, for the lumber to build it\
> had to be carried by wagon' >> chap1.txt
$hg commit -m "line 3"$ echo 'many miles.  There were four walls, a floor\
> and a roof, which made one' >> chap1.txt
$hg commit -m "line 4"$ echo 'room; and this room contained a rusty looking\
> cookstove, a cupboard for' >> chap1.txt
$hg commit -m "line 5"$ echo 'the dishes, a table, three or four chairs,\
> and the beds.' >> chap1.txt
$hg commit -m "line 6" Our plan is to fold all changesets between 1 and 5 included into a single one. Which is, we would run hg histedit 1 The histedit commands will be: (pick, 1) (fold, 2) (fold, 3) (fold, 4) (fold, 5) The first thing we do is to update the working directory to 0 and play the cmdutil.revert() trick to apply 2 on the top of 1. from mercurial import ui, hg, commands, localrepo, node, merge from mercurial import cmdutil, scmutil, context, hg, repair from hgext import histedit repo = localrepo.localrepository(ui.ui(), '.') hg.update(repo, repo[1].node()) cmdutil.revert(ui.ui(), repo, repo[2], (repo[1].node(), node.nullid), all=True) repo.commit(text="2 bis")  Now we again update to 0, and collapse the line 1 -- 6 into changeset 7, which we apply on the top of 0. hg.update(repo, repo[0].node()) histedit.collapse(repo, repo[1], repo[6], {}) hg.update(repo, repo[7].node())  Now we want to apply 3 on the top of 7. That’s the merge case we saw earlier; our reference ancestor will be 2, since 0 -- 7 is the same as 0 -- 1 -- 2. merge.update(repo, repo[3].node(), True, True, False, repo[2].node()) repo.setparents(repo[7].node(), node.nullid) repo.dirstate.write() repo.commit(text="7 and 3")  Now guess what? collapse the line 7 -- 8, call it 9. hg.update(repo, repo[0].node()) histedit.collapse(repo, repo[7], repo[8], {}) hg.update(repo, repo[9].node())  Come on, a couple more and we’re done: 4 on top of 9, then 5 on top of 11. merge.update(repo, repo[4].node(), True, True, False, repo[3].node()) repo.setparents(repo[9].node(), node.nullid) repo.dirstate.write() repo.commit(text="4 bis") hg.update(repo, repo[0].node()) histedit.collapse(repo, repo[9], repo[10], {}) hg.update(repo, repo[11].node()) merge.update(repo, repo[5].node(), True, True, False, repo[4].node()) repo.setparents(repo[11].node(), node.nullid) repo.dirstate.write() repo.commit(text="5 bis")  now the final collapse: changeset 13 is the result of 1 -- 2 -- 3 -- 4 -- 5 all folded together. hg.update(repo, repo[0].node()) histedit.collapse(repo, repo[11], repo[12], {}) hg.update(repo, repo[13].node())  Now it’s time to throw away the leftovers. repair.strip() comes to the rescue: repair.strip(ui.ui(), repo, repo[1].node())  repair.strip(ui.ui(), repo, repo[1].node())  repair.strip(ui.ui(), repo, repo[1].node())  repair.strip(ui.ui(), repo, repo[1].node())  Hooray! # Is TCP turing complete? Is the TCP protocol capable of… computing things? In other words: if you consider the topology of your network (nodes and connections between them) and the data that each node plans to send across the network as “the input” of TCP as a “computing machine”, and the data exchange (or maybe, just the rate of the data exchange at steady state) that actually happens to be “the output”, can one imagine some weird complicated “encoding scheme” so that, at the end of the day, one can make TCP compute prime numbers? In much more cryptic terms, for people used to theoretical computer science lingo, is the TCP protocol somehow Turing Complete? I asked myself this question since I learnt of the AIMD congestion control scheme, where basically the rate at which data is transmitted increases linearly in time until a packet loss is detected. At that point, the transmission rate is divided by a constant factor. Well, I feel that this scheme is complex enough that there might be some computation power lurking in there. Personally I find it very amusing when turing completeness sneaks into systems “by accident”, and this happens quite often in real world situations. Some examples: Usually when you have turing completeness by accident it’s quite an undesirable thing, as famous programmer Jamie Zawinsky points out in his blog here; interesting also the comment on the DNS resource record NAPTR. Any comment is *very* wellcome! I feel the formulation of my question is quite fuzzy, so even if somebody can make me better understand *what* I am asking… well that would make my day # A feature of (q)bzr I really love Bazaar lets you compare branches. Not just branches from the same repository, i.e. all stored in the same .bzr/ folder, but arbitrary branches that may (or may not) have some common history. All those years of typing code for fun and profit, and staring at directed acyclic graphs wandering where the heck that stupid C macro that breaks the build was introduced, might have changed my sense of “awesome”. But ladies and gentlemen, this is pretty much awesome. Let me show you an example. ## Playing the TIAS game with bazaar TIAS is Try It And See, no less no more. I did some TIAS in another post about Mercurial, and I am doing it again. Let’s get this party started. First thing we create a repository and fork it two times. Actually in bzr the correspondence between repositories and branches can be one to one, with bzr init, or one to many, with bzr init-repo. The latter creates what is called a “shared repository”, i.e. a repository that can hosts many branches. This is indeed the way git and mercurial work; the former option is a bzr-only thing. Let’s stick to the one-repo-one-branch setup; it doesn’t make a big difference for the purpose of this post, and it requires less typing. $ bzr init b
Created a standalone tree (format: 2a)
$bzr branch b c1 Branched 0 revisions.$ bzr branch b c2
Branched 0 revisions.

We now switch to c1, enter a couple of commits, and push them back to the parent branch, which is b/.

$cd c1 /c1$ echo 'hello world!' > src1.txt
/c1$bzr add src1.txt adding src1.txt /c1$ bzr commit -m "initial import"
Committing to: c1/
Committed revision 1.

/c1$echo 'hello world!\nIt is gonna be a nice day.' > src1.txt /c1$ bzr commit -m "added optimism."
Committing to: c1/
modified src1.txt
Committed revision 2.

/c1$bzr push :parent All changes applied successfully. Pushed up to revision 2. Some time later, developer c2 enters the scene, retrieves the current content of b/ and makes a commit. He doesn’t push back to b/, because he’s not quite done yet. $ cd ../c2/

/c2$bzr pull Using saved parent location: b/ +N src1.txt All changes applied successfully. Now on revision 2. /c2$ echo "That's a nice Tnettenba" > src2.txt
/c2$bzr add src2.txt adding src2.txt /c2$ bzr commit -m "added joke"
Committing to: c2/
Committed revision 3.

c1 gets back to work: he continues to develop his thing.

/c2$cd ../c1 /c1$ echo 'voodoo atchoo to you too' > src3.txt
/c1$bzr add src3.txt adding src3.txt /c1$ bzr commit -m 'thank you!'
Committing to: c1/
Committed revision 3.

The release engineer, who has direct access to the server where b/ is stored, has been thinking on how to fix that shell script that handles the deployment. He finally figures it out; he’s in a hurry, and doesn’t want to branch and merge — he commits directly on b/. It happens.

$cd ../b/ /b$ echo 'the grumpy cat says it had fun, once' > src4.txt
/b$bzr add src4.txt adding src4.txt /b$ bzr commit -m 'that never happened.'
Committing to: b/
Committed revision 3.

It’s c1‘s turn to play now. He had a coffee with c2; they talked about the code and c1 realizes he needs the new feature that c2 just committed. So he merges with c2, and then goes a little further.

/c1$bzr merge ../c2 +N src2.txt All changes applied successfully. /c1$ bzr commit -m "thanks c2, that was neat"
Committing to: c1/
Committed revision 4.

/c1$echo 'if (0 == a);' > src6.txt /c1$ bzr add src6.txt
/c1$bzr commit -m 'Zero the value is not. Yoda style.' Committing to: c1/ added src6.txt Committed revision 5. It’s almost time to leave, and c2 has just the time to fix that little annoying bug before heading home. /c1$ cd ../c2

/c2$echo 'NaNNaNaNNaNaN' > src5.txt /c2$ bzr add src5.txt
/c2$bzr commit -m 'Batman. I need Batman.' Committing to: c2/ added src5.txt Committed revision 4. ## The sweet pretty pictures The day after c1 and c2 get back to work, chit-chat a little, and realize that they don’t know exactly how their two branches relate. Ok, somebody pushed, somebody merged… but they’d like to see it crystal clear, to anticipate all the problems that will arise when, next wednesday, all lines are going to be merged into the production code. More over, they’ve no idea that the release engineer jumped in and added stuff directly to the reference repo b/. What do they do? Easy peasy: bzr qlog b c1 c2. The amazing command bzr qlog <branch_1> <branch_2> ... <branch_n> shows what all branches have in common, who diverged, who pushed to whom, everything. And recall that we’re talking of branches that reside in different repositories. Too good! This is what the command bzr qlog b c1 c2 would show: Isn’t that pretty? Just for the sake of taking some screenshots and messing around with gimp, I’ll show you that the branches you compare can be totally, and I say totally, unrelated. Here is what you get if you compare the trunk of bzr itself with the latest mysql-server codebase via bzr qlog lp:bzr lp:mysql-server I had to shrink it a little bit, as you can imagine; of course the two don’t share a thing, but that’s just for the fun of the abuse: # All your rebase are belong to us When I hit “commit” on my version control system, I behave very much like when I click “save” on my text editor. I just do it whenever I feel like. I can’t see in advance how the thing is gonna look like at the end, in all details. Result: at the end of the day, I have a bazillion of small commits without any clear, organic structure. I just cannot hand this blob to the release engineer, or code reviewer or whatever — the guy is gonna yell at me all over the building. So, guess what? you have to rewrite your history, my son. And here is how it works (with Mercurial). ## Beautify the commit history, even between regular merges Alice clones Production; commits some stuff; then merges from Production just to keep up to date; commits again a few times, re-merges, etc. Then she would like to beautify a little her commit history, and would like to “squash” all commits she made between a merge and the next one, all in a single commit. Here a picture that explains the scenario: I apologize for the ugly picture (you might want to read again the title of this blog). At the expense of killing the narrative, I am gonna say upfront how the story is gonna end (Spoiler Alert!): initially Alice would be happy just by folding commits between merge M1 and merge M2. But then she discovers that actually she’s free to move her commits all the way up and down in her development line, and that the two merges with upstream that she did are … not a “boundary” anymore. So, in the final picture, all merges are still there (even if M1 is tecnically just a clone), and commits A, B, C, D, E can be in arbitrary positions with respect to M1 and M2 (just need to respect their relative order); look at the picture at the end of the post for some abstract art representing the final result. The hg terminology for squashing commits is folding. We are going to see how to handle this situation using the rebase and histedit mercurial extensions. Things we’re gonna need, and reference documents: Ok, last step is optional but appreciated. ## First off, Mercurial TIAS (Try It And See) TIAS is an effective tecnique to effectively grasp new concepts, and the first step on the way to mastery. For mercurial, it goes like this: let’s create a repo serving as a (local) reference one $ hg init b

this will create an empty repository in folder b/. Other clones would communicate via this one as if it were remote repos like in a real world (distributed) development workflow. You will then clone one or more times repo b/, simulating actions from several developers:

$hg clone b c1$ hg clone b c2
...

Please note that all local repos are basically equal, and here we just talking about conventions. There is no such a thing like a client/server scenario in the distributed version control world, but rather a peer-to-peer setup.
You now have a basic working environment where you can play, experiment, try it, and see.
For instance, why not enter c1, add a small text file and commit it:

$cd c1/ ; echo "hello world" > src.txt ; c1/$ hg add src.txt ; hg commit -m "initial import"

now we could push this commit to c1‘s parent repository, wich is b, with

c1/$hg push then enter c2/ and retrieve those modifications via c1/$ cd ../c2
c2/$hg pull c2/$ hg update

Did it work? Let’s see:

c2/$cd ..$ tree

.
├── b
├── c1
│   └── src.txt
└── c2
└── src.txt

Hooray! Note that b‘s working tree is empty, because we never run hg update there; nonetheless, all the stuff we care about is safely stored into b/.hg/.

## Simulation of the workflow

I’ll now reproduce the situation described above. Let’s create a dumb production repo:

$hg init b$ cd b
/b$echo -e 'Hello\nworld!!' > src1.txt /b$ cat src1.txt
Hello
world!!
/b$hg add src1.txt; hg commit -m "initial import" /b$ cd ..

Note that I committed in b/, which actually is not what you do in real life; the reference repository only receives push and pull from other repo. But this little idiosyncrasy is inessential here, it doesn’t alter the point I want to make. Let’s then clone b

$hg clone b c1 updating to branch default 1 files updated, 0 files merged, 0 files removed, 0 files unresolved$ tree
.
├── b
│   └── src1.txt
└── c1
└── src1.txt

2 directories, 2 files

back on prod to do a commit

$cd b/ /b$ echo -e 'Hello\nworld' > src1.txt
/b$hg commit -m "removed bangs" /b$ hg log --graph
@  changeset:   1:38bceaef9ec1
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:03:52 2013 +0200
|  summary:     removed bangs
|
o  changeset:   0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:06 2013 +0200
summary:     initial import

/b$cd .. now some commit on repo c1 $ cd c1/
/c1$echo 'foo bar' > src2.txt /c1$ hg add src2.txt; hg commit -m "added a source file"

another commit

/c1$echo -e 'foo\nbar' > src2.txt /c1$ cat src2.txt
foo
bar
/c1$hg commit -m "inserted newline" a third commit /c1$ echo -e 'foo\nbar\nbaz\nbazinga' > src2.txt
/c1$hg commit -m "more text" checking history /c1$ hg log --graph
@  changeset:   3:5df72ef842d2
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:07:12 2013 +0200
|  summary:     more text
|
o  changeset:   2:26a980e728e2
|  user:        gghh
|  date:        Thu Apr 11 10:07:00 2013 +0200
|  summary:     inserted newline
|
o  changeset:   1:9f7e0b186fa4
|  user:        gghh
|  date:        Thu Apr 11 10:06:43 2013 +0200
|  summary:     added a source file
|
o  changeset:   0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:06 2013 +0200
summary:     initial import

now pull and merge from b (production)

/c1$hg pull pulling from /b searching for changes adding changesets adding manifests adding file changes added 1 changesets with 1 changes to 1 files (+1 heads) (run 'hg heads' to see heads, 'hg merge' to merge) /c1$ hg merge
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
(branch merge, don't forget to commit)

/c1$hg commit -m "merged with production" checking history /c1$ hg log --graph
@    changeset:   5:0877a5c95a08
|\   tag:         tip
| |  parent:      3:5df72ef842d2
| |  parent:      4:38bceaef9ec1
| |  user:        gghh
| |  date:        Thu Apr 11 10:09:04 2013 +0200
| |  summary:     merged with production
| |
| o  changeset:   4:38bceaef9ec1
| |  parent:      0:1e0b6ab683de
| |  user:        gghh
| |  date:        Thu Apr 11 10:03:52 2013 +0200
| |  summary:     removed bangs
| |
o |  changeset:   3:5df72ef842d2
| |  user:        gghh
| |  date:        Thu Apr 11 10:07:12 2013 +0200
| |  summary:     more text
| |
o |  changeset:   2:26a980e728e2
| |  user:        gghh
| |  date:        Thu Apr 11 10:07:00 2013 +0200
| |  summary:     inserted newline
| |
o |  changeset:   1:9f7e0b186fa4
|/   user:        gghh
|    date:        Thu Apr 11 10:06:43 2013 +0200
|    summary:     added a source file
|
o  changeset:   0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:06 2013 +0200
summary:     initial import

checking phases

/c1$hg phase -r 0::5 0: public 1: draft 2: draft 3: draft 4: public 5: draft ## Wait, hold on: phases? whazat? Mercurial phases are a powerful features that prevent you from doing mistakes when rewriting the history of a repo. A changeset (commit) marked as public is a changeset that has been shared (i.e., thru a push/pull), so it is not wise to modify it. Mercurial keeps track of that, and an attempt to rewrite a public changeset is prevented, unless you change the phase level of the commit with hg phase -v --draft [rev] --force note the option --force, since we’re about to take the risk of rising the commit level. Just to give a little context on how advanced this feature is, I’ll mention the fact the git is going to implement it in the next Google Summer of Code. ## Rewrite history! Back to the original problem, note that commits 1, 2, 3 and 5 are draft, since they never left the repo c1, so it’s safe to move them around. We do that using the rebase mercurial extension, available since version 2.3. ### État des lieux Before we start messing up with our repository, let’s take note of all changesets: /c1$ hg log -p
changeset:   5:0877a5c95a08
tag:         tip
parent:      3:5df72ef842d2
parent:      4:38bceaef9ec1
user:        gghh
date:        Thu Apr 11 10:09:04 2013 +0200
summary:     merged with production

diff -r 5df72ef842d2 -r 0877a5c95a08 src1.txt
--- a/src1.txt	Thu Apr 11 10:07:12 2013 +0200
+++ b/src1.txt	Thu Apr 11 10:09:04 2013 +0200
@@ -1,2 +1,2 @@
Hello
-world!!
+world

changeset:   4:38bceaef9ec1
parent:      0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:52 2013 +0200
summary:     removed bangs

diff -r 1e0b6ab683de -r 38bceaef9ec1 src1.txt
--- a/src1.txt	Thu Apr 11 10:03:06 2013 +0200
+++ b/src1.txt	Thu Apr 11 10:03:52 2013 +0200
@@ -1,2 +1,2 @@
Hello
-world!!
+world

changeset:   3:5df72ef842d2
user:        gghh
date:        Thu Apr 11 10:07:12 2013 +0200
summary:     more text

diff -r 26a980e728e2 -r 5df72ef842d2 src2.txt
--- a/src2.txt	Thu Apr 11 10:07:00 2013 +0200
+++ b/src2.txt	Thu Apr 11 10:07:12 2013 +0200
@@ -1,2 +1,4 @@
foo
bar
+baz
+bazinga

changeset:   2:26a980e728e2
user:        gghh
date:        Thu Apr 11 10:07:00 2013 +0200
summary:     inserted newline

diff -r 9f7e0b186fa4 -r 26a980e728e2 src2.txt
--- a/src2.txt	Thu Apr 11 10:06:43 2013 +0200
+++ b/src2.txt	Thu Apr 11 10:07:00 2013 +0200
@@ -1,1 +1,2 @@
-foo bar
+foo
+bar

changeset:   1:9f7e0b186fa4
user:        gghh
date:        Thu Apr 11 10:06:43 2013 +0200
summary:     added a source file

diff -r 1e0b6ab683de -r 9f7e0b186fa4 src2.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src2.txt	Thu Apr 11 10:06:43 2013 +0200
@@ -0,0 +1,1 @@
+foo bar

changeset:   0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:06 2013 +0200
summary:     initial import

diff -r 000000000000 -r 1e0b6ab683de src1.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src1.txt	Thu Apr 11 10:03:06 2013 +0200
@@ -0,0 +1,2 @@
+Hello
+world!!

### Rebase

Let’s do it! just as in this example from the docs,

/c1$hg rebase --dest 4 --source 1 saved backup bundle to c1/.hg/strip-backup/9f7e0b186fa4-backup.hg Note that, for the sake of readability, I am referring to changesets via their revision numbers; the other option is to use their hashes. This is perfectly working, just be aware that revision numbers aren’t unique across repositories, while hashes are. What does the history looks like now? /c1$ hg log --graph
@  changeset:   4:c61940d6fbeb
|  tag:         tip
|  user:        gghh
|  date:        Thu Apr 11 10:07:12 2013 +0200
|  summary:     more text
|
o  changeset:   3:45eebb742cee
|  user:        gghh
|  date:        Thu Apr 11 10:07:00 2013 +0200
|  summary:     inserted newline
|
|  user:        gghh
|  date:        Thu Apr 11 10:06:43 2013 +0200
|  summary:     added a source file
|
o  changeset:   1:38bceaef9ec1
|  user:        gghh
|  date:        Thu Apr 11 10:03:52 2013 +0200
|  summary:     removed bangs
|
o  changeset:   0:1e0b6ab683de
user:        gghh
date:        Thu Apr 11 10:03:06 2013 +0200
summary:     initial import

and phases are

/c1$hg phase 0::4 0: public 1: public 2: draft 3: draft 4: draft ### Now collapse (fold) To do the folding from revision 2 onwards, we finally resort to the “histedit” Mercurial extension: /c1$ hg histedit 2

change the commands from

pick 9ad745b5c9db 2 added a source file
pick 45eebb742cee 3 inserted newline
pick c61940d6fbeb 4 more text

# Edit history between 9ad745b5c9db and c61940d6fbeb
#
# Commands:
#  p, pick = use commit
#  e, edit = use commit, but stop for amending
#  f, fold = use commit, but fold into previous commit (combines N and N-1)
#  d, drop = remove commit from history
#  m, mess = edit message without changing commit content
#

to

pick 9ad745b5c9db 2 added a source file
fold 45eebb742cee 3 inserted newline
fold c61940d6fbeb 4 more text

add an updated commit message and it’s done. A last glance at our pretty history:

/c1$hg log --graph @ changeset: 2:5f8e8722ad9a | tag: tip | user: gghh | date: Thu Apr 11 10:07:12 2013 +0200 | summary: added a source file and other stuff | o changeset: 1:38bceaef9ec1 | user: gghh | date: Thu Apr 11 10:03:52 2013 +0200 | summary: removed bangs | o changeset: 0:1e0b6ab683de user: gghh date: Thu Apr 11 10:03:06 2013 +0200 summary: initial import now the last revision is the sum of the three. /c1$ hg log -p -r 2
tag:         tip
user:        gghh
date:        Thu Apr 11 10:07:12 2013 +0200
summary:     added a source file and other stuff

diff -r 38bceaef9ec1 -r 5f8e8722ad9a src2.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src2.txt	Thu Apr 11 10:07:12 2013 +0200
@@ -0,0 +1,4 @@
+foo
+bar
+baz
+bazinga

So that’s all I have got. Just a few other pointers for the men of good will:

hg pull --rebase

and

hg rebase --dest [rev] --source [rev] --collapse

can perform common task at once, as described here and ?here.

## A note about getting the latest Mercurial

If you happen to work on a ubuntu box like me, you can bypass the beaurocracy of official ubuntu repositories and get the bleeding latest Mercurial with a simple

sudo add-apt-repository ppa:mercurial-ppa/releases
sudo apt-get update
sudo apt-get install mercurial

## Recap

To recap, we started with the goal of helping Alice reorganize her commits between one merge with upstream and the following, and we discovered that much more is possible: if the situation isn’t pathological, marges aren’t a bound at all, and “draft” commits from Alice can really be moved around.

Mercurial has a really powerful mechanism to edit the history of a repo, and more awesomeness still has to come. More over, “phases” help to keep the thing safe.