How many elements in the union of two sets? The sum of those in each set respectively, you say? Of course not, there might be some element in common that you counted twice. How if the sets involved are three? Or four?

The answer is known as “the Inclusion-Exclusion Principle”, and in this post I’ll describe how to use it to “explode intersections”, i.e. give the cardinality of each region of a Venn diagram. With a picture, here the case of three sets:

The Inclusion-Exclusion Principle tells us that the cardinality of the union of three sets is given by the formula

|A ∪ B ∪ C| = |A| + |B| + |C| - |A∩B| - |A∩C| - |B∩C| + |A∩B∩C|

which doesn’t come too much as a surprise: we first sum the sets’ cardinalities, then see that we added once too much the “two-by-two” intersections and subtract them, but then ehy, we added the “all-in” interesection three times and subtracted three times, gotta put it back in.

The generalized case follows the same pattern: add all “order one” intersections (the sets alone), remove all “order two” intersections (the two-by-two), add back all “order three” ones (the three-by-three), remove all “order four”, and so on until all no more cases exists.

When thinking of that formula applied to our problem of “exploding” the intersection, I find useful to visualize the situation with the following lattice, where I stress the fact that higher order intersections are contained into some lower order ones. Nodes in the lattice are intersections (all nicely lined up by order), edges represent the partial order relationship “contains”.

The diagram reminds that, for instance, to get how many elements are only in A (and not in B nor C), one has to follow all paths from A down to the highest order intersection, and flip the sign as the order of intersections increases. Which is: sum |A∩B| and |A∩C|, then subtract |A∩B∩C|. The result corresponds to the cardinality of (A∩B) ∪ (A∩C), and has to be subtracted from the cardinality of A.

Slightly harder: 4 sets instead of 3. Following the above diagram, we can compute the number of items in A but in none of the other sets via following all paths that go from A to the very bottom intersection A∩B∩C∩D, and using the principle to know what to add and what to subtract:

in A only = |A| - |A∩B| - |A∩C| - |A∩D| + |A∩B∩C| + |A∩B∩D| + |A∩C∩D| - |A∩B∩C∩D|

In this other example we want to see how many things are un A and C but in none of the other sets; same story, following the paths and flipping the sign at each step down.

in A∩C only = |A∩C| - |A∩B∩C| - |A∩C∩D| + |A∩B∩C∩D|

Now we implement this algorithm in code. First some helper functions; list flattening:

def flatten(listlist): return [elem for list_ in listlist for elem in list_]

>>> flatten([['a', 'b'], ['c', 'd', 'e']]) ['a', 'b', 'c', 'd', 'e']

permutations of a list:

def perm(sequence): if len(sequence) == 0: return [[]] else: out = [] for cnt, elem in enumerate(sequence): out += map(lambda x: [elem] + x, perm(sequence[:cnt] + sequence[(cnt+1):])) return out

>>> perm(['a', 'b', 'c']) [['a', 'b', 'c'], ['a', 'c', 'b'], ['b', 'a', 'c'], ['b', 'c', 'a'], ['c', 'a', 'b'], ['c', 'b', 'a']]

All subsets of given cardinality of a list:

def choose_n(n, srcList): if n == 0: return [[]] else: out = [] for cnt, elem in enumerate(srcList): out += map(lambda x: [elem] + x, choose_n(n-1, srcList[(cnt+1):])) return out

>>> choose_n(2, ['a', 'b', 'c']) [['a', 'b'], ['a', 'c'], ['b', 'c']]

All possible subsets of a list, excluded the empty one:

def allsubsets(srcList): out = [] tot = len(srcList)+1 complem = lambda universe, ensemble: \ list(set(universe) - set(ensemble)) complSrcList = lambda ensemble: complem(srcList, ensemble) for i in range(1, tot): if i > tot / 2: # integer division yield map(complSrcList, choose_n(tot - (i + 1), srcList)) else: yield choose_n(i, srcList)

Note the optimization where I work on the complementary list to reduce the number of calls to `choose_n`

, which is equal to the size of the argument.

>>> list(allsubsets(['a', 'b', 'c'])) [[['a'], ['b'], ['c']], [['a', 'b'], ['a', 'c'], ['b', 'c']], [['a', 'c', 'b']]]

Now a function to precompute all intersection values. We store them in a dictionary where values are intersection cardinalities, and keys are the names of the sets involved in the intersection. Many items in this lookup table represent the same intersection, with a different order of the sets in the key. I do this to speed up the lookup: instead of normalizing the queries, I store the answer to all possible queries.

def intersLookup(lists): toInters = flatten(allsubsets(lists.keys())) def inters_n(names): return reduce(lambda s, t: set(s) & set(t), [lists[name] for name in names[1:]], lists[names[0]]) lookup = {} for sequence in toInters: cardinality = len(inters_n(sequence)) for variant in perm(sequence): lookup['&'.join(variant)] = cardinality return lookup

>>> intersLookup({'a': ['apple', 'banana'], ... 'b': ['orange', 'apple', 'watermelon'], ... 'c': ['peach', 'plum', 'pear', 'apple', 'orange']}) {'a': 2, 'c': 5, 'b': 3, 'a&b&c': 1, 'b&c&a': 1, 'c&b&a': 1, 'a&c&b': 1, 'b&a&c': 1, 'c&a&b': 1, 'b&a': 1, 'a&b': 1, 'b&c': 2, 'c&b': 2, 'c&a': 1, 'a&c': 1}

The following generator, named `subints`

as in “sub-intersections”, is particularly important since it implements the concept of “walking down all paths” in the above diagrams, “one level at a time”. It returns the nodes of the lattice conveniently grouped for the sign flipping that happens in the Inclusion Exclusion formula.

def subints(elem, compl): for thing in allsubsets(compl): yield map(lambda x: elem + x, thing)

See the example above, the first I gave for the 4 sets case. We want to see how much of A is “eaten” by the sub-intersections. What to sum, what to subtract? Here the answer:

>>> si = subints(['a'], ['b', 'c', 'd']) >>> si.next() # add [['a', 'b'], ['a', 'c'], ['a', 'd']] >>> si.next() # subtract [['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'c', 'd']] >>> si.next() # add [['a', 'c', 'b', 'd']] >>> si.next() # it's over Traceback (most recent call last): File "", line 1, in StopIteration

The trick is to observe that in a sub-lattice of our diagrams lower levels are just intersections of the nodes in upper levels plus all sets left out. Now it’s clear why `allsubsets`

works the way it does: it returns subsets grouped by size, so that `subints`

can be written more concisely.

Let’s see the second example above, again from the 4-sets case:

>>> si = subints(['a', 'c'], ['b', 'd']) >>> si.next() [['a', 'c', 'b'], ['a', 'c', 'd']] >>> si.next() [['a', 'c', 'b', 'd']] >>> si.next() Traceback (most recent call last): File "", line 1, in StopIteration

We’ve now everything needed to write the inclusion-exclusion formula:

def flip(): curr = 1 while True: yield curr curr *= -1 def o(f, g): # function composition def helper(x): return f(g(x)) return helper def inclusionexclusion(elem, compl, allInters): currentInters = allInters['&'.join(elem)] sign = flip() subintValue = \ sum(map(lambda (sign, value): sign * value, zip(sign, map(o(sum, (lambda nodes: map(lambda node: allInters['&'.join(node)], nodes))), subints(elem, compl))))) return currentInters - subintValue

Let’s see:

>>> allInters = intersLookup({'a': ['apple', 'banana'], 'b': ['orange', 'apple', 'watermelon'], 'c': ['peach', 'plum', 'pear', 'apple', 'orange']})

how many items in `a`

and `b`

but not in `c`

?

>>> inclusionexclusion(['a', 'b'], ['c'], allInters) 0

zero, since the only common element is `apple`

, but it’s also in `c`

. How many elements in the intersection of all three?

>>> inclusionexclusion(['a', 'b', 'c'], [], allInters) 1

the `apple`

, indeed. How many in `a`

, but not in `b`

nor `c`

?

>>> inclusionexclusion(['a'], ['b', 'c'], allInters) 1

that’s the `banana`

.

The last thing left to do is to package all of this into a shiny function that computes our complete explosion:

def explode(allNames, allInters): out = {} for elem in flatten(allsubsets(allNames)): out['&'.join(elem)] = \ inclusionexclusion(elem, complem(allNames, elem), allInters) return out

There we go:

>>> sets = {'a': ['apple', 'banana'], ... 'b': ['orange', 'apple', 'watermelon'], ... 'c': ['peach', 'plum', 'pear', 'apple', 'orange']} >>> >>> allInters = intersLookup(sets) >>> explode(sets.keys(), allInters) {'a': 1, 'a&c&b': 1, 'c': 3, 'b': 1, 'c&b': 1, 'a&b': 0, 'a&c': 0}

The code is available at https://github.com/gghh/intersections. Images are drawn with gimp and zwibbler. The 4-sets Venn diagram comes from Wikipedia.