Here I investigate the intersection between the most popular subreddits. By “intersection” I mean those users who have activity (upvote or downvote on a post) in multiple subreddits. You can also think of a subreddit as the collection of its users, and intersections are then people in two or more subreddits at the same time. The data I use contains activity of ~10k users across ~11k subreddits (more details in the last section). A few hints on how to read the diagram (more on this in a previous post)
- Subreddits are represented on a circle; chords between circle segments represent common users between subreddits (the ticker the chord, the larger is the intersection size).
- Intersections can have different “orders”: if we consider, say, the four subreddits A, B, C and D, I call a “4th order intersection” all those element common to all four sets; a “3rd order intersection” will be all those elements common to A, B and C but with no element in D, and so on. The diagram represents a n-th order intersection via a sort of “polygon”, i.e. a loop of n chords connecting the interested sets.
- To reduce the clutter of the visualization, the concept of “threshold” is introduced: all intersections smaller than a threshold T won’t be represented, but “absorbed” by lower order intersections. To be precise, a small intersection of order n is evenly split among all the (n-1)th order intersections it is contained in. This process is iterative, since there is no guaranteed that a “upward redistribution” of elements will make the receivers pass the threshold.
The top 10 subreddits
This plot tells us that, as far as the top 10 subreddits go, everyone is everywhere. The big fat dark blue intersection touches all of them. Then we have three groups of people: light orange, who likes all subs but
/r/gaming; dark orange, everywhere but on
/r/AskReddit; then the green group, which is not on
/r/AskReddit. The last interesting story is that another group, light blue, is only on the main page (reddit.com) and on
/r/pics, with no interest in anything else.
If we now line up those 10 subreddits by size, we see that reddit.com and
/r/pics are the most popular.
8407 reddit.com 7790 pics 7462 funny 6966 WTF 6828 politics 6424 science 6378 technology 6101 worldnews 5519 AskReddit 5115 gaming
The diagram shows that the edge they have is partly due to the light-blue group of people, who boosts up both. On the other end of the scale,
/r/AskReddit are at the bottom; indeed the light orange and dark orange groups avoid one or the other, and the green group avoids both.
Note that these numbers don’t represent the actual subreddit populaton (which is way larger), but the number of users who participated to the experiment accepting to have their activity anonimously published.
Looking down in the ranks
It would be very nice to get a diagram with more than 10 lists, but unfortunately it’s very expensive to compute: the algorithm that calculates the size of the chords is exponential in the number of lists one considers. To see why, remember that the power set of a set of n elements (the collection of all its subsets) has 2^n elements; and these diagrams need you to compute all possible intersections between the lists, which is taking everything that’s in the power set and compute the size of its intersection. So I’ll resort to sampling; I’ll take 10 random subreddits and see what happens. Since the population of subreddits follow a pareto (see below for pictures), a uniform sampling would favour very small elements; so what I do is sorting the subs and put them into 10 buckets where each of them contain roughly the same amount of users. The first buckets will be made of only a few large subs; the last will contain a lot of small ones (I don’t consider subreddits with less then 300 users). Here the exact figures I got:
[3, 4, 4, 6, 8, 12, 17, 26, 43]
and the corresponding number of users per bucket:
[23659, 26596, 21296, 21967, 21229, 21011, 21371, 21544, 21296]
I’ll now take one random subreddit from each bucket. Here a couple of shots:
Uhm. What’s the story behind this? A nice discovery would be something like “subreddits X and Y have a strong affinity, because users in X tend to like Y and vice-versa”. But that’s not what’s in the diagram. It’s a mere reflection of the popularity of the four big subreddits
/r/videos. It’s just telling us “the top subreddits captivate attention from all users”, which isn’t news. It’s evident if we focus on the two medium-sized subs in the diagram,
/r/fffffffuuuuuuuuuuuu (it’s a sub about comics with funny faces that say crazy things, I had to check): they’re composed by groups that follow all the big subs, and this doesn’t say much about affinities. Another shot to make sure this is not a particular case:
Again, the medium-sized
/r/news shares user with all the bigs without any particular skew.
So I removed the top 20 subreddits from my lists and recomputed the buckets. Here two examples:
Some non trivial links come to the surface: people in
/r/energy unsurprisingly care about
/r/energy, a big chunk of the NSFW subreddit
/r/gonewild likes the funny comics at
/r/fffffffuuuuuuuuuuuu and enjoy
/r/humor, people interested in
/r/marijuana also like
/r/humor. One more:
Somebody’s at the same time into
/r/bestof, people in
/r/libertarian are often also on
I spent more time I care to admit generating these random matches and discovering subreddits I didn’t know they existed.
Data sanity check
The data I use is from Alex Mizrahi (/u/killerstorm), who posted it to the rrecommender google group in 2012 in the context of an initiative to build a recommender for subreddits launched in 2010 by David King (/u/ketralnis, reddit employee at the time).
The dataset shows the activity of 9579 users (who opted-in to contribute their data for research) across 10855 subreddits. “Activity” means merely that a user made an upvote or downvote in a subreddit. Here I’ll have a quantitative look at the dataset, to see how healthy it is.
It’s reasonable to expect the subreddits population to follow a pareto distribution (see this post by Terence Tao for a definition of “reasonable”). The simplest way to check is to see if the cumulative distribution of subreddit sizes is a pareto (see “Power laws, Pareto distributions and Zipf’s law” by M. E. J. Newman for details). Here we go:
The log-log plot looks a lot like a straight line, and with some handwaving we can conclude the data goes like a pareto. This diagram, also called “rank/frequency plot” because of its use in word frequency analysis in texts, is very simple to draw: line up your population sizes sorted from largest to smallest, and plot their rank (position in the list counted starting from 1) versus sizes. The first population will have only one largest or equal element (i.e., itself); the second will have two (itself and the previous one, which is larger since elements are sorted) and so on.
The first dump by David King showed users with abnormal activity; Alex Mizrahi’s sampling was said to mitigate the problem. So here I plot the histograms of user activities; i.e. what’s the distribution of the number of subreddits that users follow.
Most of the users are on <20 subreddits. Somebody has activity on ~250 subs, but the histogram shows those are outliers.