So, I’ve been working on a bloom filter application for keeping “buddy” counts in edge lists. Given a stream of edges, count up the number of neighbors for each vertex that appears in the stream. The simple solution would be to just store a list of neighbors for each vertex, but this can exhaust memory if you have a large number of vertices and/or edges. So, I’ve been trying to use a bloom filter and a per-vertex counter to do this. Basically, for each pair of vertices, I ask if one is a neighbor of the other by querying the bloom filter. If not, then I increment a counter, otherwise I just go on to the next edge.

Now, one of the things I wanted to get a handle on was how the false positive rate interacts with the count. In other words, could I predict the degree of undercounting for a given bloom filter configuration. The trouble is, the false positive rate varies as a function of how many elements have been inserted. Being a little naive, I totally forgot about that part and tried modeling the behavior of the per-vertex counter as a markov chain by computing transition probabilities between counts. It was only after I ran the simulations that I realized that the transition matrix varies with the number of inserts (and that the system is therefore does not have markov chain properties). D’oh!

I guess I have to just directly simulate the system.


I’m not sure if anybody is still even monitoring this blog. It’d be understandable if I were tossing stuff out into the void. After all, it’s been months since I started this blog and I haven’t made a single post. Part of it was that I just didn’t really have much neat stuff to talk about.

Last week, I came up with what I think is a pretty neat optimization for a problem I often come up against. As some of you may know, I work with large graphs. These graphs have on the order of 10s of millions of vertices and 100s of millions of edges and are stored as edge lists. Thanks to the rapidly dropping price of DRAM, things fit into memory. What about when things don’t though? Bored with IRC, Twitter, Facebook, and Hacker News one night, I started thinking about how I might go about working with large graphs. The first task, how do you update such a graph?

Like I said, I’ve been loading everything into memory. I’d load the graph up, load up and apply my updates, and then write the graph back out. Easy peasy, and just fine for smaller stuff, but things were starting to get out of hand. Mind you, I have no formal training in databases. I know what a B-tree is and how they could be used, but I’ve never written any on-disk data structure that supports expanding and contracting records such as you’d need when updating graphs.

I did eventually come up with such a layout and felt pretty chuffed until I came to a huge realization. The updates I do are substantially smaller than the size of the graph! That is, any set of updates only touches a relatively small portion of the graph, so why load the whole thing? I realized that I could just load my updates, and then apply them as I read the original edge list sequentially off disk. So, the only thing I had to keep in memory was my update set.

This little change had a huge effect on my processing workflow. Now, updating graphs took at most a few gigabytes of RAM vs. 10s of GB. In fact, I also got a bit of a speedup since the OS didn’t need to swap out bits of my working set. In fact, I also found other uses for this improved tool.

I’m always a little skeptical when lots of people become irrationally excited about something. From time to time, I get to cash in on this excitement. When that happens I also get to be on the receiving end of jealous rhetoric. On my first day of grad school, I went out to the local pub with some of my new classmates. On hearing that I would be working on social networks, I got a passel of questions asking why chosen subfield of Sociology was getting so much attention. I also got to hear why it wasn’t knew, but we all know that there’s nothing new under the sun anyways. Daniel Lemire has some interesting thoughts on this.


So what are social networks good for? Are they only good for throwing sheep? Are they just vast databases of dubious friend lists? My answers are: 1. I suspect a lot, 2. no, and 3. no. Before I start spewing rhetoric about online social network services, let me outline what I understand social networks to be, and what kind of tools they give us for thinking about the world. Coming from a social sciences background, social networks are simply collections of people and the relations between them. Now, relations can be anything. This means that for any group of people, we can induce any number of networks. For example, we could think about the acquaintanceship network (who knows whom), the communication network (who talks to whom), or any number of other networks.


What’s interesting about real social networks is what is often referred to as the small-world property (see the work by Watts and Strogatz). This doesn’t refer to any empirical measure. Instead, the small-world property refers to the fact that real world social networks seem to have higher clustering but lower path length when compared to random networks (where people friend each other uniformly at random). Perhaps even more interesting, it seems like people can actually navigate these networks without resorting to centralized directories. In other words, hey are “searchable.”


In my mind, online social network services will really come into their own when they provide mechanisms for people to take advantage of this property to solve individual problems. The challenge will be to find a way of incentivizing people to participate in search chains. For example, suppose two people separated by a chain of other people want to meet each other. How do we incentivize or compensate intermediaries to participate in such chains when their payoffs are minimal. Kleinberg and Raghavan have done some work on this in what they call query incentive networks, but I have yet to see any commercial applications.


Services such as StackOverflow (or Slashdot’s comment moderation system) use reputation systems to elicit query responses, but such services don’t take advantage of human ability to route queries. The reputation mechanisms serve more to regulate crowd behavior rather than as distributed “computing” networks. I think part of what could make social networks effective computing networks is the fact that individuals seem to be able to do this sort of social search.


Hello! Depending on who you talk to, I’m either a distracted sociologist or a lost computer scientist. Up until this past Spring, I was a Ph.D. candidate studying collective problem solving and social networks. My work till now has been mostly sociological in nature, but there has always been a computational component. I am now on leave, working on methods of audience segmentation for Internet ad targeting.

I am starting this blog in order to have a place to participate in online discussions concerning social networks (both real and virtual) and the various applications, algorithms, and conceptualizations surrounding them. It’s my way of keeping my hand in while I serve more commercial interests.

  • Recent Comments

    Daniel Tunkelang on Squirt!
    antisociology on Squirt!
    Daniel Tunkelang on Squirt!
  • Ten to read