## Archive for the ‘Uncategorized’ Category

So, I’ve been working on a bloom filter application for keeping “buddy” counts in edge lists. Given a stream of edges, count up the number of neighbors for each vertex that appears in the stream. The simple solution would be to just store a list of neighbors for each vertex, but this can exhaust memory if you have a large number of vertices and/or edges. So, I’ve been trying to use a bloom filter and a per-vertex counter to do this. Basically, for each pair of vertices, I ask if one is a neighbor of the other by querying the bloom filter. If not, then I increment a counter, otherwise I just go on to the next edge.

Now, one of the things I wanted to get a handle on was how the false positive rate interacts with the count. In other words, could I predict the degree of undercounting for a given bloom filter configuration. The trouble is, the false positive rate varies as a function of how many elements have been inserted. Being a little naive, I totally forgot about that part and tried modeling the behavior of the per-vertex counter as a markov chain by computing transition probabilities between counts. It was only after I ran the simulations that I realized that the transition matrix varies with the number of inserts (and that the system is therefore does not have markov chain properties). D’oh!

I guess I have to just directly simulate the system.

I’m not sure if anybody is still even monitoring this blog. It’d be understandable if I were tossing stuff out into the void. After all, it’s been months since I started this blog and I haven’t made a single post. Part of it was that I just didn’t really have much neat stuff to talk about.

Last week, I came up with what I think is a pretty neat optimization for a problem I often come up against. As some of you may know, I work with large graphs. These graphs have on the order of 10s of millions of vertices and 100s of millions of edges and are stored as edge lists. Thanks to the rapidly dropping price of DRAM, things fit into memory. What about when things don’t though? Bored with IRC, Twitter, Facebook, *and* Hacker News one night, I started thinking about how I might go about working with large graphs. The first task, how do you update such a graph?

Like I said, I’ve been loading everything into memory. I’d load the graph up, load up and apply my updates, and then write the graph back out. Easy peasy, and just fine for smaller stuff, but things were starting to get out of hand. Mind you, I have no formal training in databases. I know what a B-tree is and how they could be used, but I’ve never written any on-disk data structure that supports expanding and contracting records such as you’d need when updating graphs.

I did eventually come up with such a layout and felt pretty chuffed until I came to a huge realization. The updates I do are substantially smaller than the size of the graph! That is, any set of updates only touches a relatively small portion of the graph, so why load the whole thing? I realized that I could just load my updates, and then apply them as I read the original edge list sequentially off disk. So, the only thing I had to keep in memory was my update set.

This little change had a huge effect on my processing workflow. Now, updating graphs took at most a few gigabytes of RAM vs. 10s of GB. In fact, I also got a bit of a speedup since the OS didn’t need to swap out bits of my working set. In fact, I also found other uses for this improved tool.