## Archive for August, 2009

So, I’ve been working on a bloom filter application for keeping “buddy” counts in edge lists. Given a stream of edges, count up the number of neighbors for each vertex that appears in the stream. The simple solution would be to just store a list of neighbors for each vertex, but this can exhaust memory if you have a large number of vertices and/or edges. So, I’ve been trying to use a bloom filter and a per-vertex counter to do this. Basically, for each pair of vertices, I ask if one is a neighbor of the other by querying the bloom filter. If not, then I increment a counter, otherwise I just go on to the next edge.

Now, one of the things I wanted to get a handle on was how the false positive rate interacts with the count. In other words, could I predict the degree of undercounting for a given bloom filter configuration. The trouble is, the false positive rate varies as a function of how many elements have been inserted. Being a little naive, I totally forgot about that part and tried modeling the behavior of the per-vertex counter as a markov chain by computing transition probabilities between counts. It was only after I ran the simulations that I realized that the transition matrix varies with the number of inserts (and that the system is therefore does not have markov chain properties). D’oh!

I guess I have to just directly simulate the system.