Archive for July, 2009

I’m not sure if anybody is still even monitoring this blog. It’d be understandable if I were tossing stuff out into the void. After all, it’s been months since I started this blog and I haven’t made a single post. Part of it was that I just didn’t really have much neat stuff to talk about.

Last week, I came up with what I think is a pretty neat optimization for a problem I often come up against. As some of you may know, I work with large graphs. These graphs have on the order of 10s of millions of vertices and 100s of millions of edges and are stored as edge lists. Thanks to the rapidly dropping price of DRAM, things fit into memory. What about when things don’t though? Bored with IRC, Twitter,¬†Facebook, and Hacker News one night, I started thinking about how I might go about working with large graphs. The first task, how do you update such a graph?

Like I said, I’ve been loading everything into memory. I’d load the graph up, load up and apply my updates, and then write the graph back out. Easy peasy, and just fine for smaller stuff, but things were starting to get out of hand. Mind you, I have no formal training in databases. I know what a B-tree is and how they could be used, but I’ve never written any on-disk data structure that supports expanding and contracting records such as you’d need when updating graphs.

I did eventually come up with such a layout and felt pretty chuffed until I came to a huge realization. The updates I do are substantially smaller than the size of the graph! That is, any set of updates only touches a relatively small portion of the graph, so why load the whole thing? I realized that I could just load my updates, and then apply them as I read the original edge list sequentially off disk. So, the only thing I had to keep in memory was my update set.

This little change had a huge effect on my processing workflow. Now, updating graphs took at most a few gigabytes of RAM vs. 10s of GB. In fact, I also got a bit of a speedup since the OS didn’t need to swap out bits of my working set. In fact, I also found other uses for this improved tool.

Advertisements



  • Recent Comments

    Daniel Tunkelang on Squirt!
    antisociology on Squirt!
    Daniel Tunkelang on Squirt!
  • Ten to read