back

Graph Traversal Benchmark for Graph Databases

Intro:

The goal of the project is to provide a benchmarking mechanism to measure effectiveness of graph databases for graph traversal operations. Our motivation is to measure the capabilities of graph databases to perform a) query-like traversal, where one searches for topologically related vertices for a given vertex (breath-first traversal from a single vertex (or small number of vertices)) b) grcooccurenceaph analysis/mining operations that require the (multiple) traversal of the whole graph (e.g. computation of connected components, centrality measures (HITS/PageRank), community detection etc.).

Benchmark Design Decisions:

Data Sets: We believe that the benchmark should be performed on the data sets with properties similar to those of the real data sets. As most of the networks that people are interested nowadays, like social networks, information networks (internet), traffic networks, biological networks, term cooccurence networks etc., have small world properties, our aim is to use network generators with such realistic properties. Currently, we use LFR-Benchmark generator, which produces networks with power-law degree distribution and with implanted communities within the network.

Traversal Patterns: In our minds, the main distinction between the graph databases and the other data management solutions is the capability to capture and exploit rich structure of relations / connections between records/entities of the database. Thus, a graph database should be able to perform traversal operations efficiently. The benchmark focuses mainly on two traversal patters: breath-first traversal from a single vertex and whole graph traversal.

Supported Implementation of Graph Databases: Currently, the benchmark is designed for graph databases implementing Blueprints API. Blueprints is an effort to build a common API for graph databases. As the number of graph database systems is growing, it is not feasible for us to implement the benchmark operations for all the native APIs of graph database systems. We thus limit the benchmark to the Blueprints enabled dbs. The drawback of this approach is that implementation of blueprints API over distinct databases might bring additional overhead to the database operations and that the system specific optimization techniques might not be available.

Sources and Data files:

The code is hosted on Sourceforge. The benchmark implementation utilize heavily Graph-bench.
SVN repository: https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/TraversalBench/
Sample data files generated by LFR benchmark generator can be downloaded here.

Benchmark Operations:

current state (as of 28.09.2011)
query-like traversal benchmarking operations:
  1. load graph from LFR-Benchmark network definition file
  2. get node based on node ID, for 10.000 random nodes, group operations by 100
  3. compute local clustering coefficient, for 10.000 random nodes, group operations by 100a
  4. do BFS to level 3 from a given node, for 10.000 random nodes, group operations by 100
  5. get node based on node ID, for 10.000 random nodes, group operations by 10000
  6. compute local clustering coefficient, for 10.000 random nodes, group operations by 10000
  7. do BFS to level 3 from a given node, for 10.000 random nodes, group operations by 10000

Whole graph traversal benchmarking operations:
  1. load graph from LFR-Benchmark network definition file
  2. compute connected components, traversing using outgoing edges
  3. compute connected components, traversing using incoming edges
  4. do 50 iterations of HITS algorithm; keeping mapping between nodes and their hubs/authorities in memory

Build instructions: Build with: ant