Wider spectrum of data sources for Apache Giraph using Apache Gora
by J MM for Apache Software Foundation
Apache Giraph is a graph-processing framework which can be used as regular Hadoop jobs in order to leverage existing Hadoop infrastructure. Giraph has been built taking into consideration the Pregel paper[1] but adds fault-tolerance to the coordinator process using Apache ZooKeeper as its centralized coordination service. It uses the bulk-synchronous parallel model relative to graphs in which vertices send messages to other vertices in a given superstep. In this manner, Apache Gora could provide a new vertex input format for Giraph and help Giraph provide a wider spectrum of data sources where graph processing could be done and stored. As Gora provides access to different data stores, the best configuration parameters for each one of them should be tested in a graph-processing framework. This could be done by testing specific parameters in well known algorithms implemented in Giraph e.g. PageRank, shared connections, or others.