Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Ringo: In-Memory Graph Exploration System

Ringo is a system for interactive data analysis for workflows that involve tabular and graph data representations. Ringo provides a high-productivity environment for construction, analysis, and manipulation of graphs on a single large-memory multicore machine.

Motivation

Detecting users in a question-answering forum, tracing the propagation of information in a social network, or reconstructing the Internet topology from a set of traceroutes are all examples of tasks faced by today's data scientists. To solve such problems, data scientists engage in large scale data analyses that require quick prototyping of trial-and-error graph modeling, processing, and manipulation.

To support their work, data scientists need a system that offers operations for graph construction and transformations between tabular and graph representations of the data, in addition to a large number of efficient ready-to-use graph algorithms and constructs. Such a system needs to provide an easy-to-use, high productivity front-end, as well as an optimized back-end that supports fast execution on large datasets, suitable for interactive use.

Available systems today answer only a partial subset of a data scientist's requirements. On one hand, rich and easy-to-use graph packages such as NetworkX do not scale to large graphs. On the other hand, scalable graph processing systems offer very limited out-of-the-box functionality, target batch execution instead of interactive data exploration, and require high level of user expertise. Furthermore, most graph processing systems do not support manipulation of tabular input data and transformation into graphs.

About Ringo

To support the work of data scientists, we present Ringo - A system for construction and analysis of large graphs on a single large memory multicore machine, that combines high productivity analysis with fast and scalable execution times.
Ringo offers the following features:

  • An interactive easy-to-use Python interface
  • A rich set of over 200 advanced graph operations and algorithms (based on the SNAP graph library).
  • Integration of table and graph processing, and support for efficient graph construction and transformations between tables and graphs.
  • Object provenance tracking to make it easier for a data scientist to follow multiple data exploration paths in parallel and later reproduce the analyses.

Why a "Big-Memory" Machine?

In building Ringo, we recognize the trend that large memory machines are becoming affordable and provide a real alternative to distributed computing environments. Most real-world graphs being analyzed today fit comfortably in the memory of a single "big-memory" server. Furthermore, even in cases where raw data might not fit initially in main memory, data cleaning and manipulation often result in significant data size reduction, so that the "interesting" part of the data nicely fits in memory.

Targeting single big-memory machines offers many benefits for interactive analysis, compared to distributed environments, both in terms of performance and ease of programming. Ringo table operations, transformations between tables and graphs, and several graph algorithms are fully parallelized to take full advantage of the multi-core environment, and the set of graph algorithms available for parallel execution is under constant expansion.

Example

Consider a typical use case, in which a data scientist's goal is to identify the top Java experts in the StackOverflow user community. StackOverflow is the world largest question-answering website, where users post questions, then others answer them. As the answers are given, the person posting the question has the option of picking the best answer by "accepting it".

One way to define user expertise is by constructing a network of users and finding the importance of user nodes in the network. In order to do so, the data scientist would extract the relevant tables from input data, build a graph connecting users providing Java-related questions and answers, and run a node centrality / link analysis algorithm such as PageRank on that graph to identify experts.





Assume that the input table has the following schema: posts(Id, PostTypeId, AcceptedAnswerId, OwnerUserId, Body, Tag) First, the data scientist would initialize Ringo, load the input data, and remove the text field of the posts:

import ringo
Schema = [('Id','int'), ('PostTypeId','int'), ('AcceptedAnswerId','int'), 
          ('OwnerUserId','int'), ('Body','string'), ('Tag','string')]
ringo = ringo.Ringo()
P = ringo.LoadTableTSV(Schema, 'posts.tsv', '\t', True)
ringo.Project(P, ['Id', 'PostTypeId', 'AcceptedAnswerId', 'OwnerUserId', 'Tag'])

Next, the scientist would extract the relevant tables - a table of Java questions, and a table of Java answers:

JP = ringo.Select(P, "Tag = 'java'", False)
Q = ringo.Select(JP, 'PostTypeId = 1', False)
A = ringo.Select(JP, 'PostTypeId = 2', False)

Then, the scientist would construct a graph representing the user network, where nodes are users, and an edge from user u to user v is formed if v provided an answer accepted by u :

QA = ringo.Join(Q, A, 'AcceptedAnswerId', 'Id')
G = ringo.ToGraph(QA, 'OwnerUserId-1', 'OwnerUserId-2')

Finally, the scientist would run the PageRank algorithm on the graph, sort the results, and save them as a measure of user expertise:

PR_MAP = ringo.PageRank(G)	# A hash map object: node/user id -> PageRank score
PR = ringo.TableFromHashMap(PR_MAP, 'user', 'score')
PR = ringo.Order(PR, ['score'])
ringo.SaveTableTSV(PR, 'scores.tsv')

There are many alternatives for each step of the workflow. The scientist may want to build the user network in different ways, or use other algorithms to find node importance. With an interactive interface and fast execution times, Ringo allows for quick and agile prototyping of such alternatives.

Dataset Download

The Ringo script for the above example can also be downloaded here: usecase.py
A sample StackOverflow dataset with 100K records (79 MB) can be downloaded here: posts_100K.tgz
A complete, preprocessed StackOverflow dataset with ~15.7M records (16 GB) can be downloaded here: posts_full.tgz

Provenance

To further support high-productivity data exploration, Ringo records user actions and tracks the provenance of each object (table or graph) in the session. Ringo can provide a provenance script for an object, which is an executable and human-readable Python script containing the sequence of operations that led to the creation of that object. Provenance scripts can be used to reproduce and edit workflows that generated objects of interest. For instance, the provenance script for the user network can be retrieved by calling ringo.GetProvenance(G) which will generate the following script:

 import sys
 import ringo


 def generate(engine, filename0):
     P = engine.LoadTableTSV([('Id', 'int'), ('PostTypeId', 'int'), 
                              ('AcceptedAnswerId', 'int'),
                              ('OwnerUserId', 'int'), ('Body', 'string'), 
                              ('Tag', 'string')], filename0, '	', True)
                             
     P = engine.Project(P, ['Id', 'PostTypeId', 'AcceptedAnswerId', 'OwnerUserId', 'Tag'])
     JP = engine.Select(P, 'Tag = 'java'', False)
     Q = engine.Select(JP, 'PostTypeId = 1', False)
     A = engine.Select(JP, 'PostTypeId = 2', False)
     QA = engine.Join(Q, A, 'AcceptedAnswerId', 'Id')
     G = engine.ToGraph(QA, 'OwnerUserId-1', 'OwnerUserId-2')
     return G

 engine = ringo.Ringo()
 files = ['posts.tsv']
 for i in xrange(min(len(files), len(sys.argv)-1)):
     files[i] = sys.argv[i+1]
 G = generate(engine, *files)

Download and Installation

The latest version of Ringo is 0.1 (Jun 4, 2014). A package for Linux (as CentOS) is available at Ringo download. Ringo requires a 64-bit operating system version.

Ringo requires that Python 2.7.x is installed on your machine. Python 2.7.x can be downloaded from the Python Download page. Make sure that you are using a 64-bit Python 2.7.x package.

To install Ringo, download and unpack the package and run setup.py.

Ringo is largely self-contained and requires external packages only for drawing and visualization. The following packages need to be installed on the system to support drawing and visualization in Ringo:

  • Gnuplot for plotting structural properties of networks (e.g., degree distribution);
  • Graphviz for drawing and visualizing small graphs.
Set the system PATH variable, so that Gnuplot and Graphviz are available, or put the executables in the working directory.

Installation of Ringo on Linux

On Linux, use the following commands:
tar zxvf ringo-0.1-centos6.2-x64-py2.6.tar.gz cd ringo-0.1-centos6.2-x64-py2.6 sudo python setup.py install

Local Install of Ringo

If you want to use Ringo in a local directory without installing it system-wide, then download the Ringo package, unpack, and copy files ringo.py, snap.py and _snap.so (or _snap.pyd) to your working directory.

Online Documentation and Resources

SIGMOD 2015 Demo Paper

Ringo: Interactive Graph Analytics on Big-Memory Machines

Programming Manual

Under construction.

GitHub Repository

The following development repositories are available for Ringo:

Snap.py Documentation

Ringo's backend relies on an extended version of SNAP graph library. SNAP has a Python interface as well, called snap.py. Most of snap.py functions are available through Ringo. Documentation of Snap.py functions can be found here. A Ringo script should begin with an import statement and an initialization of the Ringo engine:
import ringo
ringo = ringo.Ringo()

In most cases, calls to snap.py functions of the form snap.f(args), as well as calls to static methods of snap.py objects of the form snap.class.f(args), would translate into a call of the form ringo.f(args) in Ringo.
For instance the snap.py function call snap.GetShortPath(G, Src, Dst) would translate into the Ringo call ringo.GetShortPath(G, Src, Dst). This applies for most of the graph algorithms provided by Ringo and snap.

Calls to instance methods of a snap.py object of the form obj.f(args) would, in most cases, translate into a Ringo call of the form ringo.f(obj,args). For example, the snap.py call t1.Join("col1", t2, "col2") would translate into the Ringo call ringo.Join(t1,t2,"col1","col2").

Note that the order of parameters, function names, and other factors are not always kept in this translation. Thus, it is advisable to lookup functions of interest in the Ringo python interface file for exact call syntax.

Contributing to Ringo

We encourage you to try out Ringo for your data science needs. We also welcome contributions to the project via GitHub pull requests. Along with any pull request, please state that the contribution is your original work and that you license the work to the project under the project's open source license. Whether or not you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project's open source license and warrant that you have the legal authority to do so.

Contributors

The following people contributed to the Ringo project (appear in alphabetical order):

Active Contributors

Jure Leskovec
Yonathan Perez
Rok Sosic

Past Contributors

Arijit Banerjee
Chantat Eksombatchai
Jason Jong
Nikhil Khadke
Vikesh Khanna
Rohan Puttagunta
Martin Raison
Sheila Ramaswamy
Pararth Shah
Nicholas Shelly