Ringo is a system for interactive data analysis for workflows that involve tabular and graph data representations. Ringo provides a high-productivity environment for construction, analysis, and manipulation of graphs on a single large-memory multicore machine.
Detecting users in a question-answering forum, tracing the propagation of information in a social network, or reconstructing the Internet topology from a set of traceroutes are all examples of tasks faced by today's data scientists. To solve such problems, data scientists engage in large scale data analyses that require quick prototyping of trial-and-error graph modeling, processing, and manipulation.
To support their work, data scientists need a system that offers operations for graph construction and transformations between tabular and graph representations of the data, in addition to a large number of efficient ready-to-use graph algorithms and constructs. Such a system needs to provide an easy-to-use, high productivity front-end, as well as an optimized back-end that supports fast execution on large datasets, suitable for interactive use.
Available systems today answer only a partial subset of a data scientist's requirements. On one hand, rich and easy-to-use graph packages such as NetworkX do not scale to large graphs. On the other hand, scalable graph processing systems offer very limited out-of-the-box functionality, target batch execution instead of interactive data exploration, and require high level of user expertise. Furthermore, most graph processing systems do not support manipulation of tabular input data and transformation into graphs.
To support the work of data scientists, we present Ringo - A system for construction and analysis of large graphs on a single large memory multicore machine, that combines high productivity analysis with fast and scalable execution times.
Ringo offers the following features:
In building Ringo, we recognize the trend that large memory machines are becoming affordable and provide a real alternative to distributed computing environments. Most real-world graphs being analyzed today fit comfortably in the memory of a single "big-memory" server. Furthermore, even in cases where raw data might not fit initially in main memory, data cleaning and manipulation often result in significant data size reduction, so that the "interesting" part of the data nicely fits in memory.
Targeting single big-memory machines offers many benefits for interactive analysis, compared to distributed environments, both in terms of performance and ease of programming. Ringo table operations, transformations between tables and graphs, and several graph algorithms are fully parallelized to take full advantage of the multi-core environment, and the set of graph algorithms available for parallel execution is under constant expansion.
Consider a typical use case, in which a data scientist's goal is to identify the top Java experts in the StackOverflow user community. StackOverflow is the world largest question-answering website, where users post questions, then others answer them. As the answers are given, the person posting the question has the option of picking the best answer by "accepting it".
One way to define user expertise is by constructing a network of users and finding the importance of user nodes in the network. In order to do so, the data scientist would extract the relevant tables from input data, build a graph connecting users providing Java-related questions and answers, and run a node centrality / link analysis algorithm such as PageRank on that graph to identify experts.
Assume that the input table has the following schema: posts(Id, PostTypeId, AcceptedAnswerId, OwnerUserId, Body, Tag) First, the data scientist would initialize Ringo, load the input data, and remove the text field of the posts:
import ringo Schema = [('Id','int'), ('PostTypeId','int'), ('AcceptedAnswerId','int'), ('OwnerUserId','int'), ('Body','string'), ('Tag','string')] ringo = ringo.Ringo() P = ringo.LoadTableTSV(Schema, 'posts.tsv', '\t', True) ringo.Project(P, ['Id', 'PostTypeId', 'AcceptedAnswerId', 'OwnerUserId', 'Tag'])
Next, the scientist would extract the relevant tables - a table of Java questions, and a table of Java answers:
JP = ringo.Select(P, "Tag = 'java'", False) Q = ringo.Select(JP, 'PostTypeId = 1', False) A = ringo.Select(JP, 'PostTypeId = 2', False)
Then, the scientist would construct a graph representing the user network, where nodes are users, and an edge from user u to user v is formed if v provided an answer accepted by u :
QA = ringo.Join(Q, A, 'AcceptedAnswerId', 'Id') G = ringo.ToGraph(QA, 'OwnerUserId-1', 'OwnerUserId-2')
Finally, the scientist would run the PageRank algorithm on the graph, sort the results, and save them as a measure of user expertise:
PR_MAP = ringo.PageRank(G) # A hash map object: node/user id -> PageRank score PR = ringo.TableFromHashMap(PR_MAP, 'user', 'score') PR = ringo.Order(PR, ['score']) ringo.SaveTableTSV(PR, 'scores.tsv')
There are many alternatives for each step of the workflow. The scientist may want to build the user network in different ways, or use other algorithms to find node importance. With an interactive interface and fast execution times, Ringo allows for quick and agile prototyping of such alternatives.
The Ringo script for the above example can also be downloaded here: usecase.py |
A sample StackOverflow dataset with 100K records (79 MB) can be downloaded here: posts_100K.tgz |
A complete, preprocessed StackOverflow dataset with ~15.7M records (16 GB) can be downloaded here: posts_full.tgz |
To further support high-productivity data exploration, Ringo records user actions and tracks the provenance of each object (table or graph) in the session. Ringo can provide a provenance script for an object, which is an executable and human-readable Python script containing the sequence of operations that led to the creation of that object. Provenance scripts can be used to reproduce and edit workflows that generated objects of interest. For instance, the provenance script for the user network can be retrieved by calling ringo.GetProvenance(G) which will generate the following script:
import sys import ringo def generate(engine, filename0): P = engine.LoadTableTSV([('Id', 'int'), ('PostTypeId', 'int'), ('AcceptedAnswerId', 'int'), ('OwnerUserId', 'int'), ('Body', 'string'), ('Tag', 'string')], filename0, ' ', True) P = engine.Project(P, ['Id', 'PostTypeId', 'AcceptedAnswerId', 'OwnerUserId', 'Tag']) JP = engine.Select(P, 'Tag = 'java'', False) Q = engine.Select(JP, 'PostTypeId = 1', False) A = engine.Select(JP, 'PostTypeId = 2', False) QA = engine.Join(Q, A, 'AcceptedAnswerId', 'Id') G = engine.ToGraph(QA, 'OwnerUserId-1', 'OwnerUserId-2') return G engine = ringo.Ringo() files = ['posts.tsv'] for i in xrange(min(len(files), len(sys.argv)-1)): files[i] = sys.argv[i+1] G = generate(engine, *files)
The latest version of Ringo is 0.1 (Jun 4, 2014). A package for Linux (as CentOS) is available at Ringo download. Ringo requires a 64-bit operating system version.
Ringo requires that Python 2.7.x is installed on your machine. Python 2.7.x can be downloaded from the Python Download page. Make sure that you are using a 64-bit Python 2.7.x package.
To install Ringo, download and unpack the package and run setup.py.
Ringo is largely self-contained and requires external packages only for drawing and visualization. The following packages need to be installed on the system to support drawing and visualization in Ringo:
SIGMOD 2015 Demo PaperRingo: Interactive Graph Analytics on Big-Memory Machines |
Programming ManualUnder construction. |
GitHub RepositoryThe following development repositories are available for Ringo:
|
Snap.py Documentation |
Ringo's backend relies on an extended version of SNAP graph library. SNAP has a Python interface as well, called snap.py. Most of snap.py functions are available through Ringo. Documentation of Snap.py functions can be found here. A Ringo script should begin with an import statement and an initialization of the Ringo engine: |
import ringo ringo = ringo.Ringo() |
In most cases, calls to snap.py functions of the form snap.f(args), as well as calls to static methods of snap.py objects
of the form snap.class.f(args), would translate into a call of the form ringo.f(args) in Ringo. Calls to instance methods of a snap.py object of the form obj.f(args) would, in most cases, translate into a Ringo call of the form ringo.f(obj,args). For example, the snap.py call t1.Join("col1", t2, "col2") would translate into the Ringo call ringo.Join(t1,t2,"col1","col2"). Note that the order of parameters, function names, and other factors are not always kept in this translation. Thus, it is advisable to lookup functions of interest in the Ringo python interface file for exact call syntax. |
The following people contributed to the Ringo project (appear in alphabetical order):