Ruby on Big Data (Cassandra + Hadoop)

Ruby on Big Data
Brian O’Neill
Lead Architect, Health Market Science (HMS)

The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other
organizations mentioned.

Agenda
Big Data Orientation
Cassandra
Hadoop
SOLR
Storm

DEMO
Java/Ruby Interoperability
Advanced Ideas
Rails Integration
Combing Real-time w/ Batch Processing (The Final Frontier)

“Big” Data
Size doesn’t always matter, it may be
what your doing with it
e.g. Natural-Language Processing

Flexibility was our major motivator
Data sources with disparate schema

Decomposing the
Problem
Data Processing
Storage Distributed

Indexing Batch

Querying Real-time

Relational Storage
ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back.
Consistent: A transaction cannot leave the
database in an inconsistent state.
Isolated: Transactions cannot interfere with each
other.
Durable: Completed transactions persist, even when
ser vers restart etc.

Relational Storage
Benefits Limitations
Data Integrity Static Schemas

Ubiquity Scalability

NoSQL Storage
BASE
Basic Availability
Soft-state
Eventual consistency

Simple API
REST + JSON

Indexing
Real-time Answers
Full-text queries
Fuzzy Searching

Nickname analysis
Geospatial and Temporal Search

Why?
Cassandra
Consistency-level per operation
Temporal dimension of an operation
Idempotent mentality

SOLR
Community
Integration (Solandra)
NOT scalability and flexibility (sharding stinks)

Cassandra’s Data Model
Keyspaces

Column Families
Rows
(Sorted by KEY!)

Columns
(Name : Value)

Example
BeerGuys (Keyspace)
Users (Column Families)
bonedog (Row)
firstName : Brian
lastName : O’Neill
lisa (Row)
firstName : Lisa
lastName : O’Neill
maidenName : Kelley

Cassandra Architecture
Ring Architecture A
(N-Z)
Hash(key) -> Node

Reliability
F
(A-F)
Scalability

Client
M
(G-M)

Why NoSQL for us?
Flexibility
A new data processing paradigm
Instead of:
Data Processing
Do this:

Processing Data

Batch Processing
DATA

JOB A
Distributable (T-A)

Scalable
Data Locality
S HDFS H
(I-R) (B-G)

Map / Reduce
tuple = (key, value)
map(x) -> tuple[]
reduce(key, value[]) -> tuple[]

Word Count
The Code The Run
def map(doc) doc1 = “boy meets girl”
doc.each do |word| doc2 = ”girl likes boy”)
emit(word, 1)
map (doc1) -> (boy, 1), (meets, 1), (girl, 1)
end
map (doc2) -> (girl, 1), (likes, 1), (boy, 1)
end
reduce (boy, [1, 1]) -> (boy, 2)

def reduce(key, values[]) reduce (girl, [1, 1]) -> (girl, 2)

sum = values.inject {|sum,x| sum + x } reduce (likes [1]) -> (likes, 1)
emit(key, sum) reduce (meets, [1]) -> (meets, 1)
end

Queries / Flows

Hive
Pig Cascading

Real-time Processing
Deals with data streams
Storm
tuple Bolt tuple

Spout Bolt
tuple tuple

tuple
Bolt
Spout Bolt
tuple tuple

Bolt

Putting it Together
A
(T-A)

S Storm H
(I-R) (B-G)

But...
We love Ruby!
and it’s all in Java. :(

That’s okay,
because
We love REST!

REST Layer
CRUD via HTTP
Map/Reduce via HTTP
A

Client

S H
Storm

Java Interoperability
Conventional Interoperability
I/O Streams bet ween processes

Hadoop Streaming
Storm Multilang

CRUD via HTTP
http://virgil/data/{keyspace}/{columnFamily}/{column}/{row}
PUT : Replaces Content of Row/Column
GET : Retrieves Value of a Row/Column
DELETE : Removes Value of a Row/Column

A

curl

S H

Map/Reduce over HTTP
wordcount.rb
def map(rowKey, columns)
result = []
columns.each do |column_name, value|
words = value.split A
words.each do |word|
result << [word, "1"]
end
end curl
return result
end

def reduce(key, values)
rows = {}
total = 0
S H
columns = {}
values.each do |value|
total += value.to_i
end
columns["count"] = total.to_s
rows[key] = columns
return rows
end

CF in CF out

Better?
Use JRuby
Single Process
Parse Once / Eval Many

JSR 223
ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby");
ScriptContext context = new SimpleScriptContext();
Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE);
bindings.put("variable", "value");
ENGINE.eval(script, context);

Redbridge
this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT);
this.rubyReceiver = rubyContainer.runScriptlet(script);
container.callMethod(rubyReceiver, "foo", "value");

Rails Integration
A

Balancer
Load

ta
Da

g
S H

sin
es
oc
Pr
“REST is the new JDBC”
ActiveRecord backed by REST?
Anything more than a proxy?

Ratch Processing
(Combing Real-time and Batch)

Data Flows as:
Cascading Map/Reduce jobs
Storm Topologies?

Can’t we have one framework to rule
them all?

Ruby on Big Data (Cassandra + Hadoop)

More Related Content

Ruby on Big Data (Cassandra + Hadoop)

Editor's Notes