The document discusses using Ruby for big data applications, including using Ruby with NoSQL databases like Cassandra and Hadoop for distributed storage and processing, and integrating Ruby with real-time streaming frameworks like Storm. It also covers using REST APIs to allow Ruby applications to interact with these big data systems and perform batch and real-time processing of data.
Report
Share
Report
Share
1 of 30
More Related Content
Ruby on Big Data (Cassandra + Hadoop)
1. Ruby on Big Data
Brian O’Neill
Lead Architect, Health Market Science (HMS)
The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other
organizations mentioned.
2. Agenda
Big Data Orientation
Cassandra
Hadoop
SOLR
Storm
DEMO
Java/Ruby Interoperability
Advanced Ideas
Rails Integration
Combing Real-time w/ Batch Processing (The Final Frontier)
3. “Big” Data
Size doesn’t always matter, it may be
what your doing with it
e.g. Natural-Language Processing
Flexibility was our major motivator
Data sources with disparate schema
4. Decomposing the
Problem
Data Processing
Storage Distributed
Indexing Batch
Querying Real-time
5. Relational Storage
ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back.
Consistent: A transaction cannot leave the
database in an inconsistent state.
Isolated: Transactions cannot interfere with each
other.
Durable: Completed transactions persist, even when
ser vers restart etc.
11. Why?
Cassandra
Consistency-level per operation
Temporal dimension of an operation
Idempotent mentality
SOLR
Community
Integration (Solandra)
NOT scalability and flexibility (sharding stinks)
12. Cassandra’s Data Model
Keyspaces
Column Families
Rows
(Sorted by KEY!)
Columns
(Name : Value)
13. Example
BeerGuys (Keyspace)
Users (Column Families)
bonedog (Row)
firstName : Brian
lastName : O’Neill
lisa (Row)
firstName : Lisa
lastName : O’Neill
maidenName : Kelley
14. Cassandra Architecture
Ring Architecture A
(N-Z)
Hash(key) -> Node
Reliability
F
(A-F)
Scalability
Client
M
(G-M)
15. Why NoSQL for us?
Flexibility
A new data processing paradigm
Instead of:
Data Processing
Do this:
Processing Data
16. Batch Processing
DATA
JOB A
Distributable (T-A)
Scalable
Data Locality
S HDFS H
(I-R) (B-G)
27. Map/Reduce over HTTP
wordcount.rb
def map(rowKey, columns)
result = []
columns.each do |column_name, value|
words = value.split A
words.each do |word|
result << [word, "1"]
end
end curl
return result
end
def reduce(key, values)
rows = {}
total = 0
S H
columns = {}
values.each do |value|
total += value.to_i
end
columns["count"] = total.to_s
rows[key] = columns
return rows
end
CF in CF out
28. Better?
Use JRuby
Single Process
Parse Once / Eval Many
JSR 223
ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby");
ScriptContext context = new SimpleScriptContext();
Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE);
bindings.put("variable", "value");
ENGINE.eval(script, context);
Redbridge
this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT);
this.rubyReceiver = rubyContainer.runScriptlet(script);
container.callMethod(rubyReceiver, "foo", "value");
29. Rails Integration
A
Balancer
Load
ta
Da
g
S H
sin
es
oc
Pr
“REST is the new JDBC”
ActiveRecord backed by REST?
Anything more than a proxy?
30. Ratch Processing
(Combing Real-time and Batch)
Data Flows as:
Cascading Map/Reduce jobs
Storm Topologies?
Can’t we have one framework to rule
them all?