Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Ruby on Big Data
                 Brian O’Neill
Lead Architect, Health Market Science (HMS)



      The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other
                                              organizations mentioned.
Agenda
Big Data Orientation
  Cassandra
  Hadoop
  SOLR
  Storm

DEMO
Java/Ruby Interoperability
Advanced Ideas
  Rails Integration
  Combing Real-time w/ Batch Processing (The Final Frontier)
“Big” Data
Size doesn’t always matter, it may be
what your doing with it
 e.g. Natural-Language Processing

Flexibility was our major motivator
 Data sources with disparate schema
Decomposing the
       Problem
Data         Processing
 Storage     Distributed

 Indexing    Batch

 Querying    Real-time
Relational Storage
ACID
 Atomic: Everything in a transaction succeeds or the
 entire transaction is rolled back.
 Consistent: A transaction cannot leave the
 database in an inconsistent state.
 Isolated: Transactions cannot interfere with each
 other.
 Durable: Completed transactions persist, even when
 ser vers restart etc.
Relational Storage
Benefits          Limitations
 Data Integrity   Static Schemas

 Ubiquity         Scalability
NoSQL Storage
BASE
 Basic Availability
 Soft-state
 Eventual consistency

Simple API
 REST + JSON
Indexing
Real-time Answers
Full-text queries
 Fuzzy Searching

Nickname analysis
Geospatial and Temporal Search
Storage Options
Indexing Options
Why?
Cassandra
 Consistency-level per operation
 Temporal dimension of an operation
 Idempotent mentality

SOLR
 Community
 Integration (Solandra)
   NOT scalability and flexibility (sharding stinks)
Cassandra’s Data Model
   Keyspaces

     Column Families
                 Rows
               (Sorted by KEY!)


                       Columns
                         (Name : Value)
Example
BeerGuys (Keyspace)
  Users (Column Families)
     bonedog (Row)
        firstName : Brian
        lastName : O’Neill
     lisa (Row)
        firstName : Lisa
        lastName : O’Neill
        maidenName : Kelley
Cassandra Architecture
 Ring Architecture         A
                          (N-Z)
  Hash(key) -> Node

  Reliability
                                   F
                                  (A-F)
  Scalability



                 Client
                          M
                          (G-M)
Why NoSQL for us?
Flexibility
A new data processing paradigm
  Instead of:
                Data          Processing
  Do this:


 Processing            Data
Batch Processing
                      DATA

                JOB           A
Distributable                 (T-A)


Scalable
Data Locality
                       S      HDFS     H
                      (I-R)           (B-G)
Map / Reduce
tuple = (key, value)
map(x) -> tuple[]
reduce(key, value[]) -> tuple[]
Word Count
The Code                                   The Run
def map(doc)                                   doc1 = “boy meets girl”
 doc.each do |word|                            doc2 = ”girl likes boy”)
      emit(word, 1)
                                               map (doc1) -> (boy, 1), (meets, 1), (girl, 1)
  end
                                               map (doc2) -> (girl, 1), (likes, 1), (boy, 1)
end
                                               reduce (boy, [1, 1]) -> (boy, 2)

def reduce(key, values[])                      reduce (girl, [1, 1]) -> (girl, 2)

  sum = values.inject {|sum,x| sum + x }       reduce (likes [1]) -> (likes, 1)
  emit(key, sum)                               reduce (meets, [1]) -> (meets, 1)
end
Queries / Flows


      Hive
Pig          Cascading
Real-time Processing
Deals with data streams
                                    Storm
          tuple   Bolt   tuple

  Spout                          Bolt
          tuple          tuple



          tuple
                  Bolt
  Spout                          Bolt
          tuple          tuple

                  Bolt
Putting it Together
          A
          (T-A)




  S      Storm     H
 (I-R)            (B-G)
But...
We love Ruby!
 and it’s all in Java. :(


That’s okay,
  because
We love REST!
REST Layer
         CRUD via HTTP
         Map/Reduce via HTTP
                                A

Client



                         S             H
                               Storm
DEMO
Java Interoperability
Conventional Interoperability
 I/O Streams bet ween processes



Hadoop Streaming
Storm Multilang
CRUD via HTTP
http://virgil/data/{keyspace}/{columnFamily}/{column}/{row}
                    PUT : Replaces Content of Row/Column
                    GET : Retrieves Value of a Row/Column
                    DELETE : Removes Value of a Row/Column


                                                    A




             curl


                                             S               H
Map/Reduce over HTTP
       wordcount.rb
def map(rowKey, columns)
    result = []
    columns.each do |column_name, value|
        words = value.split                              A
        words.each do |word|
            result << [word, "1"]
        end
    end                                    curl
    return result
end

def reduce(key, values)
    rows = {}
    total = 0
                                                     S            H
    columns = {}
    values.each do |value|
        total += value.to_i
    end
    columns["count"] = total.to_s
    rows[key] = columns
    return rows
end

                                             CF in           CF out
Better?
                             Use JRuby
                                 Single Process
                                 Parse Once / Eval Many

JSR 223
    ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby");
    ScriptContext context = new SimpleScriptContext();
    Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE);
    bindings.put("variable", "value");
    ENGINE.eval(script, context);



Redbridge
    this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT);
    this.rubyReceiver = rubyContainer.runScriptlet(script);
    container.callMethod(rubyReceiver, "foo", "value");
Rails Integration
                                        A




                   Balancer
                     Load




                                   ta
                                   Da




                                                 g
                               S                     H




                                             sin
                                            es
                                        oc
                                        Pr
“REST is the new JDBC”
ActiveRecord backed by REST?
Anything more than a proxy?
Ratch Processing
  (Combing Real-time and Batch)


Data Flows as:
 Cascading Map/Reduce jobs
 Storm Topologies?

Can’t we have one framework to rule
them all?

More Related Content

Ruby on Big Data (Cassandra + Hadoop)

  • 1. Ruby on Big Data Brian O’Neill Lead Architect, Health Market Science (HMS) The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other organizations mentioned.
  • 2. Agenda Big Data Orientation Cassandra Hadoop SOLR Storm DEMO Java/Ruby Interoperability Advanced Ideas Rails Integration Combing Real-time w/ Batch Processing (The Final Frontier)
  • 3. “Big” Data Size doesn’t always matter, it may be what your doing with it e.g. Natural-Language Processing Flexibility was our major motivator Data sources with disparate schema
  • 4. Decomposing the Problem Data Processing Storage Distributed Indexing Batch Querying Real-time
  • 5. Relational Storage ACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when ser vers restart etc.
  • 6. Relational Storage Benefits Limitations Data Integrity Static Schemas Ubiquity Scalability
  • 7. NoSQL Storage BASE Basic Availability Soft-state Eventual consistency Simple API REST + JSON
  • 8. Indexing Real-time Answers Full-text queries Fuzzy Searching Nickname analysis Geospatial and Temporal Search
  • 11. Why? Cassandra Consistency-level per operation Temporal dimension of an operation Idempotent mentality SOLR Community Integration (Solandra) NOT scalability and flexibility (sharding stinks)
  • 12. Cassandra’s Data Model Keyspaces Column Families Rows (Sorted by KEY!) Columns (Name : Value)
  • 13. Example BeerGuys (Keyspace) Users (Column Families) bonedog (Row) firstName : Brian lastName : O’Neill lisa (Row) firstName : Lisa lastName : O’Neill maidenName : Kelley
  • 14. Cassandra Architecture Ring Architecture A (N-Z) Hash(key) -> Node Reliability F (A-F) Scalability Client M (G-M)
  • 15. Why NoSQL for us? Flexibility A new data processing paradigm Instead of: Data Processing Do this: Processing Data
  • 16. Batch Processing DATA JOB A Distributable (T-A) Scalable Data Locality S HDFS H (I-R) (B-G)
  • 17. Map / Reduce tuple = (key, value) map(x) -> tuple[] reduce(key, value[]) -> tuple[]
  • 18. Word Count The Code The Run def map(doc) doc1 = “boy meets girl” doc.each do |word| doc2 = ”girl likes boy”) emit(word, 1) map (doc1) -> (boy, 1), (meets, 1), (girl, 1) end map (doc2) -> (girl, 1), (likes, 1), (boy, 1) end reduce (boy, [1, 1]) -> (boy, 2) def reduce(key, values[]) reduce (girl, [1, 1]) -> (girl, 2) sum = values.inject {|sum,x| sum + x } reduce (likes [1]) -> (likes, 1) emit(key, sum) reduce (meets, [1]) -> (meets, 1) end
  • 19. Queries / Flows Hive Pig Cascading
  • 20. Real-time Processing Deals with data streams Storm tuple Bolt tuple Spout Bolt tuple tuple tuple Bolt Spout Bolt tuple tuple Bolt
  • 21. Putting it Together A (T-A) S Storm H (I-R) (B-G)
  • 22. But... We love Ruby! and it’s all in Java. :( That’s okay, because We love REST!
  • 23. REST Layer CRUD via HTTP Map/Reduce via HTTP A Client S H Storm
  • 24. DEMO
  • 25. Java Interoperability Conventional Interoperability I/O Streams bet ween processes Hadoop Streaming Storm Multilang
  • 26. CRUD via HTTP http://virgil/data/{keyspace}/{columnFamily}/{column}/{row} PUT : Replaces Content of Row/Column GET : Retrieves Value of a Row/Column DELETE : Removes Value of a Row/Column A curl S H
  • 27. Map/Reduce over HTTP wordcount.rb def map(rowKey, columns) result = [] columns.each do |column_name, value| words = value.split A words.each do |word| result << [word, "1"] end end curl return result end def reduce(key, values) rows = {} total = 0 S H columns = {} values.each do |value| total += value.to_i end columns["count"] = total.to_s rows[key] = columns return rows end CF in CF out
  • 28. Better? Use JRuby Single Process Parse Once / Eval Many JSR 223 ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby"); ScriptContext context = new SimpleScriptContext(); Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE); bindings.put("variable", "value"); ENGINE.eval(script, context); Redbridge this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT); this.rubyReceiver = rubyContainer.runScriptlet(script); container.callMethod(rubyReceiver, "foo", "value");
  • 29. Rails Integration A Balancer Load ta Da g S H sin es oc Pr “REST is the new JDBC” ActiveRecord backed by REST? Anything more than a proxy?
  • 30. Ratch Processing (Combing Real-time and Batch) Data Flows as: Cascading Map/Reduce jobs Storm Topologies? Can’t we have one framework to rule them all?

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n