Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Buzzwords Berlin HBase Hackathon, June 2012
Apache Flume and HBase
Alexander Alten-Lorenz | Customer Operations Engineer



                                                        1
About Me

    •   COPS Engineer @ Cloudera
    •   Apache Flume Contributor
    •   Working with hadoop since 2009
    •   Blogger (mapredit.blogspot.com)
    •   Speaker at Conferences / Meetups /
        Tooling Events



2                                    ©2012
                       Cloudera, Inc. All Rights Reserved.


                                                             2
Flume 1.x

    • Mass event collector
    • Stream data (events, not files) from clients
      to sinks
    • Clients: files, syslog, avro, seq, exec
    • Sinks: HDFS files, HBase, …
    • Configurable routing / topology




3                                    ©2012
                       Cloudera, Inc. All Rights Reserved.


                                                             3
Architecture
    Component   Function

    Agent       The JVM running Flume. One per machine. Runs
                many sources and sinks.
    Client      Produces data in the form of events. Runs in a
                separate thread.
    Sink        Receives events from a channel. Runs in a separate
                thread.
    Channel     Connects sources to sinks (like a queue).
                Implements the reliability semantics.
    Event       A single datum; a log record, an avro object, etc.
                Normally around ~4KB.




4                                 ©2012
                    Cloudera, Inc. All Rights Reserved.


                                                                     4
Agent

    • Runs many clients and sinks
    • Java properties-based configuration
    • Low overhead (-Xmx20m)
      – adding RAM increases performance
      – setting Xms prevent in time memory allocation
      – Batching increase performance dramatically




5                                    ©2012
                       Cloudera, Inc. All Rights Reserved.


                                                             5
Sources

    • Plugin interface
    • Managed by a SourceRunner that controls
      threading and execution model (e.g. polling
      vs. event-based)
    • Included: exec, avro, syslog, seq




6                                   ©2012
                      Cloudera, Inc. All Rights Reserved.


                                                            6
HBase sink
    ls -la flume-ng-sinks/flume-ng-hbase-sink/
    src/main/java/org/apache/flume/sink/hbase/

    HBaseSink.java
    HbaseEventSerializer.java
    SimpleHbaseEventSerializer.java
    SimpleRowKeyGenerator.java




7                                  ©2012
                     Cloudera, Inc. All Rights Reserved.


                                                           7
HBaseSink.java


•   Control flush()
•   Using serializer
•   Control the transaction
•   Control rollbacks (in case of events couldn’t
    written)




8                                   ©2012
                      Cloudera, Inc. All Rights Reserved.


                                                            8
Configuration


    •   Source Seq interface
    •   Listening on a defined port @localhost
    •   Serializer need some parameters
    •   Column family and column must be known
    •   Valid hbase-site.xml in $CLASSPATH



9                                   ©2012
                      Cloudera, Inc. All Rights Reserved.


                                                            9
Configuration Example
host1.sources = src1
host1.sinks = sink1
host1.channels = ch1

host1.sources.src1.type = seq
host1.sources.src1.port = 25001
host1.sources.src1.bind = localhost
host1.sources.src1.channels = ch1
host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink
host1.sinks.sink1.channel = ch1
host1.sinks.sink1.table = test3
host1.sinks.sink1.columnFamily = testing
host1.sinks.sink1.column = foo
host1.sinks.sink1.serializer =
org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
host1.sinks.sink1.serializer.payloadColumn = pcol
host1.sinks.sink1.serializer.incrementColumn = icol
host1.channels.ch1.type=memory


10                                     ©2012
                         Cloudera, Inc. All Rights Reserved.


                                                                 10
Take Away


 •   Flume collects events
 •   Source - Channel - Sink concept
 •   HBase sink needs a serializer interface
 •   Column family and column must be known




11                                ©2012
                    Cloudera, Inc. All Rights Reserved.


                                                          11
Thank You

 • Web: https://cwiki.apache.org/FLUME/
   getting-started.html
 • ML: flume-user@incubator.apache.org

 • Mail: alexander@cloudera.com
 • Blog: mapredit.blogspot.com
 • Twitter: @mapredit


12                              ©2012
                  Cloudera, Inc. All Rights Reserved.


                                                        12

More Related Content

Flume and HBase

  • 1. Buzzwords Berlin HBase Hackathon, June 2012 Apache Flume and HBase Alexander Alten-Lorenz | Customer Operations Engineer 1
  • 2. About Me • COPS Engineer @ Cloudera • Apache Flume Contributor • Working with hadoop since 2009 • Blogger (mapredit.blogspot.com) • Speaker at Conferences / Meetups / Tooling Events 2 ©2012 Cloudera, Inc. All Rights Reserved. 2
  • 3. Flume 1.x • Mass event collector • Stream data (events, not files) from clients to sinks • Clients: files, syslog, avro, seq, exec • Sinks: HDFS files, HBase, … • Configurable routing / topology 3 ©2012 Cloudera, Inc. All Rights Reserved. 3
  • 4. Architecture Component Function Agent The JVM running Flume. One per machine. Runs many sources and sinks. Client Produces data in the form of events. Runs in a separate thread. Sink Receives events from a channel. Runs in a separate thread. Channel Connects sources to sinks (like a queue). Implements the reliability semantics. Event A single datum; a log record, an avro object, etc. Normally around ~4KB. 4 ©2012 Cloudera, Inc. All Rights Reserved. 4
  • 5. Agent • Runs many clients and sinks • Java properties-based configuration • Low overhead (-Xmx20m) – adding RAM increases performance – setting Xms prevent in time memory allocation – Batching increase performance dramatically 5 ©2012 Cloudera, Inc. All Rights Reserved. 5
  • 6. Sources • Plugin interface • Managed by a SourceRunner that controls threading and execution model (e.g. polling vs. event-based) • Included: exec, avro, syslog, seq 6 ©2012 Cloudera, Inc. All Rights Reserved. 6
  • 7. HBase sink ls -la flume-ng-sinks/flume-ng-hbase-sink/ src/main/java/org/apache/flume/sink/hbase/ HBaseSink.java HbaseEventSerializer.java SimpleHbaseEventSerializer.java SimpleRowKeyGenerator.java 7 ©2012 Cloudera, Inc. All Rights Reserved. 7
  • 8. HBaseSink.java • Control flush() • Using serializer • Control the transaction • Control rollbacks (in case of events couldn’t written) 8 ©2012 Cloudera, Inc. All Rights Reserved. 8
  • 9. Configuration • Source Seq interface • Listening on a defined port @localhost • Serializer need some parameters • Column family and column must be known • Valid hbase-site.xml in $CLASSPATH 9 ©2012 Cloudera, Inc. All Rights Reserved. 9
  • 10. Configuration Example host1.sources = src1 host1.sinks = sink1 host1.channels = ch1 host1.sources.src1.type = seq host1.sources.src1.port = 25001 host1.sources.src1.bind = localhost host1.sources.src1.channels = ch1 host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink host1.sinks.sink1.channel = ch1 host1.sinks.sink1.table = test3 host1.sinks.sink1.columnFamily = testing host1.sinks.sink1.column = foo host1.sinks.sink1.serializer = org.apache.flume.sink.hbase.SimpleHbaseEventSerializer host1.sinks.sink1.serializer.payloadColumn = pcol host1.sinks.sink1.serializer.incrementColumn = icol host1.channels.ch1.type=memory 10 ©2012 Cloudera, Inc. All Rights Reserved. 10
  • 11. Take Away • Flume collects events • Source - Channel - Sink concept • HBase sink needs a serializer interface • Column family and column must be known 11 ©2012 Cloudera, Inc. All Rights Reserved. 11
  • 12. Thank You • Web: https://cwiki.apache.org/FLUME/ getting-started.html • ML: flume-user@incubator.apache.org • Mail: alexander@cloudera.com • Blog: mapredit.blogspot.com • Twitter: @mapredit 12 ©2012 Cloudera, Inc. All Rights Reserved. 12