Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

Kafka Streaming Data

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Kafka Streaming Data

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 154

Kafka Technical Overview

In this post, we take a high-level look at the architecture of Apache Kafka, the role
ZooKeeper plays, and more.

by

Sylvester Daniel

Objective

In this article series, we will learn Kafka basics, Kafka delivery semantics, and configuration to
achieve different semantics, Spark Kafka integration, and optimization.

In Part 1 of this series let’s understand Kafka basics. In Part 2 of this series, we'll learn more
about Kafka producer and it's configuration.

Problem Statement
The following could be some of the problem statements:

 Many sources and target systems to integrate. Generally, the integration of many systems
involves complexities like dealing with many protocols, messaging formats, etc.
 Message systems handle high volume streams.

Use Cases

Some of the use cases include:

 Streaming processing
 Tracking user activity, log aggregation, etc.
 De-coupling systems

What Is Kafka?

Kafka is a horizontally scalable, fault tolerant, and fast messaging system.

It is a pub-sub model in which various producers and consumers can write and read.

It decouples source and target systems.

Some of the key features are:

 Scale to hundreds of nodes.


 Can handle millions of messages per second.
 Real-time processing (~10ms).

Key Terminologies
Topic, Partitions, and Offsets

A topic is a specific stream of data. It is very similar to a table in a NoSQL database.

Like tables in a NoSQL database, the topic is split into partitions that enable topics to be
distributed across various nodes.

Like primary keys in tables, topics have offsets per partitions.

You can uniquely identify a message using its topic,


partition, and offset.

Partitions
Partitions enable topics to be distributed across the cluster.

Partitions are a unit of parallelism for horizontal scalability.

One topic can have more than one partition scaling across nodes.

Messages are assigned to partitions based on partition keys;

if there are no partition keys then the partition is randomly


assigned.
It’s important to use the correct key to avoid hotspots.

Each message in a partition is assigned an incremental id called an offset.

Offsets are unique per partition and messages are ordered only within a partition.

Messages written to partitions are immutable.

Kafka Architecture

The diagram below shows the architecture of Kafka.

ZooKeeper

ZooKeeper is a centralized service for managing distributed systems.

It offers hierarchical key-value store, configuration, synchronization, and name registry


services to the distributed system it manages.

ZooKeeper acts as ensemble layer (ties things together) and ensures high availability of the
Kafka cluster.

Kafka nodes are also called brokers. It’s important to understand that Kafka cannot work without
ZooKeeper.

From the list of ZooKeeper nodes, one of the nodes is elected as a leader and the rest of the
nodes follow the leader.

In the case of a ZooKeeper node failure, one of the followers is elected as leader.

More than one node is strongly recommended for high availability and more than 7 is not
recommended.

ZooKeeper stores metadata and the current state of the Kafka


cluster.
For example, details like topic name, the number of partitions, replication, leader details of
petitions, and consumer group details are stored in ZooKeeper.

You can think of ZooKeeper like a project manager who manages resources in the project and
remembers the state of the project.

Key things to remember:


 Manages list of brokers.
 Elects broker leaders when a broker goes down.
 Sends notifications on a new broker, new topic, deleted topic, lost brokers, etc.
 From Kafka 0.10 on, consumer offsets are not stored in ZooKeeper, only the metadata of the
cluster is stored in ZooKeepr.
 The leader in ZooKeepr handles all writes and follower ZooKeepr handle only reads.

Broker

A broker is a single Kafka node that is managed by ZooKeeper.

A set of brokers form a Kafka cluster.

Topics that are created in Kaka are distributed across brokers based on the partition, replication,
and other factors.

When a broker node fails based on the state stored in ZooKeeper it


automatically rebalances the cluster and if a leader partition is lost then
one of the follower petitions is elected as the leader.
You can think of a broker as a team leader who takes care of the assigned tasks.

If a team lead isn’t available then the manager takes care of assigning tasks to other team
members.

Replication
A replication is making a copy of a partition available in another broker.

Replication enables Kafka to be fault tolerant.

When a partition of the topic is available in multiple brokers


then one of the partitions in a broker is elected as the leader
and the rest of the replications of the partition are followers.
Replication enables Kafka to be fault tolerant even when a broker is down.

For example, Topic B partition 0 is stored in both broker 0 and broker 1.

Both producers and consumers are served only by the leader.


In case of a broker failure the partition from another broker is elected as a leader
and it starts serving the producers and consumer groups.
Replica partitions that are in sync with the leader are flagged as ISR (In Sync
Replica).

IT Team and Kafka Cluster Analogy

The diagram below depicts an analogy of an IT team and Kafka cluster.

Summary

Below is the summary of core components in Kafka.

 ZooKeeper manages Kafka brokers and their metadata.


 Brokers are horizontally scalable Kafka nodes that contain topics and it's replications.
 Topics are message streams with one or more partitions.
 Partitions contains messages with unique offsets per partition.
 Replication enables Kafka to be fault tolerant using follower partitions.

Kafka producer overview


 Published on May 5, 2019
Sylvester Daniel

Program Architect at Mindtree - Big Data, AWS, Azure, Machine Learning & Deep Learning

11 articles

This article is a continuation of part 1 Kafka technical overview article. In part 2 of the series
let's look into the details of how Kafka producer works and important configurations.

Producer Role
The primary role of a Kafka producer is to take producer properties
& record as inputs and write it to an appropriate Kafka broker.
Producers serialize, partitions, compresses and load balances data across brokers based on
partitions.

Properties

Some of the producer properties are bootstrap servers, acks, batch.size,


linger.ms key.serializer, value.serializer and many more.
We will discuss some of these properties later in this article.
Producer record
A message that should be written to Kafka is referred to as Producer Record.

A producer record should have the name of the topic it should be written to and value of the
record.

Other fields like partition, timestamp and key are optional.


Broker and metadata discovery
Bootstrap server

Any broker in Kafka cluster can act as a bootstrap server.

Generally, a list of bootstrap servers is passed instead of just one server. At least 2 bootstrap
servers are recommended.

In order to send producer record to an appropriate broker, the producer first establishes a
connection to one of the bootstrap server.

The bootstrap-server returns list of all the brokers available in the clusters and all the
metadata details like topics, partitions, replication factor and so on.

Based on the list of brokers and metadata details the producer


identifies the leader
broker that hosts the leader partition of the producer record and
writes to the broker.
Workflow
The diagram below shows the workflow of a producer.

The workflow of a producer involves five important steps:

1. Serialize
2. Partition
3. Compress
4. Accumulate records
5. Group by broker and send

Serialize
In this step, the producer record gets serialized based on the serializers passed to the producer.

Both key and value are serialized based on the serializer passed. Some of the serializers include
string serializer, byteArray serializer and ByteBuffer serializers.

Partition
In this step, the producer decides which partition of the topic the record should get written to.

By default murmur2 algorithm is used for partitioning.


Murmur 2 algorithm generates a unique hash code based on the Key passed and the appropriate
partition is decided.

In case the key is not passed the partitions are chosen in a round-robin fashion.

It’s important to understand that by passing the same key to a set of records, Kafka will ensure
that messages are written to the same partition in the order received for a given number of
partitions.

If you want to retain the order of messages received it’s important to use an appropriate key for
the messages.

Custom Partitioner can also be passed to the producer to control which partitions
message should be written to.

Compression
In this step producer record is compressed before it’s written to the record accumulator. By
default, compression is not enabled in Kafka producer.

Below are supported compression types:


Compression enables faster transfer not only from producer to broker but also
during replication.

Compression helps better throughput, low latency, and better disk utilization.

Refer http://blog.yaorenjie.com/2017/01/03/Kafka-0-10-Compression-Benchmark/ for


benchmark details.

Record accumulator
In this step, the records are accumulated in a buffer per partition of a topic.

Records are grouped into batches based on producer batch size property. Each partition in a topic
gets a separate accumulator/buffer.

Sender thread

In this step, the batches of the partition in record accumulator are grouped by the broker to which
they are to be sent.

The records in the batch are sent to a broker based on batch.size and linger.ms properties.

The records are sent by the producer based on two conditions.


When the defined batch size is reached or defined linger time is reached.

Duplicate message detection


Producers may send a duplicate message when a message was committed by Kafka but the
acknowledgment was never received by the producer due to network failure and other
issues.

From Kafka 0.11 to avoid duplicate messages in case


of scenario stated earlier Kafka tracks each message
based on producer ID and sequence number.
When a duplicate message is received for a committed message with same producer ID and
sequence number then Kafka would treat the message as a duplicate message and will not
committee message again but it will send the acknowledgment back to the producer so the
producer can treat the message as sent.

Few other producer properties

 Buffer.memory – manage buffer memory allocated to producer


 Retries - Number of times to retry message. Default is 0. The retry may cause out of order
messages.
 Max.in.flight.requests.per.connection - The number of messages to be sent without any
acknowledgment. Default is 5. Set this to 1 to avoid out of order message due to retry.
 Max.request.size - Maximum size of the message. Default 1 MB.

Summary

Based on the producer workflow and producer properties, tune the configuration to achieve
desired results.

Importantly focus on below properties.

 Batch.size – batch size (messages) per request


 Linger.ms – Time to wait before sending the current batch
 Compression.type – compress messages

In part 3 of the series let’s understand Kafka producer delivery semantics and how to tune some
of the producer properties to achieve desired results.
Kafka producer delivery semantics
 Published on May 10, 2019

This article is a continuation of part 1 Kafka technical overview and part 2 Kafka producer
overview articles. Let's look into different delivery semantics and how to achieve those using
producer and broker properties.

Delivery semantics
Based on broker & producer configuration all three-delivery semantics “at most once”, “at least
once” and “exactly once” are supported.

At most once
In at most once delivery, semantics a message should be delivered maximum only once. It's
acceptable to lose a message rather than delivering a message twice in this semantic.

Few use cases of at most once includes metrics collection, log collection and so on. Applications
adopting at most semantics can easily achieve higher throughput and low latency.
At least once
In at least once delivery semantics it is acceptable to deliver a message more than once but no
message should be lost.

The producer ensures that all messages are delivered for sure even though it may result in
message duplication.

This is mostly preferred semantics out of all. Applications adopting at least once semantics
may have moderate throughput and moderate latency.
Exactly once

In exactly one delivery semantics, a message must be delivered only once and no message
should be lost.

This is the most difficult delivery semantic of all. Applications adopting exactly once
semantics may have lower throughput and higher latency compared other 2 semantics.

Delivery Semantics summary

The table below summarizes the behavior of all delivery semantics.


Producer delivery semantics

Different delivery semantics can be achieved in Kafka using Acks property of producer and
min.insync.replica property of the broker (considered only when acks = all).

Acks = 0

When acks property is set to zero you get at most once delivery semantics. Kafka producer
sends the record to the broker and doesn't wait for any response.

Messages once sent will not be retried in this setting. The producer uses “send and forget
approach “with acks = 0.

Data loss

In this mode, chances for data loss is high as the producer does not confirm the message was
received by the broker.

The message may not have even reached the broker or broker failure soon
after message delivery can result in data loss.
Acks = 1

When this property is set to 1 you can achieve at least once delivery semantics.
Kafka producer sends the record to the broker and waits for a response from the broker. If no
acknowledgment is received for the message sent, then the producer will retry sending the
messages based on retry configuration.

Retries property by default is 0 make sure this is set to desired number or Max.INT.

Data loss

In this mode, chances for data is moderate as the producer confirms that the message was
received by the broker (leader partition).

As the replication of follower, partition happens after the acknowledgment this may still result
in data loss.

For example, after sending the acknowledgment and before replication if the broker goes
down this may result in data loss, as the producer will not resend the message.
Acks = All
When acks property is set to all, you can achieve exactly once delivery semantics.
Kafka producer sends the record to the broker and waits for a response from the broker.

If no acknowledgment is received for the message sent, then the producer will retry sending the
messages based on retry config n times.

The broker sends acknowledgment only after replication based on min.insync.replica


property.

For example, a topic may have a replication factor of 3 and


min.insync.replica of 2.
In this case, an acknowledgment will be sent after the second replication is complete.

In order to achieve exactly once delivery semantics the broker has to be idempotent. Acks = all
should be used in conjunction with min.insync.replicas.

Data loss

In this mode, chances for data loss is low as the producer confirms that the message was
received by the broker (leader and follower partition) only after replication.
As the replication of follower partition happens before the acknowledgment data loss
chances are minimal.

For example, before replication and sending acknowledgment if the broker goes down, the
producer will not receive the acknowledgment and will send the message again to the newly
elected leader partition.

Exception

When there are not enough nodes to replicate as per


min.insync.replica property then the broker would return an
exception instead of acknowledgment.

Safe producer

In order to create a safe producer that ensures minimal data loss, use below producer
properties.

Producer properties

 Acks = all (default 1) – Ensures replication before acknowledgement


 Retries = MAX_INT (default 0) – Retry in case of exceptions
 Max.in.flight.requests.per.connection = 5 (default) – Parallel connections to broker

Broker properties

 Min.insync.replicas = 2 (at least 2) – Ensures minimum In Sync replica (ISR).


Acks impact

The table below summarizes the impact of acks property on latency, throughput, and durability.

Summary

Configure Kafka producer and broker to achieve desired delivery semantics based on
following properties.

 Acks
 Retries
 Max.in.flight.requests.per.connection
 Min.insync.replicas

In part 4 of the series, let’s understand Kafka consumer, consumer group and how to achieve
different Kafka consumer delivery semantics.

Kafka Consumer Overview


 Published on May 29, 2019

Sylvester Daniel
Program Architect at Mindtree - Big Data, AWS, Azure, Machine Learning & Deep Learning

11 articles

This article is a continuation of part 1 Kafka technical overview, part 2 Kafka producer overview
and part 3 Kafka producer delivery semantics articles. Let's look into Kafka consumer group,
consumer and protocol used in detail.

Consumer Role

Like Kafka Producer that optimizes writes to Kafka, Consumer is used for optimal consumption
of Kafka data.

The primary role of a Kafka consumer is to take Kafka connection and consumer properties to
read records from the appropriate Kafka broker.

Complexities of concurrent multiple application consumption, offset management, delivery


semantics and lot more are taken care of by Consumer APIs.

Properties

Some of the consumer properties are the bootstrap servers, fetch.min.bytes,


max.partition.fetch.bytes, fetch.max.bytes, enable.auto.commit and many more. We will
discuss some of these properties later in the next part of the article series.

Multi-app Consumption

Multiple application can consume records from the same Kafka


topic, as shown in the diagram below.
Each application that consumes data from Kafka gets its own copy and can read at its own
speed.

In other words, offset consumed by one application could be different from another application.

Kafka keeps tracks of offsets consumed by each application in an internal


“__consumer_offset” topic.

Consumer Group and Consumer

Each application consuming data from Kafka is treated as a consumer group.

For example, if two (2) applications are consuming the same topic from Kafka then
internally Kafka creates 2 consumer groups.
Each consumer group can have one or more consumers.

If a topic has 3 partitions and an application consumes it, then a consumer group
would be created and a consumer in the consumer group will consume all
partitions of the topic. The diagram below depicts a consumer group with a single
consumer.
When an application wants to increase the speed of processing and
process partitions in parallel then it can add more consumers to the
consumer group.
Kafka takes care of keeping track of offsets consumed per consumer in a
consumer group; rebalancing consumers in the consumer group when a
consumer is added or removed and lot more.
When there are multiple consumers in a consumer group, each consumer in the group is assigned
one or more partitions.

Each consumer in the group will process records in parallel


from each leader partition of the brokers.
A consumer can read from more than one partitions.
It’s very important to understand that no single partition
will be assigned to two consumers in the same consumer
group; in other words, the same partition will not be
processed by two consumers as shown in the diagram below.
When consumers in a consumer group are more than partitions in a topic then over-
allocated consumers in the consumer group will be unused.
When you have multiple topics and multiple applications consuming the data, consumer
group and consumers of Kafka will look similar to the diagram shown below.

Coordinator and leader discovery


In order to manage the handshake between Kafka and application that forms consumer group and
consumer, a coordinator on the Kafka side and a leader (one of the consumers in the
consumer group) is elected.

The first consumer that initiates the process is automatically elected


as leader in the consumer group.
As explained in the diagram below, for a consumer to join a consumer group following
handshake processes take place:

1. Find coordinator
2. Join group
3. Sync group
4. Heartbeat
5. Leave group
Coordinator

In order to create or join a group, a consumer has to first find the coordinator on the Kafka
side that manages the consumer group.

The consumer makes a “find coordinator” request to one of the bootstrap servers.

If a coordinator already doesn’t exist it’s identified based on a hashing formula and returned as a
response to “find coordinator” request.

Join Group

Once the coordinator is identified, the consumer makes a “join group” request to the coordinator.

The coordinator returns the consumer group leader and metadata details.

If a leader already doesn’t exist then the first consumer of the group is elected as leader.
Consuming application can also control the leader elected by the coordinator node.
Sync Group

After leader details are received for the join group request, the consumer makes a “Sync group”
request to the coordinator.

This request triggers the rebalancing process across consumers in the consumer group, as the
partitions assigned to the consumers, will change after the “sync group” request.
Rebalance

All consumers in the consumer group will receive updated partition assignments that they need
to consume when a consumer is added/removed or “sync group” request is sent.

Data consumption by all consumers in the consumer group will be halted until the rebalance
process is complete.

Heartbeat

Each consumer in the consumer group periodically sends a heartbeat signal to its group
coordinator. In the case of heartbeat timeout, the consumer is considered lost and rebalancing is
initiated by the coordinator.
Leave Group

A consumer can choose to leave the group anytime by sending a “leave group” request. The
coordinator will acknowledge the request and initiate a rebalance. In case the leader node leaves
the group, a new leader is elected from the group and a rebalance is initiated.

Summary

As explained in part 1of the article series “partitions” are unit of parallelism. As consumers in a
consumer group are limited by the partition in a topic, it’s very important to decide you partitions
based on the SLA and scale your consumers accordingly. Consumer offsets are managed and
stored by Kafka in an internal “__consumer_offset” topic. Each consumer in a consumer group
follows find coordinator, join group, sync group, heartbeat and leave group protocol. Let’s
understand Kafka consumer properties and delivery semantics in the next part of the article.
Kafka Consumer Delivery Semantics
 Published on September 1, 2019

Sylvester Daniel

Program Architect at Mindtree - Big Data, AWS, Azure, Machine Learning & Deep Learning

11 articles

This article is a continuation of part 1 Kafka technical overview, part 2 Kafka producer
overview, part 3 Kafka producer delivery semantics and part 4 Kafka consumer overview
articles. Let's understand different consumer configurations and consumer delivery semantics.

Subscribe
To read records from Kafka topic, create an instance of Kafka consumer
and subscribe to one or more of Kafka topics.
You can subscribe to a list of topics using regular expressions, for example, “myTopic.*”.

Properties props = new Properties();

props.put("bootstrap.servers", "broker1:9092,broker2:9092");

KafkaConsumer<String, String> consumer = new KafkaConsumer<String,

String>(props);

consumer.subscribe("myTopic.*");
Poll Method
Consumers read data from Kafka by polling for new data.

The poll method takes care of all coordination like partition rebalancing, heartbeat, and data
fetching.

When auto-commit is set to true poll method not only reads data but also commits the offsets and
then reads the next batch of record as well.

Consumer Configurations

Kafka consumer behavior is configurable through the following properties. These properties are
passed as key-value pair when consumer instance is created.

Enable.auto.commit

Defines how offsets are committed to Kafka, by default “enable.auto.commit” is set to true.

When this property is set to true you may also want to set how frequent offsets should be
committed using “auto.commit.interval.ms”.

By default auto.commit.interval.ms is set to 5000ms (5 seconds).


When “enable.auto.commit” is set to true, consumer delivery
semantics is “At most once” and commits are async.

Key points:

 Enable.auto.commit = true (default)


 auto.commit.interval.ms = 5000ms (default)
 At most once delivery semantic
 Commits are async when enable.auto.commit is true.

Partition.assignment.stratergy

In the previous article Kafka consumer overview, we learned that consumers in a consumer
group are assigned different partitions.

The partitions are assigned to consumers based on “partition.assignment.strategy” property.

PartitionAssignor is a class that defines the required interface for the assignment strategy.

Kafka comes inbuilt with RangeAssignor and RoundRobinAssignor supporting Range and Round
Robin strategy respectively.

Range strategy: In range strategy partitions are assigned in ranges to consumers.

For example, if there are 7 partitions in 2 topics each, consumed by 2 consumers; then ranger
strategy assigns first 4 partitions (0 – 3) to the first consumer from both topics and 3 partitions (4
– 6) from both topics to the second consumer.

The partitions are unevenly assigned, with first consumer processing 8 partitions and second
consumer processing only 6 partitions.

By default “partition.assignment.streatergy” is set to “RangeAssignor”.

Round-robin strategy: In round-robin strategy partitions are assigned to the consumer in a


round-robin fashion resulting in even distribution of partitions to the consumer.

For example, if there are 7 partitions in 2 topics each consumed by 2 consumers; then round-
robin strategy assigns 4 partitions (0, 2, 4, 6) of first topic and 3 partitions (1,3,5) of the second
topic to first consumer and 3 partitions (1,3,5) of first topic and 4 partitions (0, 2, 4, 6) of the
second topic to the second consumer.

Key points:
 partition.assignment.strategy – decides how partitions are assigned to consumers
 Range strategy (RangerAssignor) is the default.
 Range strategy may result in an uneven assignment.

Fetch.min.bytes

Defines a minimum number of bytes required to send data from Kafka to the consumer. When
Consumer polls for data, if the minimum number of bytes is not reached, then Kafka waits until
the pre-defined size is reached and then sends the data.

The default value is set to 1MB.

By increasing the fetch.min.bytes load on both consumer and


broker are reduced increasing both latency and throughput.
When the messages are too many and small resulting in higher CPU consumption, it is
better to increase fetch.min.bytes value.

Fetch.max.wait.ms

Defines max time to wait before sending data from Kafka to the consumer. When
fetch.min.bytes control minimum bytes required, sometime minimum bytes may not be reached
even for a long time and to keep a balance on how long Kafka should wait before sending data
“fetch.max.wait.ms” is used.

Default value of fetch.max.wait.ms is 500ms (.5 second). Increasing this value will
increase latency and throughput of the application, define both fetch.min.bytes
and fetch.max.wait.ms based on SLA.

Session.timeout.ms
Defines how long a consumer can be out of contact with the broker.

While heartbeat.interval.ms defines how often poll method should


send a heartbeat, session.timeout.ms defines how long consumer can be
out of contact with the broker.

When session times out consumer is considered lost and


rebalance is triggered.
To avoid this from happening often it’s better to set
heartbeat.interval.ms value 3 times higher than
session.timeout.ms.
By setting a higher value you can avoid unwanted rebalancing and other overheads associated
with it.

Max.partitions.fetch.bytes

Defines max bytes per partitions to be sent from broker to consumer.

By default value is set to 1 MB.

Max.message.size and max.partitions.getch.bytes will decide the memory required per


consumer to receive the message.

Max.pool.records

Defines the number of records to be returned for a single poll() call.

Helps control number of records to be processed per poll method call.

Auto.offset.reset

When reading from the broker for the first time, as Kafka may not have any committed offset
value, this property defines where to start reading from.

You could set “earliest” or “latest”, while “earliest” will read all messages from the beginning
“latest” will read only new message after a consumer has subscribed to the topic.

The default value of “auto.offset.reset” is “latest”.

Delivery semantics

As stated in earlier article Kafka producer delivery semantics there are three delivery semantics
namely At most once, At least once and Exactly once.

When data is consumed from Kafka by Consumer group/consumer, only


below two semantics are supported.

You could still achieve output similar to exactly once by choosing suitable data store that
writes by a unique key.
For example, any key-value store, RDBMS (primary key), elastic search or any other store that
supports idempotent write.

At most once

In at most once delivery semantics a message should be delivered maximum only once.

It's acceptable to lose a message rather than delivering a message twice in this semantic.
Applications adopting at most semantics can easily achieve higher throughput and low latency.

By default, Kafka consumers are set to use “At most once” delivery
semantics as “enable.auto.commit” is true.

In case consumer fails after messages are committed as read but before processing them, the
unprocessed messages are lost and will not be read again.
Partition rebalancing will result in another consumer reading messages from last committed
offset. As shown in the diagram below, messages are read in batches and some or all of the
messages in the batch might be unprocessed but still committed as processed

At least once

In at least once delivery semantics it is acceptable to deliver a message more than once but no
message should be lost.

The consumer ensures that all messages are read and processed for sure even though it may
result in message duplication.

This is mostly preferred semantics out of all. Applications adopting at least once semantics may
have moderate throughput and moderate latency.

By setting “enable.auto.commit” value to “false”, you can manually


commit after the messages are processed.
In case consumer fails before processing them, the unprocessed messages are not lost as the
offsets are not committed as read. Partition rebalancing will result in another consumer reading
the same messages again from last committed offset resulting in duplicate messages. As shown
in the diagram below, messages are read in batches and some or all of the messages in the batch
might be processed again but no messages will be lost.

Exactly once

In exactly-once delivery semantics, a message must be delivered only once and no message
should be lost. This is the most difficult delivery semantic of all. Applications adopting exactly
once semantics may have lower throughput and higher latency compared other 2 semantics. As
stated earlier you could still achieve output similar to exactly once by choosing suitable data
store that writes by a unique key. For example any key-value store, RDBMS (primary key),
elastic search or any other store that supports idempotent write.

Summary

Configure Kafka consumer to achieve desired performance and delivery semantics based
on the following properties.

 Enable.auto.commit
 Partition.assignment.stratergy
 Fetch.max.wait.ms
 Fetch.min.bytes
 Session.timeout.ms
 Max.partitions.fetch.bytes
 Max.pool.records
 Auto.offset.reset

Kafka consumer supports only At most once and At least


once delivery semantics.

Building a Real Time Application using


Kafka and Spark
1 5,897

Analyzing real-time streaming data with accuracy and storing this lightning fast data has become
one of the biggest challenges in the world of big data.
One of the best solutions for tackling this problem is building a real-time streaming
application with Kafka and Spark and storing this incoming data into HBase
using Spark.

In this blog, we will be discussing on how to build a real-time stateful streaming


application using Kafka and Spark and storing these results in HBase in real
time.

Before going through this blog, we recommend our users to go through our previous blogs on
Kafka, Spark Streaming, and Hbase. Click Here for Kafka and Spark Integration. Beginners
Guide of HBase, Stateful Streaming Blog Link

Here is the source code of our streaming application, which runs every 10 seconds and
stores the results back to the HBase.

package WordCount
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{ State, StateSpec }
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor,
HColumnDescriptor }
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.mapreduce.{ TableInputFormat, TableOutputFormat
}
import org.apache.hadoop.hbase.client.{ HBaseAdmin, Put, HTable }
object Kafka_HBase {
def main(args: Array[String]) {
val conf = new
SparkConf().setMaster("local[2]").setAppName("Kafka_Spark_Hbase")
val ssc = new StreamingContext(conf, Seconds(10))
/*
* Defingin the Kafka server parameters
*/
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topics = Array("acadgild_topic") //topics list
val kafkaStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams))
val splits = kafkaStream.map(record => (record.key(),
record.value.toString)).flatMap(x => x._2.split(" "))
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
val updatedSum = currentCount+previousCount
Some(updatedSum)
}
//Defining a check point directory for performing stateful operations
ssc.checkpoint("hdfs://localhost:9000/WordCount_checkpoint")
val cnt = splits.map(x => (x, 1)).reduceByKey(_ +
_).updateStateByKey(updateFunc)
def toHBase(row: (_, _)) {
val hConf = new HBaseConfiguration()
hConf.set("hbase.zookeeper.quorum", "localhost:2182")
val tableName = "Streaming_wordcount"
val hTable = new HTable(hConf, tableName)
val tableDescription = new HTableDescriptor(tableName)
//tableDescription.addFamily(new HColumnDescriptor("Details".getBytes()))
val thePut = new Put(Bytes.toBytes(row._1.toString()))
thePut.add(Bytes.toBytes("Word_count"), Bytes.toBytes("Occurances"),
Bytes.toBytes(row._2.toString))
hTable.put(thePut)
}
val Hbase_inset = cnt.foreachRDD(rdd => if (!rdd.isEmpty())
rdd.foreach(toHBase(_)))
ssc.start()
ssc.awaitTermination()
}
}

Given below are the Maven dependencies for the application:

<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.11
-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-
kafka-0-10_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11 -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-common -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-protocol -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-hadoop2-compat
-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop2-compat</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-annotations -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-annotations</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-server -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.3.1</version>
</dependency>
</dependencies>

Technology stacks used:

Spark Version 2.1.0

Kafka Version 0.10.2


HBase Version 0.13

Note: In this application, we are performing stateful streaming so the occurrences of the words
will be accumulated right from the starting state of the streaming application.

You can refer to our stateful streaming using our Spark blog to know more about it.

Now this application will calculate the accumulated word counts of the words and update the
results back to HBase.

Now, let us run this application as a normal Spark streaming application and produce some data
through our Kafka console producer and check for the word count results in HBase.

In the screenshot below, you can see that our streaming application is running.
Let us give some input now.
In the above screenshot, you can see the word count results in HBase, let’s give the same input
again and check for the accumulated results.
Of these results in HBase, we can again build a Hive external table using Hive-HBase storage
handler and query the results. You can also read our blog post on HBase Write Using Hive to
know how to build an external table on a table in HBase.

This is how we can build real-time robust streaming applications using Kafka, Spark, and HBase.

Keep visiting our website, www.acadgild.com, for more updates on Big Data and other
technologies.

Enroll for Big Data and Hadoop Training conducted by Acadgild and become a successful big
data developer.
HBase Write Using Hive
kiran March 27, 2017

0 4,628

HBase is one of the most popular NoSQL databases which runs on top of the Hadoop eco-system. In
this blog, we will be discussing the ways of HBase write into HBase table using Hive.

For learning the basics of HBase, you can refer to our blog on Beginners Guide of HBase.

Now let us start with creating a reference table in Hive for the table in
HBase.

Creating a Table in HBase


Firstly, we will create a table in HBase for storing the employee data as shown below.

Create 'employee','emp_details'

You can see the same in the below screen shot


We have successfully created a table in HBase. There is no data in the table. Let’s quickly insert
some data into the table.

put 'employee',1,'emp_details:first_name','Debra'

put 'employee',1,'emp_details:last_name','Burke'

put 'employee',1,'emp_details:email','dburke0@unblog.fr'

In the below screenshot, you can see that we have successfully inserted the data into the HBase table.
Creating an External Table in Hive
Let us query this data from Hive. As
we have already created a table in HBase, so we
need to create an external table in Hive referring to HBase table.

That can be created as shown below.

CREATE EXTERNAL TABLE employee(

id string COMMENT 'from deserializer',

first_name string COMMENT 'from deserializer',

last_name string COMMENT 'from deserializer',

email string COMMENT 'from deserializer')

ROW FORMAT SERDE

'org.apache.hadoop.hive.hbase.HBaseSerDe'

STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

'hbase.columns.mapping'=':key,emp_details:first_name,emp_details:last_name,emp
_details:email',

'serialization.format'='1')

TBLPROPERTIES (

'hbase.table.name'='employee'

) ;

We have successfully created a Hive table, which refers to the HBase


table. You can see the same in the below screen shot.
This is how we can create a reference table for HBase table using Hive. Let us now
insert data into this Hive table, which in turn will get reflected in HBase table.

Inserting Values into HBase Table through Hive


For inserting data into the HBase table through Hive, you need to specify the HBase table name in

the hive shell by using the below property before running the insert command.

set hbase.mapred.output.outputtable=employee;

If the above property is not set, then you will get an error as shown below.

Job Submission failed with exception ‘java.io.IOException


(org.apache.hadoop.Hive.ql.metadata.HiveException:
java.lang.IllegalArgumentException: Must specify table name)'

FAILED: Execution Error, return code 1 from


org.apache.hadoop.Hive.ql.exec.mr.MapRedTask

Let us now insert one record into this Hive table using the below insert statement.

insert into table employee


values('2','Robin','Harvey','rharvey1@constantcontact.com');

In the below screenshot you can see that we have successfully inserted one record.
Let’s check for the same in the HBase table.

the record inserted into Hive got


In the above screenshot, we can see that

successfully written into the HBase table also.


This sums up the discussion regarding loading a record into HBase through Hive.
Let us now implement to write the complete file (cluster of many
records) into HBase table through Hive.

The Hive table, which refers to the HBase table, is non-native, so we cannot directly load the
data into this table.

If you try to do that, you will get the below exception.

FAILED: SemanticException [Error 10101]: A non-native table cannot be used as


target for LOAD

So for loading the data in a file, we need to create a normal


Hive table and then we need to write the insert overwrite
statement.

This can be done as follows.

Loading the data in a file into HBase using Hive


Let us create a staging table for the employee table as shown below.

CREATE TABLE employee_stg (

id STRING,

first_name STRING,

last_name STRING,

email STRING

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

In the below screenshot, we can see that we have successfully loaded the data into staging table.
You can download the emp dataset from here.

Now let us copy these contents into the employee table using insert overwrite statement.

Insert overwrite table employee select * from employee_stg;


In the below screenshot, you can see that we have successfully written the data into the employee table.
Let’s check for the data in HBase.
Here in HBase, you can see that the complete data has got populated.

how to write data into a table, which is present


So far, we have seen

in HBase using Hive.


Creating a table in HBase from Hive
Now let’s see creating a HBase table from Hive itself and inserting the data into that HBase table.

This is far simple, earlier we have created an


external table in Hive, now we will create a
managed table in Hive.

CREATE TABLE employee1(

id string COMMENT 'from deserializer',

first_name string COMMENT 'from deserializer',

last_name string COMMENT 'from deserializer',

email string COMMENT 'from deserializer')

ROW FORMAT SERDE

'org.apache.hadoop.hive.hbase.HBaseSerDe'

STORED BY

'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

'hbase.columns.mapping'=':key,emp_details:first_name,emp_details:last_name,emp
_details:email',

'serialization.format'='1')

TBLPROPERTIES (

'hbase.table.name'='employee1'

) ;

Here the above query will create a Hive table with name employee1 and also a HBase table with name
employee1.Let’s now insert the data
insert into table employee1
values('2','Robin','Harvey','rharvey1@constantcontact.com');

In the below screenshot, we can see the same.


Let’s check for the employee1 table in HBase and the data in it.

We have successfully created a HBase table from Hive and inserted the data into it.

Loading the data in a file into HBase using Hive

To insert the data in a file, we need to follow the same


procedure which we have used earlier by creating a staging
table and using the insert overwrite statement.

Insert overwrite table employee1 select * from employee_stg;


We have successfully inserted the data, let’s check for the data

Let us check for the complete data in HBase table.


In the above screenshot, we can see the data written successfully into HBase table also.
Hive HBase Integration

A brief introduction to Apache Hive:


Apache Hive is a data warehouse software that facilitates querying and managing of large datasets
residing in distributed storage. Apache Hive provides SQL-like language called HiveQL for querying the
data. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.

Hive is best suited for data warehousing applications where data is stored, mined and reporting is done
based on the processing. Apache Hive bridges the gap between data warehouse applications and Hadoop
as relational database models are the base of most data warehousing applications.

It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy. HiveQL also
allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to
do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Why do we need to integrate Apache Hive with HBase?


Hive can store information of hundreds of millions of users
effortlessly, but, faces some difficulties when it comes to keeping the
warehouse up to date with the latest information.
Apache Hive uses HDFS as an underlying storage,
which comes with limitations like append-only, block-
oriented storage.
This makes it impossible to directly apply individual updates to warehouse tables.

Up till now the only practical option to overcome this limitation is to pull the snapshots from
MySQL databases and dump them to new Hive partitions.

This expensive operation of pulling the data from one location to


another location is not frequently practiced.

(Leading to stale data in the warehouse), and it does not scale well as the data volume continues to
shoot through the roof.

To overcome this problem, Apache HBase is used in place of MySQL, with Hive.

What is HBase?
HBase is a scale-out table store, which can support a very high rate of row-level updates over a large
amount of data.

HBase solves Hadoop’s append-only constraint by


keeping recently updated data in memory and
incrementally rewriting data to new files, splitting and
merging data intelligently based on data distribution
changes.
integrating it with Hive is straight forward as
Since HBase is based on Hadoop,
HBase tables can be accessed like native Hive tables.

As a result, a single Hive query can now perform complex operations such as join, union, and
aggregation across combinations of HBase and native Hive tables.
Likewise, Hive’s INSERT statement can be used to move data
between HBase and native Hive tables, or to reorganize data
within the HBase itself.

How is HBase integrated with Hive?


For integrating HBase with Hive, Storage Handlers in Hive is used.

Storage Handlers are a combination of InputFormat, OutputFormat, SerDe, and specific code that
Hive uses to identify an external entity as a Hive table.

This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in
Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra,
and Amazon DynamoDB.

Storage Handlers are not only limited to NoSQL databases, a storage


handler could be designed for several different kinds of data stores.

Here the example for connecting Hive with HBase using HiveStorageHandler.

Create the HBase Table:

create 'employee','personaldetails','deptdetails'

‘personaldetails’ and ‘deptdetails’The above statement will create ‘employee’ with two columns families

Insert the data into HBase table:


hbase(main):049:0> put 'employee','eid01','personaldetails:fname','Brundesh'

0 row(s) in 0.1030 seconds

hbase(main):050:0> put 'employee','eid01','personaldetails:Lname','R'

0 row(s) in 0.0160 seconds

hbase(main):051:0> put 'employee','eid01','personaldetails:salary','10000'

0 row(s) in 0.0090 seconds

hbase(main):060:0> put 'employee','eid01','deptdetails:name','R&D'

0 row(s) in 0.0680 seconds

hbase(main):061:0> put 'employee','eid01','deptdetails:location','Banglore'

0 row(s) in 0.0140 seconds

hbase(main):067:0> put 'employee','eid02','personaldetails:fname','Abhay'

0 row(s) in 0.0080 seconds

hbase(main):068:0> put 'employee','eid02','personaldetails:Lname','Kumar'

0 row(s) in 0.0080 seconds

hbase(main):069:0> put 'employee','eid02','personaldetails:salary','100000'

0 row(s) in 0.0090 seconds

Now create the Hive table pointing to HBase table.

If there are multiple columns family in HBase, we can


create one table for each column families.
In this case, we
have 2 column families and hence we are creating two tables, one for
each column families.

Table for personal details column family:


create external table employee_hbase(Eid String, f_name string, s_name string,
salary int)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

with serdeproperties
("hbase.columns.mapping"=":key,personaldetails:fname,personaldetails:Lname,per
sonaldetails:salary")

tblproperties("hbase.table.name"="employee");

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

If we are creating the non-native Hive table using Storage Handler then we should specify the
STORED BY clause

Note: There are different classes for different databases

hbase.columns.mapping : It is used to map the Hive columns with the HBase columns. The first
column must be the key column which would also be same as the HBase’s row key column.

Now we can query the HBase table with SQL queries in hive using the below command.

select *from employee_hbase;

We hope going through this blog will help you in the integration of hive and hbase and help in building
the useful SQL interface on the top of Hbase .Above query fired from hive terminal will yield all the data
from the hbase table.
Data Migration from SQL to
HBase Using MapReduce

Data Migration from SQL to NoSQL


Data migration is the process of transferring data from one system to another by changing the
storage or database or the application.

In this tutorial, let us learn how to migrate the data present in MySQL to HBase which is a NoSQL
database using Mapreduce.

MySQL is one of the most widely used Relational Database systems. But due to the rapid growth of data
nowadays people are searching for better alternatives to store and process their data.

This is how Hbase came into existence which is a Hadoop database


capable of storing a huge amount of data in the clusters and can scale
massively.

Let us now see how to migrate the data present in MySQL to HBase using
Hadoop’s map reduce.

Here, for reading the data in MySQL, we will be using DBInputFormat which is as follows:
import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import java.sql.ResultSet;

import java.sql.PreparedStatement;

import java.sql.SQLException;

import org.apache.hadoop.io.Writable;

import org.apache.hadoop.mapreduce.lib.db.DBWritable;

public class DBInputWritable implements Writable, DBWritable

private int id;

private String name;

//read() is used when reading from Database

public void readFields(DataInput in) throws IOException { }

public void readFields(ResultSet rs) throws SQLException

//Resultset object represents the data returned from a SQL statement

id = rs.getInt(1);

name = rs.getString(2);

//write() is required when saving to Database

public void write(DataOutput out) throws IOException { }

public void write(PreparedStatement ps) throws SQLException


{

ps.setInt(1, id);

ps.setString(2, name);

public int getId()

return id;

public String getName()

return name;

Using DBInput format, our MapReduce code will be able to read the data from MYSQL.

In the table which we are using in this example, we have two fields emp_id & emp_name. So we will
take the two fields from MYSQL table and store them in HDFS.

Here is our data present in MYSQL. In the database Acadgild we have employee table and in that
table, we have two columns emp_id & emp_name as shown in the below screenshot.
Our DBInputFormat will read this data, so this will be the input of our mapper class. To store this data in
Hbase, we need to create a table in Hbase. You can use the below Hbase command to create a table.

create 'employee','emp_info'

In the above screenshot, you can see that employee table has been created in HBase. emp_info is the
column family which contains the information of the employee.
The mapper class which reads the input from MySQL table is as follows.

import java.io.IOException;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.hbase.io.ImmutableBytesWritable;

import org.apache.hadoop.hbase.util.Bytes;

public class DBInputFormatMap extends Mapper<LongWritable, DBInputWritable,


ImmutableBytesWritable, Text>

protected void map(LongWritable id, DBInputWritable value, Context context)

try

String line = value.getName();

String cd = value.getId()+"";

context.write(new ImmutableBytesWritable(Bytes.toBytes(cd)),new
Text(line));

catch(IOException e)

e.printStackTrace();

catch(InterruptedException e)
{

e.printStackTrace();

Above is the Mapper class implementation which can read the data from a MySQL table. The output of
this Mapper class is the emp_id as key and emp_name as value. HBase writes data into it as bytes, so we
need to take the key as ImmutableBytesWritable.

So from this mapper class MySQL table data is read and the data is kept as key and value. Key will be the
same in both MySQL and HBase.

Now the key and the rest of the columns of the MySQL table will be sent to the reducer. For writing data
into the HBase a reducer class called TableReducer is called. Using TableReducer class we need to write
the MySQL data into Hbase and is as follows:

import java.io.IOException;

import org.apache.hadoop.hbase.client.Put;

import org.apache.hadoop.hbase.io.ImmutableBytesWritable;

import org.apache.hadoop.hbase.mapreduce.TableReducer;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.io.Text;

public class Reduce extends TableReducer<

ImmutableBytesWritable, Text, ImmutableBytesWritable> {

public void reduce(ImmutableBytesWritable key, Iterable<Text> values,


Context context) throws IOException, InterruptedException {

String name=null;

for(Text val : values)

name =val.toString();

// Put to HBase

Put put = new Put(key.get());

put.add(Bytes.toBytes("emp_info"),
Bytes.toBytes("name"),Bytes.toBytes(name));

context.write(key, put);

This Tablereducer receives key and the rest of the columns as values. Now we need to write a for-each
loop to iterate the rest of the columns. In this particular table, we have only two columns so we have
taken a variable name which stores the emp_name and using Put class provided by HBase we write the
data into HBase column family.

Put put = new Put(key.get());

put.add(Bytes.toBytes("emp_info"), Bytes.toBytes("name"),Bytes.toBytes(name));

The above two lines will write the data into our HBase table. emp_info is the column family here and below is the
driver class implementation of this program.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;

import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;

import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;

import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;

import org.apache.hadoop.io.Text;

public class RDBMSToHDFS

public static void main(String[] args) throws Exception

Configuration conf = new Configuration();

DBConfiguration.configureDB(conf, //Mysql user information

"com.mysql.jdbc.Driver",

"jdbc:mysql://localhost/Acadgild", //Mysql database URI

"root", //Mysql User_name

"root_usr_password"); //Mysql user password

Job job = new Job(conf);

job.getConfiguration().setInt("mapred.map.tasks", 1);

job.setJarByClass(RDBMSToHDFS.class);

job.setMapperClass(DBInputFormatMap.class);

TableMapReduceUtil.initTableReducerJob("employee",Reduce.class, job);

job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Text.class);

TableMapReduceUtil.initTableReducerJob("employee",Reduce.class, job);

job.setInputFormatClass(DBInputFormat.class);

job.setOutputFormatClass(TableOutputFormat.class);

job.setInputFormatClass(DBInputFormat.class);

job.setNumReduceTasks(1);

DBInputFormat.setInput(

job,

DBInputWritable.class,

"employee", //input table name

null,

null,

new String[] { "emp_id", "emp_name"} // table columns

);

System.exit(job.waitForCompletion(true) ? 0 : 1);

We have built this program using Maven and here are the Maven dependencies for this program.

<dependencies>

<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-server -->

<dependency>

<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>

<version>1.1.2</version>

</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client -->

<dependency>

<groupId>org.apache.hbase</groupId>

<artifactId>hbase-client</artifactId>

<version>1.1.2</version>

</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-common -->

<dependency>

<groupId>org.apache.hbase</groupId>

<artifactId>hbase-common</artifactId>

<version>1.1.2</version>

</dependency>

<dependency>

<groupId>mysql</groupId>

<artifactId>mysql-connector-java</artifactId>

<version>5.1.36</version>

</dependency>

</dependencies>

We can build an executable jar with these dependencies by adding Maven assembly plugin into the
pom.xml file. The plugin is as follows:
<plugin>

<artifactId>maven-assembly-plugin</artifactId>

<configuration>

<archive>

<manifest>

<mainClass>Mysql_to_Hbase.RDBMSToHDFS</mainClass>

</manifest>

</archive>

<descriptorRefs>

<descriptorRef>jar-with-dependencies</descriptorRef>

</descriptorRefs>

</configuration>

</plugin>

Now we can use the command mvn clean compile assembly:single to build an executable jar for this
program. After running this command, you need to get the success message as shown in the below
screenshot.

In your project directory, inside target folder, you will be able to see the jar file created.

We will now run this jar as a normal Hadoop jar. After which we can check for the output in HBase
table.
In the above screenshot, you can see that the job has been completed successfully. Let us now check for
the output in our HBase table.
In the above screenshot, you can see the data in HBase table after running the jar file. We have
successfully migrated the data present in MySQL table to HBase table using MapReduce.
Data Bulk Loading into HBase
Table Using MapReduce
3 9,347

In this blog, we will be discussing the steps to perform data bulk loading file contents from HDFS path
into an HBase table using Java MapReduce API. Before, moving forward you can follow below link blogs
to gain more knowledge on HBase and its working.

Beginners Guide to Apache HBase


Integrating Hive with HBase
Performing CRUD Operations on HBase Using Java API
Introduction to HBase Filters
Read and Write Operations in HBase
How to Import Table from MySQL to HBase

Apache HBase gives us a random, real-time, read/write access to Big Data, but here it is more important
that how do we get the data loaded into HBase.
As HBase Put API can be used to insert the data into HDFS, but inserting the every record into HBase
using the Put API is lot slower than the bulk loading.
Thus, it is better to load a complete file content as a bulk into the HBase table using Bulk load function.

Bulk loading in HBase is the process of preparing HFiles and loading it directly into the region servers.

In our example, we will be using a sample data set hbase_input_emp.txt which is saved in our hdfs
directory hbase_input_dir. You can download this sample data set for practice from the below link.

DATASET
HBase_input_emp.txt
Please refer the description for the above data set containing three columns named as:
Column 1: Employee Id
Column 2: Employee name
Column 3: Employee mail id
Column 4: Employee salary
You can follow below steps to perform bulk load data contents from Hdfs to HBase via MapReduce job.

Extract the data from the source, and load into HDFS.
If data is in Oracle, MySQL you need to fetch it using Sqoop or any such tools which gives mechanism to
import data directly from a database into HDFS. If your raw files such as .txt, .pst, .xml are located in any
servers then simply pull it and load into HDFS. HBase doesn’t prepare HFiles directly reading data from
the source.
As of our example, our data is already available in our hdfs path. We can use cat command to see the
input file hbase_input_emp.txt content which is saved in the hbase_input_dir folder of hdfs path.
hdfs dfs -cat /hbase_input_dir/hbase_input_emp.txt
Transform the data into HFiles via MapReduce job.
Here we write a MapReduce job which will process our data and create HFile. There will be only Mapper
class and will be no Reducer class. In our code, we configure
HFileOutputFormat.configureIncrementalLoad() doing which HBase creates its own Reducer class.

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Put;

import org.apache.hadoop.hbase.io.ImmutableBytesWritable;

import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.client.Put;

import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class HBaseBulkLoad {

public static class BulkLoadMap extends Mapper<LongWritable, Text,


ImmutableBytesWritable, Put> {

public void map(LongWritable key, Text value, Context context) throws


IOException, InterruptedException {

String line = value.toString();

String[]parts=line.split(",");

String rowKey = parts[0];

//The line is splitting the file records into parts wherever it is comma (‘,’)
separated, and the first column are considered as rowKey.

ImmutableBytesWritable HKey = new


ImmutableBytesWritable(Bytes.toBytes(rowKey));

//Here the row key is first converted to Bytes as Hbase understand its data as
Bytes, and also object is created as ImmutableBytesWriteable

Put HPut = new Put(Bytes.toBytes(rowKey));

//This will write the rowKey values into Hbase while creating an object.

//Here the fields of tables inside Hbase is are stated to be written

HPut.add(Bytes.toBytes("id"), Bytes.toBytes("name"), Bytes.toBytes(parts[1]));

HPut.add(Bytes.toBytes("id"), Bytes.toBytes("mail_id"),
Bytes.toBytes(parts[2]));

HPut.add(Bytes.toBytes("id"), Bytes.toBytes("sal"), Bytes.toBytes(parts[3]));

context.write(HKey,HPut);

//first we are creating instance PUT with 1st field as row key,
}

public static void main(String[] args) throws Exception {

Configuration conf = HBaseConfiguration.create();

String inputPath = args[0];

HTable table=new HTable(conf,args[2]);

conf.set("hbase.mapred.outputtable", args[2]);

Job job = new Job(conf,"HBase_Bulk_loader");

job.setMapOutputKeyClass(ImmutableBytesWritable.class);

job.setMapOutputValueClass(Put.class);

job.setSpeculativeExecution(false);

job.setReduceSpeculativeExecution(false);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(HFileOutputFormat.class);

job.setJarByClass(HBaseBulkLoad.class);

job.setMapperClass(HBaseBulkLoad.BulkLoadMap.class);

FileInputFormat.setInputPaths(job, inputPath);

TextOutputFormat.setOutputPath(job, new Path(args[1]));

HFileOutputFormat.configureIncrementalLoad(job, table);

System.exit(job.waitForCompletion(true) ? 0 : 1);

The above step finishes the MapReduce programming part.


Now, we need to create a new table in Hbase to import table contents from hdfs input directory. So,
follow the below steps to import the contents from hdfs path to Hbase table.
Enter HBase shell:
Before entering to HBase shell user should start the start HBase service. Use below command to start
HBase services.
start-hbase.sh

After starting the hmaster service use below command to enter HBase shell.
HBase shell

Create table:
We can use create command to create a table in HBase.
Create ‘Academp’,’id’

Scan table:
We can use scan command to see a table contents in Hbase.
Scan ‘Academp’

We can observe from the above image no contents are available in the table Academp
Export Hadoop_classpath:
In the next step, we need to load the HBase library files into the Hadoop classpath this enables the
Hadoop client to connect to HBase and get the number of splits.
export HADOOP_CLASSPATH=$HBASE_HOME/lib/*

Mapreduce jar execution:


Now, run the MapReduce job by following below command to generate the HFiles.
hadoop jar /home/acadgild/Desktop/BKLoad.jar /hbase_input_dir/hbase_input_emp.txt /hbase_output_dir
Academp

Here, the first parameter is the input the input directory where our input file is saved, the second
parameter is the output directory where we will be saving the HFiles, and the third parameter is the HBase
table name.
Now, let us use list command to list the HFiles which are stored in our output directory ‘hbase_output_dir’
hadoop fs -ls /hbase_output_dir

hadoop fs -ls /hbase_output_dir/id


We can use below command to see the output HFile content which is saved in the sub-directory ‘id’
hadoop fs -cat /hbase_output/dir/id/5ed1f7…..
After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using
completebulkload. This command line tool iterates through the prepared data files, and for each one
determines the region the file belongs to. It then contacts the appropriate Region Server which adopts the
HFile, moving it into its storage directory and making the data available to clients.
Now, load the files into HBase by telling the RegionServers where to find them.
HBase hadoop jar execution:
Once the HFiles are created in HDFS directory, we can use below command to store the HFiles contents
into HBase table.
hadoop jar /home/acadgild/Downloads/hbase-server-0.98.14-hadoop2.jar completebulkload
/hbase_output_dir/ Academp

Scan Academp table:


Now, we can use scan command on the table Academp to see the contents which are exported from
HDFS path.
scan ‘Academp’
Thus, from the above steps, we can observe that we have successfully imported bulk data into an HBase
table using Java API.
We hope this post has been helpful in understanding importing bulk data into HBase table. In case of any
queries, feel free to comment below and we will get back to you at the earliest.

Stateful Streaming in Apache


Spark

kiran May 31, 2017


4 5,956

Apache Spark is a general processing engine built on top of the Hadoop eco-system. Spark has a
complete setup and a unified framework to process any kind of data. Spark can do batch processing as
well as stream processing. Spark has a powerful SQL engine to run SQL queries on the data; it also has an
integrated Machine Learning library called MlLib and a graph processing library called GraphX. As it can
integrate many things into it, we identify Spark as a unified framework rather than a processing engine.

Now coming to the real-time stream processing engine of Spark. Spark doesn’t process the data in real
time it does a near-real-time processing. It means it processes the data in micro batches, in just a few
milliseconds.

Here we have got a program where Spark’s streaming context will process the data in micro batches but
generally, this processing is stateless. Let’s take we have defined the streaming Context to run for every 10
seconds, it will process the data that is arrived within that 10 seconds, to process the previous data we
have something called windows concept, windows cannot give the accumulated results from the starting
timestamp.

But what if you need to the accumulate the results from the start of the streaming job. Which means you
need to check the previous state of the RDD in order to update the new state of the RDD. This is what is
known as stateful streaming in Spark.

Spark provides 2 API’s to perform stateful streaming, which is updateStateByKey and mapWithState.

Now we will see how to perform stateful streaming of wordcount using updateStateByKey.
UpdateStateByKey is a function of Dstreams in Spark which accepts an update function as its parameter.
In that update function, you need to provide the following parameters newState for the key which is a
seq of values and the previous state of key as an Option[?].

Let’s take a word count program, let’s say for the first 10 seconds we have given this data hello every one
from acadgild. Now the wordcount program result will be

(one,1)

(hello,1)

(from,1)

(acadgild,1)

(every,1)

Now without writing the updateStateByKey function, if you give some other data, in the next 10 seconds
i.e. let’s assume we give the same line hello every one from acadigld. Now we will get the same result in
the next 10 seconds also i.e.,

(one,1)

(hello,1)

(from,1)

(acadgild,1)

(every,1)

Now, what if we need an accumulated result of the wordcount which counts my previous results also. This
is where stateful streaming comes into the act. In stateful streaming, your key’s previous state will be
preserved and it will be updated with new results.

Note: For performing stateful operations, you will need a key value pair because streamingContext
remembers the state of your RDD based on the keys itself.

In our previous blog on Kafka-Spark-Streaming integration, we have discussed about how to integrate
Apache spark with Kafka and do realtime processing. We recommend our users to go through our
previous blog on Kafka Spark integration to generate your input to the Spark streaming job using Kafka-
producer console. You can refer the below link for the same.
https://acadgild.com/blog/spark-streaming-and-kafka-integration/

Below is the Spark Scala program to perform stateful streaming using Kafka and Spark streaming.

Here are the Spark and Kafka versions we have used to build this application

Spark Version: 2.1.0

Kafka Version: 0.10.2

Here is the source code of the application:

package WordCount

import org.apache.spark.{ SparkConf, SparkContext }

import org.apache.spark.streaming.StreamingContext

import org.apache.spark.streaming.Seconds

import org.apache.spark.streaming.dstream.DStream

import org.apache.spark.rdd.RDD

import org.apache.spark.streaming.{ State, StateSpec }

import org.apache.spark.streaming.kafka010.KafkaUtils

import org.apache.kafka.common.serialization.StringDeserializer

import org.apache.kafka.clients.consumer.ConsumerRecord

import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

object stateFulWordCount {

def main(args: Array[String]) {

val conf = new


SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")

val ssc = new StreamingContext(conf, Seconds(10))

/*

* Defingin the Kafka server parameters

*/

val kafkaParams = Map[String, Object](

"bootstrap.servers" -> "localhost:9092,localhost:9092",

"key.deserializer" -> classOf[StringDeserializer],

"value.deserializer" -> classOf[StringDeserializer],

"group.id" -> "use_a_separate_group_id_for_each_stream",

"auto.offset.reset" -> "latest",

"enable.auto.commit" -> (false: java.lang.Boolean))

val topics = Array("acadgild-topic") //topics list

val kafkaStream = KafkaUtils.createDirectStream[String, String](

ssc,

PreferConsistent,

Subscribe[String, String](topics, kafkaParams))

val splits = kafkaStream.map(record => (record.key(),


record.value.toString)).flatMap(x => x._2.split(" "))

val updateFunc = (values: Seq[Int], state: Option[Int]) => {


val currentCount = values.foldLeft(0)(_ + _)

val previousCount = state.getOrElse(0)

Some(currentCount + previousCount)

//Defining a check point directory for performing stateful operations

ssc.checkpoint("hdfs://localhost:9000/WordCount_checkpoint")

val wordCounts = splits.map(x => (x,


1)).reduceByKey(_+_).updateStateByKey(updateFunc)

kafkaStream.print() //prints the stream of data received

wordCounts.print() //prints the wordcount result of the stream

ssc.start()

ssc.awaitTermination()

Here are the sparkStraeming and Kafka dependencies which you need to add if you are building your
application with SBT.

name := "StateSpark"

version := "0.1"

scalaVersion := "2.11.8"

// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.11

libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.1.0"

// https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"


// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-
0-10_2.11

libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11"


% "2.1.0"

// https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11

libraryDependencies += "org.apache.kafka" % "kafka_2.11" % "0.10.2.0"

// https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients

libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.2.0"

Here are the sparkStraeming and Kafka dependencies which you need to add if you are building your
application with Maven.

<dependencies>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-
streaming_2.11 -->

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-streaming_2.11</artifactId>

<version>2.1.0</version>

</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.11</artifactId>

<version>2.1.0</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-
kafka-0-10_2.11 -->

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>

<version>2.1.0</version>

</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11 -->

<dependency>

<groupId>org.apache.kafka</groupId>

<artifactId>kafka_2.11</artifactId>

<version>0.10.2.0</version>

</dependency>

</dependencies>

The major difference here is the addition of the update function and the addition of updateStateByKey
function to Dstream.

val wordCounts = splits.map(x => (x,


1)).reduceByKey(_+_).updateStateByKey(updateFunc)

val updateFunc = (values: Seq[Int], state: Option[Int]) => {

val currentCount = values.foldLeft(0)(_ + _)

val previousCount = state.getOrElse(0)

val updatedSum = currentCount+previousCount


Some(updatedSum)

The updateFunc will work on each key, for every key in the RDD, this update function will run, it will take
the last state of your key and it will check for the new values for your and the data operation whatever
you want to do for your key and return the new values as a Some().

For working with this update function, you need to mandatorily provide a Checkpoint directory for your
SparkStreamingContext as

ssc.checkpoint(“hdfs://localhost:9000/WordCount_checkpoint”)

Because your intermediate values will be stored in this checkpoint directory for fault tolerance, it is
suggested that you give your checkpoint directory in HDFS for more fault tolerance.

In the above update function, we are getting the new values of that key as a Seq[Int] and the oldValues
of that key as Option[Int](Which is already calculated). Now inside the function, we aggregating the
newValues of the key using the foldLeft function and then we are getting the old state value of the key
and we are adding the both to the Some() and returning the updated sum of the values.

Let’s check for the results now.

First we enter the below line

hello every one from acadgild

In the below screenshot, you can see the result as

(one,1)

(hello,1)

(from,1)

(acadgild,1)

(every,1)
Now let’s enter the same text again ‘hello every one from acadgild’ and check for the accumulated
results from the starting of our streaming job. We have got the below result

(one,2)

(hello,2)
(from,2)

(acadgild,2)

(every,2)

In the below screenshot, you can see the accumulated result.


Now let’s enter the same line again and check for the accumulated results.

hello every one from acadgild

Now we have got the result as follows:

(one,3)

(hello,3)

(from,3)

(acadgild,3)

(every,3)

We have got the accumulated results of our keys from the starting. You can see the same result in the
below screen shot too.
This is how we can perform stateful streaming using updateStateByKey function.
Building data pipelines using Kafka Connect
and Spark
0 6,036

The Kafka Connect framework comes included with Apache Kafka which helps in integrating
Kafka with other systems or other data sources. To copy data from a source to a destination file
using Kafka, users mainly opt to choose these Kafka Connectors. For doing this, many types of
source connectors and sink connectors are available for Kafka.

The Kafka Connect also provides Change Data Capture (CDC) which is an important thing to be
noted for analyzing data inside a database. Kafka Connect continuously monitors your source
database and reports the changes that keep happening in the data. You can use this data for real-
time analysis using Spark or some other streaming engine.

In this tutorial, we will discuss how to connect Kafka to a file system and stream and analyze the
continuously aggregating data using Spark.

Before going through this blog, we recommend our users to go through our previous blogs on
Kafka (which we have listed below for your convenience) to get a brief understanding of what
Kafka is, how it works, and how to integrate it with Apache Spark.

https://acadgild.com/blog/kafka-producer-consumer/

https://acadgild.com/blog/guide-installing-kafka/

https://acadgild.com/blog/spark-streaming-and-kafka-integration/
We hope you have got your basics sorted out, next, we need you to move into your Kafka’s
installed directory, $KAFKA_HOME/config, and check for the file: connect-file-
source.properties.

In this file, we need you to edit the following properties:

name=local-file-source //name of your file source


connector.class=FileStreamSource //Connector class – default for FileStream
tasks.max=1 //Number of tasks to run in parallel
file=test.txt //file location - Need to change accordingly
topic=kafka_connect-test //Name of the topic
We modified the above properties to these:
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=/home/kiran/Desktop/kafka_connect_test.txt
topic=kafka_connect_test

Now, you need to check for the Kafka brokers’ port numbers.

By default, the port number is 9092; If you want to change it, you need to set it in the connect-
standalone.properties file.

bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true

With this, we are all set to build our application.

Now, start the Kafka servers, sources, and the zookeeper servers to populate the data into your
file and let it get consumed by a Spark application.

In one of our previous blogs, we had built a stateful streaming application in Spark that helped
calculate the accumulated word count of the data that was streamed in. We will implement the
same word count application here.

(You can refer to stateful streaming in Spark, here: https://acadgild.com/blog/stateful-streaming-


in-spark/)

In the application, you only need to change the topic’s name to the name you gave in the
connect-file-source.properties file.

Firstly, start the zookeeper server by using the zookeeper properties as shown in the command
below:

zookeeper-server-start.sh kafka_2.11-0.10.2.1/config/zookeeper.properties
Keep the terminal running, open another terminal, and start the Kafka server using the kafka
server.properties as shown in the command below:

kafka-server-start.sh kafka_2.11-0.10.2.1/config/server.properties

Keep the terminal running, open another terminal, and start the source connectors using the
stand-alone properties as shown in the command below:

connect-standalone.sh kafka_2.11-0.10.2.1/config/connect-standalone.properties
kafka_2.11-0.10.2.1/config/connect-file-source.properties

Keep all the three terminals running as shown in the screenshot below:
Now, whatever data that you enter into the file will be converted into a string and will be stored
in the topics on the brokers.

You can use the console consumer to check the output as shown in the screenshot below:
In the above screenshot, you can see that the data is stored in the JSON format. As also seen in
the standalone properties of the Kafka file, we have used key.converter and value.converter
parameters to convert the key and value into the JSON format which is a default constraint found
in Kafka Connect.

Now using Spark, we need to subscribe to the topics to consume this data. In the JSON object,
the data will be presented in the column for “payload.”

So, in our Spark application, we need to make a change to our program in order to pull out the
actual data. For parsing the JSON string, we can use Scala’s JSON parser present in:
scala.util.parsing.json.JSON.parseFull

And, the final application will be as shown below:

object stateFulWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
/*
* Defingin the Kafka server parameters
*/
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topics = Array("kafka_connect_test") //topics list
val kafkaStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams))
val splits = kafkaStream.map(record => (record.key(), record.value.toString)).
flatMap(x =>
scala.util.parsing.json.JSON.parseFull(x._2).get.asInstanceOf[Map[String,
Any]].get("payload"))
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
val updatedSum = currentCount+previousCount
Some(updatedSum)
}
//Defining a check point directory for performing stateful operations
ssc.checkpoint("hdfs://localhost:9000/WordCount_checkpoint")
val wordCounts = splits.flatMap(x => x.toString.split(" ")).map(x => (x,
1)).reduceByKey(_+_).updateStateByKey(updateFunc)
wordCounts.print() //prints the wordcount result of the stream
ssc.start()
ssc.awaitTermination()
}
}

Now, we will run this application and provide some inputs to the file in real-time and we can see
the word counts results displayed in our Eclipse console.

Now, push that data into the file.


For whatever data that you enter into the file, Kafka Connect will push this data into its topics
(this typically happens whenever an event occurs, which means, whenever a new entry is made
into the file).

The Spark streaming job will continuously run on the subscribed Kafka topics. Here, we have
given the timing as 10 seconds, so whatever data that was entered into the topics in those 10
seconds will be taken and processed in real time and a stateful word count will be performed on
it.

In this case, as shown in the screenshot above, you can see the input given by us and the results
that our Spark streaming job produced in the Eclipse console. We can also store these results in
any Spark-supported data source of our choice.

And this is how we build data pipelines using Kafka Connect and Spark streaming!

We hope this blog helped you in understanding what Kafka Connect is and how to build data
pipelines using Kafka Connect and Spark streaming. Keep visiting our website,
www.acadgild.com, for more updates on big data and other technologies.

Spark Streaming and Kafka Integration

kiran April 26, 2017

3 19,534
Spark streaming and Kafka Integration are the best combinations to build real-time applications.
Spark is an in-memory processing engine on top of the Hadoop ecosystem, and Kafka is a
distributed public-subscribe messaging system. Kafka can stream data continuously from a
source and Spark can process this stream of data instantly with its in-memory processing
primitives. By integrating Kafka and Spark, a lot can be done. We can even build a real-time
machine learning application.

Spark streaming and Kafka Integration


Before going with Spark streaming and Kafka Integration, let’s have some basic knowledge
about Kafka by going through our previous blog on Kafka.

Kafka Producers and Consumers

You can install Kafka by going through this blog:

Installing Kafka

Though, let’s get started with the integration. First, we need to start the daemon.

Start the zookeeper server in Kafka by navigating into $KAFKA_HOME with the command
given below:

./bin/zookeeper-server-start.sh config/zookeeper.properties

Keep the terminal running, open one new terminal, and start the Kafka broker using the
following command:

./bin/kafka-server-start.sh config/server.properties

After starting, leave both the terminals running, open a new terminal, and create a Kafka topic
with the following command:

./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor


1 --partitions 1 --topic acdgild-topic

Note down the port number and the topic name here, you need to pass these as parameters in
Spark.
After creating a topic below you will get a message that your topic is created.
Created topic “acadgild-topic”

You can also check the topic list using the following command:

./bin/kafka-topics.sh --list --zookeeper localhost:2181


Now for sending messages to this topic, you can use the console producer and send messages
continuously. You can use the following commands to start the console producer.

./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic acadgild-


topic

You can see all the 4 consoles in the screen shot below:
You can now send messages using the console producer terminal.
Now in Spark, we will develop an application to consume the data that will do the word count
for us. Our Spark application is as follows:

import org.apache.spark._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils
object WordCount {
def main( args:Array[String] ){
val conf = new
SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-
streaming-consumer-group", Map("acadgild-topic" -> 5))
//need to change the topic name and the port number accordingly
val words = kafkaStream.flatMap(x => x._2.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
kafkaStream.print() //prints the stream of data received
wordCounts.print() //prints the wordcount result of the stream
ssc.start()
ssc.awaitTermination()
}
}

kafkaUtils provides a method called createStream in which we need to provide the input stream
details, i.e., the port number where the topic is created and the topic name.

The parameters of a static ReceiverInputDstream are as follows:

createStream(StreamingContext ssc, String zkQuorum, String groupId,


scala.collection.immutable.Map<String,Object> topics, StorageLevel
storageLevel)

Parameters

ssc – StreamingContext object

zkQuorum – Zookeeper quorum (hostname:port,hostname:port,..)

groupId – The group id for this consumer

topics – Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its
own thread

storageLevel – Storage level to use for storing the received objects (default:
StorageLevel.MEMORY_AND_DISK_SER_2)

After receiving the stream of data, you can perform the Spark streaming context operations on
that data.
The above streaming job will run for every 10 seconds and it will do the wordcount for the data
it has received in those 10 seconds.

Here is an example, we are sending a message from the console producer and the Spark job will
do the word count instantly and return the results as shown in the screenshot below:

Here are the Maven dependencies of our project:

Note: In order to convert you Java project into a Maven project, right click on the project—>
Configure —> Convert to Maven project

Now in the target–>pom.xml file, add the following dependency configurations. Then all the
required dependencies will get downloaded automatically.

<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.6.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-
streaming-kafka_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.3</version>
</dependency>

This is how you can perform Spark streaming and Kafka Integration in a simpler way by creating
the producers, topics, and brokers from the command line and accessing them from the Kafka
create stream method.

We hope this blog helped you in understanding how to build an application having Spark
streaming and Kafka Integration.

Enroll for Apache Spark Training conducted by Acadgild for a successful career growth.

Streaming Twitter Data using Kafka

kiran July 5, 2016


4 11,833

In this post, we will be discussing how to stream Twitter data using Kafka. Before going through
this post, please ensure that you have installed Kafka and Zookeeper services in your system.

You can refer to this post for installing Kafka and this one for installing Zookeeper.

Streaming Twitter data using Hosebird

Twitter provides Hosebird client (hbc), a robust Java HTTP library for consuming
Twitter’s Streaming API.

Hosebird is the server implementation of the Twitter Streaming API. The Streaming API allows clients to
receive Tweets in near real-time. Various resources allow filtered, sampled or full access to some or all
Tweets. Every Twitter account has access to the Streaming API and any developer can build applications
today. Hosebird also powers the recently announced User Streams feature that streams all events
related to a given user to drive desktop Twitter clients.
Let’s begin by starting Kafka and Zookeeper services.
Start Zookeeper server by moving into the bin folder of Zookeeper installed directory by using
thezkServer.sh start command.

Start Kafka server by moving into the bin folder of Kafka installed directory by using
the command

./kafka-server-start.sh ../config/server.properties.
In Kafka, there are two classes – Producers and Consumers. You can refer to them in
detail here.

Producer class to stream twitter data

package kafka;
import java.util.*;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import com.google.common.collect.Lists;
import com.twitter.hbc.ClientBuilder;
import com.twitter.hbc.core.Client;
import com.twitter.hbc.core.Constants;
import com.twitter.hbc.core.endpoint.StatusesFilterEndpoint;
import com.twitter.hbc.core.processor.StringDelimitedProcessor;
import com.twitter.hbc.httpclient.auth.Authentication;
import com.twitter.hbc.httpclient.auth.OAuth1;
public class TwitterKafkaProducer {
private static final String topic = "hadoop";
public static void run() throws InterruptedException {
Properties properties = new Properties();
properties.put("metadata.broker.list", "localhost:9092");
properties.put("serializer.class", "kafka.serializer.StringEncoder");
properties.put("client.id","camus");
ProducerConfig producerConfig = new ProducerConfig(properties);
kafka.javaapi.producer.Producer<String, String> producer = new
kafka.javaapi.producer.Producer<String, String>(
producerConfig);
BlockingQueue<String> queue = new LinkedBlockingQueue<String>(100000);
StatusesFilterEndpoint endpoint = new StatusesFilterEndpoint();
endpoint.trackTerms(Lists.newArrayList("twitterapi",
"#AAPSweep"));
String consumerKey= TwitterSourceConstant.CONSUMER_KEY_KEY;
String consumerSecret=TwitterSourceConstant.CONSUMER_SECRET_KEY;
String accessToken=TwitterSourceConstant.ACCESS_TOKEN_KEY;
String
accessTokenSecret=TwitterSourceConstant.ACCESS_TOKEN_SECRET_KEY;
Authentication auth = new OAuth1(consumerKey, consumerSecret,
accessToken,
accessTokenSecret);
Client client = new ClientBuilder().hosts(Constants.STREAM_HOST)
.endpoint(endpoint).authentication(auth)
.processor(new StringDelimitedProcessor(queue)).build();
client.connect();
for (int msgRead = 0; msgRead < 1000; msgRead++) {
KeyedMessage<String, String> message = null;
try {
message = new KeyedMessage<String, String>(topic,
queue.take());
} catch (InterruptedException e) {
//e.printStackTrace();
System.out.println("Stream ended");
}
producer.send(message);
}
producer.close();
client.stop();
}
public static void main(String[] args) {
try {
TwitterKafkaProducer.run();
} catch (InterruptedException e) {
System.out.println(e);
}
}
}
Here, Twitter authorization is done through
consumerKey,consumerSecret,accessToken,accessTokenSecret. Hence, we are passing them through a
class called TwitterSourceConstant.
public class TwitterSourceConstant {
public static final String CONSUMER_KEY_KEY = "xxxxxxxxxxxxxxxxxxxxxxxx";
public static final String CONSUMER_SECRET_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxx";
public static final String ACCESS_TOKEN_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxx";
public static final String ACCESS_TOKEN_SECRET_KEY = "xxxxxxxxxxxxxxxxxxxxxx";
}
In the private static final String topic = “Hadoop”; of producer class, we will pass our Topic to stream the
particular data from Twitter. So, we need to start this Producer class to start streaming data from
Twitter.
Now, we will write a Consumer class to print the streamed tweets. The consumer class is as follows:

Consumer class to stream twitter data

package kafka;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
public class KafkaConsumer {
private ConsumerConnector consumerConnector = null;
private final String topic = "twitter-topic1";
public void initialize() {
Properties props = new Properties();
props.put("zookeeper.connect", "localhost:2181");
props.put("group.id", "testgroup");
props.put("zookeeper.session.timeout.ms", "400");
props.put("zookeeper.sync.time.ms", "300");
props.put("auto.commit.interval.ms", "100");
ConsumerConfig conConfig = new ConsumerConfig(props);
consumerConnector = Consumer.createJavaConsumerConnector(conConfig);
}
public void consume() {
//Key = topic name, Value = No. of threads for topic
Map<String, Integer> topicCount = new HashMap<String, Integer>();
topicCount.put(topic, new Integer(1));
//ConsumerConnector creates the message stream for each topic
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams =
consumerConnector.createMessageStreams(topicCount);
// Get Kafka stream for topic 'mytopic'
List<KafkaStream<byte[], byte[]>> kStreamList =
consumerStreams.get(topic);
// Iterate stream using ConsumerIterator
for (final KafkaStream<byte[], byte[]> kStreams : kStreamList) {
ConsumerIterator<byte[], byte[]> consumerIte =
kStreams.iterator();
while (consumerIte.hasNext())
System.out.println("Message consumed from topic[" +
topic + "] : " +
new
String(consumerIte.next().message()));
}
//Shutdown the consumer connector
if (consumerConnector != null) consumerConnector.shutdown();
}
public static void main(String[] args) throws InterruptedException {
KafkaConsumer kafkaConsumer = new KafkaConsumer();
// Configure Kafka consumer
kafkaConsumer.initialize();
// Start consumption
kafkaConsumer.consume();
}
}

When we run the above Consumer class, it will print all the tweets collected in that
moment.
We have build this project through Maven and the pom.xml file is as follows:

pom.xml

<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.twitter</groupId>
<artifactId>hbc-example</artifactId>
<version>2.2.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>Hosebird Client Examples</name>
<properties>
<git.dir>${project.basedir}/../.git</git.dir>
<!-- this makes maven-tools not bump us to snapshot versions -->
<stabilized>true</stabilized>
<!-- Fill these in via https://dev.twitter.com/apps -->
<consumer.key>TODO</consumer.key>
<consumer.secret>TODO</consumer.secret>
<access.token>TODO</access.token>
<access.token.secret>TODO</access.token.secret>
</properties>
<dependencies>
<dependency>
<groupId>com.twitter</groupId>
<artifactId>hbc-twitter4j</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.7</version>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.2.1</version>
</plugin>
</plugins>
</build>
</project>

We need to run the Producer and Consumer programs in Eclipse. Therefore, we need
to run the Producer to stream the tweets from Twitter. The Eclipse console of the
Producer is as shown in the screenshot.
Now, let’s run the Consumer class of Kafka. The console of the Consumer with the
collected tweets is as shown in the below screenshot.
Here, we have collected the tweets related to Hadoop topic, which has been set in the
Producer class.
We can also check for the topics on which Kafka is running now, using the command

./kafka-topics.sh –zookeeper localhost:2181 –list

We can check the Consumer console simultaneously as well, to check the tweets
collected in real-time using, the below command:
./kafka-console-consumer.sh –zookeeper localhost:2181 –topic “hadoop” –from-
beginning

Below is the screenshot of the Consumer console with the tweets.

So, this is how we collect streaming data from Twitter using Kafka.
We hope this post has been helpful in understanding how to collect streaming data
from Twitter using Kafka. In case of any queries, feel free to comment below and we
will get back to you at the earliest.

For more updates on Big Data and other technologies keep visiting our site
www.acadgild.com

Hadoop Interview Questions Based on Sqoop and Kafka

What will happen if target directory already exists during sqoop import?
Ans: Sqoop runs a map-only job and if the target directory is present, it
will throw an exception.
What is the use of warehouse directory in Sqoop import?
Ans: warehouse directory is the HDFS parent directory for table
destination. If we specify target-directory all our files are stored in that
location. But, with warehouse directory, a child directory is created
inside it with the name of the table. All the files are stored inside the
child directory.
What is the default number of mappers in a Sqoop job?
Ans: 4
How to bring data directly into Hive using Sqoop?
Ans: To bring data directly into Hive using Sqoop use –hive-import
command.
We wish to bring data in CSV format in HDFS from RDBMS source.
The column in RDBMS table contains ‘,’. How to distinctly import data
in this case?
Ans: Use can use the option –optionally-enclosed-by
How to import data directly to HBase using Sqoop?
Ans: You need to use –hbase-table to import data into HBase using
sqoop. Sqoop will import data to the table specified as the argument to
–hbase-table. Each row of input table will be transformed into an Hbase
put operation to a row of output table.
What is incremental load in Sqoop?
Ans: To import records which are new. For this, you should specify –
last-value parameter so that the sqoop job will import values after the
specified value.
What is the benefit of using a Sqoop job?
In the scenario where you must perform incremental import multiple
times, you can create a sqoop job for incremental import and run the
job. Whenever you run the sqoop job, it will automatically identify last
imported value and then the import will start after the identified value.
Where does Sqoop job store the last imported value?
Ans: In its metastore.
What is Kafka?
Ans: It is a distributed, partitioned and replicated publish-subscribe
messaging framework.
How is Apache Kafka different from Apache Flume?
Ans: Kafka is a publish-subscribe messaging system, whereas, flume is
system for data collection, aggregation and movement
What are important elements of Kafka?
Ans: Kafka Producer, Consumer, Broker, and Topic.
What role does zookeeper play in a kafka cluster?
Ans: The basic responsibility of a Zookeeper is to build coordination
between Kafka cluster.
How can consumer control the offset consumed by it.?
Ans: Automatic Commit or Manual commit.
We hope the above questions will help you in answering the Hadoop interview questions asked
in the various companies. For more details, enroll for Big data and Hadoop training conducted by
Acadgild.
Big Data Hadoop & Spark

Exporting Files From HDFS To MySQL


Using SQOOP

prateek June 22, 2017

1 23,130

Apache Sqoop is a tool designed to efficiently transfer bulk data between Hadoop and structured
datastores such as relational databases.
In this blog, we will see how to export data from HDFS to MySQL using sqoop, with weblog
entry as an example.

Getting Started

Before you proceed, we recommend you to go through the blogs mentioned below which discuss
importing and exporting data into HDFS using Hadoop shell commands:
HDFS commands for beginners
Integrating MySQL and Sqoop in Hadoop
If you wish to import data from MySQL to HDFS, go through this.
Steps to Export Data from HDFS to MySQL

Follow below steps to transfer data from HDFS to MySQL table:


Step1:
Create a new database in the MySQL instance.
CREATE DATABASE db1;
NOTE: It is not mandatory to create a new database all the time. You can ‘use’ preexisting
databases as well.
Step 2:
Create a table named acad.
USE db1;
CREATE TABLE acad (
emp_id int(2),
emp_name varchar(10),
emp_sal int(10),
date date);
The image below indicates that the table inside MySQL is empty.

Figure 1
Give a command, describe <table name>, to show the various fields and types of it.
This will help in comparing the type of data present inside HDFS which is ready to be mapped.

Figure 2

The files inside HDFS must have the same format as that of MySQL table, to enable the mapping
of the data.
Refer the screenshot below to see two files which are ready to be mapped inside MySQL.

Figure 3

Step3:
Export the input.txt and input2.txt file from HDFS to MySQL
sqoop export –connect jdbc:mysql://localhost/db1 –username sqoop –password root –table acad
–export-dir /sqoop_msql/ -m 1
Where: -m denotes the number of mapper you want to run
NOTE: The target table must exist in MySQL.
To obtain a filtered map, we can use the following option:
–input-fields-terminated-by ‘/t’ –MySQL-delimiters
Where ‘/t’ denotes tab.
Once the table inside MySQL and data inside HDFS is ready to be mapped, we can execute the
export command. Refer the screenshot below:

Figure 4
Once you give the export command, the job completion statement should be displayed.

Figure 5

Note that only Map job needs to be completed. Other error messages will be displayed because
of software version compatibility. These errors can be ignored.

How it Works

Sqoop calls the JDBC driver written in the –connect statement from the location where Sqoop is
installed. The –username and –password options are used to authenticate the user and Sqoop,
internally generates the same command against the MySQL instance.
The –table argument defines the MySQL table name, that will receive the data from HDFS. This
table must be created prior to running the export command. Sqoop uses the number of columns,
their types, and the metadata of the table to validate the data inserted from the HDFS directory.
When the export statement is executed, it initiates and creates INSERT statements in MySQl. For
example, the export-job will read each line of the input.txt file from HDFS and produces the
following intermediate statements.

INSERT INTO acad VALUES (5,”HADOOP”,50000,’2011-03-21′);


INSERT INTO acad VALUES(6,”SPARK”,600000,’2011-03-22′);
INSERT INTO acad VALUES(7,”JAVA”,700000,’2011-03-23′);
By default, Sqoop export creates INSERT statements. If the –update-key argument is stated,
UPDATE statements will be created instead.
The -m argument sets the number of map jobs for reading the file splits from HDFS. Each
mapper will have its own connection to the MySQL Server.
Now, on querying inside MySQL, we see that all the data is mapped inside the table.

Figure 6

Hope this Sqoop export tutorial was useful in understanding the process of exporting data from
HDFS to MySQL. Keep visiting our website Acadgild for more updates on Big Data and other
technologies. Click here to learn Big Data Hadoop Development.

How to Import Data from MySQL to Hive


Using Sqoop

Manjunath June 16, 2017

0 14,254
What is Sqoop Import?

Sqoop is a tool from Apache using which bulk data can be imported or exported from a database
like MySQL or Oracle into HDFS.

Now, we will discuss how we can efficiently import data from MySQL to Hive using Sqoop. But
before we move ahead, we recommend you to take a look at some of the blogs that we put out
previously on Sqoop and its functioning.
Beginners Guide for Sqoop
Sqoop Tutorial for Incremental Imports
Export Data from Hive to MongoDB
Importing Data from MySQL to HBase

How do we Use Sqoop?

In this example, we will be using the table Company1 which is already present in the MySQL
database.
We can use the describe command to see the schema of the Company1 table.
Describing theTable Schema
describe Company1;
The DESCRIBE TABLE command lists the following information about each column:

 Column name
 Type schema
 Type name
 Length
 Scale
 Nulls (Yes/No)

Displaying the Table Contents


We can use the following commands to display all the columns present in the table Company1.
select * from Company1;

Granting All Permissions to Root and Flush the Privileges


We can use the following command to grant a superuser the permission to root.
grant all on *.* to ‘root’@’localhost’ with grant option;
flush privileges;
MySQL privileges are critical to the utility of the system as they allow each of their users to
access and utilize only those areas that are needed to perform their work functions. This is meant
to prevent a user from accidentally accessing an area which they should not have access to.
Additionally, this adds to the security of the MySQL server.
Whenever someone connects to a MySQL server, their identities are determined by the host used
to connect them and the user name specified. With this information, the server grants privileges
based upon the identity determined.
The above step finishes the MySQL part.

Now, let us open a new terminal and enter Sqoop commands to import data from MySQL to
Hive table.
I. A Sqoop command is used to transfer selected columns from MySQL to Hive.
Now, use the following command to import selected columns from the MySQL Company1 table
to the Hive Company1Hive table.
sqoop import –connect jdbc:mysql://localhost:3306/db1 -username root –split-by EmpId –
columns EmpId,EmpName,City –table company1 –target-dir /myhive –hive-import –create-
hive-table –hive-table default.Company1Hive -m 1

The above Sqoop command will create a new table with the name Company1Hive in the Hive
default database and transfer the 3 mentioned column (EmpId, EmpName and City) values from
the MySQL table Company1 to the Hive table Company1Hive.
Displaying the Contents of the Table Company1Hive
Now, let us see the transferred contents in the table Company1Hive.
select * from Company1Hive;

II. Sqoop command for transferring a complete table data from MySQL to Hive.
In the previous example, we transferred only the 3 selected columns from the MySQL table
Company1 to the Hive default database table Company1Hive.
Now, let us go ahead and transfer the complete table from the table Company1 to a new Hive
table by following the command given here:
sqoop import –connect jdbc:mysql://localhost:3306/db1 -username root –table Company1 –
target-dir /myhive –hive-import –create-hive-table –hive-table default.Company2Hive -m 1

The above given Sqoop command will create a new table with the name Company2Hive in the
Hive default database and will transfer all this data from the MySQL table Company1 to the
Hive table Company2Hive.
In Hive.
Now, let us see the transferred contents in the table Company2Hive.
select * from Company2Hive;

We can observe from the above screenshot that we have successfully transferred these table
contents from the MySQL to a Hive table using Sqoop.
Next, we will do a vice versa job, i.e, we will export table contents from the Hive table to the
MySQL table.
III. Export command for transferring the selected columns from Hive to MySQL.
In this example we will transfer the selected columns from Hive to MySQL. For this, we need to
create a table before transferring the data from Hive to the MySQL database. We should follow
the command given below to create a new table.
create table Company2(EmpId int, EmpName varchar(20), City varchar(15));

The above command creates a new table named Company2 in the MySQL database with three
columns: EmpId, EmpName, and City.
Let us use the select statement to see the contents of the table Company2.
Select * from Company2;
We can observe that in the screenshot shown above, the table contents are empty. Let us use the
Sqoop command to load this data from Hive to MySQL.
sqoop export –connect jdbc:mysql://localhost/db1 -username root –P –columns
EmpId,EmpName,City –table Company2 –export-dir /user/hive/warehouse/company2hive –
input-fields-terminated-by ‘\001’ -m 1

The Sqoop command given above will transfer the 3 mentioned column (EmpId, EmpName, and
City) values from the Hive table Company2Hive to the MySQL table Company2.
Displaying the Contents of the Table Company2
Now, let us see the transferred contents in the table Company2.
select * from Company2;

We can observe from the above image that we have now successfully transferred data from Hive
to MySQL.
IV. Export command for transferring the complete table data from Hive to MySQL.
Now, let us transfer this complete table from the Hive table Company2Hive to a MySQL table
by following the command given below:
create table Company2Mysql(EmpId int, EmpName varchar(20), Designation varchar(15), DOJ
varchar(15), City varchar(15), Country varchar(15));

Let us use the select statement to see the contents of the table Company2Msyql.
select * from Company2Mysql;
We observe in the screenshot given above that the table contents are empty. Let us use a Sqoop
command to load this data from Hive to MySQL.
sqoop export –connect jdbc:mysql://localhost/db1 –username root –P –table Company2Mysql –
export-dir /user/hive/warehouse/company2hive –input-fields-terminated-by ‘\001’ -m 1

The above given Sqoop command will transfer the complete data from the Hive table
Company2Hive to the MySQL table Company2Mysql.
Displaying the Contents of the Table Company2Mysql
Now, let us see the transferred contents in the table Company2Mysql.
select * from Company2Mysql;

We can see here in the screenshot how we have successfully exported table contents from Hive
to MySQL using Sqoop. We can follow the above steps to transfer this data between Apache
Hive and the structured databases.
Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.
Enroll for our Big Data and Hadoop Training and kickstart a successful career as a big data
developer.
Incremental Import in Sqoop To Load Data
From Mysql To HDFS

prateek June 23, 2017

11 42,894

This post covers the advanced topics in Sqoop, beginning with ways to import the recently
updated data in MySQL table into HDFS. If you are new to Sqoop, you can browse through
Installing Mysql and Sqoop and through Beginners guide to Sqoop for basics Sqoop commands.

Note: Make sure your Hadoop daemons are up and running. This real-world practice is done
in Cloudera system.

Sqoop supports two types of incremental imports: append and


lastmodified.
You can use the –incremental argument to specify the type of incremental import to perform.

You should specify the append mode when importing a table, where new rows are continually
added with increasing row id values.

You must specify the column containing the row’s id with –check-column.

Sqoop imports rows where the check column has a value greater than the one specified with –
last-value.

An alternate table update strategy supported by Sqoop is called


lastmodified mode.
This should be used when rows of the source table is updated, and each such update will set the
value of a last-modified column to the current timestamp.

Rows where the check column holds a timestamp more recent than the timestamp specified with
–last-value are imported.
At the end of an incremental import, the value which should be specified as –last-value for a
subsequent import is printed to the screen.

When running a subsequent import, you should specify –


last-value in this way to ensure you import only the new
or updated data.
This is handled automatically by creating an incremental import as
a saved job, which is the preferred mechanism for performing a
recurring incremental import.
Test your command in Linux Here

Let’s see with an example, step by step procedure to perform incremental import from
MySQL table.

Step 1

Start the MySQL service with the below command:


sudo service mysqld start

And enter MySQL shell using the below command:


mysql -u root -p cloudera

Step 2
Command to list database if already existing:
show databases;

Command to create a new database:


create database db1;

Command for using the database:


use db1;

Step 3

Also creating table, inserting values inside table is done using the following syntax.
create table <table name>(column name1, column name 2);
insert into <table name> values(column1 value1, column2 value1);
insert into <table name> values(column1 value2, column2 value2);

Step 4

Since the data is present in table of MySQL and Sqoop is up and running, we will fetch the
data using following command.
Sqoop import –connect jdbc:mysql://localhost/db1 –username root –password cloudera –
table acad -m1 –tagret-dir /sqoopout
As confirmation of the result, you can see in the image, the comment as Retrieved 3 records.

Step 5

Let’s check out whether any data is stored in HDFS.


This can be done by giving the following command in the terminal.
Hadoop dfs -ls /sqoopout/

This shows that part file has been created in our target directory.

Now, by the following command we view the content inside part file.
hadoop dfs -cat /sqoopout/part-m-0000

This confirms the data inside MySQL has come inside the HDFS. But what if the data inside
MySQL is increasing and has more number of rows present now than earlier?

The following steps will shed some light on the same.


Step 1
Let’s manually insert few extra values in mysql / acad table.

Now, the following command with little few extra syntax will help you feed only the new values
in the table acad.

Step 2

The following syntax is used for the incremental option in Sqoop import
command.

–incremental <mode>
–check-column <column name>
–last value <last check column value>
As you can see in above image, 3 more records have been retrieved and the incremental import is
now complete.

Along with message for next incremental import, you need to give last value as 10.

Step 3
Now let’s check and confirm the new data inside HDFS.

This is how incremental import is done every time for any number of new rows.

keep visiting our website www.acadgild.com for more blogs on Big Data ,Python and other
technologies.Click here to learn Bigdata Hadoop from our Expert Mentors

You might also like