Kafka Streaming Data
Kafka Streaming Data
In this post, we take a high-level look at the architecture of Apache Kafka, the role
ZooKeeper plays, and more.
by
Sylvester Daniel
Objective
In this article series, we will learn Kafka basics, Kafka delivery semantics, and configuration to
achieve different semantics, Spark Kafka integration, and optimization.
In Part 1 of this series let’s understand Kafka basics. In Part 2 of this series, we'll learn more
about Kafka producer and it's configuration.
Problem Statement
The following could be some of the problem statements:
Many sources and target systems to integrate. Generally, the integration of many systems
involves complexities like dealing with many protocols, messaging formats, etc.
Message systems handle high volume streams.
Use Cases
Streaming processing
Tracking user activity, log aggregation, etc.
De-coupling systems
What Is Kafka?
It is a pub-sub model in which various producers and consumers can write and read.
Key Terminologies
Topic, Partitions, and Offsets
Like tables in a NoSQL database, the topic is split into partitions that enable topics to be
distributed across various nodes.
Partitions
Partitions enable topics to be distributed across the cluster.
One topic can have more than one partition scaling across nodes.
Offsets are unique per partition and messages are ordered only within a partition.
Kafka Architecture
ZooKeeper
ZooKeeper acts as ensemble layer (ties things together) and ensures high availability of the
Kafka cluster.
Kafka nodes are also called brokers. It’s important to understand that Kafka cannot work without
ZooKeeper.
From the list of ZooKeeper nodes, one of the nodes is elected as a leader and the rest of the
nodes follow the leader.
In the case of a ZooKeeper node failure, one of the followers is elected as leader.
More than one node is strongly recommended for high availability and more than 7 is not
recommended.
You can think of ZooKeeper like a project manager who manages resources in the project and
remembers the state of the project.
Broker
Topics that are created in Kaka are distributed across brokers based on the partition, replication,
and other factors.
If a team lead isn’t available then the manager takes care of assigning tasks to other team
members.
Replication
A replication is making a copy of a partition available in another broker.
Summary
Program Architect at Mindtree - Big Data, AWS, Azure, Machine Learning & Deep Learning
11 articles
This article is a continuation of part 1 Kafka technical overview article. In part 2 of the series
let's look into the details of how Kafka producer works and important configurations.
Producer Role
The primary role of a Kafka producer is to take producer properties
& record as inputs and write it to an appropriate Kafka broker.
Producers serialize, partitions, compresses and load balances data across brokers based on
partitions.
Properties
A producer record should have the name of the topic it should be written to and value of the
record.
Generally, a list of bootstrap servers is passed instead of just one server. At least 2 bootstrap
servers are recommended.
In order to send producer record to an appropriate broker, the producer first establishes a
connection to one of the bootstrap server.
The bootstrap-server returns list of all the brokers available in the clusters and all the
metadata details like topics, partitions, replication factor and so on.
1. Serialize
2. Partition
3. Compress
4. Accumulate records
5. Group by broker and send
Serialize
In this step, the producer record gets serialized based on the serializers passed to the producer.
Both key and value are serialized based on the serializer passed. Some of the serializers include
string serializer, byteArray serializer and ByteBuffer serializers.
Partition
In this step, the producer decides which partition of the topic the record should get written to.
In case the key is not passed the partitions are chosen in a round-robin fashion.
It’s important to understand that by passing the same key to a set of records, Kafka will ensure
that messages are written to the same partition in the order received for a given number of
partitions.
If you want to retain the order of messages received it’s important to use an appropriate key for
the messages.
Custom Partitioner can also be passed to the producer to control which partitions
message should be written to.
Compression
In this step producer record is compressed before it’s written to the record accumulator. By
default, compression is not enabled in Kafka producer.
Compression helps better throughput, low latency, and better disk utilization.
Record accumulator
In this step, the records are accumulated in a buffer per partition of a topic.
Records are grouped into batches based on producer batch size property. Each partition in a topic
gets a separate accumulator/buffer.
Sender thread
In this step, the batches of the partition in record accumulator are grouped by the broker to which
they are to be sent.
The records in the batch are sent to a broker based on batch.size and linger.ms properties.
Summary
Based on the producer workflow and producer properties, tune the configuration to achieve
desired results.
In part 3 of the series let’s understand Kafka producer delivery semantics and how to tune some
of the producer properties to achieve desired results.
Kafka producer delivery semantics
Published on May 10, 2019
This article is a continuation of part 1 Kafka technical overview and part 2 Kafka producer
overview articles. Let's look into different delivery semantics and how to achieve those using
producer and broker properties.
Delivery semantics
Based on broker & producer configuration all three-delivery semantics “at most once”, “at least
once” and “exactly once” are supported.
At most once
In at most once delivery, semantics a message should be delivered maximum only once. It's
acceptable to lose a message rather than delivering a message twice in this semantic.
Few use cases of at most once includes metrics collection, log collection and so on. Applications
adopting at most semantics can easily achieve higher throughput and low latency.
At least once
In at least once delivery semantics it is acceptable to deliver a message more than once but no
message should be lost.
The producer ensures that all messages are delivered for sure even though it may result in
message duplication.
This is mostly preferred semantics out of all. Applications adopting at least once semantics
may have moderate throughput and moderate latency.
Exactly once
In exactly one delivery semantics, a message must be delivered only once and no message
should be lost.
This is the most difficult delivery semantic of all. Applications adopting exactly once
semantics may have lower throughput and higher latency compared other 2 semantics.
Different delivery semantics can be achieved in Kafka using Acks property of producer and
min.insync.replica property of the broker (considered only when acks = all).
Acks = 0
When acks property is set to zero you get at most once delivery semantics. Kafka producer
sends the record to the broker and doesn't wait for any response.
Messages once sent will not be retried in this setting. The producer uses “send and forget
approach “with acks = 0.
Data loss
In this mode, chances for data loss is high as the producer does not confirm the message was
received by the broker.
The message may not have even reached the broker or broker failure soon
after message delivery can result in data loss.
Acks = 1
When this property is set to 1 you can achieve at least once delivery semantics.
Kafka producer sends the record to the broker and waits for a response from the broker. If no
acknowledgment is received for the message sent, then the producer will retry sending the
messages based on retry configuration.
Retries property by default is 0 make sure this is set to desired number or Max.INT.
Data loss
In this mode, chances for data is moderate as the producer confirms that the message was
received by the broker (leader partition).
As the replication of follower, partition happens after the acknowledgment this may still result
in data loss.
For example, after sending the acknowledgment and before replication if the broker goes
down this may result in data loss, as the producer will not resend the message.
Acks = All
When acks property is set to all, you can achieve exactly once delivery semantics.
Kafka producer sends the record to the broker and waits for a response from the broker.
If no acknowledgment is received for the message sent, then the producer will retry sending the
messages based on retry config n times.
In order to achieve exactly once delivery semantics the broker has to be idempotent. Acks = all
should be used in conjunction with min.insync.replicas.
Data loss
In this mode, chances for data loss is low as the producer confirms that the message was
received by the broker (leader and follower partition) only after replication.
As the replication of follower partition happens before the acknowledgment data loss
chances are minimal.
For example, before replication and sending acknowledgment if the broker goes down, the
producer will not receive the acknowledgment and will send the message again to the newly
elected leader partition.
Exception
Safe producer
In order to create a safe producer that ensures minimal data loss, use below producer
properties.
Producer properties
Broker properties
The table below summarizes the impact of acks property on latency, throughput, and durability.
Summary
Configure Kafka producer and broker to achieve desired delivery semantics based on
following properties.
Acks
Retries
Max.in.flight.requests.per.connection
Min.insync.replicas
In part 4 of the series, let’s understand Kafka consumer, consumer group and how to achieve
different Kafka consumer delivery semantics.
Sylvester Daniel
Program Architect at Mindtree - Big Data, AWS, Azure, Machine Learning & Deep Learning
11 articles
This article is a continuation of part 1 Kafka technical overview, part 2 Kafka producer overview
and part 3 Kafka producer delivery semantics articles. Let's look into Kafka consumer group,
consumer and protocol used in detail.
Consumer Role
Like Kafka Producer that optimizes writes to Kafka, Consumer is used for optimal consumption
of Kafka data.
The primary role of a Kafka consumer is to take Kafka connection and consumer properties to
read records from the appropriate Kafka broker.
Properties
Multi-app Consumption
In other words, offset consumed by one application could be different from another application.
For example, if two (2) applications are consuming the same topic from Kafka then
internally Kafka creates 2 consumer groups.
Each consumer group can have one or more consumers.
If a topic has 3 partitions and an application consumes it, then a consumer group
would be created and a consumer in the consumer group will consume all
partitions of the topic. The diagram below depicts a consumer group with a single
consumer.
When an application wants to increase the speed of processing and
process partitions in parallel then it can add more consumers to the
consumer group.
Kafka takes care of keeping track of offsets consumed per consumer in a
consumer group; rebalancing consumers in the consumer group when a
consumer is added or removed and lot more.
When there are multiple consumers in a consumer group, each consumer in the group is assigned
one or more partitions.
1. Find coordinator
2. Join group
3. Sync group
4. Heartbeat
5. Leave group
Coordinator
In order to create or join a group, a consumer has to first find the coordinator on the Kafka
side that manages the consumer group.
The consumer makes a “find coordinator” request to one of the bootstrap servers.
If a coordinator already doesn’t exist it’s identified based on a hashing formula and returned as a
response to “find coordinator” request.
Join Group
Once the coordinator is identified, the consumer makes a “join group” request to the coordinator.
The coordinator returns the consumer group leader and metadata details.
If a leader already doesn’t exist then the first consumer of the group is elected as leader.
Consuming application can also control the leader elected by the coordinator node.
Sync Group
After leader details are received for the join group request, the consumer makes a “Sync group”
request to the coordinator.
This request triggers the rebalancing process across consumers in the consumer group, as the
partitions assigned to the consumers, will change after the “sync group” request.
Rebalance
All consumers in the consumer group will receive updated partition assignments that they need
to consume when a consumer is added/removed or “sync group” request is sent.
Data consumption by all consumers in the consumer group will be halted until the rebalance
process is complete.
Heartbeat
Each consumer in the consumer group periodically sends a heartbeat signal to its group
coordinator. In the case of heartbeat timeout, the consumer is considered lost and rebalancing is
initiated by the coordinator.
Leave Group
A consumer can choose to leave the group anytime by sending a “leave group” request. The
coordinator will acknowledge the request and initiate a rebalance. In case the leader node leaves
the group, a new leader is elected from the group and a rebalance is initiated.
Summary
As explained in part 1of the article series “partitions” are unit of parallelism. As consumers in a
consumer group are limited by the partition in a topic, it’s very important to decide you partitions
based on the SLA and scale your consumers accordingly. Consumer offsets are managed and
stored by Kafka in an internal “__consumer_offset” topic. Each consumer in a consumer group
follows find coordinator, join group, sync group, heartbeat and leave group protocol. Let’s
understand Kafka consumer properties and delivery semantics in the next part of the article.
Kafka Consumer Delivery Semantics
Published on September 1, 2019
Sylvester Daniel
Program Architect at Mindtree - Big Data, AWS, Azure, Machine Learning & Deep Learning
11 articles
This article is a continuation of part 1 Kafka technical overview, part 2 Kafka producer
overview, part 3 Kafka producer delivery semantics and part 4 Kafka consumer overview
articles. Let's understand different consumer configurations and consumer delivery semantics.
Subscribe
To read records from Kafka topic, create an instance of Kafka consumer
and subscribe to one or more of Kafka topics.
You can subscribe to a list of topics using regular expressions, for example, “myTopic.*”.
props.put("bootstrap.servers", "broker1:9092,broker2:9092");
String>(props);
consumer.subscribe("myTopic.*");
Poll Method
Consumers read data from Kafka by polling for new data.
The poll method takes care of all coordination like partition rebalancing, heartbeat, and data
fetching.
When auto-commit is set to true poll method not only reads data but also commits the offsets and
then reads the next batch of record as well.
Consumer Configurations
Kafka consumer behavior is configurable through the following properties. These properties are
passed as key-value pair when consumer instance is created.
Enable.auto.commit
Defines how offsets are committed to Kafka, by default “enable.auto.commit” is set to true.
When this property is set to true you may also want to set how frequent offsets should be
committed using “auto.commit.interval.ms”.
Key points:
Partition.assignment.stratergy
In the previous article Kafka consumer overview, we learned that consumers in a consumer
group are assigned different partitions.
PartitionAssignor is a class that defines the required interface for the assignment strategy.
Kafka comes inbuilt with RangeAssignor and RoundRobinAssignor supporting Range and Round
Robin strategy respectively.
For example, if there are 7 partitions in 2 topics each, consumed by 2 consumers; then ranger
strategy assigns first 4 partitions (0 – 3) to the first consumer from both topics and 3 partitions (4
– 6) from both topics to the second consumer.
The partitions are unevenly assigned, with first consumer processing 8 partitions and second
consumer processing only 6 partitions.
For example, if there are 7 partitions in 2 topics each consumed by 2 consumers; then round-
robin strategy assigns 4 partitions (0, 2, 4, 6) of first topic and 3 partitions (1,3,5) of the second
topic to first consumer and 3 partitions (1,3,5) of first topic and 4 partitions (0, 2, 4, 6) of the
second topic to the second consumer.
Key points:
partition.assignment.strategy – decides how partitions are assigned to consumers
Range strategy (RangerAssignor) is the default.
Range strategy may result in an uneven assignment.
Fetch.min.bytes
Defines a minimum number of bytes required to send data from Kafka to the consumer. When
Consumer polls for data, if the minimum number of bytes is not reached, then Kafka waits until
the pre-defined size is reached and then sends the data.
Fetch.max.wait.ms
Defines max time to wait before sending data from Kafka to the consumer. When
fetch.min.bytes control minimum bytes required, sometime minimum bytes may not be reached
even for a long time and to keep a balance on how long Kafka should wait before sending data
“fetch.max.wait.ms” is used.
Default value of fetch.max.wait.ms is 500ms (.5 second). Increasing this value will
increase latency and throughput of the application, define both fetch.min.bytes
and fetch.max.wait.ms based on SLA.
Session.timeout.ms
Defines how long a consumer can be out of contact with the broker.
Max.partitions.fetch.bytes
Max.pool.records
Auto.offset.reset
When reading from the broker for the first time, as Kafka may not have any committed offset
value, this property defines where to start reading from.
You could set “earliest” or “latest”, while “earliest” will read all messages from the beginning
“latest” will read only new message after a consumer has subscribed to the topic.
Delivery semantics
As stated in earlier article Kafka producer delivery semantics there are three delivery semantics
namely At most once, At least once and Exactly once.
You could still achieve output similar to exactly once by choosing suitable data store that
writes by a unique key.
For example, any key-value store, RDBMS (primary key), elastic search or any other store that
supports idempotent write.
At most once
In at most once delivery semantics a message should be delivered maximum only once.
It's acceptable to lose a message rather than delivering a message twice in this semantic.
Applications adopting at most semantics can easily achieve higher throughput and low latency.
By default, Kafka consumers are set to use “At most once” delivery
semantics as “enable.auto.commit” is true.
In case consumer fails after messages are committed as read but before processing them, the
unprocessed messages are lost and will not be read again.
Partition rebalancing will result in another consumer reading messages from last committed
offset. As shown in the diagram below, messages are read in batches and some or all of the
messages in the batch might be unprocessed but still committed as processed
At least once
In at least once delivery semantics it is acceptable to deliver a message more than once but no
message should be lost.
The consumer ensures that all messages are read and processed for sure even though it may
result in message duplication.
This is mostly preferred semantics out of all. Applications adopting at least once semantics may
have moderate throughput and moderate latency.
Exactly once
In exactly-once delivery semantics, a message must be delivered only once and no message
should be lost. This is the most difficult delivery semantic of all. Applications adopting exactly
once semantics may have lower throughput and higher latency compared other 2 semantics. As
stated earlier you could still achieve output similar to exactly once by choosing suitable data
store that writes by a unique key. For example any key-value store, RDBMS (primary key),
elastic search or any other store that supports idempotent write.
Summary
Configure Kafka consumer to achieve desired performance and delivery semantics based
on the following properties.
Enable.auto.commit
Partition.assignment.stratergy
Fetch.max.wait.ms
Fetch.min.bytes
Session.timeout.ms
Max.partitions.fetch.bytes
Max.pool.records
Auto.offset.reset
Analyzing real-time streaming data with accuracy and storing this lightning fast data has become
one of the biggest challenges in the world of big data.
One of the best solutions for tackling this problem is building a real-time streaming
application with Kafka and Spark and storing this incoming data into HBase
using Spark.
Before going through this blog, we recommend our users to go through our previous blogs on
Kafka, Spark Streaming, and Hbase. Click Here for Kafka and Spark Integration. Beginners
Guide of HBase, Stateful Streaming Blog Link
Here is the source code of our streaming application, which runs every 10 seconds and
stores the results back to the HBase.
package WordCount
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{ State, StateSpec }
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor,
HColumnDescriptor }
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.mapreduce.{ TableInputFormat, TableOutputFormat
}
import org.apache.hadoop.hbase.client.{ HBaseAdmin, Put, HTable }
object Kafka_HBase {
def main(args: Array[String]) {
val conf = new
SparkConf().setMaster("local[2]").setAppName("Kafka_Spark_Hbase")
val ssc = new StreamingContext(conf, Seconds(10))
/*
* Defingin the Kafka server parameters
*/
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topics = Array("acadgild_topic") //topics list
val kafkaStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams))
val splits = kafkaStream.map(record => (record.key(),
record.value.toString)).flatMap(x => x._2.split(" "))
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
val updatedSum = currentCount+previousCount
Some(updatedSum)
}
//Defining a check point directory for performing stateful operations
ssc.checkpoint("hdfs://localhost:9000/WordCount_checkpoint")
val cnt = splits.map(x => (x, 1)).reduceByKey(_ +
_).updateStateByKey(updateFunc)
def toHBase(row: (_, _)) {
val hConf = new HBaseConfiguration()
hConf.set("hbase.zookeeper.quorum", "localhost:2182")
val tableName = "Streaming_wordcount"
val hTable = new HTable(hConf, tableName)
val tableDescription = new HTableDescriptor(tableName)
//tableDescription.addFamily(new HColumnDescriptor("Details".getBytes()))
val thePut = new Put(Bytes.toBytes(row._1.toString()))
thePut.add(Bytes.toBytes("Word_count"), Bytes.toBytes("Occurances"),
Bytes.toBytes(row._2.toString))
hTable.put(thePut)
}
val Hbase_inset = cnt.foreachRDD(rdd => if (!rdd.isEmpty())
rdd.foreach(toHBase(_)))
ssc.start()
ssc.awaitTermination()
}
}
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.11
-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-
kafka-0-10_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11 -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-common -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-protocol -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-hadoop2-compat
-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop2-compat</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-annotations -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-annotations</artifactId>
<version>1.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-server -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.3.1</version>
</dependency>
</dependencies>
Note: In this application, we are performing stateful streaming so the occurrences of the words
will be accumulated right from the starting state of the streaming application.
You can refer to our stateful streaming using our Spark blog to know more about it.
Now this application will calculate the accumulated word counts of the words and update the
results back to HBase.
Now, let us run this application as a normal Spark streaming application and produce some data
through our Kafka console producer and check for the word count results in HBase.
In the screenshot below, you can see that our streaming application is running.
Let us give some input now.
In the above screenshot, you can see the word count results in HBase, let’s give the same input
again and check for the accumulated results.
Of these results in HBase, we can again build a Hive external table using Hive-HBase storage
handler and query the results. You can also read our blog post on HBase Write Using Hive to
know how to build an external table on a table in HBase.
This is how we can build real-time robust streaming applications using Kafka, Spark, and HBase.
Keep visiting our website, www.acadgild.com, for more updates on Big Data and other
technologies.
Enroll for Big Data and Hadoop Training conducted by Acadgild and become a successful big
data developer.
HBase Write Using Hive
kiran March 27, 2017
0 4,628
HBase is one of the most popular NoSQL databases which runs on top of the Hadoop eco-system. In
this blog, we will be discussing the ways of HBase write into HBase table using Hive.
For learning the basics of HBase, you can refer to our blog on Beginners Guide of HBase.
Now let us start with creating a reference table in Hive for the table in
HBase.
Create 'employee','emp_details'
put 'employee',1,'emp_details:first_name','Debra'
put 'employee',1,'emp_details:last_name','Burke'
put 'employee',1,'emp_details:email','dburke0@unblog.fr'
In the below screenshot, you can see that we have successfully inserted the data into the HBase table.
Creating an External Table in Hive
Let us query this data from Hive. As
we have already created a table in HBase, so we
need to create an external table in Hive referring to HBase table.
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping'=':key,emp_details:first_name,emp_details:last_name,emp
_details:email',
'serialization.format'='1')
TBLPROPERTIES (
'hbase.table.name'='employee'
) ;
the hive shell by using the below property before running the insert command.
set hbase.mapred.output.outputtable=employee;
If the above property is not set, then you will get an error as shown below.
Let us now insert one record into this Hive table using the below insert statement.
In the below screenshot you can see that we have successfully inserted one record.
Let’s check for the same in the HBase table.
The Hive table, which refers to the HBase table, is non-native, so we cannot directly load the
data into this table.
id STRING,
first_name STRING,
last_name STRING,
email STRING
In the below screenshot, we can see that we have successfully loaded the data into staging table.
You can download the emp dataset from here.
Now let us copy these contents into the employee table using insert overwrite statement.
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping'=':key,emp_details:first_name,emp_details:last_name,emp
_details:email',
'serialization.format'='1')
TBLPROPERTIES (
'hbase.table.name'='employee1'
) ;
Here the above query will create a Hive table with name employee1 and also a HBase table with name
employee1.Let’s now insert the data
insert into table employee1
values('2','Robin','Harvey','rharvey1@constantcontact.com');
We have successfully created a HBase table from Hive and inserted the data into it.
Hive is best suited for data warehousing applications where data is stored, mined and reporting is done
based on the processing. Apache Hive bridges the gap between data warehouse applications and Hadoop
as relational database models are the base of most data warehousing applications.
It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy. HiveQL also
allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to
do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Up till now the only practical option to overcome this limitation is to pull the snapshots from
MySQL databases and dump them to new Hive partitions.
(Leading to stale data in the warehouse), and it does not scale well as the data volume continues to
shoot through the roof.
To overcome this problem, Apache HBase is used in place of MySQL, with Hive.
What is HBase?
HBase is a scale-out table store, which can support a very high rate of row-level updates over a large
amount of data.
As a result, a single Hive query can now perform complex operations such as join, union, and
aggregation across combinations of HBase and native Hive tables.
Likewise, Hive’s INSERT statement can be used to move data
between HBase and native Hive tables, or to reorganize data
within the HBase itself.
Storage Handlers are a combination of InputFormat, OutputFormat, SerDe, and specific code that
Hive uses to identify an external entity as a Hive table.
This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in
Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra,
and Amazon DynamoDB.
Here the example for connecting Hive with HBase using HiveStorageHandler.
create 'employee','personaldetails','deptdetails'
‘personaldetails’ and ‘deptdetails’The above statement will create ‘employee’ with two columns families
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties
("hbase.columns.mapping"=":key,personaldetails:fname,personaldetails:Lname,per
sonaldetails:salary")
tblproperties("hbase.table.name"="employee");
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
If we are creating the non-native Hive table using Storage Handler then we should specify the
STORED BY clause
hbase.columns.mapping : It is used to map the Hive columns with the HBase columns. The first
column must be the key column which would also be same as the HBase’s row key column.
Now we can query the HBase table with SQL queries in hive using the below command.
We hope going through this blog will help you in the integration of hive and hbase and help in building
the useful SQL interface on the top of Hbase .Above query fired from hive terminal will yield all the data
from the hbase table.
Data Migration from SQL to
HBase Using MapReduce
In this tutorial, let us learn how to migrate the data present in MySQL to HBase which is a NoSQL
database using Mapreduce.
MySQL is one of the most widely used Relational Database systems. But due to the rapid growth of data
nowadays people are searching for better alternatives to store and process their data.
Let us now see how to migrate the data present in MySQL to HBase using
Hadoop’s map reduce.
Here, for reading the data in MySQL, we will be using DBInputFormat which is as follows:
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.ResultSet;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;
id = rs.getInt(1);
name = rs.getString(2);
ps.setInt(1, id);
ps.setString(2, name);
return id;
return name;
Using DBInput format, our MapReduce code will be able to read the data from MYSQL.
In the table which we are using in this example, we have two fields emp_id & emp_name. So we will
take the two fields from MYSQL table and store them in HDFS.
Here is our data present in MYSQL. In the database Acadgild we have employee table and in that
table, we have two columns emp_id & emp_name as shown in the below screenshot.
Our DBInputFormat will read this data, so this will be the input of our mapper class. To store this data in
Hbase, we need to create a table in Hbase. You can use the below Hbase command to create a table.
create 'employee','emp_info'
In the above screenshot, you can see that employee table has been created in HBase. emp_info is the
column family which contains the information of the employee.
The mapper class which reads the input from MySQL table is as follows.
import java.io.IOException;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.util.Bytes;
try
String cd = value.getId()+"";
context.write(new ImmutableBytesWritable(Bytes.toBytes(cd)),new
Text(line));
catch(IOException e)
e.printStackTrace();
catch(InterruptedException e)
{
e.printStackTrace();
Above is the Mapper class implementation which can read the data from a MySQL table. The output of
this Mapper class is the emp_id as key and emp_name as value. HBase writes data into it as bytes, so we
need to take the key as ImmutableBytesWritable.
So from this mapper class MySQL table data is read and the data is kept as key and value. Key will be the
same in both MySQL and HBase.
Now the key and the rest of the columns of the MySQL table will be sent to the reducer. For writing data
into the HBase a reducer class called TableReducer is called. Using TableReducer class we need to write
the MySQL data into Hbase and is as follows:
import java.io.IOException;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
String name=null;
name =val.toString();
// Put to HBase
put.add(Bytes.toBytes("emp_info"),
Bytes.toBytes("name"),Bytes.toBytes(name));
context.write(key, put);
This Tablereducer receives key and the rest of the columns as values. Now we need to write a for-each
loop to iterate the rest of the columns. In this particular table, we have only two columns so we have
taken a variable name which stores the emp_name and using Put class provided by HBase we write the
data into HBase column family.
put.add(Bytes.toBytes("emp_info"), Bytes.toBytes("name"),Bytes.toBytes(name));
The above two lines will write the data into our HBase table. emp_info is the column family here and below is the
driver class implementation of this program.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;
import org.apache.hadoop.io.Text;
"com.mysql.jdbc.Driver",
job.getConfiguration().setInt("mapred.map.tasks", 1);
job.setJarByClass(RDBMSToHDFS.class);
job.setMapperClass(DBInputFormatMap.class);
TableMapReduceUtil.initTableReducerJob("employee",Reduce.class, job);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Text.class);
TableMapReduceUtil.initTableReducerJob("employee",Reduce.class, job);
job.setInputFormatClass(DBInputFormat.class);
job.setOutputFormatClass(TableOutputFormat.class);
job.setInputFormatClass(DBInputFormat.class);
job.setNumReduceTasks(1);
DBInputFormat.setInput(
job,
DBInputWritable.class,
null,
null,
);
System.exit(job.waitForCompletion(true) ? 0 : 1);
We have built this program using Maven and here are the Maven dependencies for this program.
<dependencies>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>1.1.2</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.36</version>
</dependency>
</dependencies>
We can build an executable jar with these dependencies by adding Maven assembly plugin into the
pom.xml file. The plugin is as follows:
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>Mysql_to_Hbase.RDBMSToHDFS</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
Now we can use the command mvn clean compile assembly:single to build an executable jar for this
program. After running this command, you need to get the success message as shown in the below
screenshot.
In your project directory, inside target folder, you will be able to see the jar file created.
We will now run this jar as a normal Hadoop jar. After which we can check for the output in HBase
table.
In the above screenshot, you can see that the job has been completed successfully. Let us now check for
the output in our HBase table.
In the above screenshot, you can see the data in HBase table after running the jar file. We have
successfully migrated the data present in MySQL table to HBase table using MapReduce.
Data Bulk Loading into HBase
Table Using MapReduce
3 9,347
In this blog, we will be discussing the steps to perform data bulk loading file contents from HDFS path
into an HBase table using Java MapReduce API. Before, moving forward you can follow below link blogs
to gain more knowledge on HBase and its working.
Apache HBase gives us a random, real-time, read/write access to Big Data, but here it is more important
that how do we get the data loaded into HBase.
As HBase Put API can be used to insert the data into HDFS, but inserting the every record into HBase
using the Put API is lot slower than the bulk loading.
Thus, it is better to load a complete file content as a bulk into the HBase table using Bulk load function.
Bulk loading in HBase is the process of preparing HFiles and loading it directly into the region servers.
In our example, we will be using a sample data set hbase_input_emp.txt which is saved in our hdfs
directory hbase_input_dir. You can download this sample data set for practice from the below link.
DATASET
HBase_input_emp.txt
Please refer the description for the above data set containing three columns named as:
Column 1: Employee Id
Column 2: Employee name
Column 3: Employee mail id
Column 4: Employee salary
You can follow below steps to perform bulk load data contents from Hdfs to HBase via MapReduce job.
Extract the data from the source, and load into HDFS.
If data is in Oracle, MySQL you need to fetch it using Sqoop or any such tools which gives mechanism to
import data directly from a database into HDFS. If your raw files such as .txt, .pst, .xml are located in any
servers then simply pull it and load into HDFS. HBase doesn’t prepare HFiles directly reading data from
the source.
As of our example, our data is already available in our hdfs path. We can use cat command to see the
input file hbase_input_emp.txt content which is saved in the hbase_input_dir folder of hdfs path.
hdfs dfs -cat /hbase_input_dir/hbase_input_emp.txt
Transform the data into HFiles via MapReduce job.
Here we write a MapReduce job which will process our data and create HFile. There will be only Mapper
class and will be no Reducer class. In our code, we configure
HFileOutputFormat.configureIncrementalLoad() doing which HBase creates its own Reducer class.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
String[]parts=line.split(",");
//The line is splitting the file records into parts wherever it is comma (‘,’)
separated, and the first column are considered as rowKey.
//Here the row key is first converted to Bytes as Hbase understand its data as
Bytes, and also object is created as ImmutableBytesWriteable
//This will write the rowKey values into Hbase while creating an object.
HPut.add(Bytes.toBytes("id"), Bytes.toBytes("mail_id"),
Bytes.toBytes(parts[2]));
context.write(HKey,HPut);
//first we are creating instance PUT with 1st field as row key,
}
conf.set("hbase.mapred.outputtable", args[2]);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setSpeculativeExecution(false);
job.setReduceSpeculativeExecution(false);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HFileOutputFormat.class);
job.setJarByClass(HBaseBulkLoad.class);
job.setMapperClass(HBaseBulkLoad.BulkLoadMap.class);
FileInputFormat.setInputPaths(job, inputPath);
HFileOutputFormat.configureIncrementalLoad(job, table);
System.exit(job.waitForCompletion(true) ? 0 : 1);
After starting the hmaster service use below command to enter HBase shell.
HBase shell
Create table:
We can use create command to create a table in HBase.
Create ‘Academp’,’id’
Scan table:
We can use scan command to see a table contents in Hbase.
Scan ‘Academp’
We can observe from the above image no contents are available in the table Academp
Export Hadoop_classpath:
In the next step, we need to load the HBase library files into the Hadoop classpath this enables the
Hadoop client to connect to HBase and get the number of splits.
export HADOOP_CLASSPATH=$HBASE_HOME/lib/*
Here, the first parameter is the input the input directory where our input file is saved, the second
parameter is the output directory where we will be saving the HFiles, and the third parameter is the HBase
table name.
Now, let us use list command to list the HFiles which are stored in our output directory ‘hbase_output_dir’
hadoop fs -ls /hbase_output_dir
Apache Spark is a general processing engine built on top of the Hadoop eco-system. Spark has a
complete setup and a unified framework to process any kind of data. Spark can do batch processing as
well as stream processing. Spark has a powerful SQL engine to run SQL queries on the data; it also has an
integrated Machine Learning library called MlLib and a graph processing library called GraphX. As it can
integrate many things into it, we identify Spark as a unified framework rather than a processing engine.
Now coming to the real-time stream processing engine of Spark. Spark doesn’t process the data in real
time it does a near-real-time processing. It means it processes the data in micro batches, in just a few
milliseconds.
Here we have got a program where Spark’s streaming context will process the data in micro batches but
generally, this processing is stateless. Let’s take we have defined the streaming Context to run for every 10
seconds, it will process the data that is arrived within that 10 seconds, to process the previous data we
have something called windows concept, windows cannot give the accumulated results from the starting
timestamp.
But what if you need to the accumulate the results from the start of the streaming job. Which means you
need to check the previous state of the RDD in order to update the new state of the RDD. This is what is
known as stateful streaming in Spark.
Spark provides 2 API’s to perform stateful streaming, which is updateStateByKey and mapWithState.
Now we will see how to perform stateful streaming of wordcount using updateStateByKey.
UpdateStateByKey is a function of Dstreams in Spark which accepts an update function as its parameter.
In that update function, you need to provide the following parameters newState for the key which is a
seq of values and the previous state of key as an Option[?].
Let’s take a word count program, let’s say for the first 10 seconds we have given this data hello every one
from acadgild. Now the wordcount program result will be
(one,1)
(hello,1)
(from,1)
(acadgild,1)
(every,1)
Now without writing the updateStateByKey function, if you give some other data, in the next 10 seconds
i.e. let’s assume we give the same line hello every one from acadigld. Now we will get the same result in
the next 10 seconds also i.e.,
(one,1)
(hello,1)
(from,1)
(acadgild,1)
(every,1)
Now, what if we need an accumulated result of the wordcount which counts my previous results also. This
is where stateful streaming comes into the act. In stateful streaming, your key’s previous state will be
preserved and it will be updated with new results.
Note: For performing stateful operations, you will need a key value pair because streamingContext
remembers the state of your RDD based on the keys itself.
In our previous blog on Kafka-Spark-Streaming integration, we have discussed about how to integrate
Apache spark with Kafka and do realtime processing. We recommend our users to go through our
previous blog on Kafka Spark integration to generate your input to the Spark streaming job using Kafka-
producer console. You can refer the below link for the same.
https://acadgild.com/blog/spark-streaming-and-kafka-integration/
Below is the Spark Scala program to perform stateful streaming using Kafka and Spark streaming.
Here are the Spark and Kafka versions we have used to build this application
package WordCount
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object stateFulWordCount {
/*
*/
ssc,
PreferConsistent,
Some(currentCount + previousCount)
ssc.checkpoint("hdfs://localhost:9000/WordCount_checkpoint")
ssc.start()
ssc.awaitTermination()
Here are the sparkStraeming and Kafka dependencies which you need to add if you are building your
application with SBT.
name := "StateSpark"
version := "0.1"
scalaVersion := "2.11.8"
// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.11
// https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11
// https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11
// https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
Here are the sparkStraeming and Kafka dependencies which you need to add if you are building your
application with Maven.
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-
streaming_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-
kafka-0-10_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.2.0</version>
</dependency>
</dependencies>
The major difference here is the addition of the update function and the addition of updateStateByKey
function to Dstream.
The updateFunc will work on each key, for every key in the RDD, this update function will run, it will take
the last state of your key and it will check for the new values for your and the data operation whatever
you want to do for your key and return the new values as a Some().
For working with this update function, you need to mandatorily provide a Checkpoint directory for your
SparkStreamingContext as
ssc.checkpoint(“hdfs://localhost:9000/WordCount_checkpoint”)
Because your intermediate values will be stored in this checkpoint directory for fault tolerance, it is
suggested that you give your checkpoint directory in HDFS for more fault tolerance.
In the above update function, we are getting the new values of that key as a Seq[Int] and the oldValues
of that key as Option[Int](Which is already calculated). Now inside the function, we aggregating the
newValues of the key using the foldLeft function and then we are getting the old state value of the key
and we are adding the both to the Some() and returning the updated sum of the values.
(one,1)
(hello,1)
(from,1)
(acadgild,1)
(every,1)
Now let’s enter the same text again ‘hello every one from acadgild’ and check for the accumulated
results from the starting of our streaming job. We have got the below result
(one,2)
(hello,2)
(from,2)
(acadgild,2)
(every,2)
(one,3)
(hello,3)
(from,3)
(acadgild,3)
(every,3)
We have got the accumulated results of our keys from the starting. You can see the same result in the
below screen shot too.
This is how we can perform stateful streaming using updateStateByKey function.
Building data pipelines using Kafka Connect
and Spark
0 6,036
The Kafka Connect framework comes included with Apache Kafka which helps in integrating
Kafka with other systems or other data sources. To copy data from a source to a destination file
using Kafka, users mainly opt to choose these Kafka Connectors. For doing this, many types of
source connectors and sink connectors are available for Kafka.
The Kafka Connect also provides Change Data Capture (CDC) which is an important thing to be
noted for analyzing data inside a database. Kafka Connect continuously monitors your source
database and reports the changes that keep happening in the data. You can use this data for real-
time analysis using Spark or some other streaming engine.
In this tutorial, we will discuss how to connect Kafka to a file system and stream and analyze the
continuously aggregating data using Spark.
Before going through this blog, we recommend our users to go through our previous blogs on
Kafka (which we have listed below for your convenience) to get a brief understanding of what
Kafka is, how it works, and how to integrate it with Apache Spark.
https://acadgild.com/blog/kafka-producer-consumer/
https://acadgild.com/blog/guide-installing-kafka/
https://acadgild.com/blog/spark-streaming-and-kafka-integration/
We hope you have got your basics sorted out, next, we need you to move into your Kafka’s
installed directory, $KAFKA_HOME/config, and check for the file: connect-file-
source.properties.
Now, you need to check for the Kafka brokers’ port numbers.
By default, the port number is 9092; If you want to change it, you need to set it in the connect-
standalone.properties file.
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
Now, start the Kafka servers, sources, and the zookeeper servers to populate the data into your
file and let it get consumed by a Spark application.
In one of our previous blogs, we had built a stateful streaming application in Spark that helped
calculate the accumulated word count of the data that was streamed in. We will implement the
same word count application here.
In the application, you only need to change the topic’s name to the name you gave in the
connect-file-source.properties file.
Firstly, start the zookeeper server by using the zookeeper properties as shown in the command
below:
zookeeper-server-start.sh kafka_2.11-0.10.2.1/config/zookeeper.properties
Keep the terminal running, open another terminal, and start the Kafka server using the kafka
server.properties as shown in the command below:
kafka-server-start.sh kafka_2.11-0.10.2.1/config/server.properties
Keep the terminal running, open another terminal, and start the source connectors using the
stand-alone properties as shown in the command below:
connect-standalone.sh kafka_2.11-0.10.2.1/config/connect-standalone.properties
kafka_2.11-0.10.2.1/config/connect-file-source.properties
Keep all the three terminals running as shown in the screenshot below:
Now, whatever data that you enter into the file will be converted into a string and will be stored
in the topics on the brokers.
You can use the console consumer to check the output as shown in the screenshot below:
In the above screenshot, you can see that the data is stored in the JSON format. As also seen in
the standalone properties of the Kafka file, we have used key.converter and value.converter
parameters to convert the key and value into the JSON format which is a default constraint found
in Kafka Connect.
Now using Spark, we need to subscribe to the topics to consume this data. In the JSON object,
the data will be presented in the column for “payload.”
So, in our Spark application, we need to make a change to our program in order to pull out the
actual data. For parsing the JSON string, we can use Scala’s JSON parser present in:
scala.util.parsing.json.JSON.parseFull
object stateFulWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
/*
* Defingin the Kafka server parameters
*/
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topics = Array("kafka_connect_test") //topics list
val kafkaStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams))
val splits = kafkaStream.map(record => (record.key(), record.value.toString)).
flatMap(x =>
scala.util.parsing.json.JSON.parseFull(x._2).get.asInstanceOf[Map[String,
Any]].get("payload"))
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
val updatedSum = currentCount+previousCount
Some(updatedSum)
}
//Defining a check point directory for performing stateful operations
ssc.checkpoint("hdfs://localhost:9000/WordCount_checkpoint")
val wordCounts = splits.flatMap(x => x.toString.split(" ")).map(x => (x,
1)).reduceByKey(_+_).updateStateByKey(updateFunc)
wordCounts.print() //prints the wordcount result of the stream
ssc.start()
ssc.awaitTermination()
}
}
Now, we will run this application and provide some inputs to the file in real-time and we can see
the word counts results displayed in our Eclipse console.
The Spark streaming job will continuously run on the subscribed Kafka topics. Here, we have
given the timing as 10 seconds, so whatever data that was entered into the topics in those 10
seconds will be taken and processed in real time and a stateful word count will be performed on
it.
In this case, as shown in the screenshot above, you can see the input given by us and the results
that our Spark streaming job produced in the Eclipse console. We can also store these results in
any Spark-supported data source of our choice.
And this is how we build data pipelines using Kafka Connect and Spark streaming!
We hope this blog helped you in understanding what Kafka Connect is and how to build data
pipelines using Kafka Connect and Spark streaming. Keep visiting our website,
www.acadgild.com, for more updates on big data and other technologies.
3 19,534
Spark streaming and Kafka Integration are the best combinations to build real-time applications.
Spark is an in-memory processing engine on top of the Hadoop ecosystem, and Kafka is a
distributed public-subscribe messaging system. Kafka can stream data continuously from a
source and Spark can process this stream of data instantly with its in-memory processing
primitives. By integrating Kafka and Spark, a lot can be done. We can even build a real-time
machine learning application.
Installing Kafka
Though, let’s get started with the integration. First, we need to start the daemon.
Start the zookeeper server in Kafka by navigating into $KAFKA_HOME with the command
given below:
./bin/zookeeper-server-start.sh config/zookeeper.properties
Keep the terminal running, open one new terminal, and start the Kafka broker using the
following command:
./bin/kafka-server-start.sh config/server.properties
After starting, leave both the terminals running, open a new terminal, and create a Kafka topic
with the following command:
Note down the port number and the topic name here, you need to pass these as parameters in
Spark.
After creating a topic below you will get a message that your topic is created.
Created topic “acadgild-topic”
You can also check the topic list using the following command:
You can see all the 4 consoles in the screen shot below:
You can now send messages using the console producer terminal.
Now in Spark, we will develop an application to consume the data that will do the word count
for us. Our Spark application is as follows:
import org.apache.spark._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils
object WordCount {
def main( args:Array[String] ){
val conf = new
SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-
streaming-consumer-group", Map("acadgild-topic" -> 5))
//need to change the topic name and the port number accordingly
val words = kafkaStream.flatMap(x => x._2.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
kafkaStream.print() //prints the stream of data received
wordCounts.print() //prints the wordcount result of the stream
ssc.start()
ssc.awaitTermination()
}
}
kafkaUtils provides a method called createStream in which we need to provide the input stream
details, i.e., the port number where the topic is created and the topic name.
Parameters
topics – Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its
own thread
storageLevel – Storage level to use for storing the received objects (default:
StorageLevel.MEMORY_AND_DISK_SER_2)
After receiving the stream of data, you can perform the Spark streaming context operations on
that data.
The above streaming job will run for every 10 seconds and it will do the wordcount for the data
it has received in those 10 seconds.
Here is an example, we are sending a message from the console producer and the Spark job will
do the word count instantly and return the results as shown in the screenshot below:
Note: In order to convert you Java project into a Maven project, right click on the project—>
Configure —> Convert to Maven project
Now in the target–>pom.xml file, add the following dependency configurations. Then all the
required dependencies will get downloaded automatically.
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.6.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-
streaming-kafka_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.3</version>
</dependency>
This is how you can perform Spark streaming and Kafka Integration in a simpler way by creating
the producers, topics, and brokers from the command line and accessing them from the Kafka
create stream method.
We hope this blog helped you in understanding how to build an application having Spark
streaming and Kafka Integration.
Enroll for Apache Spark Training conducted by Acadgild for a successful career growth.
In this post, we will be discussing how to stream Twitter data using Kafka. Before going through
this post, please ensure that you have installed Kafka and Zookeeper services in your system.
You can refer to this post for installing Kafka and this one for installing Zookeeper.
Twitter provides Hosebird client (hbc), a robust Java HTTP library for consuming
Twitter’s Streaming API.
Hosebird is the server implementation of the Twitter Streaming API. The Streaming API allows clients to
receive Tweets in near real-time. Various resources allow filtered, sampled or full access to some or all
Tweets. Every Twitter account has access to the Streaming API and any developer can build applications
today. Hosebird also powers the recently announced User Streams feature that streams all events
related to a given user to drive desktop Twitter clients.
Let’s begin by starting Kafka and Zookeeper services.
Start Zookeeper server by moving into the bin folder of Zookeeper installed directory by using
thezkServer.sh start command.
Start Kafka server by moving into the bin folder of Kafka installed directory by using
the command
./kafka-server-start.sh ../config/server.properties.
In Kafka, there are two classes – Producers and Consumers. You can refer to them in
detail here.
package kafka;
import java.util.*;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import com.google.common.collect.Lists;
import com.twitter.hbc.ClientBuilder;
import com.twitter.hbc.core.Client;
import com.twitter.hbc.core.Constants;
import com.twitter.hbc.core.endpoint.StatusesFilterEndpoint;
import com.twitter.hbc.core.processor.StringDelimitedProcessor;
import com.twitter.hbc.httpclient.auth.Authentication;
import com.twitter.hbc.httpclient.auth.OAuth1;
public class TwitterKafkaProducer {
private static final String topic = "hadoop";
public static void run() throws InterruptedException {
Properties properties = new Properties();
properties.put("metadata.broker.list", "localhost:9092");
properties.put("serializer.class", "kafka.serializer.StringEncoder");
properties.put("client.id","camus");
ProducerConfig producerConfig = new ProducerConfig(properties);
kafka.javaapi.producer.Producer<String, String> producer = new
kafka.javaapi.producer.Producer<String, String>(
producerConfig);
BlockingQueue<String> queue = new LinkedBlockingQueue<String>(100000);
StatusesFilterEndpoint endpoint = new StatusesFilterEndpoint();
endpoint.trackTerms(Lists.newArrayList("twitterapi",
"#AAPSweep"));
String consumerKey= TwitterSourceConstant.CONSUMER_KEY_KEY;
String consumerSecret=TwitterSourceConstant.CONSUMER_SECRET_KEY;
String accessToken=TwitterSourceConstant.ACCESS_TOKEN_KEY;
String
accessTokenSecret=TwitterSourceConstant.ACCESS_TOKEN_SECRET_KEY;
Authentication auth = new OAuth1(consumerKey, consumerSecret,
accessToken,
accessTokenSecret);
Client client = new ClientBuilder().hosts(Constants.STREAM_HOST)
.endpoint(endpoint).authentication(auth)
.processor(new StringDelimitedProcessor(queue)).build();
client.connect();
for (int msgRead = 0; msgRead < 1000; msgRead++) {
KeyedMessage<String, String> message = null;
try {
message = new KeyedMessage<String, String>(topic,
queue.take());
} catch (InterruptedException e) {
//e.printStackTrace();
System.out.println("Stream ended");
}
producer.send(message);
}
producer.close();
client.stop();
}
public static void main(String[] args) {
try {
TwitterKafkaProducer.run();
} catch (InterruptedException e) {
System.out.println(e);
}
}
}
Here, Twitter authorization is done through
consumerKey,consumerSecret,accessToken,accessTokenSecret. Hence, we are passing them through a
class called TwitterSourceConstant.
public class TwitterSourceConstant {
public static final String CONSUMER_KEY_KEY = "xxxxxxxxxxxxxxxxxxxxxxxx";
public static final String CONSUMER_SECRET_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxx";
public static final String ACCESS_TOKEN_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxx";
public static final String ACCESS_TOKEN_SECRET_KEY = "xxxxxxxxxxxxxxxxxxxxxx";
}
In the private static final String topic = “Hadoop”; of producer class, we will pass our Topic to stream the
particular data from Twitter. So, we need to start this Producer class to start streaming data from
Twitter.
Now, we will write a Consumer class to print the streamed tweets. The consumer class is as follows:
package kafka;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
public class KafkaConsumer {
private ConsumerConnector consumerConnector = null;
private final String topic = "twitter-topic1";
public void initialize() {
Properties props = new Properties();
props.put("zookeeper.connect", "localhost:2181");
props.put("group.id", "testgroup");
props.put("zookeeper.session.timeout.ms", "400");
props.put("zookeeper.sync.time.ms", "300");
props.put("auto.commit.interval.ms", "100");
ConsumerConfig conConfig = new ConsumerConfig(props);
consumerConnector = Consumer.createJavaConsumerConnector(conConfig);
}
public void consume() {
//Key = topic name, Value = No. of threads for topic
Map<String, Integer> topicCount = new HashMap<String, Integer>();
topicCount.put(topic, new Integer(1));
//ConsumerConnector creates the message stream for each topic
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams =
consumerConnector.createMessageStreams(topicCount);
// Get Kafka stream for topic 'mytopic'
List<KafkaStream<byte[], byte[]>> kStreamList =
consumerStreams.get(topic);
// Iterate stream using ConsumerIterator
for (final KafkaStream<byte[], byte[]> kStreams : kStreamList) {
ConsumerIterator<byte[], byte[]> consumerIte =
kStreams.iterator();
while (consumerIte.hasNext())
System.out.println("Message consumed from topic[" +
topic + "] : " +
new
String(consumerIte.next().message()));
}
//Shutdown the consumer connector
if (consumerConnector != null) consumerConnector.shutdown();
}
public static void main(String[] args) throws InterruptedException {
KafkaConsumer kafkaConsumer = new KafkaConsumer();
// Configure Kafka consumer
kafkaConsumer.initialize();
// Start consumption
kafkaConsumer.consume();
}
}
When we run the above Consumer class, it will print all the tweets collected in that
moment.
We have build this project through Maven and the pom.xml file is as follows:
pom.xml
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.twitter</groupId>
<artifactId>hbc-example</artifactId>
<version>2.2.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>Hosebird Client Examples</name>
<properties>
<git.dir>${project.basedir}/../.git</git.dir>
<!-- this makes maven-tools not bump us to snapshot versions -->
<stabilized>true</stabilized>
<!-- Fill these in via https://dev.twitter.com/apps -->
<consumer.key>TODO</consumer.key>
<consumer.secret>TODO</consumer.secret>
<access.token>TODO</access.token>
<access.token.secret>TODO</access.token.secret>
</properties>
<dependencies>
<dependency>
<groupId>com.twitter</groupId>
<artifactId>hbc-twitter4j</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.7</version>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.2.1</version>
</plugin>
</plugins>
</build>
</project>
We need to run the Producer and Consumer programs in Eclipse. Therefore, we need
to run the Producer to stream the tweets from Twitter. The Eclipse console of the
Producer is as shown in the screenshot.
Now, let’s run the Consumer class of Kafka. The console of the Consumer with the
collected tweets is as shown in the below screenshot.
Here, we have collected the tweets related to Hadoop topic, which has been set in the
Producer class.
We can also check for the topics on which Kafka is running now, using the command
We can check the Consumer console simultaneously as well, to check the tweets
collected in real-time using, the below command:
./kafka-console-consumer.sh –zookeeper localhost:2181 –topic “hadoop” –from-
beginning
So, this is how we collect streaming data from Twitter using Kafka.
We hope this post has been helpful in understanding how to collect streaming data
from Twitter using Kafka. In case of any queries, feel free to comment below and we
will get back to you at the earliest.
For more updates on Big Data and other technologies keep visiting our site
www.acadgild.com
What will happen if target directory already exists during sqoop import?
Ans: Sqoop runs a map-only job and if the target directory is present, it
will throw an exception.
What is the use of warehouse directory in Sqoop import?
Ans: warehouse directory is the HDFS parent directory for table
destination. If we specify target-directory all our files are stored in that
location. But, with warehouse directory, a child directory is created
inside it with the name of the table. All the files are stored inside the
child directory.
What is the default number of mappers in a Sqoop job?
Ans: 4
How to bring data directly into Hive using Sqoop?
Ans: To bring data directly into Hive using Sqoop use –hive-import
command.
We wish to bring data in CSV format in HDFS from RDBMS source.
The column in RDBMS table contains ‘,’. How to distinctly import data
in this case?
Ans: Use can use the option –optionally-enclosed-by
How to import data directly to HBase using Sqoop?
Ans: You need to use –hbase-table to import data into HBase using
sqoop. Sqoop will import data to the table specified as the argument to
–hbase-table. Each row of input table will be transformed into an Hbase
put operation to a row of output table.
What is incremental load in Sqoop?
Ans: To import records which are new. For this, you should specify –
last-value parameter so that the sqoop job will import values after the
specified value.
What is the benefit of using a Sqoop job?
In the scenario where you must perform incremental import multiple
times, you can create a sqoop job for incremental import and run the
job. Whenever you run the sqoop job, it will automatically identify last
imported value and then the import will start after the identified value.
Where does Sqoop job store the last imported value?
Ans: In its metastore.
What is Kafka?
Ans: It is a distributed, partitioned and replicated publish-subscribe
messaging framework.
How is Apache Kafka different from Apache Flume?
Ans: Kafka is a publish-subscribe messaging system, whereas, flume is
system for data collection, aggregation and movement
What are important elements of Kafka?
Ans: Kafka Producer, Consumer, Broker, and Topic.
What role does zookeeper play in a kafka cluster?
Ans: The basic responsibility of a Zookeeper is to build coordination
between Kafka cluster.
How can consumer control the offset consumed by it.?
Ans: Automatic Commit or Manual commit.
We hope the above questions will help you in answering the Hadoop interview questions asked
in the various companies. For more details, enroll for Big data and Hadoop training conducted by
Acadgild.
Big Data Hadoop & Spark
1 23,130
Apache Sqoop is a tool designed to efficiently transfer bulk data between Hadoop and structured
datastores such as relational databases.
In this blog, we will see how to export data from HDFS to MySQL using sqoop, with weblog
entry as an example.
Getting Started
Before you proceed, we recommend you to go through the blogs mentioned below which discuss
importing and exporting data into HDFS using Hadoop shell commands:
HDFS commands for beginners
Integrating MySQL and Sqoop in Hadoop
If you wish to import data from MySQL to HDFS, go through this.
Steps to Export Data from HDFS to MySQL
Figure 1
Give a command, describe <table name>, to show the various fields and types of it.
This will help in comparing the type of data present inside HDFS which is ready to be mapped.
Figure 2
The files inside HDFS must have the same format as that of MySQL table, to enable the mapping
of the data.
Refer the screenshot below to see two files which are ready to be mapped inside MySQL.
Figure 3
Step3:
Export the input.txt and input2.txt file from HDFS to MySQL
sqoop export –connect jdbc:mysql://localhost/db1 –username sqoop –password root –table acad
–export-dir /sqoop_msql/ -m 1
Where: -m denotes the number of mapper you want to run
NOTE: The target table must exist in MySQL.
To obtain a filtered map, we can use the following option:
–input-fields-terminated-by ‘/t’ –MySQL-delimiters
Where ‘/t’ denotes tab.
Once the table inside MySQL and data inside HDFS is ready to be mapped, we can execute the
export command. Refer the screenshot below:
Figure 4
Once you give the export command, the job completion statement should be displayed.
Figure 5
Note that only Map job needs to be completed. Other error messages will be displayed because
of software version compatibility. These errors can be ignored.
How it Works
Sqoop calls the JDBC driver written in the –connect statement from the location where Sqoop is
installed. The –username and –password options are used to authenticate the user and Sqoop,
internally generates the same command against the MySQL instance.
The –table argument defines the MySQL table name, that will receive the data from HDFS. This
table must be created prior to running the export command. Sqoop uses the number of columns,
their types, and the metadata of the table to validate the data inserted from the HDFS directory.
When the export statement is executed, it initiates and creates INSERT statements in MySQl. For
example, the export-job will read each line of the input.txt file from HDFS and produces the
following intermediate statements.
Figure 6
Hope this Sqoop export tutorial was useful in understanding the process of exporting data from
HDFS to MySQL. Keep visiting our website Acadgild for more updates on Big Data and other
technologies. Click here to learn Big Data Hadoop Development.
0 14,254
What is Sqoop Import?
Sqoop is a tool from Apache using which bulk data can be imported or exported from a database
like MySQL or Oracle into HDFS.
Now, we will discuss how we can efficiently import data from MySQL to Hive using Sqoop. But
before we move ahead, we recommend you to take a look at some of the blogs that we put out
previously on Sqoop and its functioning.
Beginners Guide for Sqoop
Sqoop Tutorial for Incremental Imports
Export Data from Hive to MongoDB
Importing Data from MySQL to HBase
In this example, we will be using the table Company1 which is already present in the MySQL
database.
We can use the describe command to see the schema of the Company1 table.
Describing theTable Schema
describe Company1;
The DESCRIBE TABLE command lists the following information about each column:
Column name
Type schema
Type name
Length
Scale
Nulls (Yes/No)
Now, let us open a new terminal and enter Sqoop commands to import data from MySQL to
Hive table.
I. A Sqoop command is used to transfer selected columns from MySQL to Hive.
Now, use the following command to import selected columns from the MySQL Company1 table
to the Hive Company1Hive table.
sqoop import –connect jdbc:mysql://localhost:3306/db1 -username root –split-by EmpId –
columns EmpId,EmpName,City –table company1 –target-dir /myhive –hive-import –create-
hive-table –hive-table default.Company1Hive -m 1
The above Sqoop command will create a new table with the name Company1Hive in the Hive
default database and transfer the 3 mentioned column (EmpId, EmpName and City) values from
the MySQL table Company1 to the Hive table Company1Hive.
Displaying the Contents of the Table Company1Hive
Now, let us see the transferred contents in the table Company1Hive.
select * from Company1Hive;
II. Sqoop command for transferring a complete table data from MySQL to Hive.
In the previous example, we transferred only the 3 selected columns from the MySQL table
Company1 to the Hive default database table Company1Hive.
Now, let us go ahead and transfer the complete table from the table Company1 to a new Hive
table by following the command given here:
sqoop import –connect jdbc:mysql://localhost:3306/db1 -username root –table Company1 –
target-dir /myhive –hive-import –create-hive-table –hive-table default.Company2Hive -m 1
The above given Sqoop command will create a new table with the name Company2Hive in the
Hive default database and will transfer all this data from the MySQL table Company1 to the
Hive table Company2Hive.
In Hive.
Now, let us see the transferred contents in the table Company2Hive.
select * from Company2Hive;
We can observe from the above screenshot that we have successfully transferred these table
contents from the MySQL to a Hive table using Sqoop.
Next, we will do a vice versa job, i.e, we will export table contents from the Hive table to the
MySQL table.
III. Export command for transferring the selected columns from Hive to MySQL.
In this example we will transfer the selected columns from Hive to MySQL. For this, we need to
create a table before transferring the data from Hive to the MySQL database. We should follow
the command given below to create a new table.
create table Company2(EmpId int, EmpName varchar(20), City varchar(15));
The above command creates a new table named Company2 in the MySQL database with three
columns: EmpId, EmpName, and City.
Let us use the select statement to see the contents of the table Company2.
Select * from Company2;
We can observe that in the screenshot shown above, the table contents are empty. Let us use the
Sqoop command to load this data from Hive to MySQL.
sqoop export –connect jdbc:mysql://localhost/db1 -username root –P –columns
EmpId,EmpName,City –table Company2 –export-dir /user/hive/warehouse/company2hive –
input-fields-terminated-by ‘\001’ -m 1
The Sqoop command given above will transfer the 3 mentioned column (EmpId, EmpName, and
City) values from the Hive table Company2Hive to the MySQL table Company2.
Displaying the Contents of the Table Company2
Now, let us see the transferred contents in the table Company2.
select * from Company2;
We can observe from the above image that we have now successfully transferred data from Hive
to MySQL.
IV. Export command for transferring the complete table data from Hive to MySQL.
Now, let us transfer this complete table from the Hive table Company2Hive to a MySQL table
by following the command given below:
create table Company2Mysql(EmpId int, EmpName varchar(20), Designation varchar(15), DOJ
varchar(15), City varchar(15), Country varchar(15));
Let us use the select statement to see the contents of the table Company2Msyql.
select * from Company2Mysql;
We observe in the screenshot given above that the table contents are empty. Let us use a Sqoop
command to load this data from Hive to MySQL.
sqoop export –connect jdbc:mysql://localhost/db1 –username root –P –table Company2Mysql –
export-dir /user/hive/warehouse/company2hive –input-fields-terminated-by ‘\001’ -m 1
The above given Sqoop command will transfer the complete data from the Hive table
Company2Hive to the MySQL table Company2Mysql.
Displaying the Contents of the Table Company2Mysql
Now, let us see the transferred contents in the table Company2Mysql.
select * from Company2Mysql;
We can see here in the screenshot how we have successfully exported table contents from Hive
to MySQL using Sqoop. We can follow the above steps to transfer this data between Apache
Hive and the structured databases.
Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.
Enroll for our Big Data and Hadoop Training and kickstart a successful career as a big data
developer.
Incremental Import in Sqoop To Load Data
From Mysql To HDFS
11 42,894
This post covers the advanced topics in Sqoop, beginning with ways to import the recently
updated data in MySQL table into HDFS. If you are new to Sqoop, you can browse through
Installing Mysql and Sqoop and through Beginners guide to Sqoop for basics Sqoop commands.
Note: Make sure your Hadoop daemons are up and running. This real-world practice is done
in Cloudera system.
You should specify the append mode when importing a table, where new rows are continually
added with increasing row id values.
You must specify the column containing the row’s id with –check-column.
Sqoop imports rows where the check column has a value greater than the one specified with –
last-value.
Rows where the check column holds a timestamp more recent than the timestamp specified with
–last-value are imported.
At the end of an incremental import, the value which should be specified as –last-value for a
subsequent import is printed to the screen.
Let’s see with an example, step by step procedure to perform incremental import from
MySQL table.
Step 1
Step 2
Command to list database if already existing:
show databases;
Step 3
Also creating table, inserting values inside table is done using the following syntax.
create table <table name>(column name1, column name 2);
insert into <table name> values(column1 value1, column2 value1);
insert into <table name> values(column1 value2, column2 value2);
Step 4
Since the data is present in table of MySQL and Sqoop is up and running, we will fetch the
data using following command.
Sqoop import –connect jdbc:mysql://localhost/db1 –username root –password cloudera –
table acad -m1 –tagret-dir /sqoopout
As confirmation of the result, you can see in the image, the comment as Retrieved 3 records.
Step 5
This shows that part file has been created in our target directory.
Now, by the following command we view the content inside part file.
hadoop dfs -cat /sqoopout/part-m-0000
This confirms the data inside MySQL has come inside the HDFS. But what if the data inside
MySQL is increasing and has more number of rows present now than earlier?
Now, the following command with little few extra syntax will help you feed only the new values
in the table acad.
Step 2
The following syntax is used for the incremental option in Sqoop import
command.
–incremental <mode>
–check-column <column name>
–last value <last check column value>
As you can see in above image, 3 more records have been retrieved and the incremental import is
now complete.
Along with message for next incremental import, you need to give last value as 10.
Step 3
Now let’s check and confirm the new data inside HDFS.
This is how incremental import is done every time for any number of new rows.
keep visiting our website www.acadgild.com for more blogs on Big Data ,Python and other
technologies.Click here to learn Bigdata Hadoop from our Expert Mentors