Lecture Notes: Data Ingestion For Structured/Unstructured Data
Lecture Notes: Data Ingestion For Structured/Unstructured Data
In this module, you learnt about data ingestion for structured/unstructured data. You were first
introduced to what data ingestion is and then you understood Apache Sqoop - tool for ingesting
structured data and Apache Flume - tool for ingesting unstructured data. You got to know how
these tools are used for ingesting data and how insights are generated from the ingested data in
a typical production setup in the industry.
Data ingestion, can be referred to as the process of absorbing data for immediate use or
storage. It is a bridge to transfer data from the source to the destination such as HDFS, where it
can be used efficiently.
But there can be numerous data sources and the data can be in a plethora of formats. So, it
becomes a challenge for businesses across the globe to ingest data efficiently and process it at
a reasonable speed in order to get business insights. Moreover, the velocity at which data is
being generated nowadays is rapid, and the volume of data being generated is humongous.
With such colossal volume of data being generated at such an accelerated velocity, it becomes
a challenge to ingest data and process it at a proper pace so that businesses can get the
desired analysis at the required time.
Moreover, to generate any business insights, you have to prioritise the sources of data you will
analyse, filter out data you don’t want, and set up a system that allows you to draw conclusions
from this data.
Data ingestion tools, such as Sqoop, Flume, Kafka, Gobblin, etc., which have come up in the
recent past, can help alleviate these challenges and generate business insights.
Along with the above-discussed challenges, there are network-related challenges as well in a
typical production setup for data ingestion.
1. Data collection: When beginning to conduct any analysis, you have to first collect data. This
is where you analyse the sources from where the data has to be imported, based on the
requirement, out of the many resources available.
2. Data validation: Once you prioritise the resources and collect the data, you have to validate
it so that the unwanted data can be filtered out.
3. Data routing: Next, you have to route the validated data to its particular destination such as
HBase, HDFS, Hive, or some other system, where it will be further analysed.
Once data has been successfully imported, validated, and routed to its particular destination,
data processing tools are run on the data to get the desired output. You may also use business
intelligence (BI) tools and business analytics (BA) tools to get meaningful insights.
There are multiple sources of data. We have structured data coming from RDBMS,
multi-structured data coming from social media, streaming data and real-time data coming in
from back-end servers, weblogs, etc.
To ingest data from all such sources, the following commands and tools can be used:
1. File transfer using commands: Use commands such as put to copy files from a local file
system to HDFS, and get to copy files from HDFS back to the local file system. For data
transformation, however, you cannot depend on this method.
2. Apache Sqoop: Sqoop is short for SQL to Hadoop. It is used for importing data from RDBMS
to Big Data Ecosystem (Hive, HDFS, HBase, etc.) and exporting the data back to RDBMS after
it gets processed in the Big Data Ecosystem. It was created by Cloudera and was then open
sourced.
3. Apache Flume: Flume is a distributed data collection service for collecting, aggregating, and
transporting large amounts of real-time data from various sources to a centralised place, where
the data can be processed. It was released as an open source service.
4. Apache Kafka: Kafka is a fast, scalable distributed system that can handle a high volume of
data; it enables programmers to pass messages from one point to another. Apache Kafka was
developed by LinkedIn; later, it became open source.
5. Apache Gobblin: Gobblin is an open source data ingestion framework for extracting,
transforming, and loading a large volume of data from different data sources. It supports both
streaming and batch data ecosystems. Gobblin is LinkedIn's Data Ingestion Platform.
Apart from the tools you have seen till now, tools such as Apache Storm, Apache Chukwa,
Apache Spark, etc. are used for ingesting data the way we want.
Since data can be in any type and choosing a particular tool for data ingestion depends a lot on
the type of data we are going to ingest, let’s have a quick refresher on the types of data.
1. Structured data: It is organised and can be stored in databases, i.e. in tables with rows and
columns. Therefore, it has the advantage of being entered, stored, queried, and analysed
efficiently using Structured Query Language(SQL). In other words, structured data is data that
can be read easily by machines. Examples include Aadhaar data, financial data, the metadata
of files, etc.
We are now familiar with the types of data, but each type of data can be stored in a plethora of
file formats, and each format has its advantages and disadvantages. When it comes to data
ingestion, file formats play a crucial role. A file format represents a way in which the information
is stored or encoded in a computer. Choosing a particular file format is important if you want to
have maximum efficiency in terms of factors such as processing power, network bandwidth,
available storage, etc. A file format directly affects the processing power of the system ingesting
the data, the capacity of the network carrying the data, and the available storage for storing the
ingested data. Following are some of the widely used file formats:
1. Text/CSV: CSV stands for Comma-Separated Values. This is the most commonly used file
format for exchanging huge data between Hadoop and external systems. A CSV file has very
limited support for schema evolution and it does not support block compression. It is not
compact.
2. XML and JSON: XML stands for Extensible Markup Language, which defines a set of rules
using which documents can be encoded in a format that is both machine-readable and
human-readable. JSON stands for JavaScript Object Notation and is an open-standard file
format consisting of key-value pairs. Since both are text files, they don’t support block
compression and are not compact. Splitting is very tricky in these files as Hadoop doesn’t
provide a built-in InputFormat for either. Since splitting is tricky, these files cannot be split easily
to be processed in parallel in Hadoop.
3. Sequence Files: These are binary files and store data as binary key-value pairs. A binary file
is a file stored in binary format. These are binary files, and, hence, are more compact than text
files. A binary file is more compact because in a binary file each byte has 256 possible values
as opposed to pure unextended ASCII which only has only 128 so immediately it is twice as
compact. Sequence files support block compression and can be split and processed in parallel,
due to which they are extensively used in MapReduce jobs.
4. Avro: This is the language-neutral data serialization system developed within Apache’s
Hadoop project. Serialization is the process of turning data structures into a format that can be
used for either storage or transmission over a network. Language-neutral means Avro files can
be easily read later, even from a language different from the one used to write the file. Avro files
are self-describing, compressible, and splittable thus suitable for MapReduce jobs as they can
be split and processed in parallel. These are binary files and are, hence, more compact than
text files. Avro also supports schema evolution, which means that the schema used to read the
file doesn’t have to match the schema used to write the file.
Based on your requirement, you may choose the appropriate file format for storing data.
Schema Evolution: To understand schema evolution, suppose you are working on a particular
schema of a database in a software company. Now a requirement comes from the client that
you need to update the schema based on some requirements. So can you go ahead and
directly update it? Well, No! This is because there will be applications running and getting data
based on the current schema. Now, if you just update schema without considering anything then
those applications will get affected. So you need to evolve schema in such a way that it caters
to the new requirements as well as to the existing applications. This is schema evolution.
Block Compression: Since Hadoop stores large files by splitting them into blocks, it will be
best if the individual blocks can be compressed. Thus block compression is the process of
compressing each individual block.
Apache Sqoop — short for ‘SQL to Hadoop’, which is used for ingesting relational data. Sqoop
provides efficient, bi-directional data transfer between Hadoop and relational databases - in
parallel. The data can be imported directly in the HDFS or to HBase or Hive tables as per the
use case.
When you use Sqoop to transfer data, the dataset being transferred is split into multiple
partitions, and a map-only job is launched. Individual mappers are now responsible for the
transfer of each slice/partition of the dataset. The metadata of the database is used to handle
each data record in a type-safe manner.
Once Sqoop connects to the database, it used JDBC to examine the table to be imported by
retrieving a list of all the columns and their SQL data types. The SQL data types (integer,
varchar, etc.) can be mapped to Java data types (Integer, String, etc.). Sqoop has a code
generator which creates a table-specific java class to hold the extracted records from the table
by using the information given by the JDBC about the data types, etc. Then Sqoop connects to
the cluster to submit a MapReduce job using the java class generated. The dataset being
transferred is split into multiple partitions, and a map-only job is launched. The output of this is a
set of files containing the imported data. Since the import process is performed in parallel, the
output is in multiple files.
The basic ‘import’ command used for importing data from RDBMSs to the HDFS is —
After firing this command on your terminal, data from the ‘Categories’ table from the database in
MySQL is transferred to the HDFS. You can see the transferred data by running this command
on your console:
There will be files generated like part-m-00000, part-m-00001, etc. depending on the number of
mappers used by Sqoop during import. Each mapper works on a portion of the data and imports
it to the corresponding file like part-m-00000.
There are two steps involved in the execution of the ‘import’ command:
1. First, Sqoop connects to the database and fetches the table metadata — the number of
columns and the column names and their types. In the ‘Categories’ table, it finds that there are
six columns, namely, ItemCode, ItemName, Category, Stock, LastStockedOn, and MRP with
VARCHAR, VARCHAR, VARCHAR, INT, DATE, and NUMBER as their data types, respectively.
Based on the metadata retrieved, Sqoop internally generates a Java class and compiles it using
the JDK and the Hadoop libraries available on the machine.
2. Next, Sqoop connects to the Hadoop cluster and submits a MapReduce job where each
mapper transfers a slice of the table’s data. As multiple mappers run at the same time, the
transfer of data between the database and the Hadoop cluster takes place in parallel.
Note: For all the tables imported using the given ‘import’ command structure, the primary key is
mandatory. If there’s no primary key, then the ‘--split-by’ parameter has to be specified like so:
Note: The target directories should be made via hdfs user, and then the owner has to be
changed to root in order to avoid permission issues in AWS EC2. Wherever required, the
commands for creating the directory and changing the permissions have been given.
Create a /user/root in hdfs and change permissions for it by using the following commands:
[root@ip-10-0-0-163 ~]# su - hdfs
[hdfs@ip-10-0-0-163 ~]$ hadoop fs -mkdir /user/root
[hdfs@ip-10-0-0-163 ~]$ hadoop fs -chown root /user/root
[hdfs@ip-10-0-0-163 ~]$ exit
To list databases using Sqoop, you can run the following command:
To list the tables of a database using Sqoop, you can run the following command:
The --target-dir parameter can be used to specify the target directory where you want to store
your imported data. The command to do this is —
The issue with the previous approach is that every time you run the ‘import’ script, you need to
change the target directory. To overcome this issue, you need to import the data to a
warehouse directory. The --warehouse-dir parameter can be used for multiple table imports.
That is, if you need to import the ‘Categories’ table followed by the ‘Products’ table, then you
just need to change the table name in each ‘import’ command like so:
As you can see, you need only change the table name without changing the target location. As
a result, the data from both tables is getting imported at the specific location (/input/data/tables)
in the folders created of the name same as the tables getting imported.
Note: To run commands on the Products table, you first need to have it in your ‘sqoopdemo’
database. Create a ‘Products’ table of your choice if you want to run the preceding command.
Create /input/data/ in hdfs and change permissions for it by using the following commands:
[root@ip-10-0-0-163 ~]# su - hdfs
[hdfs@ip-10-0-0-163 ~]$ hadoop fs -mkdir -p /input/data/
[hdfs@ip-10-0-0-163 ~]$ hadoop fs -chown root /input/data/
[hdfs@ip-10-0-0-163 ~]$ exit
Till now, you were focussed on importing just one table. What if you need to import all tables
from the database. You’ll obviously prefer not to write individual Sqoop Import commands for
every table! This is where you will use the ‘import-all-tables’ command. This command retrieves
a list of all the tables from the database and calls the ‘import’ tool to import the data of each
table in a sequential manner to avoid putting excessive load on the database server.
Note: To avoid conflict with the previously created ‘Categories’ directory in segment 3, remove it
using —
You can see the folders created for each table using this command:
In a typical database, you will have plenty of data, and in most situations, you won't require all of
this data. Instead, you may specifically need some rows that satisfy some properties. You can
do this in SQL by using the 'WHERE' clause. By using the command line parameter --where you
can specify the SQL condition that the imported rows should satisfy. The command is —
To use SQL queries within Sqoop Import commands, use the --query parameter. The command
looks like this:
Here, the ‘--query’ parameter holds the query, the ‘--split-by’ parameter indicates the column
that is to be used for slicing the data into parallel tasks ( by default this column is the primary
key of the main table). Remember, Sqoop connects to the Hadoop cluster and submits a
MapReduce job where each mapper transfers a slice of the table’s data. As multiple mappers
run at the same time, the transfer of data between the database and the Hadoop cluster takes
place in parallel.
So far, you have covered use cases that import data as a one-time operation. It is possible that
you can use Hadoop as an active backup for your database, i.e. you can keep the data in
Hadoop in sync with the relational database. It is in such cases that incremental import comes
into the picture. When the table is getting new rows and no existing rows are changed, you need
to import just the new rows. To achieve this incremental import, you have to use the parameter
--incremental with its value as append. Also, you need a mechanism to indicate how to track
new rows. You can use the primary key of the table to identify the new rows. The parameter
--check-column can be used to tell Sqoop which column to be checked to find if the row is new.
The --last-value parameter indicates the value of the said column for the row inserted last. In
other words, Sqoop checks the column for rows that have values greater than the last value and
inserts just these rows.
Note: After running the initial ‘import’ command, remember to insert the two rows in the
Morning_Shift table as mentioned in the initial Database Setup document given in segment 2.
The commands for incremental import using the ‘--last-value’ parameter are —
mysql -u root -p
Enter password: cloudera
use sqoopdemo;
exit;
So, in this import example, Sqoop checks the ‘agentid’ column for all rows and only those rows
with their ‘agentid’ value greater than 4 are imported.
You can check the output by running the following command:
mysql -u root -p
Enter password: 123
use sqoopdemo;
exit;
Note: What if you want to import the updated rows as well along with the newly added rows?
Sqoop provides --last-modified parameter to do so.
To transfer processed or backed-up data from Hadoop back to the database use sqoop export.
Note: Remember to create the table ‘Consolidated_Stocks‘ as mentioned in the Database
Setup document given in segment 2.
Create table where exported data will be put by logging into MySQL.
mysql -u root -p
Enter password: cloudera
use sqoopdemo;
exit;
mysql -u root -p
Enter password: cloudera
use sqoopdemo;
exit;
1. Sqoop connects to the database and extracts the metadata information about the table to
which data is to be loaded — the number of columns, data types of columns, and more. This
information is used by Sqoop to create and compile the Java class used in the MapReduce job.
2. Sqoop connects to the cluster and submits the MapReduce job to transfer the data from
Hadoop to the database table in parallel.
Note: What happens when the data is corrupted, say, for instance, the type of the data in any of
the columns does not match with the expected type. In this case, the export will fail. Sqoop does
not skip rows when it encounters an error, so it must be fixed before running the command
again.
For EC2 Users:
Create a table where exported data will be put by logging into MySQL.
mysql -u root -p
Enter password: 123
use sqoopdemo;
exit;
mysql -u root -p
Enter password: 123
use sqoopdemo;
exit;
Now that you are familiar with the import and export commands in Sqoop, it might have
occurred to you that instead of writing and executing them every single time you wanted to carry
out an operation, it would be great if you could have some jobs that were scheduled to run at a
particular time on their own.
To create Sqoop jobs that could be re-run as and when required, the command is —
Note: There is a space before import (-- import) in the above command.
You can use the following command to run this import job:
The Sqoop metastore, which is a metadata repository, stores all the saved jobs. You can view
the parameters of a saved job using the following command:
Scheduling and using Sqoop jobs effectively will be discussed in Module 4: Oozie of this course.
We’ll saw a case study for Sqoop that was archiving relational databases in Hadoop using
different parameters available in Sqoop. For this case study, we made use of the Enron Email
Dataset. Enron Corporation was a US energy trading and utility company. When the company
collapsed in 2001, almost 0.5M- 5,00,000 internal emails were made public. This dataset is
available as a SQL dump. We loaded this dataset into MySQL and we saw how to import it to
Hadoop using Sqoop.
Sqoop supports importing data to file formats that do not generally support the NULL value. So,
it is required to encode this missing value or NULL value in the data.
Sqoop encodes it to a string constant 'null' in lowercase. But there is a problem with this
approach. If your data itself contains NULL as a string/regular value rather than a missing value,
this doesn't help. Also, it is possible that further processing steps expect a different substitution
for missing values.
In such cases, you can override the NULL substitution string with the --null-string and
--null-non-string parameters to any required value.
For text-based columns’ missing values, we use the --null-string parameter, and for other
columns, we use the --null-non-string parameter. For example, we have data for three columns
— ID number, Name, and Address. Now, ID is an integer data type column, and address and
name are of character array/string data types. If ID number has a NULL value, it gets encoded
with the value specified for the --null-non-string. If Address has a NULL value, it gets encoded
with the value specified for the --null-string.
So you see, how NULL was replaced by 'NA' in the files. In the same way, if any other non-text
based column would have contained NULL then it would have been replaced by the value
defined for --null-non-string parameter.
Note: If we create directories directly such as enron instead of /enron, we don’t have to change
permissions every time. So in the commands ahead, we have created directories without /. This
is specific to AWS EC2. In videos, you are shown to create directories with /, but you can do
without /, as shown in all the commands ahead. We are not creating the directories inside / to
avoid permission issues in AWS EC2.
Create a /user/root in HDFS and change permissions for it using the following commands:
[root@ip-10-0-0-163 ~]# su - hdfs
[hdfs@ip-10-0-0-163 ~]$ hadoop fs -mkdir /user/root
[hdfs@ip-10-0-0-163 ~]$ hadoop fs -chown root /user/root
[hdfs@ip-10-0-0-163 ~]$ exit
Sqoop uses four map tasks to achieve the import by default. However, we can control the
number of mappers by using the --num-mappers parameter. Moreover, increasing the number
of mappers does not necessarily reduce the processing time. It is possible that the database
gets overwhelmed by a large number of mappers and loses time in context, switching between
these tasks rather than getting the data transferred. The best method to determine the optimal
number of mappers is to go with trial and error. You can set the number of mappers at a starting
value, increase it, and test until no further improvement is achieved.
We can import data in a binary file format via Sqoop. Binary formats are used to store images,
PDFs, etc. If the text itself contains characters that are used as separators in the text file (CSV),
then it is preferable to use binary formats.To import data in a sequence file format use the
following command:
Import:
You can see that all the files created have a .avro suffix.
You can see that all the files created have a .avro suffix.
You can see that all files are compressed and hence have a .gz suffix in their filename.
You can see that all files are compressed and, hence, have a .gz suffix in their filename.
Once you get this Enron Email Data into Hadoop, you can run data processing tools on the data
to get the desired output. You may also use business intelligence (BI) tools and business
analytics (BA) tools to get meaningful insights. In a typical production setup in the industry,
Sqoop is used to ingest data such as Enron Email Data from RDBMS into Hadoop, and then,
the analysis is performed on that data.
Note: We created the database inside MySQL and then imported data to Hadoop. But in a
practical scenario, we would already have enron database on-premise or on Cloud, and Sqoop
would directly connect to it to import data.
Note: Just as you have connected Sqoop to MySQL, you can also connect Sqoop to other
RDBMSs like Oracle, DB2, etc.
Industry use case of Sqoop: There is a RDBMS system that hosts five years of legacy data
which needs to be imported to Hadoop and this imported data need to be processed inside the
Big Data Ecosystem to generate business insights. RDBMS system is on-premise and Hadoop
is on cloud and there is a networking layer involved between RDBMS system and Hadoop.
4.2 Sqoop In A Typical Production Architecture
You can see from the diagram how Sqoop fits in a typical production architecture and what are
the factors involved in building the connection between RDBMS and Hadoop to transfer data via
Sqoop. There are network challenges involved in the production setup and resolving the
network challenges play an important role in the successful transfer of the data. Once those
network challenges are taken care of, Sqoop can transfer data from RDBMS system to Hadoop.
Moreover, we also have network latency issues when data transfer takes place between two
networks in different zones. Here, since RDBMS system is on-premise and Hadoop is on cloud
so network latency will also be a factor slowing down the transfer of data.
There is a concept of availability zones using which we can reduce network latency.
You can see from the diagram how Sqoop fits into the entire big data ecosystem in a typical
production setup in the industry to generate business insights.
Apache Flume, which is a tool for ingesting unstructured data. According to the Flume user
guide, Apache Flume is a distributed, reliable, and available system for efficiently collecting,
aggregating, and moving large amounts of log data from many different sources to a centralised
data store. Flume is also used for ingesting massive quantities of event data such as network
traffic data, social-media-generated data, data from email messages, etc. All of this is
unstructured and there are challenges involved in ingesting unstructured data.
The second challenge involves latency issues. Servers are scattered across multiple geographic
locations but they all try to write data to a centralised Hadoop system. The servers which are at
a far geographic location will not be able to write data at a fast speed due to network lag which
is called as latency. Hence, you need a system that is extendable to such far-off locations.
Flume is extendable in such scenarios which means that you can just configure flume once and
extend it to anywhere we want with the same configuration. It will work the same way.
The third challenge is that the data might get lost in the network due to network related issues.
So you need to ensure you have a system that is fault tolerant (keeps account of the data sent
and received). Flume is fault-tolerant.
A flume agent is a Java application that generates or receives data and buffers it until it is
written to the next flume agent or a storage system. A chain of flume agents can be used to
move data from some data sources to the target HDFS or HBase in a scalable and durable
manner. The Flume agent has three components, namely, the source, the sink, and the
channel. Data is represented as ‘events’. The source receives or produces the data, which
contains events. The sink reads these events and sends them to the next flume agent or to the
target Hadoop system. The channel acts as a buffer to hold the data written by the source until
the sink successfully writes the data to the next stage.
An event can be considered equivalent to a data structure holding data. It consists of header(s)
and a body. Headers are key-value pair used to show routing information or other structured
information. The body contains the actual data which is an array of bytes. Data flows in the form
of these events only.
Sources are the components of a Flume agent which receive data from any application that
produces data. They take the data by listening to a port or the file system. Every source is at
least connected to one channel. Source writes the data in the form of events to channels. There
are different types of sources supported by Flume such as Avro Source, HTTP source, etc.
Sinks are components of a Flume agent which deliver the data to the final destination. A sink
continuously polls the connected channel to retrieve the events that were written by the source
in the channel. A sink writes events to the next hop (Flume agent) or the final destination. Once
it has successfully written the events to the next destination then it informs the channel to
remove the written events from the channel. There are different types of sinks available
supported by Flume such as Avro Sink, HDFS Sink, etc.
Channels are those components of a Flume agent which acts as a conduit between the source
and the sink. It is a buffer (in-memory queue) which keeps events till the sink writes them to the
next hop or the target destination. Multiple sources can write to the same channel, and multiple
sinks can read from the same channel, but a sink can only be connected to only one channel.
The number of sources, channels and sinks in a Flume agent is not restricted to one. These can
be multiple. In such a scenario we have other components as well to be taken into account. Let
see how the data flows in such a case.
Source uses channel processor to write events to the channels. The channels processor passes
the events to one or more interceptors. The interceptor reads, modifies and drop some events
as required and then sends them back to the channel processor. The channel processor then
passes the events to the channel selector. Channel selector determines how the events will now
move to the channels. There are two types of channel selectors mainly: replicating and
multiplexing. A replicating channel selector simply sends a copy of each event to all the
connected channels. This is the default channel selector. Multiplexing channel selector writes
events to the channels based on some criteria such as header information. Once the data is
available in the channel then it is used by the sinks. There can be more than one sink. In such a
case, we have sink groups. Sink groups contain one or more sink groups and each group has a
sink processor which determines which sinks carry out event processing. Once the sink is
selected it writes events to the next destination and removes the written events from the
channel.
Here is a diagram showing how all the components of Flume interact with each other. Have a
look at it to get more understanding on the above discussed other components.
5.3 Flume: Industrial Discussion
Industry use case of Flume: You have banking transactions going on and in real-time or
near-real-time, you have to infer which transaction is fraud and which is not. You got an
overview of how it is done using Flume and the architecture involved.
However, for things to happen in real-time or near-real-time, network latency should be
minimum which can be reduced using the concept of availability zones.
Now that you are familiar with both Sqoop and Flume, let’s compare and summarise the
features of Sqoop and Flume. This will help you determine which tool suits which use cases.
Sqoop is used to import data from RDBMS to HDFS, HBase, Hive and to export data from
HDFS to RDBMS. It is a tool for ingesting structured data. On the other hand, Flume is a
distributed data collection service for collecting, aggregating and transporting large amounts of
real-time data from various sources to a centralised place where the data can be processed. It
is a tool for ingesting unstructured data.
Sqoop is not event-driven which means that its functioning is on dependent on events and thus
suitable for moving data from and to RDBMS such as Oracle, MySQL, etc. On the other hand,
Flume has an event-based architecture which means it is dependant on events. The events can
be tweets generated on Twitter, log files of a server, etc.
Sqoop has a connector based architecture which means that JDBC connector is primarily
responsible for connecting with the data sources and to fetch data correspondingly while Flume
has an agent-based architecture which means that the Flume agent is responsible for the data
transfer.
Sqoop can transfer data in parallel for better performance while Flume scales horizontally and
multiple Flume agents can be configured to collect high volumes of data. Flume also has
several recoveries and failover mechanisms due to which it is highly reliable.
This will help you determine when to use which tool. You further got to know about the
companies that use Sqoop and Flume.