BDA Assignment I and II
BDA Assignment I and II
BDA Assignment I and II
Q.1 How to import RDBMS table in Hadoop using Sqoop when the table doesn’t have a
primary key column?
Ans: Usually, we import an RDBMS table in Hadoop using Sqoop Importwhen it has a
primary key column. If it doesn’t have the primary key column, it will give you the below
error-
ERROR tool.ImportTool: Error during import: No primary key could be found for table
<table_name>. Please specify one with –split-by or perform a sequential import with ‘-m 1’
Here is the solution of what to do when you don’t have a primary key column in RDBMS,
and you want to import using Sqoop.
If your table doesn’t have the primary key column, you need to specify -m 1option for
importing the data, or you have to provide –split-by argument with some column name.
Here are the scripts which you can use to import an RDBMS table in Hadoop using Sqoop
when you don’t have a primary key column.
sqoop import \
–connect jdbc:mysql://localhost/dbname \
–username root \
–password root \
–table user \
–target-dir /user/root/user_data \
–columns “first_name, last_name, created_date”
-m 1
or
sqoop import \
–connect jdbc:mysql://localhost/ dbname\
–username root \
–password root \
–table user \
–target-dir /user/root/user_data \
–columns “first_name, last_name, created_date”
–split-by created_date
Q.2 Suppose I have installed Apache Hive on top of my Hadoop cluster using
default metastore configuration. Then, what will happen if we have multiple
clients trying to access Hive at the same time?
The default metastore configuration allows only one Hive session to be opened at a
time for accessing the metastore. Therefore, if multiple clients try to access the
metastore at the same time, they will get an error. One has to use a standalone
metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing
Following are the steps to configure MySQL database as the local metastore in Apache
Hive:
Now, after restarting the Hive shell, it will automatically connect to the MySQL
database which is running as a standalone metastore.
Q.3 Suppose, I create a table that contains details of all the transactions done by
the customers of year 2016: CREATE TABLE transaction_details (cust_id INT,
amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’ ;
Now, after inserting 50,000 tuples in this table, I want to know the total revenue
generated for each month. But, Hive is taking too much time in processing this
query. How will you solve this problem and list the steps that I will be taking in
order to do so?
We can solve this problem of query latency by partitioning the table according to each
month. So, for each month we will be scanning only the partitioned data instead of
‘,’ ;
3. Transfer the data from the non – partitioned table into the newly created
partitioned table:
query time.
Q.4 Suppose, I have a lot of small CSV files present in /input directory in HDFS
and I want to create a single Hive table corresponding to these files. The data in
these files are in the format: {id, name, e-mail, country}. Now, as we know,
Hadoop performance degrades when we use lots of small files.
So, how will you solve this problem where we want to create a single Hive table
for lots of small files without degrading the performance of the system?
One can use the SequenceFile format which will group these small files together to
form a single sequence file. The steps that will be followed in doing so are as follows:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)
Transfer the data from the temporary table into the sample_seqfile table:
Hence, a single SequenceFile is generated which contains the data present in all of the
input files and therefore, the problem of having lots of small files is finally eliminated.
Assignment II
Q.5 Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the
following entries:
id first_name last_name email gender ip_address
Q.6 How will you consume this CSV file into the Hive warehouse using built
SerDe?
bytes into a record that we can process using Hive. SerDes are implemented using
Java. Hive comes with several built-in SerDes and many other third-party SerDes are
also available.
Hive provides a specific SerDe for working with CSV files. We can use this SerDe for
querying data stored in HDFS for analysis using Hive Query Language (HQL), which is a
SQL-like language, that gets translated into MapReduce jobs. Hive performs batch
processing on Hadoop.
Apache HBase is NoSQL key/value store which runs on top of HDFS. Unlike Hive,
HBase operations run in real-time on its database rather than MapReduce jobs.
HBase partitions the tables, and the tables are further splitted into column families.
Hive and HBase are two different Hadoop based technologies – Hive is an SQL-like
engine that runs MapReduce jobs, and HBase is a NoSQL key/value database of
Hadoop. We can use them together. Hive can be used for analytical queries while
HBase for real-time querying. Data can even be read and written from HBase to Hive
and vice-versa.