Lab - Working With SQL Using Big SQL v3
Lab - Working With SQL Using Big SQL v3
Overview
Apache Hadoop and its map-reduce framework have become very popular for its robust, scalable distributed processing. While Hadoop is very good at munching big data, developing map-reduce applications is complex and time-consuming. Scripting languages such as Pig try to solve this problem; however, this requires mastering these new languages. Big SQL alleviates both of these problems by allowing users to write their queries in well-understood SQL language. Under the hood it takes advantage of Hadoops scalable distributed processing when necessary. While there are other SQL processors for Hadoop, Big SQL is much superior in terms of functionality and performance.
2013 BigDataUniversity.com
Password: Use the password used when you logged on to my.imdemocloud.com For simplicity in this document we will refer to this userID/password as IMDC userID/psw
Prerequisite
Ensure you have followed the instructions to set up your lab environment (First lab in this course) When you gain access to my.imdemocloud.com, enter the appropriate credentials, and start a MindTerm terminal window or a putty session.
and should typically take you to /opt/ibm/biginsights/bigsql. Explore the directory structure. It should look like this: 2013 BigDataUniversity.com 3
Working with SQL using Big SQL Hands-on Lab bin executables to start/stop bigsql server/client lib servers jars conf server configuration files jdbc jdbc driver odbc ODBC driver jsqsh bigsql command line client msg error messages security keystore file for ssl encryption between client and server (if ssl is enabled)
Note: In this lab environment, most configuration files can be read, but no changes are allowed. Even if you could edit files in those directories, configuration changes will not take effect unless you restart the Big SQL server. You will not have the permissions to restart the Big SQL server.
From the BigInsights Web-console you can also stop/start/monitor the Big SQL server process: 2013 BigDataUniversity.com 4
Working with SQL using Big SQL Hands-on Lab From the my.imdemocloud.com dashboard, cluster tab, start the Web Console: o Click on the link besides BigInsights Web Console Your browser may display a warning message indicating the security certificate is not trusted as shown below. Proceed anyway. We do not purchase SSL certificates for these demonstration systems.
You will be prompted for credentials: User Name and Password. Use the IMDC UserID/psw as described earlier. Click on Login.
2013 BigDataUniversity.com
Click on the Cluster Status tab, and look for the Big SQL process. It should show it is running. Click on this process, and on the right side you will see more detail. Note that the Start and Stop buttons are grayed out since you do not have admin privileges, and thus are not allowed to stop/start this process.
2013 BigDataUniversity.com
Server wide job-conf setting: To pass hadoop job-conf settings at the bigsql server level, put them in: $BIGSQL_HOME/conf/bigsql-site.xml and restart the bigsql server.
Other server settings: Explore $BIGSQL_HOME/conf/bigsql-conf.xml to make changes to other settings such as: Network-interface where bigsql-server should listen (default is 0.0.0.0 i.e. all interfaces) Port where bigsql-server should listen (default is 7052) Whether to authenticate using web-console or some other mechanism Whether to turn on SSL encryption and SSL specific settings List of jars/jaql modules to load at server startup time Etc
2013 BigDataUniversity.com
Working with SQL using Big SQL Hands-on Lab o The first time you invoke jsqsh, you will get a Welcome message indicating that some files have been created for you as shown below.
o Press enter to continue. This will start a series of screens that allow you to define one or more connection aliases. Lets go through the screens one time to set up one connection alias. o Enter 1 to choose the Big SQL driver in the screen below.
2013 BigDataUniversity.com
o Next, enter (S) to save the configuration properties (We will take all defaults). Later in the lab there is a section that explains how to configure your Big SQL server.
2013 BigDataUniversity.com
Next you will be prompted to enter a connection name. Type myconn, then on the next screen enter A to add. This is an alias for a connection. We will explain later how to manually create other aliases, and edit existing ones.
Next, type Q to exit, otherwise you will be prompted again for driver information, and so on to set up another connection alias. Exit jsqsh client (The prompt will be a line number like 1>, however we are using jsqsh> for ease of readability) jsqsh> quit Start it again. This time it should not prompt you for anything jsqsh 2013 BigDataUniversity.com 10
Working with SQL using Big SQL Hands-on Lab Get help jsqsh> \help Create a connection alias to database-server \connect -U<user> -P<password> -S<host> -p<port> -d<driver> -a<alias_for_this_connection> For example, lets say your user ID is rfchong and the password is passw0rd. Say the alias for the connection you want to create is myconn1. Then the statement to use is: \connect -Urfchong Ppassw0rd -Slocalhost -p7052 -dbigsql -amyconn1 Note: When you follow the step above, remember to use the IMDC user ID/psw indicated in a previous note. The lab environment has been set up so this user ID/psw is also used for the BigInsights Web console authentication, and the Big SQL environment has been set up to use this same authentication method. Ensure to enclose the password in double quotes when using special characters like &. If you make a mistake in the above command, simply execute it again with the correct information and using the same alias name. The incorrect information would be overwritten. Alternative use the r flag to remove the connection alias. For more details about the syntax, type: \help connect Verify the connection alias was indeed created by listing the connection aliases: \connect l Connect using alias myconn1 \connect myconn1 Show currently connected users user-name set v jaql.system.dataSource.userName; Note: If for any reason, after you press enter, you keep getting prompts with line numbers, ensure to add a semicolon (;) at the end of the line. If that doesnt work, type go and press enter. Show schemas \show schemas Show tables \show tables Show columns \show columns 2013 BigDataUniversity.com 11
Working with SQL using Big SQL Hands-on Lab Describe (get schema of) a table \describe system.dual; See the contents of the catalog table syscat.columns select * from syscat.columns;
If you had admin privileges, you could cancel all applications as follows: 2013 BigDataUniversity.com 12
Working with SQL using Big SQL Hands-on Lab Jsqsh> cancel applications all;
Tell bigsql server that all objects (table etc) we refer to in our DDL, DML, queries should use this schema.
2013 BigDataUniversity.com
13
Working with SQL using Big SQL Hands-on Lab Tip: Everytime a new bigsql connection is established, this should be the 1st statement. Otherwise you will end up creating objects in default schema
Set the default schema for the session: USE <schema_name>; For example: Jsqsh> USE raulchongbiz; Check current default schema Jsqsh> Set v jaql.sql.defaultSchema;
Create a table. Pay attention to how the delimiter | for this csv file is specified in the statement. The STORED AS TEXTFILE clause tells that this will be a text-file in hdfs Note: The tables lineitem, and supplier to be created in this section are the same tables used in the Working with SQL using Hive lab. A few things to keep in mind while going through the steps below: When using Jsqsh (instead of the Hive CLI), you can copy/paste the entire statements below. Jsqsh will know how to paste the lines appropriately. In the Hive CLI, you had to type the entire statement, or copy/paste one line at a time. With Big SQL, you have the option to run the queries in local mode or map-reduce mode. In Hive 2013 BigDataUniversity.com 14
Working with SQL using Big SQL Hands-on Lab its only map-reduce. You should notice gains in performance for some of the queries either in map-reduce or local mode. If you did not complete the Working with SQL using Hive lab, or dont remember the performance running queries in Hive, we suggest you open two terminal windows (If using MindTerm, click on File > Clone Terminal), and run the same queries side by side; one in Hive, one in Big SQL. Tables created in Hive or Big SQL are the same, so you dont need to create the table if it was created already in one or the other.
Jsqsh> CREATE TABLE lineitem ( L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
Load 10,000 rows in the newly created table. Note: The LOAD command in Big SQL is currently passing the command to Hive. At the time of writing, Hive seems to have a problem where even though a user creates a table, he cannot LOAD data to it because it doesnt have UPDATE privilege; therefore, you need to go to the Hive shell, and issue a GRANT 2013 BigDataUniversity.com 15
Working with SQL using Big SQL Hands-on Lab statement to yourself (as yourself), granting UPDATE privilege, or better yet, to grant ALL privileges.
Open another terminal window, lets call it Terminal 2 (If using Mindterm, go to File > Clone terminal) Start the Hive shell, and execute the GRANT statement: hive hive> use <IMDC UserID> ; hive> grant all on table lineitem to user <IMDC UserID> ; hive> quit;
Go back to Terminal 1, and issue: Jsqsh> LOAD HIVE DATA LOCAL INPATH '/userdata/lineitem.data' OVERWRITE INTO TABLE lineitem; In this environment, for this lab we have copied the input files to /userdata. Note that this directory and input files must be readable by biadmin (because bigsql-server, who is running as biadmin, will read the files). Run a simple query Jsqsh> SELECT COUNT (*) FROM lineitem; Note: Queries on large amount of data typically run faster with map-reduce parallelism. On the other hand, queries on small amount of data or full table-scan typically runs faster with no map-reduce (i.e. localread) because in this case, the overhead introduced by map-reduce parallelism is more than benefits offered by it.
bigsql uses map-reduce by default for most cases. For some simpler cases like a select * from t1, bigsql uses local-read by default.
You can force local-read mode using accessmode hint. 2013 BigDataUniversity.com
16
Working with SQL using Big SQL Hands-on Lab Jsqsh> SELECT COUNT (*) FROM lineitem /*+ accessmode='local' +*/;
Similarly you can force map-reduce as follows: Jsqsh> SELECT COUNT (*) FROM lineitem /*+ accessmode='MR' +*/;
Here is how you can set local mode for all queries in the session: Jsqsh> SET FORCE LOCAL ON;
2013 BigDataUniversity.com
17
Verify that there are 10,000 rows in your supplier table Jsqsh>
SELECT COUNT(*) FROM supplier;
In our tests, the above query takes 24.723 seconds in Hive vs. 8.31 seconds in Big SQL. Next, run the following queries (also ran in the Hive lab)
jsqsh> SELECT SUM(L_EXTENDEDPRICE*L_DISCOUNT) AS REVENUE FROM lineitem WHERE L_SHIPDATE >= '1994-01-01' AND L_SHIPDATE < '1995-01-01' AND L_DISCOUNT >= 0.05 AND L_DISCOUNT <= 0.07 AND L_QUANTITY < 24;
In our case when we executed the above query in Hive, it took 25.868 seconds. In Big SQL it took 9.30 seconds, both using Map-Reduce. You can try as well in Big SQL using the /*+ accessmode='local' +*/ hint as shown below for almost immediate result:
SELECT SUM(L_EXTENDEDPRICE*L_DISCOUNT) AS REVENUE FROM lineitem /*+ accessmode='local' +*/ WHERE L_SHIPDATE >= '1994-01-01' AND L_SHIPDATE < '1995-01-01' AND L_DISCOUNT >= 0.05 AND L_DISCOUNT <= 0.07 AND L_QUANTITY < 24;
If you recall in the Hive lab, the query below could not be run as shown because Hive supports a subset of the standard SQL, and the WITH clause was not supported. In Big SQL you can try the query with no modifications required:
2013 BigDataUniversity.com
18
Create a regular table (not a CTAS yet): CREATE TABLE ( O_ORDERKEY BIGINT, 2013 BigDataUniversity.com 19 orders1
Working with SQL using Big SQL Hands-on Lab O_CUSTKEY INTEGER, O_ORDERSTATUS CHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY CHAR(15), O_CLERK CHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) row format delimited fields terminated by '|' stored as textfile;
Load 10k rows. From the other terminal window, Terminal 2 , start the Hive shell, and execute the GRANT statement: hive hive> use <IMDC UserID> ; hive> grant all on table order1 to user <IMDC UserID> ; hive> quit;
Return to Terminal 1 and issue: Jsqsh> LOAD HIVE DATA LOCAL INPATH '/userdata/orders.data' OVERWRITE INTO TABLE orders1;
Verify the data was loaded: Jsqsh> SELECT COUNT (*) FROM orders1 /*+ accessmode=local +*/;
2013 BigDataUniversity.com
20
Working with SQL using Big SQL Hands-on Lab Create a CTAS (create table as select) query to prepare for another query: CREATE TABLE q4_order_priority_tmp (O_ORDERKEY) as select DISTINCT l_orderkey as O_ORDERKEY from lineitem where l_commitdate < l_receiptdate;
Using a JOIN to join the previously two created tables: select o_orderpriority, count(1) as order_count from orders1 o join q4_order_priority_tmp t on o.o_orderkey = t.o_orderkey and o.o_orderdate >= cast('1993-07-01' as timestamp) and o.o_orderdate < cast('1993-10-01' as timestamp) group by o_orderpriority order by o_orderpriority; Tip: If we knew that one table was small, then during the join, it could be pulled in memory and we can do cheaper memory join (aka hash join, aka map-side join) e.g. FROM T1, T2 /*+ tablesize=small +*/ Or WHERE T1.c1 = T2.c1 /*+ joinmethod=mapsidehash, buildtable=T2 +*/
Complex types
Lets create another table that uses complex data types (namely array and struct) Set the schema for this session: 2013 BigDataUniversity.com
21
Working with SQL using Big SQL Hands-on Lab Jsqsh> USE <schema_name>;
Create a table with complex data types. The clause COLLECTION ITEMS TERMINATED BY tells how to separate struct/array members. Jsqsh> CREATE TABLE employees ( name STRING, phones ARRAY<STRING>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY ':';
Load some data. From the other terminal window, Terminal 2 , start the Hive shell, and execute the GRANT statement: hive hive> use <IMDC UserID> ; hive> grant all on table employees to user <IMDC UserID> ; hive> quit;
2013 BigDataUniversity.com
22
Working with SQL using Big SQL Hands-on Lab LOAD HIVE DATA LOCAL INPATH '/userdata/employees.data' OVERWRITE INTO TABLE employees;
Lets see how to access complex types. Note that for this query, bigsql used local mode by default Jsqsh> SELECT * FROM EMPLOYEES;
Look at names and 1st phone# (in array of phone numbers) Jsqsh> SELECT name, phones[1] FROM EMPLOYEES;
Look at names and city (from address struct) Jsqsh> SELECT name, address.city FROM EMPLOYEES;
If we know that very large amount of data needs to be processed by reducers, then increase the # of reducers Jsqsh> set mapred.reduce.tasks = 4;
Check back to ensure that setting is now in effect: Jsqsh> set v mapred.reduce.tasks;
2013 BigDataUniversity.com
23
Working with SQL using Big SQL Hands-on Lab Print all settings that are applicable to this session Jsqsh> set v;
Print the settings that you have manually set in this session Jsqsh> set;
2013 BigDataUniversity.com
24