The document discusses Impala SQL support, including that Impala can access tables defined through Impala or Hive DDL, supports similar DML statements to HiveQL, and provides many built-in functions with the same names and parameters as HiveQL equivalents. It covers Impala's support for data definition languages (DDL) like CREATE TABLE and ALTER TABLE, data manipulation languages (DML) like INSERT and LOAD DATA statements, and Impala-specific functions and languages (ISL).
Report
Share
Report
Share
1 of 30
Download to read offline
More Related Content
Impala SQL Support
1. Impala SQL Support
Yue Chen
http://linkedin.com/in/yuechen2
http://dataera.wordpress.com
2. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports
Because Impala uses the same metadata store as Hive to record information about table structure and properties, Impala can access tables defined through the native Impala CREATE TABLE command, or tables created using the Hive data definition language (DDL).
Impala supports data manipulation (DML) statements similar to the DML component of HiveQL.
Impala provides many built-in functions with the same names and parameter types as their HiveQL equivalents
4. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DDL)
Data Definition Language
CREATE DATABASE Statement
usage:
create database first LOCATION hdfs_path];
use first;
CREATE TABLE Statement
usage:
CREATE EXTERNAL TABLE tab1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/cloudera/sample_data/tab1';
CREATE TABLE tab3 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, month INT, day INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',‘[STORED AS file_format] ; file_format: PARQUET | TEXTFILE | AVRO | SEQUENCEFILE | RCFILE
5. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DDL)
Data Definition Language
CREATE TABLE Statement
usage:
CREATE EXTERNAL TABLE tab1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/cloudera/sample_data/tab1';
CREATE TABLE tab3 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, month INT, day INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',‘[STORED AS file_format] ;
file_format: PARQUET | TEXTFILE | AVRO | SEQUENCEFILE | RCFILE
6. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DDL)
Data Definition Language
CREATE VIEW Statement
usage:
create view v1 as select * from t1; create view v2 as select c1, c3, c7 from t1;
create view v3 as select c1, cast(c3 as string) c3, concat(c4,c5) c5, trim(c6) c6, "Constant" c8 from t1;
create view v4 as select t1.c1, t2.c2 from t1 join t2 on t1.id = t2.id; create view some_db.v5 as select * from some_other_db.t1;
7. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DDL)
Data Definition Language
ALTER TABLE Statement
usage:
create database d1; create database d2; create database d3;
use d1;
create table mobile (x int);
use d2; -- Move table from another database to the current one.
alter table d1.mobile rename to mobile; use d1;
alter table d2.mobile rename to d3.mobile; -- Move table from one database to another.
create table p1 (s string) partitioned by (month int, day int);
alter table p1 partition (month=1, day=1) set location ‘/usr/external_data/new_years_day'
8. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DDL)
Data Definition Language
ALTER VIEW Statement
usage:
create table t1 (x int, y int, s string);
create table t2 like t1;
create view v1 as select * from t1;
alter view v1 as select * from t2;
12. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
DML Statements
INSERT Statement
Impala supports inserting into tables and partitions that you create with the Impala CREATE TABLE statement, or pre-defined tables and partitions created through Hive.
13. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
DML Insert Statements
INSERT INTO to append data to a table.
INSERT OVERWRITE to replace the data in a table.
Copy data from another table using SELECT query.
An optional with clause before the INSERT keyword, to define a subquery referenced in the SELECT portion.
Create one or more new rows using constant expressions through VALUES clause.
Specify the names or order of columns to be inserted, different than the columns of the table being queried by the INSERT statement.
14. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
DML Insert Statements
Eg: insert into table text_table select * from default.tab1
Eg: insert overwrite table parquet_table select * from default.tab1
Eg: insert into val_test_1 values (100, 99.9/10, 'abc', true, now());
Eg: insert overwrite val_test_2 values (1, 'a'), (2, 'b'), (-1,'xyzzy');
Eg:CREATE TABLE rc_table ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) STORED AS RCFILE;
insert into table rc_table select * from default.tab1; Remote error Backend 0:RC_FILE not implemented.
15. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
DML Insert Statements
Eg: insert into table text_table select * from default.tab1
Eg: insert overwrite table parquet_table select * from default.tab1
Eg: insert into val_test_1 values (100, 99.9/10, 'abc', true, now());
Eg: insert overwrite val_test_2 values (1, 'a'), (2, 'b'), (-1,'xyzzy');
Eg:CREATE TABLE rc_table ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) STORED AS RCFILE;
insert into table rc_table select * from default.tab1; Remote error Backend 0:RC_FILE not implemented.
Note: The above examples show the type of "not implemented" error that you see when attempting to insert data into a table with a file format that Impala currently does not write to
16. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
DML Insert Statements
An optional hint clause immediately before the SELECT keyword, to fine-tune the behavior when doing an INSERT ... SELECT operation into partitioned Parquet tables. The hint keywords are [SHUFFLE] and [NOSHUFFLE], including the square brackets. Inserting into partitioned Parquet tables can be a resource-intensive operation because it potentially involves many files being written to HDFS simultaneously.
17. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
DML Insert Statements
oselect customer.address, state_lookup.state_name from customer join [broadcast] state_lookup on customer.state_id = state_lookup.state_id;
•This query joins a large customer table with a small lookup table of less than 100 rows. The right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the [broadcast] hint to force a broadcast join strategy.
18. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
DML Insert Statements
oselect weather.wind_velocity, geospatial.altitude from weather join [shuffle] geospatial on weather.lat = geospatial.lat and weather.long = geospatial.long;
This query joins two large tables of unpredictable size. You might benchmark the query with both kinds of hints and find that it is more efficient to transmit portions of each table to other nodes for processing. Thus, you would use the [shuffle] hint to force a partitioned join strategy.
19. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
Loading Statements
The LOAD DATA statement streamlines the ETL process for an internal Impala table by moving a data file or all the data files in a directory from an HDFS location into the Impala data directory for that table.
20. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
Loading Statements
usage:
create table t1 (s string);
load data inpath '/user/cloudera/thousand_strings.txt' into table t1;
load data inpath '/user/cloudera/ten_strings.txt' overwrite into table t1;
23. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
Loading Statements Notes:
The loaded data files are moved, not copied, into the Impala data directory.
You can specify the HDFS path of a single file to be moved, or the HDFS path of a directory to move all the files inside that directory. You cannot specify any sort of wildcard to take only some of the files from a directory. When loading a directory full of data files, keep all the data files at the top level, with no nested directories underneath.
24. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
Loading Statements Notes:
The loaded data files are moved, not copied, into the Impala data directory.
You can specify the HDFS path of a single file to be moved, or the HDFS path of a directory to move all the files inside that directory. You cannot specify any sort of wildcard to take only some of the files from a directory. When loading a directory full of data files, keep all the data files at the top level, with no nested directories underneath.
Currently, the Impala LOAD DATA statement only imports files from HDFS, not from the local file system.
25. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (ISL)
Usage:
REFRESH table_name
Notes:
o Use the REFRESH statement to load the latest metastore metadata and block location data for a particular table in these scenarios:
oAfter loading new data files into the HDFS data directory for the table. (Once you have set up an ETL pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why metadata needs to be refreshed.)
oAfter issuing ALTER TABLE, INSERT, LOAD DATA, or other table-modifying SQL statement in Hive
26. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (ISL)
Usage:
INVALIDATE METADATA [table_name]
Notes:
oA metadata update for an impalad instance is required if:
oA metadata change occurs.
oand the change is made from another impalad instance in your cluster, or through Hive.
oand the change is made to a database to which clients such as the Impala shell or ODBC directly connect.
27. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (ISL)
Usage:
Show Statements:
SHOW DATABASES [[LIKE] 'pattern']
SHOW SCHEMAS [[LIKE] 'pattern'] - an alias for SHOW DATABASES SHOW TABLES [IN database_name] [[LIKE] 'pattern']
SHOW [AGGREGATE] FUNCTIONS [IN database_name] [[LIKE] 'pattern']
SHOW CREATE TABLE [database_name].table_name
SHOW TABLE STATS [database_name.]table_name
SHOW COLUMN STATS [database_name.]table_name
SHOW PARTITIONS [database_name.]table_name
28. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
Impala does not support the following SQL features
Non-scalar data types such as maps, arrays, structs.
Extensibility mechanisms such as TRANSFORM, custom file formats.
XML and JSON functions.
Certain aggregate functions from HiveQL: variance,var_pop,var_samp,stddev_pop,stddev_samp,covar_pop,covar_samp,corr,percentile,percentile_approx,histogram_numeric,collect_set.
Sampling.
Lateral views.
Multiple DISTINCT clauses per query.
29. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Cloudera Impala – SQL Supports (DML)
Impala does not currently support these HiveQL statements:
ANALYZE TABLE (the Impala equivalent is COMPUTE STATS)
DESCRIBE COLUMN
DESCRIBE DATABASE
EXPORT TABLE
IMPORT TABLE
SHOW PARTITIONS
SHOW TABLE EXTENDED
SHOW INDEXES
SHOW COLUMNS