Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
1 
Impala 
2.0 
Update 
Sho 
Shimauchi, 
Cloudera 
2014/10/31
2 
Today’s 
Topic 
• What 
is 
Cloudera 
Impala? 
• Impala 
1.4 
/ 
2.0 
update 
• Performance 
Improvement 
• Query 
Language 
• Resource 
Management 
and 
Security 
• Others
3 
Who 
am 
I 
? 
• Pre-­‐sales 
SoluLons 
Architect 
• joined 
Cloudera 
in 
2011, 
the 
first 
Japanese 
employee 
at 
Cloudera 
• email: 
sho@cloudera.com 
• twiTer: 
@shiumachi
4 
Cloudera 
Impala
5 
What 
is 
Impala? 
• MPP 
SQL 
query 
engine 
for 
Hadoop 
environment 
• wriTen 
in 
naLve 
code 
for 
maximum 
hardware 
efficiency 
• open-­‐source! 
• hTp://impala.io/ 
• Supported 
by 
Cloudera, 
Amazon, 
and 
MapR 
• History 
• 2012/10 
Public 
Beta 
released 
• 2013/04 
Impala 
1.0 
released 
• current 
version: 
Impala 
2.0
6 
Impala 
is 
easy 
to 
use 
• create 
tables 
as 
virtual 
views 
over 
data 
stored 
in 
HDFS 
/ 
HBase 
• schema 
metadata 
is 
stored 
in 
Metastore 
• shared 
with 
Hive, 
Pig, 
etc. 
• connect 
via 
ODBC 
/ 
JDBC 
• authenLcate 
via 
Kerberos 
/ 
LDAP 
• run 
standard 
SQL 
• ANSI 
SQL-­‐92 
based 
• limited 
to 
SELECT 
and 
bulk 
INSERT 
• no 
correlated 
subqueries 
available 
in 
2.0 
• UDF 
/ 
UDAF
7 
Impala 
1.4 
(2014/07) 
• DECIMAL(<precision>, 
<scale>) 
• HDFS 
caching 
DDL 
• column 
definiLon 
based 
on 
Parquet 
file 
(CREATE 
TABLE 
… 
LIKE 
PARQUET) 
• ORDER 
BY 
without 
LIMIT 
• LDAP 
connecLons 
through 
TLS 
• SHOW 
PARTITIONS 
• YARN 
integrated 
resource 
manager 
will 
be 
producLon 
ready 
• Llama 
HA 
support 
• CREATE 
TABLE 
… 
STORED 
AS 
AVRO 
• SUMMARY 
command 
in 
impala-­‐shell 
(provides 
high-­‐level 
summary 
of 
query 
plan) 
• faster 
COMPUTE 
STATS 
• Performance 
improvements 
for 
parLLon 
pruning 
• impala 
shell 
supports 
UTF-­‐8 
characters 
• addiLonal 
built-­‐ins 
from 
EDW 
systems
8 
Impala 
2.0 
(2014/10) 
• hash 
table 
can 
spill 
to 
disk 
• join 
and 
aggregate 
tables 
of 
arbitrary 
size 
• Subquery 
enhancements 
• allowed 
in 
WHERE 
queries 
• EXISTS 
/ 
NOT 
EXISTS 
• IN 
/ 
NOT 
IN 
can 
operate 
on 
the 
result 
set 
from 
a 
subquery 
• correlated 
/ 
uncorrelated 
subqueries 
• scalar 
subqueries 
• SQL 
2003 
compliant 
analyLc 
window 
funcLons 
• LEAD(), 
LAG(), 
RANK(), 
FIRST_VALUE(), 
etc. 
• New 
Data 
Type: 
VARCHAR, 
CHAR 
• Security 
Enhancements 
• mulLple 
authenLcaLon 
methods 
• GRANT 
/ 
REVOKE 
/ 
CREATE 
ROLE 
/ 
DROP 
ROLE 
/ 
SHOW 
ROLES 
/ 
etc. 
• text 
+ 
gzip 
/ 
bzip2 
/ 
Snappy 
• Hint 
inside 
views 
• QUERY_TIMEOUT_S 
• DATE_PART() 
/ 
EXTRACT() 
• Parquet 
default 
block 
size 
is 
changed 
to 
256MB 
(was: 
1GB) 
• LEFT 
ANTI 
JOIN 
/ 
RIGHT 
ANTI 
JOIN 
• impala-­‐shell 
can 
read 
sesngs 
from 
$HOME/.impalarc
9 
Performance 
Improvement
10 
HDFS 
caching 
• When 
HDFS 
files 
are 
cached 
in 
memory, 
Impala 
can 
read 
the 
cached 
data 
without 
any 
disk 
reads, 
and 
without 
making 
an 
addiLonal 
copy 
of 
the 
data 
in 
memory 
• avoids 
checksumming 
and 
data 
copies 
• new 
HDFS 
API 
is 
available 
in 
CDH 
5.0 
• configure 
cache 
with 
Impala 
DDL 
• CREATE 
TABLE 
tbl_name 
CACHED 
IN 
‘<pool>’ 
• ALTER 
TABLE 
tbl_name 
ADD 
PARTITION 
… 
CACHED 
IN 
‘<pool>’
11 
ParLLon 
Pruning 
improvement 
• 
Previously, 
Impala 
typically 
queried 
tables 
with 
up 
to 
approximately 
3000 
parLLons. 
With 
the 
performance 
improvement 
in 
parLLon 
pruning, 
now 
Impala 
can 
comfortably 
handle 
tables 
with 
tens 
of 
thousands 
of 
parLLons.
12 
Spilling 
to 
Disk 
SQL 
OperaLon 
• write 
temporary 
data 
to 
when 
Impala 
is 
close 
to 
exceeding 
its 
memory 
limit 
• In 
PROFILE, 
BlockMgr.BytesWriTen 
counter 
reports 
how 
much 
data 
was 
wriTen 
to 
disk 
during 
the 
query
13 
Query 
Language
14 
Subquery 
Scalar 
subquery: 
produces 
a 
result 
set 
with 
a 
single 
row 
containing 
a 
single 
column 
SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2);! 
Uncorrelated 
subquery: 
not 
refer 
to 
any 
tables 
from 
the 
outer 
block 
of 
the 
query 
SELECT x FROM t1 WHERE x IN (SELECT y FROM t2);! 
Correlated 
subquery: 
compare 
one 
or 
more 
values 
from 
the 
outer 
query 
block 
to 
values 
referenced 
in 
the 
WHERE 
clause 
of 
the 
subquery 
SELECT employee_name, employee_id FROM employees one WHERE! 
salary > (SELECT avg(salary) FROM employees two WHERE 
one.dept_id = two.dept_id);!
15 
AnalyLc 
FuncLons 
(a.k.a 
Window 
FuncLons) 
• supported 
in 
2.0 
and 
later 
• supported 
funcLons 
• RANK() 
/ 
DENSE_RANK() 
• FIRST_VALUE() 
/ 
LAST_VALUE() 
• LAG() 
/ 
LEAD() 
• ROW_NUMBER() 
• Aggregate 
funcLons 
are 
already 
implemented 
• MAX(), 
MIN(), 
AVG(), 
SUM(), 
etc.
16 
AnalyLc 
FuncLons 
Example 
For 
each 
day, 
the 
query 
prints 
the 
closing 
price 
alongside 
the 
previous 
day's 
closing 
price: 
select stock_symbol, closing_date, closing_price,! 
lag(closing_price,1) over (partition by stock_symbol order by closing_date) as 
"yesterday closing"! 
from stock_ticker! 
order by closing_date;! 
+--------------+---------------------+---------------+-------------------+! 
| stock_symbol | closing_date | closing_price | yesterday closing |! 
+--------------+---------------------+---------------+-------------------+! 
| JDR | 2014-09-13 00:00:00 | 12.86 | NULL |! 
| JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 |! 
| JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 |! 
| JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 |! 
| JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 |! 
| JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 |! 
| JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 |! 
+--------------+---------------------+---------------+-------------------+!
17 
ApproximaLon 
features 
• APPX_COUNT_DISTINCT 
query 
opLon 
• rewrite 
COUNT(DISTINCT) 
calls 
to 
use 
NDV() 
• speeds 
up 
the 
operaLon 
• allows 
mulLple 
COUNT(DISTINCT) 
in 
a 
single 
query 
• APPX_MEDIAN() 
• returns 
a 
value 
that 
is 
approximately 
the 
median 
(midpoint) 
of 
values 
in 
the 
set 
of 
input 
values
18 
Approx. 
funcLons 
example 
[localhost:21000] > select min(x), max(x), avg(x) from 
million_numbers;! 
+-------------------+-------------------+-------------------+! 
| min(x) | max(x) | avg(x) |! 
+-------------------+-------------------+-------------------+! 
| 4.725693727250069 | 49994.56852674231 | 24945.38563793553 |! 
+-------------------+-------------------+-------------------+! 
[localhost:21000] > select appx_median(x) from million_numbers;! 
+----------------+! 
| appx_median(x) |! 
+----------------+! 
| 24721.6 |! 
+----------------+!
19 
CREATE 
TABLE 
… 
LIKE 
PARQUET 
• CREATE 
TABLE 
... 
LIKE 
PARQUET 
'hdfs_path_of_parquet_file' 
• The 
column 
names 
and 
data 
types 
are 
automaLcally 
configured 
based 
on 
the 
Parquet 
data 
file
20 
ORDER 
BY 
without 
LIMIT 
• LIMIT 
clause 
is 
now 
opLonal 
for 
queries 
that 
use 
the 
ORDER 
BY 
clause 
• Impala 
automaLcally 
uses 
a 
temporary 
disk 
work 
area 
to 
perform 
the 
sort 
if 
the 
sort 
operaLon 
would 
otherwise 
exceed 
the 
Impala 
memory 
limit 
for 
a 
parLcular 
data 
node.
21 
DECODE() 
SELECT event, DECODE(day_of_week, 1, "Monday", 2, "Tuesday", 3, 
"Wednesday”, 4, "Thursday", 5, "Friday", 6, "Saturday", 7, 
"Sunday", "Unknown day")! 
FROM calendar;!
22 
ANTI 
JOIN 
LEFT 
ANTI 
JOIN 
/ 
RIGHT 
ANTI 
JOIN 
are 
supported 
in 
Impala 
2.0 
[localhost:21000] > create table t1 (x int);! 
[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);! 
! 
[localhost:21000] > create table t2 (y int);! 
[localhost:21000] > insert into t2 values (2), (4), (6);! 
! 
[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);! 
+---+! 
| x |! 
+---+! 
| 1 |! 
| 3 |! 
| 5 |! 
+---+! 
!
23 
new 
data 
types 
• DECIMAL 
(Impala 
1.4) 
• column_name 
DECIMAL[(precision[,scale])] 
• with 
no 
precision 
or 
scale 
values 
is 
equivalent 
to 
DECIMAL(9,0) 
• VARCHAR 
(Impala 
2.0) 
• STRING 
with 
a 
max 
length 
• CHAR 
(Impala 
2.0) 
• STRING 
with 
a 
precise 
length
24 
new 
built-­‐in 
funcLons 
• EXTRACT() 
: 
returns 
one 
date 
or 
Lme 
field 
from 
a 
TIMESTAMP 
value 
• TRUNC() 
: 
truncates 
date/Lme 
values 
to 
year, 
month, 
etc. 
• ADD_MONTHS(): 
alias 
for 
MONTHS_ADD() 
• ROUND(): 
rounds 
DECIMAL 
values 
• for 
compuLng 
properLes 
for 
staLsLcal 
distribuLons 
• STDDEV() 
• STDDEV_SAMP() 
/ 
STDDEV_POP() 
• VARIANCE() 
• VARIANCE_SAMP() 
/ 
VARIANCE_POP() 
• MAX_INT() 
/ 
MIN_SMALLINT() 
• IS_INF() 
/ 
IS_NAN()
25 
SHOW 
PARTITIONS 
[localhost:21000] > show partitions census;! 
+-------+-------+--------+------+---------+! 
| year | #Rows | #Files | Size | Format |! 
+-------+-------+--------+------+---------+! 
| 2000 | -1 | 0 | 0B | TEXT |! 
| 2004 | -1 | 0 | 0B | TEXT |! 
| 2008 | -1 | 0 | 0B | TEXT |! 
| 2010 | -1 | 0 | 0B | TEXT |! 
| 2011 | 4 | 1 | 22B | TEXT |! 
| 2012 | 4 | 1 | 22B | TEXT |! 
| 2013 | 1 | 1 | 231B | PARQUET |! 
| Total | 9 | 3 | 275B | |! 
+-------+-------+--------+------+---------+! 
!
26 
SUMMARY 
• impala-­‐shell 
command 
• easy-­‐to-­‐digest 
overview 
of 
the 
Lmings 
for 
the 
different 
phases 
of 
execuLon 
for 
a 
query 
[localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;! 
+---------------------+! 
| avg(ss_sales_price) |! 
+---------------------+! 
| 37.80770926328327 |! 
+---------------------+! 
[localhost:21000] > summary;! 
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+! 
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |! 
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+! 
| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |! 
| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |! 
| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |! 
| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |! 
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
27 
SET 
statement 
• Before 
Impala 
2.0, 
SET 
can 
be 
used 
only 
in 
impala-­‐ 
shell 
• In 
Impala 
2.0, 
you 
can 
use 
SET 
in 
client 
app 
through 
JDBC 
/ 
ODBC 
APIs.
28 
Resource 
Management 
and 
Security
29 
Admission 
Control 
(Impala 
1.3) 
• Fast 
and 
lightweight 
resource 
management 
mechanism 
• avoids 
oversubscripLon 
of 
resources 
for 
concurrent 
workloads 
• queries 
are 
queued 
when 
reaching 
configurable 
limits 
• Run 
on 
every 
impalad 
• no 
SPOF
30 
YARN 
and 
Llama 
• Llama: 
Low 
Latency 
ApplicaLon 
MAster 
• Subdivides 
coarse-­‐grain 
YARN 
scheduling 
into 
finer-­‐ 
granularity 
for 
low-­‐latency 
and 
short-­‐lived 
queries 
• Llama 
registers 
one 
long-­‐lived 
AM 
per 
YARN 
pool 
• Llama 
caches 
resources 
allocated 
by 
YARN 
for 
a 
short 
Lme, 
so 
that 
they 
can 
be 
quickly 
re-­‐allocated 
to 
Impala 
queries 
• much 
faster 
than 
waiLng 
for 
YARN 
• Impala 
1.4: 
GA. 
Llama 
HA 
support
31 
Query 
Timeout 
• A 
new 
query 
opLon, 
QUERY_TIMEOUT_S, 
lets 
you 
specify 
a 
Lmeout 
period 
in 
seconds 
for 
individual 
queries 
• Note: 
The 
Lmeout 
clock 
for 
queries 
and 
sessions 
only 
starts 
Lcking 
when 
the 
query 
or 
session 
is 
idle
32 
Security 
• Impala 
2.0 
can 
accept 
either 
kind 
of 
auth. 
request 
• ex) 
host 
A 
with 
Kerberos, 
and 
host 
B 
with 
LDAP 
• Security 
related 
statement 
• GRANT 
• REVOKE 
• CREATE 
ROLE 
• DROP 
ROLE 
• SHOW 
ROLES 
• SHOW 
ROLE 
GRANT 
• -­‐-­‐disk_spill_encrypLon 
opLon
33 
Others
34 
Text 
+ 
gzip, 
bzip2, 
and 
Snappy 
• In 
Impala 
2.0 
and 
later, 
Impala 
supports 
using 
text 
data 
files 
that 
employ 
gzip, 
bzip2, 
or 
Snappy 
compression 
• use 
ROW 
FORMAT 
with 
delimiter 
and 
escape 
character 
to 
create 
table 
CREATE TABLE csv_compressed (a STRING, b STRING, c STRING)! 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";!
35 
impala-­‐shell 
• UTF-­‐8 
support 
(1.4) 
• .impalarc 
file 
(2.0) 
[impala]! 
verbose=true! 
default_db=tpc_benchmarking! 
write_delimited=true! 
output_delimiter=,! 
output_file=/home/tester1/benchmark_results.csv! 
show_profiles=true!
36 
DocumentaLon 
• Cluster 
Sizing 
Guidelines 
for 
Impala 
• hTp://www.cloudera.com/content/cloudera/en/ 
documentaLon/core/latest/topics/ 
impala_cluster_sizing.html
37

More Related Content

Impala 2.0 Update #impalajp

  • 1. 1 Impala 2.0 Update Sho Shimauchi, Cloudera 2014/10/31
  • 2. 2 Today’s Topic • What is Cloudera Impala? • Impala 1.4 / 2.0 update • Performance Improvement • Query Language • Resource Management and Security • Others
  • 3. 3 Who am I ? • Pre-­‐sales SoluLons Architect • joined Cloudera in 2011, the first Japanese employee at Cloudera • email: sho@cloudera.com • twiTer: @shiumachi
  • 5. 5 What is Impala? • MPP SQL query engine for Hadoop environment • wriTen in naLve code for maximum hardware efficiency • open-­‐source! • hTp://impala.io/ • Supported by Cloudera, Amazon, and MapR • History • 2012/10 Public Beta released • 2013/04 Impala 1.0 released • current version: Impala 2.0
  • 6. 6 Impala is easy to use • create tables as virtual views over data stored in HDFS / HBase • schema metadata is stored in Metastore • shared with Hive, Pig, etc. • connect via ODBC / JDBC • authenLcate via Kerberos / LDAP • run standard SQL • ANSI SQL-­‐92 based • limited to SELECT and bulk INSERT • no correlated subqueries available in 2.0 • UDF / UDAF
  • 7. 7 Impala 1.4 (2014/07) • DECIMAL(<precision>, <scale>) • HDFS caching DDL • column definiLon based on Parquet file (CREATE TABLE … LIKE PARQUET) • ORDER BY without LIMIT • LDAP connecLons through TLS • SHOW PARTITIONS • YARN integrated resource manager will be producLon ready • Llama HA support • CREATE TABLE … STORED AS AVRO • SUMMARY command in impala-­‐shell (provides high-­‐level summary of query plan) • faster COMPUTE STATS • Performance improvements for parLLon pruning • impala shell supports UTF-­‐8 characters • addiLonal built-­‐ins from EDW systems
  • 8. 8 Impala 2.0 (2014/10) • hash table can spill to disk • join and aggregate tables of arbitrary size • Subquery enhancements • allowed in WHERE queries • EXISTS / NOT EXISTS • IN / NOT IN can operate on the result set from a subquery • correlated / uncorrelated subqueries • scalar subqueries • SQL 2003 compliant analyLc window funcLons • LEAD(), LAG(), RANK(), FIRST_VALUE(), etc. • New Data Type: VARCHAR, CHAR • Security Enhancements • mulLple authenLcaLon methods • GRANT / REVOKE / CREATE ROLE / DROP ROLE / SHOW ROLES / etc. • text + gzip / bzip2 / Snappy • Hint inside views • QUERY_TIMEOUT_S • DATE_PART() / EXTRACT() • Parquet default block size is changed to 256MB (was: 1GB) • LEFT ANTI JOIN / RIGHT ANTI JOIN • impala-­‐shell can read sesngs from $HOME/.impalarc
  • 10. 10 HDFS caching • When HDFS files are cached in memory, Impala can read the cached data without any disk reads, and without making an addiLonal copy of the data in memory • avoids checksumming and data copies • new HDFS API is available in CDH 5.0 • configure cache with Impala DDL • CREATE TABLE tbl_name CACHED IN ‘<pool>’ • ALTER TABLE tbl_name ADD PARTITION … CACHED IN ‘<pool>’
  • 11. 11 ParLLon Pruning improvement • Previously, Impala typically queried tables with up to approximately 3000 parLLons. With the performance improvement in parLLon pruning, now Impala can comfortably handle tables with tens of thousands of parLLons.
  • 12. 12 Spilling to Disk SQL OperaLon • write temporary data to when Impala is close to exceeding its memory limit • In PROFILE, BlockMgr.BytesWriTen counter reports how much data was wriTen to disk during the query
  • 14. 14 Subquery Scalar subquery: produces a result set with a single row containing a single column SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2);! Uncorrelated subquery: not refer to any tables from the outer block of the query SELECT x FROM t1 WHERE x IN (SELECT y FROM t2);! Correlated subquery: compare one or more values from the outer query block to values referenced in the WHERE clause of the subquery SELECT employee_name, employee_id FROM employees one WHERE! salary > (SELECT avg(salary) FROM employees two WHERE one.dept_id = two.dept_id);!
  • 15. 15 AnalyLc FuncLons (a.k.a Window FuncLons) • supported in 2.0 and later • supported funcLons • RANK() / DENSE_RANK() • FIRST_VALUE() / LAST_VALUE() • LAG() / LEAD() • ROW_NUMBER() • Aggregate funcLons are already implemented • MAX(), MIN(), AVG(), SUM(), etc.
  • 16. 16 AnalyLc FuncLons Example For each day, the query prints the closing price alongside the previous day's closing price: select stock_symbol, closing_date, closing_price,! lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing"! from stock_ticker! order by closing_date;! +--------------+---------------------+---------------+-------------------+! | stock_symbol | closing_date | closing_price | yesterday closing |! +--------------+---------------------+---------------+-------------------+! | JDR | 2014-09-13 00:00:00 | 12.86 | NULL |! | JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 |! | JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 |! | JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 |! | JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 |! | JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 |! | JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 |! +--------------+---------------------+---------------+-------------------+!
  • 17. 17 ApproximaLon features • APPX_COUNT_DISTINCT query opLon • rewrite COUNT(DISTINCT) calls to use NDV() • speeds up the operaLon • allows mulLple COUNT(DISTINCT) in a single query • APPX_MEDIAN() • returns a value that is approximately the median (midpoint) of values in the set of input values
  • 18. 18 Approx. funcLons example [localhost:21000] > select min(x), max(x), avg(x) from million_numbers;! +-------------------+-------------------+-------------------+! | min(x) | max(x) | avg(x) |! +-------------------+-------------------+-------------------+! | 4.725693727250069 | 49994.56852674231 | 24945.38563793553 |! +-------------------+-------------------+-------------------+! [localhost:21000] > select appx_median(x) from million_numbers;! +----------------+! | appx_median(x) |! +----------------+! | 24721.6 |! +----------------+!
  • 19. 19 CREATE TABLE … LIKE PARQUET • CREATE TABLE ... LIKE PARQUET 'hdfs_path_of_parquet_file' • The column names and data types are automaLcally configured based on the Parquet data file
  • 20. 20 ORDER BY without LIMIT • LIMIT clause is now opLonal for queries that use the ORDER BY clause • Impala automaLcally uses a temporary disk work area to perform the sort if the sort operaLon would otherwise exceed the Impala memory limit for a parLcular data node.
  • 21. 21 DECODE() SELECT event, DECODE(day_of_week, 1, "Monday", 2, "Tuesday", 3, "Wednesday”, 4, "Thursday", 5, "Friday", 6, "Saturday", 7, "Sunday", "Unknown day")! FROM calendar;!
  • 22. 22 ANTI JOIN LEFT ANTI JOIN / RIGHT ANTI JOIN are supported in Impala 2.0 [localhost:21000] > create table t1 (x int);! [localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);! ! [localhost:21000] > create table t2 (y int);! [localhost:21000] > insert into t2 values (2), (4), (6);! ! [localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);! +---+! | x |! +---+! | 1 |! | 3 |! | 5 |! +---+! !
  • 23. 23 new data types • DECIMAL (Impala 1.4) • column_name DECIMAL[(precision[,scale])] • with no precision or scale values is equivalent to DECIMAL(9,0) • VARCHAR (Impala 2.0) • STRING with a max length • CHAR (Impala 2.0) • STRING with a precise length
  • 24. 24 new built-­‐in funcLons • EXTRACT() : returns one date or Lme field from a TIMESTAMP value • TRUNC() : truncates date/Lme values to year, month, etc. • ADD_MONTHS(): alias for MONTHS_ADD() • ROUND(): rounds DECIMAL values • for compuLng properLes for staLsLcal distribuLons • STDDEV() • STDDEV_SAMP() / STDDEV_POP() • VARIANCE() • VARIANCE_SAMP() / VARIANCE_POP() • MAX_INT() / MIN_SMALLINT() • IS_INF() / IS_NAN()
  • 25. 25 SHOW PARTITIONS [localhost:21000] > show partitions census;! +-------+-------+--------+------+---------+! | year | #Rows | #Files | Size | Format |! +-------+-------+--------+------+---------+! | 2000 | -1 | 0 | 0B | TEXT |! | 2004 | -1 | 0 | 0B | TEXT |! | 2008 | -1 | 0 | 0B | TEXT |! | 2010 | -1 | 0 | 0B | TEXT |! | 2011 | 4 | 1 | 22B | TEXT |! | 2012 | 4 | 1 | 22B | TEXT |! | 2013 | 1 | 1 | 231B | PARQUET |! | Total | 9 | 3 | 275B | |! +-------+-------+--------+------+---------+! !
  • 26. 26 SUMMARY • impala-­‐shell command • easy-­‐to-­‐digest overview of the Lmings for the different phases of execuLon for a query [localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;! +---------------------+! | avg(ss_sales_price) |! +---------------------+! | 37.80770926328327 |! +---------------------+! [localhost:21000] > summary;! +--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+! | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |! +--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+! | 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |! | 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |! | 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |! | 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |! +--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
  • 27. 27 SET statement • Before Impala 2.0, SET can be used only in impala-­‐ shell • In Impala 2.0, you can use SET in client app through JDBC / ODBC APIs.
  • 28. 28 Resource Management and Security
  • 29. 29 Admission Control (Impala 1.3) • Fast and lightweight resource management mechanism • avoids oversubscripLon of resources for concurrent workloads • queries are queued when reaching configurable limits • Run on every impalad • no SPOF
  • 30. 30 YARN and Llama • Llama: Low Latency ApplicaLon MAster • Subdivides coarse-­‐grain YARN scheduling into finer-­‐ granularity for low-­‐latency and short-­‐lived queries • Llama registers one long-­‐lived AM per YARN pool • Llama caches resources allocated by YARN for a short Lme, so that they can be quickly re-­‐allocated to Impala queries • much faster than waiLng for YARN • Impala 1.4: GA. Llama HA support
  • 31. 31 Query Timeout • A new query opLon, QUERY_TIMEOUT_S, lets you specify a Lmeout period in seconds for individual queries • Note: The Lmeout clock for queries and sessions only starts Lcking when the query or session is idle
  • 32. 32 Security • Impala 2.0 can accept either kind of auth. request • ex) host A with Kerberos, and host B with LDAP • Security related statement • GRANT • REVOKE • CREATE ROLE • DROP ROLE • SHOW ROLES • SHOW ROLE GRANT • -­‐-­‐disk_spill_encrypLon opLon
  • 34. 34 Text + gzip, bzip2, and Snappy • In Impala 2.0 and later, Impala supports using text data files that employ gzip, bzip2, or Snappy compression • use ROW FORMAT with delimiter and escape character to create table CREATE TABLE csv_compressed (a STRING, b STRING, c STRING)! ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";!
  • 35. 35 impala-­‐shell • UTF-­‐8 support (1.4) • .impalarc file (2.0) [impala]! verbose=true! default_db=tpc_benchmarking! write_delimited=true! output_delimiter=,! output_file=/home/tester1/benchmark_results.csv! show_profiles=true!
  • 36. 36 DocumentaLon • Cluster Sizing Guidelines for Impala • hTp://www.cloudera.com/content/cloudera/en/ documentaLon/core/latest/topics/ impala_cluster_sizing.html
  • 37. 37