The document discusses Impala, an SQL query engine for Hadoop. It provides an overview of Impala, details improvements in versions 1.4 and 2.0, and describes new features like subqueries, analytic functions, and data types. Performance optimizations like HDFS caching and partition pruning are also covered.
2. 2
Today’s
Topic
• What
is
Cloudera
Impala?
• Impala
1.4
/
2.0
update
• Performance
Improvement
• Query
Language
• Resource
Management
and
Security
• Others
3. 3
Who
am
I
?
• Pre-‐sales
SoluLons
Architect
• joined
Cloudera
in
2011,
the
first
Japanese
employee
at
Cloudera
• email:
sho@cloudera.com
• twiTer:
@shiumachi
5. 5
What
is
Impala?
• MPP
SQL
query
engine
for
Hadoop
environment
• wriTen
in
naLve
code
for
maximum
hardware
efficiency
• open-‐source!
• hTp://impala.io/
• Supported
by
Cloudera,
Amazon,
and
MapR
• History
• 2012/10
Public
Beta
released
• 2013/04
Impala
1.0
released
• current
version:
Impala
2.0
6. 6
Impala
is
easy
to
use
• create
tables
as
virtual
views
over
data
stored
in
HDFS
/
HBase
• schema
metadata
is
stored
in
Metastore
• shared
with
Hive,
Pig,
etc.
• connect
via
ODBC
/
JDBC
• authenLcate
via
Kerberos
/
LDAP
• run
standard
SQL
• ANSI
SQL-‐92
based
• limited
to
SELECT
and
bulk
INSERT
• no
correlated
subqueries
available
in
2.0
• UDF
/
UDAF
7. 7
Impala
1.4
(2014/07)
• DECIMAL(<precision>,
<scale>)
• HDFS
caching
DDL
• column
definiLon
based
on
Parquet
file
(CREATE
TABLE
…
LIKE
PARQUET)
• ORDER
BY
without
LIMIT
• LDAP
connecLons
through
TLS
• SHOW
PARTITIONS
• YARN
integrated
resource
manager
will
be
producLon
ready
• Llama
HA
support
• CREATE
TABLE
…
STORED
AS
AVRO
• SUMMARY
command
in
impala-‐shell
(provides
high-‐level
summary
of
query
plan)
• faster
COMPUTE
STATS
• Performance
improvements
for
parLLon
pruning
• impala
shell
supports
UTF-‐8
characters
• addiLonal
built-‐ins
from
EDW
systems
8. 8
Impala
2.0
(2014/10)
• hash
table
can
spill
to
disk
• join
and
aggregate
tables
of
arbitrary
size
• Subquery
enhancements
• allowed
in
WHERE
queries
• EXISTS
/
NOT
EXISTS
• IN
/
NOT
IN
can
operate
on
the
result
set
from
a
subquery
• correlated
/
uncorrelated
subqueries
• scalar
subqueries
• SQL
2003
compliant
analyLc
window
funcLons
• LEAD(),
LAG(),
RANK(),
FIRST_VALUE(),
etc.
• New
Data
Type:
VARCHAR,
CHAR
• Security
Enhancements
• mulLple
authenLcaLon
methods
• GRANT
/
REVOKE
/
CREATE
ROLE
/
DROP
ROLE
/
SHOW
ROLES
/
etc.
• text
+
gzip
/
bzip2
/
Snappy
• Hint
inside
views
• QUERY_TIMEOUT_S
• DATE_PART()
/
EXTRACT()
• Parquet
default
block
size
is
changed
to
256MB
(was:
1GB)
• LEFT
ANTI
JOIN
/
RIGHT
ANTI
JOIN
• impala-‐shell
can
read
sesngs
from
$HOME/.impalarc
10. 10
HDFS
caching
• When
HDFS
files
are
cached
in
memory,
Impala
can
read
the
cached
data
without
any
disk
reads,
and
without
making
an
addiLonal
copy
of
the
data
in
memory
• avoids
checksumming
and
data
copies
• new
HDFS
API
is
available
in
CDH
5.0
• configure
cache
with
Impala
DDL
• CREATE
TABLE
tbl_name
CACHED
IN
‘<pool>’
• ALTER
TABLE
tbl_name
ADD
PARTITION
…
CACHED
IN
‘<pool>’
11. 11
ParLLon
Pruning
improvement
•
Previously,
Impala
typically
queried
tables
with
up
to
approximately
3000
parLLons.
With
the
performance
improvement
in
parLLon
pruning,
now
Impala
can
comfortably
handle
tables
with
tens
of
thousands
of
parLLons.
12. 12
Spilling
to
Disk
SQL
OperaLon
• write
temporary
data
to
when
Impala
is
close
to
exceeding
its
memory
limit
• In
PROFILE,
BlockMgr.BytesWriTen
counter
reports
how
much
data
was
wriTen
to
disk
during
the
query
14. 14
Subquery
Scalar
subquery:
produces
a
result
set
with
a
single
row
containing
a
single
column
SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2);!
Uncorrelated
subquery:
not
refer
to
any
tables
from
the
outer
block
of
the
query
SELECT x FROM t1 WHERE x IN (SELECT y FROM t2);!
Correlated
subquery:
compare
one
or
more
values
from
the
outer
query
block
to
values
referenced
in
the
WHERE
clause
of
the
subquery
SELECT employee_name, employee_id FROM employees one WHERE!
salary > (SELECT avg(salary) FROM employees two WHERE
one.dept_id = two.dept_id);!
15. 15
AnalyLc
FuncLons
(a.k.a
Window
FuncLons)
• supported
in
2.0
and
later
• supported
funcLons
• RANK()
/
DENSE_RANK()
• FIRST_VALUE()
/
LAST_VALUE()
• LAG()
/
LEAD()
• ROW_NUMBER()
• Aggregate
funcLons
are
already
implemented
• MAX(),
MIN(),
AVG(),
SUM(),
etc.
16. 16
AnalyLc
FuncLons
Example
For
each
day,
the
query
prints
the
closing
price
alongside
the
previous
day's
closing
price:
select stock_symbol, closing_date, closing_price,!
lag(closing_price,1) over (partition by stock_symbol order by closing_date) as
"yesterday closing"!
from stock_ticker!
order by closing_date;!
+--------------+---------------------+---------------+-------------------+!
| stock_symbol | closing_date | closing_price | yesterday closing |!
+--------------+---------------------+---------------+-------------------+!
| JDR | 2014-09-13 00:00:00 | 12.86 | NULL |!
| JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 |!
| JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 |!
| JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 |!
| JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 |!
| JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 |!
| JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 |!
+--------------+---------------------+---------------+-------------------+!
17. 17
ApproximaLon
features
• APPX_COUNT_DISTINCT
query
opLon
• rewrite
COUNT(DISTINCT)
calls
to
use
NDV()
• speeds
up
the
operaLon
• allows
mulLple
COUNT(DISTINCT)
in
a
single
query
• APPX_MEDIAN()
• returns
a
value
that
is
approximately
the
median
(midpoint)
of
values
in
the
set
of
input
values
19. 19
CREATE
TABLE
…
LIKE
PARQUET
• CREATE
TABLE
...
LIKE
PARQUET
'hdfs_path_of_parquet_file'
• The
column
names
and
data
types
are
automaLcally
configured
based
on
the
Parquet
data
file
20. 20
ORDER
BY
without
LIMIT
• LIMIT
clause
is
now
opLonal
for
queries
that
use
the
ORDER
BY
clause
• Impala
automaLcally
uses
a
temporary
disk
work
area
to
perform
the
sort
if
the
sort
operaLon
would
otherwise
exceed
the
Impala
memory
limit
for
a
parLcular
data
node.
22. 22
ANTI
JOIN
LEFT
ANTI
JOIN
/
RIGHT
ANTI
JOIN
are
supported
in
Impala
2.0
[localhost:21000] > create table t1 (x int);!
[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);!
!
[localhost:21000] > create table t2 (y int);!
[localhost:21000] > insert into t2 values (2), (4), (6);!
!
[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);!
+---+!
| x |!
+---+!
| 1 |!
| 3 |!
| 5 |!
+---+!
!
23. 23
new
data
types
• DECIMAL
(Impala
1.4)
• column_name
DECIMAL[(precision[,scale])]
• with
no
precision
or
scale
values
is
equivalent
to
DECIMAL(9,0)
• VARCHAR
(Impala
2.0)
• STRING
with
a
max
length
• CHAR
(Impala
2.0)
• STRING
with
a
precise
length
24. 24
new
built-‐in
funcLons
• EXTRACT()
:
returns
one
date
or
Lme
field
from
a
TIMESTAMP
value
• TRUNC()
:
truncates
date/Lme
values
to
year,
month,
etc.
• ADD_MONTHS():
alias
for
MONTHS_ADD()
• ROUND():
rounds
DECIMAL
values
• for
compuLng
properLes
for
staLsLcal
distribuLons
• STDDEV()
• STDDEV_SAMP()
/
STDDEV_POP()
• VARIANCE()
• VARIANCE_SAMP()
/
VARIANCE_POP()
• MAX_INT()
/
MIN_SMALLINT()
• IS_INF()
/
IS_NAN()
26. 26
SUMMARY
• impala-‐shell
command
• easy-‐to-‐digest
overview
of
the
Lmings
for
the
different
phases
of
execuLon
for
a
query
[localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;!
+---------------------+!
| avg(ss_sales_price) |!
+---------------------+!
| 37.80770926328327 |!
+---------------------+!
[localhost:21000] > summary;!
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |!
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |!
| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |!
| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |!
| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |!
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
27. 27
SET
statement
• Before
Impala
2.0,
SET
can
be
used
only
in
impala-‐
shell
• In
Impala
2.0,
you
can
use
SET
in
client
app
through
JDBC
/
ODBC
APIs.
29. 29
Admission
Control
(Impala
1.3)
• Fast
and
lightweight
resource
management
mechanism
• avoids
oversubscripLon
of
resources
for
concurrent
workloads
• queries
are
queued
when
reaching
configurable
limits
• Run
on
every
impalad
• no
SPOF
30. 30
YARN
and
Llama
• Llama:
Low
Latency
ApplicaLon
MAster
• Subdivides
coarse-‐grain
YARN
scheduling
into
finer-‐
granularity
for
low-‐latency
and
short-‐lived
queries
• Llama
registers
one
long-‐lived
AM
per
YARN
pool
• Llama
caches
resources
allocated
by
YARN
for
a
short
Lme,
so
that
they
can
be
quickly
re-‐allocated
to
Impala
queries
• much
faster
than
waiLng
for
YARN
• Impala
1.4:
GA.
Llama
HA
support
31. 31
Query
Timeout
• A
new
query
opLon,
QUERY_TIMEOUT_S,
lets
you
specify
a
Lmeout
period
in
seconds
for
individual
queries
• Note:
The
Lmeout
clock
for
queries
and
sessions
only
starts
Lcking
when
the
query
or
session
is
idle
32. 32
Security
• Impala
2.0
can
accept
either
kind
of
auth.
request
• ex)
host
A
with
Kerberos,
and
host
B
with
LDAP
• Security
related
statement
• GRANT
• REVOKE
• CREATE
ROLE
• DROP
ROLE
• SHOW
ROLES
• SHOW
ROLE
GRANT
• -‐-‐disk_spill_encrypLon
opLon
34. 34
Text
+
gzip,
bzip2,
and
Snappy
• In
Impala
2.0
and
later,
Impala
supports
using
text
data
files
that
employ
gzip,
bzip2,
or
Snappy
compression
• use
ROW
FORMAT
with
delimiter
and
escape
character
to
create
table
CREATE TABLE csv_compressed (a STRING, b STRING, c STRING)!
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";!