Cloudera Introduction PDF
Cloudera Introduction PDF
Cloudera Introduction PDF
Important Notice
2010-2016 Cloudera, Inc. All rights reserved.
Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names
or slogans contained in this document are trademarks of Cloudera and its suppliers or
licensors, and may not be copied, imitated or used, in whole or in part, without the prior
written permission of Cloudera or the applicable trademark holder.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and company
names or logos mentioned in this document are the property of their respective owners.
Reference to any products, services, processes or other information, by trade name,
trademark, manufacturer, supplier or otherwise does not constitute or imply
endorsement, sponsorship or recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced, stored
in or introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,
without the express written permission of Cloudera.
Cloudera may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as expressly
provided in any written license agreement from Cloudera, the furnishing of this document
does not give you any license to these patents, trademarks copyrights, or other
intellectual property. For information about patents covering Cloudera products, see
http://tiny.cloudera.com/patents.
The information in this document is subject to change without notice. Cloudera shall
not be liable for any damages resulting from technical errors or omissions which may
be present in this document, or from use of this document.
Cloudera, Inc.
1001 Page Mill Road, Bldg 3
Palo Alto, CA 94304
info@cloudera.com
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
Version: Cloudera Enterprise 5.8.x
Date: September 7, 2016
Table of Contents
About Cloudera Introduction...................................................................................6
Documentation Overview....................................................................................................................................6
CDH Overview..........................................................................................................8
Apache Impala (incubating) Overview.................................................................................................................8
Impala Benefits......................................................................................................................................................................9
How Impala Works with CDH.................................................................................................................................................9
Primary Impala Features......................................................................................................................................................10
External Documentation....................................................................................................................................34
Impala Availability...............................................................................................................................................................87
Impala Internals...................................................................................................................................................................88
SQL.......................................................................................................................................................................................90
Partitioned Tables................................................................................................................................................................91
HBase...................................................................................................................................................................................92
Getting Support.....................................................................................................96
Cloudera Support...............................................................................................................................................96
Information Required for Logging a Support Case...............................................................................................................96
Community Support...........................................................................................................................................96
Get Announcements about New Releases.........................................................................................................97
Report Issues......................................................................................................................................................97
Documentation Overview
The following guides are included in the Cloudera documentation set:
Guide
Description
Overview of Cloudera and the Cloudera Cloudera provides a scalable, flexible, integrated platform that makes it easy
Documentation Set
to manage rapidly increasing volumes and varieties of data in your enterprise.
Cloudera products and solutions enable you to deploy and manage Apache
Hadoop and related projects, manipulate and analyze your data, and keep
that data secure and protected.
Cloudera Release Notes
6 | Cloudera Introduction
This guide contains release and download information for installers and
administrators. It includes release notes as well as information about versions
and downloads. The guide also provides a release matrix that shows which
major and minor release version of a product is supported with which release
version of Cloudera Manager, CDH and, if applicable, Cloudera Impala.
Description
Cloudera QuickStart
This guide describes how to quickly install Cloudera software and create initial
deployments for proof of concept (POC) or development. It describes how to
download and use the QuickStart virtual machines, which provide everything
you need to start a basic installation. It also shows you how to create a new
installation of Cloudera Manager 5, CDH 5, and managed services on a cluster
of four hosts. QuickStart installations should be used for demonstrations and
POC applications only and are not recommended for production.
Cloudera Administration
Cloudera Operation
This guide shows how to monitor the health of a Cloudera deployment and
diagnose issues. You can obtain metrics and usage information and view
processing activities. This guide also describes how to examine logs and reports
to troubleshoot issues with cluster configuration and operation as well as
monitor compliance.
Cloudera Security
This guide is intended for system administrators who want to secure a cluster
using data encryption, user authentication, and authorization techniques.
This topic also provides information about Hadoop security programs and
shows you how to set up a gateway to restrict access.
This guide describes Impala, its features and benefits, and how it works with
CDH. This topic introduces Impala concepts, describes how to plan your Impala
deployment, and provides tutorials for first-time users as well as more
advanced tutorials that describe scenarios and specialized features. You will
also find a language reference, performance tuning, instructions for using the
Impala shell, troubleshooting information, and frequently asked questions.
This guide explains how to configure and use Cloudera Search. This includes
topics such as extracting, transforming, and loading data, establishing high
availability, and troubleshooting.
Spark Guide
Cloudera Glossary
Cloudera Introduction | 7
CDH Overview
CDH Overview
CDH is the most complete, tested, and popular distribution of Apache Hadoop and related projects. CDH delivers the
core elements of Hadoop scalable storage and distributed computing along with a Web-based user interface and
vital enterprise capabilities. CDH is Apache-licensed open source and is the only Hadoop solution to offer unified batch
processing, interactive SQL and interactive search, and role-based access controls.
CDH provides:
FlexibilityStore any type of data and manipulate it with a variety of different computation frameworks including
batch processing, interactive SQL, free text search, machine learning and statistical computation.
IntegrationGet up and running quickly on a complete Hadoop platform that works with a broad range of hardware
and software solutions.
SecurityProcess and control sensitive data.
ScalabilityEnable a broad range of applications and scale and extend them to suit your requirements.
High availabilityPerform mission-critical business tasks with confidence.
CompatibilityLeverage your existing IT infrastructure and investment.
For information about CDH components which is out of scope for Cloudera documentation, see the links in External
Documentation on page 34.
8 | Cloudera Introduction
CDH Overview
Note: Impala was accepted into the Apache incubator on December 2, 2015. In places where the
documentation formerly referred to Cloudera Impala, now the official name is Apache Impala
(incubating).
Impala Benefits
Impala provides:
Familiar SQL interface that data scientists and analysts already know.
Ability to query high volumes of data (big data) in Apache Hadoop.
Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective commodity
hardware.
Ability to share data files between different components with no copy or export/import step; for example, to
write with Pig, transform with Hive and query with Impala. Impala can read from and write to Hive tables, enabling
simple data interchange using Impala for analytics on Hive-produced data.
Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just for
analytics.
CDH Overview
3. Services such as HDFS and HBase are accessed by local impalad instances to provide data.
4. Each impalad returns data to the coordinating impalad, which sends these results to the client.
Simplified infrastructure
Better production visibility
Quicker insights across various data types
Quicker problem resolution
Simplified interaction and platform access for more users and use cases
Scalability, flexibility, and reliability of search services on the same platform used to run other types of workloads
on the same data
Description
Unified management and monitoring Cloudera Manager provides unified and centralized management and
with Cloudera Manager
monitoring for CDH and Cloudera Search. Cloudera Manager simplifies
deployment, configuration, and monitoring of your search services. Many
existing search solutions lack management and monitoring capabilities and
fail to provide deep insight into utilization, system health, trending, and other
supportability aspects.
Index storage in HDFS
10 | Cloudera Introduction
Cloudera Search is integrated with HDFS for index storage. Indexes created
by Solr/Lucene can be directly written in HDFS with the data, instead of to
local disk, thereby providing fault tolerance and redundancy.
CDH Overview
Feature
Description
Cloudera Search is optimized for fast read and write of indexes in HDFS while
indexes are served and queried through standard Solr mechanisms. Because
data and indexes are co-located, data processing does not require transport
or separately managed storage.
To facilitate index creation for large data sets, Cloudera Search has built-in
MapReduce jobs for indexing data stored in HDFS. As a result, the linear
scalability of MapReduce is applied to the indexing pipeline.
Real-time and scalable indexing at data Cloudera Search provides integration with Flume to support near real-time
ingest
indexing. As new events pass through a Flume hierarchy and are written to
HDFS, those events can be written directly to Cloudera Search indexers.
In addition, Flume supports routing events, filtering, and annotation of data
passed to CDH. These features work with Cloudera Search for improved index
sharding, index separation, and document-level access control.
Easy interaction and data exploration A Cloudera Search GUI is provided as a Hue plug-in, enabling users to
through Hue
interactively query data, view result files, and do faceted exploration. Hue can
also schedule standing queries and explore index files. This GUI uses the
Cloudera Search API, which is based on the standard Solr API.
Simplified data processing for Search Cloudera Search relies on Apache Tika for parsing and preparation of many
workloads
of the standard file formats for indexing. Additionally, Cloudera Search supports
Avro, Hadoop Sequence, and Snappy file format mappings, as well as Log file
formats, JSON, XML, and HTML. Cloudera Search also provides data
preprocessing using Morphlines, which simplifies index configuration for these
formats. Users can use the configuration for other applications, such as
MapReduce jobs.
HBase search
CDH Overview
Once the indexes are stored in HDFS, they can be queried using standard Solr mechanisms, as previously described
above for the near-real-time indexing use case.
The Lily HBase Indexer Service is a flexible, scalable, fault tolerant, transactional, near real-time oriented system for
processing a continuous stream of HBase cell updates into live search indexes. Typically, the time between data ingestion
using the Flume sink to that content potentially appearing in search results is measured in seconds, although this
duration is tunable. The Lily HBase Indexer uses Solr to index data stored in HBase. As HBase applies inserts, updates,
and deletes to HBase table cells, the indexer keeps Solr consistent with the HBase table contents, using standard HBase
replication features. The indexer supports flexible custom application-specific rules to extract, transform, and load
HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in
HBase. This way applications can use the Search result set to directly access matching raw HBase cells. Indexing and
searching do not affect operational stability or write throughput of HBase because the indexing and searching processes
are separate and asynchronous to HBase.
Contribution
HDFS
Stores source documents. Search indexes source documents to make All cases
them searchable. Files that support Cloudera Search, such as Lucene
index files and write-ahead logs, are also stored in HDFS. Using HDFS
provides simpler provisioning on a larger base, redundancy, and fault
tolerance. With HDFS, Cloudera Search servers are essentially stateless,
so host failures have minimal consequences. HDFS also provides
snapshotting, inter-cluster replication, and disaster recovery.
12 | Cloudera Introduction
Applicable To
CDH Overview
Component
Contribution
Applicable To
MapReduce
Flume
Hue
Search includes a Hue front-end search application that uses standard Many cases
Solr APIs. The application can interact with data indexed in HDFS. The
application provides support for the Solr standard query language,
visualization of faceted search functionality, and a typical full text
search GUI-based.
Morphlines
ZooKeeper
Coordinates distribution of data and metadata, also known as shards. Many cases
It provides automatic failover to increase service resiliency.
Spark
The CrunchIndexerTool can use Spark to move data from HDFS files Some cases
into Apache Solr, and run the data through a morphline for extraction
and transformation.
HBase
Supports indexing of stored data, extracting columns, column families, Some cases
and key information as fields. Because HBase does not use secondary
indexing, Cloudera Search can complete full-text searches of content
in rows and tables in HBase.
Cloudera Manager
Deploys, configures, manages, and monitors Cloudera Search processes Some cases
and resource utilization across services on the cluster. Cloudera
Manager helps simplify Cloudera Search administration, but it is not
required.
Cloudera Navigator
Cloudera Navigator provides governance for Hadoop systems including Some cases
support for auditing Search operations.
Sentry
Oozie
Some cases
Impala
Some cases
Hive
Some cases
Many cases
Cloudera Introduction | 13
CDH Overview
Component
Contribution
Applicable To
Parquet
Provides a columnar storage format, enabling especially rapid result Some cases
returns for structured workloads such as Impala or Hive. Morphlines
provide an efficient pipeline for extracting data from Parquet.
Avro
Kafka
Search uses this message broker project to increase throughput and Some cases
decrease latency for handling real-time data.
Sqoop
Ingests data in batch and enables data availability for batch indexing. Some cases
Mahout
Some cases
Some cases
Each Cloudera Search server can handle requests for information. As a result, a client can send requests to index
documents or perform searches to any Search server, and that server routes the request to the correct server.
Each search deployment requires:
ZooKeeper on one host. You can install ZooKeeper, Search, and HDFS on the same host.
HDFS on at least one but as many as all hosts. HDFS is commonly installed on all hosts.
Solr on at least one but as many as all hosts. Solr is commonly installed on all hosts.
More hosts with Solr and HDFS provides benefits of:
More search host installations doing work.
More search and HDFS collocation increasing the degree of data locality. More local data provides faster
performance and reduces network traffic.
14 | Cloudera Introduction
CDH Overview
The following graphic illustrates some of the key elements in a typical deployment.
CDH Overview
Solr files stored in ZooKeeper. Copies of these files exist on all Solr servers.
solrconfig.xml: Contains the parameters for configuring Solr.
schema.xml: Contains all of the details about which fields your documents can contain, and how those fields
should be dealt with when adding documents to the index, or when querying those fields.
Files are copied from hadoop-conf in HDFS configurations to Solr servers:
core-site.xml
hdfs-site.xml
ssl-client.xml
hadoop-env.sh
topology.map
topology.py
log4j.properties
jaas.conf
solr.keytab
sentry-site.xml
Search can be deployed using parcels or packages. Some files are always installed to the same location and some files
are installed to different locations based on whether the installation is completed using parcels or packages.
Client Files
Client files are always installed to the same location and are required on any host where corresponding services are
installed. In a Cloudera Manager environment, Cloudera Manager manages settings. In an unmanaged deployment,
all files can be manually edited. All files are found in a subdirectory of /etc/. Client configuration file types and their
locations are:
/etc/solr/conf for Solr client settings files
/etc/hadoop/conf for HDFS, MapReduce, and YARN client settings files
/etc/zookeeper/conf for ZooKeeper configuration files
Server Files
Server configuration file locations vary based on how services are installed.
Cloudera Manager environments store configuration all files in /var/run/.
Unmanaged environments store configuration files in /etc/svc/conf. For example:
/etc/solr/conf
/etc/zookeeper/conf
/etc/hadoop/conf
CDH Overview
In a typical environment, administrators establish systems for search. For example, HDFS is established to provide
storage; Flume or distcp are established for content ingestion. After administrators establish these services, users
can use ingestion tools such as file copy utilities or Flume sinks.
Indexing
Content must be indexed before it can be searched. Indexing comprises the following steps:
1. Extraction, transformation, and loading (ETL) - Use existing engines or frameworks such as Apache Tika or Cloudera
Morphlines.
a. Content and metadata extraction
b. Schema mapping
2. Create indexes using Lucene.
a. Index creation
b. Index serialization
Indexes are typically stored on a local file system. Lucene supports additional index writers and readers. One HDFS-based
interface implemented as part of Apache Blur is integrated with Cloudera Search and has been optimized for CDH-stored
indexes. All index data in Cloudera Search is stored in and served from HDFS.
You can index content in three ways:
Batch indexing using MapReduce
To use MapReduce to index documents, run a MapReduce job on content in HDFS to produce a Lucene index. The
Lucene index is written to HDFS, and this index is subsequently used by search services to provide query results.
Batch indexing is most often used when bootstrapping a search cluster. The Map component of the MapReduce task
parses input into indexable documents, and the Reduce component contains an embedded Solr server that indexes
the documents produced by the Map. You can also configure a MapReduce-based indexing job to use all assigned
resources on the cluster, utilizing multiple reducing steps for intermediate indexing and merging operations, and then
writing the reduction to the configured set of shard sets for the service. This makes the batch indexing process as
scalable as MapReduce workloads.
Cloudera Introduction | 17
CDH Overview
Near real-time (NRT) indexing using Flume
Flume events are typically collected and written to HDFS. Although any Flume event can be written, logs are most
common.
Cloudera Search includes a Flume sink that enables you to write events directly to the indexer. This sink provides a
flexible, scalable, fault-tolerant, near real-time (NRT) system for processing continuous streams of records to create
live-searchable, free-text search indexes. Typically, data ingested using the Flume sink appears in search results in
seconds, although you can tune this duration.
The Flume sink meets the needs of identified use cases that rely on NRT availability. Data can flow from multiple sources
through multiple flume hosts. These hosts, which can be spread across a network, route this information to one or
more Flume indexing sinks. Optionally, you can split the data flow, storing the data in HDFS while writing it to be
indexed by Lucene indexers on the cluster. In that scenario, data exists both as data and as indexed data in the same
storage infrastructure. The indexing sink extracts relevant data, transforms the material, and loads the results to live
Solr search servers. These Solr servers are immediately ready to serve queries to end users or search applications.
This flexible, customizable system scales effectively because parsing is moved from the Solr server to the multiple
Flume hosts for ingesting new content.
Search includes parsers for standard data formats including Avro, CSV, Text, HTML, XML, PDF, Word, and Excel. You
can extend the system by adding additional custom parsers for other file or data formats in the form of Tika plug-ins.
Any type of data can be indexed: a record is a byte array of any format, and custom ETL logic can handle any format
variation.
In addition, Cloudera Search includes a simplifying ETL framework called Cloudera Morphlines that can help adapt and
pre-process data for indexing. This eliminates the need for specific parser deployments, replacing them with simple
commands.
Cloudera Search is designed to handle a variety of use cases:
Search supports routing to multiple Solr collections to assign a single set of servers to support multiple user groups
(multi-tenancy).
Search supports routing to multiple shards to improve scalability and reliability.
Index servers can be collocated with live Solr servers serving end-user queries, or they can be deployed on separate
commodity hardware, for improved scalability and reliability.
Indexing load can be spread across a large number of index servers for improved scalability and can be replicated
across multiple index servers for high availability.
This flexible, scalable, highly available system provides low latency data acquisition and low latency querying. Instead
of replacing existing solutions, Search complements use cases based on batch analysis of HDFS data using MapReduce.
In many use cases, data flows from the producer through Flume to both Solr and HDFS. In this system, you can use
NRT ingestion and batch analysis tools.
NRT indexing using some other client that uses the NRT API
Other clients can complete NRT indexing. This is done when the client first writes files directly to HDFS and then triggers
indexing using the Solr REST API. Specifically, the API does the following:
1. Extract content from the document contained in HDFS, where the document is referenced by a URL.
2. Map the content to fields in the search schema.
3. Create or update a Lucene index.
This is useful if you index as part of a larger workflow. For example, you could trigger indexing from an Oozie workflow.
Querying
After data is available as an index, the query API provided by the search service allows direct queries to be completed
or to be facilitated through a command-line tool or graphical interface. Cloudera Search provides a simple UI application
that can be deployed with Hue, or you can create a custom application based on the standard Solr API. Any application
that works with Solr is compatible and runs as a search-serving application for Cloudera Search, because Solr is the
core.
18 | Cloudera Introduction
CDH Overview
Related Information
Managing Spark
Monitoring Spark Applications
Spark Authentication
Spark Encryption
Cloudera Spark forum
Apache Spark documentation
Cloudera Introduction | 19
CDH Overview
20 | Cloudera Introduction
CDH Overview
Note:
Once you create a Parquet table, you can query it or insert into it through other components
such as Impala and Spark.
Set dfs.block.size to 256 MB in hdfs-site.xml.
If the table will be populated with data files generated outside of Impala and Hive, you can create the table as an
external table pointing to the location where the files will be created:
hive> create external table parquet_table_name (x INT, y STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/test-warehouse/tinytable';
To populate the table with an INSERT statement, and to read the table with a SELECT statement, see Using the Parquet
File Format with Impala Tables.
To set the compression type to use when writing data, configure the parquet.compression property:
set parquet.compression=GZIP;
INSERT OVERWRITE TABLE tinytable SELECT * FROM texttable;
Once you create a Parquet table this way in Impala, you can query it or insert into it through either Impala or Hive.
The Parquet format is optimized for working with large data files. In Impala 2.0 and higher, the default size of Parquet
files written by Impala is 256 MB; in lower releases, 1 GB. Avoid using the INSERT ... VALUES syntax, or partitioning
the table at too granular a level, if that would produce a large number of small files that cannot use Parquet optimizations
for large data chunks.
Inserting data into a partitioned Impala table can be a memory-intensive operation, because each data file requires a
memory buffer to hold the data before it is written. Such inserts can also exceed HDFS limits on simultaneous open
files, because each node could potentially write to a separate data file for each partition, all at the same time. Make
sure table and column statistics are in place for any table used as the source for an INSERT ... SELECT operation
into a Parquet table. If capacity problems still occur, consider splitting insert operations into one INSERT statement
per partition.
Impala can query Parquet files that use the PLAIN, PLAIN_DICTIONARY, BIT_PACKED, and RLE encodings. Currently,
Impala does not support RLE_DICTIONARY encoding. When creating files outside of Impala for use by Impala, make
sure to use one of the supported encodings. In particular, for MapReduce jobs, parquet.writer.version must not
be defined (especially as PARQUET_2_0) for writing the configurations of Parquet MR jobs. Use the default version (or
format). The default format, 1.0, includes some enhancements that are compatible with older versions. Data using the
2.0 format might not be consumable by Impala, due to use of the RLE_DICTIONARY encoding.
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE,
DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is
Cloudera Introduction | 21
CDH Overview
represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala
interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported
this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
For complete instructions and examples, see Using the Parquet File Format with Impala Tables.
Using Parquet Files in MapReduce
MapReduce requires Thrift in its CLASSPATH and in libjars to access Parquet files. It also requires parquet-format
in libjars. Set up the following before running MapReduce jobs that access Parquet data files:
if [ -e /opt/cloudera/parcels/CDH ] ; then
CDH_BASE=/opt/cloudera/parcels/CDH
else
CDH_BASE=/usr
fi
THRIFTJAR=`ls -l $CDH_BASE/lib/hive/lib/libthrift*jar | awk '{print $9}' | head -1`
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$THRIFTJAR
export LIBJARS=`echo "$CLASSPATH" | awk 'BEGIN { RS = ":" } { print }' | grep
parquet-format | tail -1`
export LIBJARS=$LIBJARS,$THRIFTJAR
hadoop jar my-parquet-mr.jar -libjars $LIBJARS
org.apache.hadoop.conf.Configuration;
org.apache.hadoop.conf.Configured;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.NullWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.Mapper.Context;
org.apache.hadoop.mapreduce.Job;
org.apache.hadoop.mapreduce.Mapper;
org.apache.hadoop.mapreduce.Reducer;
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import parquet.Log;
import parquet.example.data.Group;
import parquet.hadoop.example.ExampleInputFormat;
public class TestReadParquet extends Configured
implements Tool {
private static final Log LOG =
Log.getLog(TestReadParquet.class);
/*
* Read a Parquet record
*/
public static class MyMap extends
Mapper<LongWritable, Group, NullWritable, Text> {
@Override
public void map(LongWritable key, Group value, Context context) throws IOException,
InterruptedException {
NullWritable outKey = NullWritable.get();
String outputRecord = "";
22 | Cloudera Introduction
CDH Overview
// Get the schema and field values of the record
String inputRecord = value.toString();
// Process the value, create an output record
// ...
context.write(outKey, new Text(outputRecord));
}
}
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setJobName(getClass().getName());
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMap.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(ExampleInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
try {
int res = ToolRunner.run(new Configuration(), new TestReadParquet(), args);
System.exit(res);
} catch (Exception e) {
e.printStackTrace();
System.exit(255);
}
}
}
parquet.Log;
parquet.example.data.Group;
parquet.hadoop.example.GroupWriteSupport;
parquet.hadoop.example.ExampleInputFormat;
parquet.hadoop.example.ExampleOutputFormat;
parquet.hadoop.metadata.CompressionCodecName;
parquet.hadoop.ParquetFileReader;
parquet.hadoop.metadata.ParquetMetadata;
parquet.schema.MessageType;
parquet.schema.MessageTypeParser;
parquet.schema.Type;
int run(String[] args) throws Exception {
Cloudera Introduction | 23
CDH Overview
job.submit();
If input files are in Parquet format, the schema can be extracted using the getSchema method:
import
import
import
import
...
org.apache.hadoop.fs.FileSystem;
org.apache.hadoop.fs.FileStatus;
org.apache.hadoop.fs.LocatedFileStatus;
org.apache.hadoop.fs.RemoteIterator;
You can then write records in the mapper by composing a Group value using the Example classes and no key:
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, Void, Group>.Context context)
throws java.io.IOException, InterruptedException {
int x;
int y;
// Extract the desired output values from the input text
//
Group group = factory.newGroup()
.append("x", x)
.append("y", y);
context.write(null, group);
}
}
To set the compression type before submitting the job, invoke the setCompression method:
ExampleOutputFormat.setCompression(job, compression_type);
24 | Cloudera Introduction
CDH Overview
Using Parquet Files in Pig
Reading Parquet Files in Pig
If the external table is created and populated, the Pig instruction to read the data is:
grunt> A = LOAD '/test-warehouse/tinytable' USING parquet.pig.ParquetLoader AS (x: int,
y int);
To set the compression type, configure the parquet.compression property before the first store instruction in a
Pig script:
SET parquet.compression gzip;
The supported compression types are uncompressed, gzip, and snappy (the default).
Using Parquet Files in Spark
See Accessing External Storage and Accessing Parquet Files From Spark SQL Applications.
Parquet File Interoperability
Impala has always included Parquet support, using high-performance code written in C++ to read and write Parquet
files. The Parquet JARs for use with Hive, Pig, and MapReduce are available with CDH 4.5 and higher. Using the Java-based
Parquet implementation on a CDH release lower than CDH 4.5 is not supported.
A Parquet table created by Hive can typically be accessed by Impala 1.1.1 and higher with no changes, and vice versa.
Before Impala 1.1.1, when Hive support for Parquet was not available, Impala wrote a dummy SerDe class name into
each data file. These older Impala data files require a one-time ALTER TABLE statement to update the metadata for
the SerDe class name before they can be used with Hive. See Apache Impala (incubating) Incompatible Changes and
Limitations for details.
A Parquet file written by Hive, Impala, Pig, or MapReduce can be read by any of the others. Different defaults for file
and block sizes, compression and encoding settings, and so on might cause performance differences depending on
which component writes or reads the data files. For example, Impala typically sets the HDFS block size to 256 MB and
divides the data files into 256 MB chunks, so that each I/O request reads an entire data file.
In CDH 5.5 and higher, non-Impala components that write Parquet files include extra padding to ensure that the Parquet
row groups are aligned with HDFS data blocks. The maximum amount of padding is controlled by the
parquet.writer.max-padding setting, specified as a number of bytes. By default, up to 8 MB of padding can be
added to the end of each row group. This alignment helps prevent remote reads during Impala queries. The setting
does not apply to Parquet files written by Impala, because Impala always writes each Parquet file as a single HDFS data
block.
Each release may have limitations. The following are current limitations in CDH:
The TIMESTAMP data type in Parquet files is not supported in Hive, Pig, or MapReduce in CDH 4. Attempting to
read a Parquet table created with Impala that includes a TIMESTAMP column fails.
Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned
tables; this is true for all file formats.
Impala supports table columns using nested data types or complex data types such as map, struct, or array
only in Impala 2.3 (corresponding to CDH 5.5) and higher. Impala 2.2 (corresponding to CDH 5.4) can query only
the scalar columns of Parquet files containing such types. Lower releases of Impala cannot query any columns
from Parquet data files that include such types.
Cloudera Introduction | 25
CDH Overview
Cloudera supports some but not all of the object models from the upstream Parquet-MR project. Currently
supported object models are:
The Impala and Hive object models built into those components, not available in external libraries. (CDH does
not include the parquet-hive module of the parquet-mr project, because recent versions of Hive have
Parquet support built in.)
cat: Print a file's contents to standard out. In CDH 5.5 and higher, you can use the -j option to output JSON.
head: Print the first few records of a file to standard output.
schema: Print the Parquet schema for the file.
meta: Print the file footer metadata, including key-value properties (like Avro schema), compression ratios,
26 | Cloudera Introduction
CDH Overview
arr_time = 851
crs_arr_time = 846
carrier = US
flight_num = 53
actual_elapsed_time = 63
crs_elapsed_time = 56
arrdelay = 5
depdelay = -2
origin = CMH
dest = IND
distance = 182
cancelled = 0
diverted = 0
year = 1992
month = 1
day = 3
...
Cloudera Introduction | 27
CDH Overview
crs_arr_time:
INT32 SNAPPY DO:72432398 FPO:72438151 SZ:10908972/12164626/1.12
VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
carrier:
BINARY SNAPPY DO:83341427 FPO:83341558 SZ:114916/128611/1.12
VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
flight_num:
INT32 SNAPPY DO:83456393 FPO:83488603 SZ:10216514/11474301/1.12
VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
...
28 | Cloudera Introduction
CDH Overview
Using Avro Data Files in Hive
The following example demonstrates how to create a Hive table backed by Avro data files:
CREATE TABLE doctors
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
"namespace": "testing.hive.avro.serde",
"name": "doctors",
"type": "record",
"fields": [
{
"name":"number",
"type":"int",
"doc":"Order of playing the role"
},
{
"name":"first_name",
"type":"string",
"doc":"first name of actor playing role"
},
{
"name":"last_name",
"type":"string",
"doc":"last name of actor playing role"
},
{
"name":"extra_field",
"type":"string",
"doc:":"an extra field not in the original file",
"default":"fishfingers and custard"
}
]
}');
LOAD DATA LOCAL INPATH '/usr/share/doc/hive-0.7.1+42.55/examples/files/doctors.avro'
INTO TABLE doctors;
You can also create an Avro backed Hive table by using an Avro schema file:
CREATE TABLE my_avro_table(notused INT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url'='file:///tmp/schema.avsc')
STORED as INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
avro.schema.url is a URL (here a file:// URL) pointing to an Avro schema file used for reading and writing. It
could also be an hdfs: URL; for example, hdfs://hadoop-namenode-uri/examplefile.
To enable Snappy compression on output files, run the following before writing to the table:
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
Haivvreo SerDe has been merged into Hive as AvroSerDe and is no longer supported in its original form. schema.url
and schema.literal have been changed to avro.schema.url and avro.schema.literal as a result of the
merge. If you were using Haivvreo SerDe, you can use the Hive AvroSerDe with tables created with the Haivvreo
Cloudera Introduction | 29
CDH Overview
SerDe. For example, if you have a table my_avro_table that uses the Haivvreo SerDe, add the following to make the
table use the new AvroSerDe:
ALTER TABLE my_avro_table SET SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe';
ALTER TABLE my_avro_table SET FILEFORMAT
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
Then write your program, using the Avro MapReduce javadoc for guidance.
At run time, include the avro and avro-mapred JARs in the HADOOP_CLASSPATH and the avro, avro-mapred and
paranamer JARs in -libjars.
To enable Snappy compression on output, call AvroJob.setOutputCodec(job, "snappy") when configuring the
job. You must also include the snappy-java JAR in -libjars.
Using Avro Data Files in Pig
CDH provides AvroStorage for Avro integration in Pig.
To use it, first register the piggybank JAR file and supporting libraries:
REGISTER
REGISTER
REGISTER
REGISTER
piggybank.jar
lib/avro-1.7.3.jar
lib/json-simple-1.1.jar
lib/snappy-java-1.0.4.1.jar
With store, Pig generates an Avro schema from the Pig schema. You can override the Avro schema by specifying it
literally as a parameter to AvroStorage or by using the same schema as an existing Avro data file. See the Pig wiki
for details.
To store two relations in one script, specify an index to each store function. For example:
set1 = load 'input1.txt' using PigStorage() as ( ... );
store set1 into 'set1' using org.apache.pig.piggybank.storage.avro.AvroStorage('index',
'1');
set2 = load 'input2.txt' using PigStorage() as ( ... );
store set2 into 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index',
'2');
30 | Cloudera Introduction
CDH Overview
For more information, search for "index" in the AvroStorage wiki.
To enable Snappy compression on output files, do the following before issuing the STORE statement:
SET mapred.output.compress true
SET mapred.output.compression.codec org.apache.hadoop.io.compress.SnappyCodec
SET avro.output.codec snappy
For more information, see the Pig wiki. The version numbers of the JAR files to register are different on that page, so
adjust them as shown above.
Importing Avro Data Files in Sqoop 1
On the command line, use the following option to import Avro data files:
--as-avrodatafile
Sqoop 1 automatically generates an Avro schema that corresponds to the database table being exported from.
To enable Snappy compression, add the following option:
--compression-codec snappy
Data Compression
Data compression and compression formats can have a significant impact on performance. Three important places to
consider data compression are in MapReduce and Spark jobs, data stored in HBase, and Impala queries. For the most
part, the principles are similar for each.
Cloudera Introduction | 31
CDH Overview
You must balance the processing capacity required to compress and uncompress the data, the disk IO required to read
and write the data, and the network bandwidth required to send the data across the network. The correct balance of
these factors depends upon the characteristics of your cluster and your data, as well as your usage patterns.
Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the
resulting file can sometimes be larger than the original.
For more information about compression algorithms in Hadoop, see the Compression section of the Hadoop I/O chapter
in Hadoop: The Definitive Guide.
Compression Types
Hadoop supports the following compression types and codecs:
gzip - org.apache.hadoop.io.compress.GzipCodec
bzip2 - org.apache.hadoop.io.compress.BZip2Codec
LZO - com.hadoop.compression.lzo.LzopCodec
Snappy - org.apache.hadoop.io.compress.SnappyCodec
Deflate - org.apache.hadoop.io.compress.DeflateCodec
Different file types and CDH components support different compression types. For details, see Using Apache Avro Data
Files with CDH on page 28 and Using Apache Parquet Data Files with CDH on page 20.
For guidelines on choosing compression types and configuring compression, see Choosing and Configuring Data
Compression.
Snappy Compression
Snappy is a compression/decompression library. It optimizes for very high-speed compression and decompression,
and moderate compression instead of maximum compression or compatibility with other compression libraries.
Snappy is supported for all CDH components. How you specify compression depends on the component.
Using Snappy with HBase
If you install Hadoop and HBase from RPM or Debian packages, Snappy requires no HBase configuration.
Using Snappy with Hive
To enable Snappy compression for Hive output when creating SequenceFile outputs, use the following settings:
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
32 | Cloudera Introduction
CDH Overview
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
YARN
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
YARN Property
Description
mapred.output.compress
mapreduce.output.
fileoutputformat.
compress
mapred.output.
compression.codec
mapreduce.output.
fileoutputformat.
compress.codec
mapreduce.output.
fileoutputformat.
compress.type
Note: The MRv1 property names are also supported (but deprecated) in YARN. You do not need to
update them in this release.
Using Snappy with Pig
Set the same properties for Pig as for MapReduce.
Using Snappy with Spark SQL
To enable Snappy compression for Spark SQL when writing tables, specify the snappy codec in the
spark.sql.parquet.compression.codec configuration:
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
Cloudera recommends using the --as-sequencefile option with this compression option.
Sqoop 2 - When you create a job (sqoop:000> create job), choose 7 (SNAPPY) as the compression format.
Cloudera Introduction | 33
CDH Overview
External Documentation
Cloudera provides documentation for CDH as a whole, whether your CDH cluster is managed by Cloudera Manager or
not. In addition, you may find it useful to refer to documentation for the individual components included in CDH. Where
possible, these links point to the main documentation for a project, in the Cloudera release archive. This ensures that
you are looking at the correct documentation for the version of a project included in CDH. Otherwise, the links may
point to the project's main site.
Apache Avro
Apache Crunch
Apache DataFu
Apache Flume
Apache Hadoop
Apache HBase
Apache Hive
Hue
Kite
Apache Mahout
Apache Oozie
Apache Parquet
Apache Pig
Apache Sentry
Apache Solr
Apache Spark
Apache Sqoop
Apache Sqoop2
Apache Whirr
Apache ZooKeeper
34 | Cloudera Introduction
Terminology
To effectively use Cloudera Manager, you should first understand its terminology. The relationship between the terms
is illustrated below and their definitions follow:
Some of the terms, such as cluster and service, will be used without further explanation. Others, such as role group,
gateway, host template, and parcel are expanded upon in later sections.
A common point of confusion is the overloading of the terms service and role for both types and instances; Cloudera
Manager and this section sometimes uses the same term for type and instance. For example, the Cloudera Manager
Admin Console Home > Status tab and Clusters > ClusterName menu lists service instances. This is similar to the
practice in programming languages where for example the term "string" may indicate either a type (java.lang.String)
or an instance of that type ("hi there"). When it's necessary to distinguish between types and instances, the word
"type" is appended to indicate a type and the word "instance" is appended to explicitly indicate an instance.
deployment
A configuration of Cloudera Manager and all the clusters it manages.
dynamic resource pool
In Cloudera Manager, a named configuration of resources and a policy for scheduling the resources among YARN
applications or Impala queries running in the pool.
Cloudera Introduction | 35
The host tcdn501-1 is the "master" host for the cluster, so it has many more role instances, 21, compared with the 7
role instances running on the other hosts. In addition to the CDH "master" role instances, tcdn501-1 also has Cloudera
Management Service roles:
Cloudera Introduction | 37
Architecture
As depicted below, the heart of Cloudera Manager is the Cloudera Manager Server. The Server hosts the Admin Console
Web Server and the application logic, and is responsible for installing software, configuring, starting, and stopping
services, and managing the cluster on which the services run.
38 | Cloudera Introduction
State Management
The Cloudera Manager Server maintains the state of the cluster. This state can be divided into two categories: "model"
and "runtime", both of which are stored in the Cloudera Manager Server database.
Cloudera Introduction | 39
Cloudera Manager models CDH and managed services: their roles, configurations, and inter-dependencies. Model state
captures what is supposed to run where, and with what configurations. For example, model state captures the fact
that a cluster contains 17 hosts, each of which is supposed to run a DataNode. You interact with the model through
the Cloudera Manager Admin Console configuration screens and API and operations such as "Add Service".
Runtime state is what processes are running where, and what commands (for example, rebalance HDFS or run a
Backup/Disaster Recovery schedule or rolling restart or stop) are currently running. The runtime state includes the
exact configuration files needed to run a process. When you select Start in the Cloudera Manager Admin Console, the
server gathers up all the configuration for the relevant services and roles, validates it, generates the configuration files,
and stores them in the database.
When you update a configuration (for example, the Hue Server web port), you have updated the model state. However,
if Hue is running while you do this, it is still using the old port. When this kind of mismatch occurs, the role is marked
as having an "outdated configuration". To resynchronize, you restart the role (which triggers the configuration
re-generation and process restart).
While Cloudera Manager models all of the reasonable configurations, some cases inevitably require special handling.
To allow you to workaround, for example, a bug or to explore unsupported options, Cloudera Manager supports an
"advanced configuration snippet" mechanism that lets you add properties directly to the configuration files.
Configuration Management
Cloudera Manager defines configuration at several levels:
The service level may define configurations that apply to the entire service instance, such as an HDFS service's
default replication factor (dfs.replication).
The role group level may define configurations that apply to the member roles, such as the DataNodes' handler
count (dfs.datanode.handler.count). This can be set differently for different groups of DataNodes. For
example, DataNodes running on more capable hardware may have more handlers.
The role instance level may override configurations that it inherits from its role group. This should be used sparingly,
because it easily leads to configuration divergence within the role group. One example usage is to temporarily
enable debug logging in a specific role instance to troubleshoot an issue.
Hosts have configurations related to monitoring, software management, and resource management.
Cloudera Manager itself has configurations related to its own administrative operations.
Role Groups
You can set configuration at the service instance (for example, HDFS) or role instance (for example, the DataNode on
host17). An individual role inherits the configurations set at the service level. Configurations made at the role level
override those inherited from the service level. While this approach offers flexibility, configuring a set of role instances
in the same way can be tedious.
40 | Cloudera Introduction
In addition to making it easy to manage the configuration of subsets of roles, role groups also make it possible to
maintain different configurations for experimentation or managing shared clusters for different users or workloads.
Host Templates
In typical environments, sets of hosts have the same hardware and the same set of services running on them. A host
template defines a set of role groups (at most one of each type) in a cluster and provides two main benefits:
Adding new hosts to clusters easily - multiple hosts can have roles from different services created, configured,
and started in a single operation.
Altering the configuration of roles from different services on a set of hosts easily - which is useful for quickly
switching the configuration of an entire cluster to accommodate different workloads or users.
Server and Client Configuration
Administrators are sometimes surprised that modifying /etc/hadoop/conf and then restarting HDFS has no effect.
That is because service instances started by Cloudera Manager do not read configurations from the default locations.
To use HDFS as an example, when not managed by Cloudera Manager, there would usually be one HDFS configuration
per host, located at /etc/hadoop/conf/hdfs-site.xml. Server-side daemons and clients running on the same
host would all use that same configuration.
Cloudera Manager distinguishes between server and client configuration. In the case of HDFS, the file
/etc/hadoop/conf/hdfs-site.xml contains only configuration relevant to an HDFS client. That is, by default, if
you run a program that needs to communicate with Hadoop, it will get the addresses of the NameNode and JobTracker,
and other important configurations, from that directory. A similar approach is taken for /etc/hbase/conf and
/etc/hive/conf.
In contrast, the HDFS role instances (for example, NameNode and DataNode) obtain their configurations from a private
per-process directory, under /var/run/cloudera-scm-agent/process/unique-process-name. Giving each process
its own private execution and configuration environment allows Cloudera Manager to control each process
independently. For example, here are the contents of an example 879-hdfs-NAMENODE process directory:
$ tree -a /var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
/var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
cloudera_manager_Agent_fencer.py
cloudera_manager_Agent_fencer_secret_key.txt
cloudera-monitor.properties
core-site.xml
dfs_hosts_allow.txt
dfs_hosts_exclude.txt
event-filter-rules.json
hadoop-metrics2.properties
Cloudera Introduction | 41
hdfs.keytab
hdfs-site.xml
log4j.properties
logs
stderr.log
stdout.log
topology.map
topology.py
42 | Cloudera Introduction
Process Management
In a non-Cloudera Manager managed cluster, you most likely start a role instance process using an init script, for
example, service hadoop-hdfs-datanode start. Cloudera Manager does not use init scripts for the daemons
it manages; in a Cloudera Manager managed cluster, starting and stopping services using init scripts will not work.
In a Cloudera Manager managed cluster you can only start or stop role instance processes using Cloudera Manager.
Cloudera Manager uses an open source process management tool called supervisord, that starts processes, takes
care of redirecting log files, notifying of process failure, setting the effective user ID of the calling process to the right
user, and so on. Cloudera Manager supports automatically restarting a crashed process. It will also flag a role instance
with a bad health flag if its process crashes repeatedly right after start up.
Stopping the Cloudera Manager Server and the Cloudera Manager Agents will not bring down your services; any running
role instances keep running.
The Agent is started by init.d at start-up. It, in turn, contacts the Cloudera Manager Server and determines which
processes should be running. The Agent is monitored as part of Cloudera Manager's host monitoring: if the Agent stops
heartbeating, the host is marked as having bad health.
One of the Agent's main responsibilities is to start and stop processes. When the Agent detects a new process from
the Server heartbeat, the Agent creates a directory for it in /var/run/cloudera-scm-agent and unpacks the
configuration. It then contacts supervisord, which starts the process.
These actions reinforce an important point: a Cloudera Manager process never travels alone. In other words, a process
is more than just the arguments to exec()it also includes configuration files, directories that need to be created,
and other information.
Host Management
Cloudera Manager provides several features to manage the hosts in your Hadoop clusters. The first time you run
Cloudera Manager Admin Console you can search for hosts to add to the cluster and once the hosts are selected you
can map the assignment of CDH roles to hosts. Cloudera Manager automatically deploys all software required to
participate as a managed host in a cluster: JDK, Cloudera Manager Agent, CDH, Impala, Solr, and so on to the hosts.
Once the services are deployed and running, the Hosts area within the Admin Console shows the overall status of the
managed hosts in your cluster. The information provided includes the version of CDH running on the host, the cluster
to which the host belongs, and the number of roles running on the host. Cloudera Manager provides operations to
manage the lifecycle of the participating hosts and to add and delete hosts. The Cloudera Management Service Host
Monitor role performs health tests and collects host metrics to allow you to monitor the health and performance of
the hosts.
Resource Management
Resource management helps ensure predictable behavior by defining the impact of different services on cluster
resources. Use resource management to:
44 | Cloudera Introduction
You can dynamically apportion resources that are statically allocated to YARN and Impala by using dynamic resource
pools.
Depending on the version of CDH you are using, dynamic resource pools in Cloudera Manager support the following
scenarios:
YARN (CDH 5) - YARN manages the virtual cores, memory, running applications, and scheduling policy for each
pool. In the preceding diagram, three dynamic resource poolsDev, Product, and Mktg with weights 3, 2, and 1
respectivelyare defined for YARN. If an application starts and is assigned to the Product pool, and other
applications are using the Dev and Mktg pools, the Product resource pool receives 30% x 2/6 (or 10%) of the total
cluster resources. If no applications are using the Dev and Mktg pools, the YARN Product pool is allocated 30% of
the cluster resources.
Impala (CDH 5 and CDH 4) - Impala manages memory for pools running queries and limits the number of running
and queued queries in each pool.
Cloudera Introduction | 45
User Management
Access to Cloudera Manager features is controlled by user accounts. A user account identifies how a user is authenticated
and determines what privileges are granted to the user.
Cloudera Manager provides several mechanisms for authenticating users. You can configure Cloudera Manager to
authenticate users against the Cloudera Manager database or against an external authentication service. The external
authentication service can be an LDAP server (Active Directory or an OpenLDAP compatible directory), or you can
specify another external service. Cloudera Manager also supports using the Security Assertion Markup Language (SAML)
to enable single sign-on.
For information about the privileges associated with each of the Cloudera Manager user roles, see Cloudera Manager
User Roles.
Security Management
Cloudera Manager strives to consolidate security configurations across several projects.
Authentication
The purpose of authentication in Hadoop, as in other systems, is simply to prove that a user or service is who he or
she claims to be.
Typically, authentication in enterprises is managed through a single distributed system, such as a Lightweight Directory
Access Protocol (LDAP) directory. LDAP authentication consists of straightforward username/password services backed
by a variety of storage systems, ranging from file to database.
A common enterprise-grade authentication system is Kerberos. Kerberos provides strong security benefits including
capabilities that render intercepted authentication packets unusable by an attacker. It virtually eliminates the threat
of impersonation by never sending a user's credentials in cleartext over the network.
Several components of the Hadoop ecosystem are converging to use Kerberos authentication with the option to manage
and store credentials in LDAP or AD. For example, Microsoft's Active Directory (AD) is an LDAP directory that also
provides Kerberos authentication for added security.
Authorization
Authorization is concerned with who or what has access or control over a given resource or service. Since Hadoop
merges together the capabilities of multiple varied, and previously separate IT systems as an enterprise data hub that
stores and works on all data within an organization, it requires multiple authorization controls with varying granularities.
In such cases, Hadoop management tools simplify setup and maintenance by:
Tying all users to groups, which can be specified in existing LDAP or AD directories.
Providing role-based access control for similar interaction methods, like batch and interactive SQL queries. For
example, Apache Sentry permissions apply to Hive (HiveServer2) and Impala.
CDH currently provides the following forms of access control:
Traditional POSIX-style permissions for directories and files, where each directory and file is assigned a single
owner and group. Each assignment has a basic set of permissions available; file permissions are simply read, write,
and execute, and directories have an additional permission to determine access to child directories.
Extended Access Control Lists (ACLs) for HDFS that provide fine-grained control of permissions for HDFS files by
allowing you to set different permissions for specific named users or named groups.
Apache HBase uses ACLs to authorize various operations (READ, WRITE, CREATE, ADMIN) by column, column
family, and column family qualifier. HBase ACLs are granted and revoked to both users and groups.
Role-based access control with Apache Sentry.
46 | Cloudera Introduction
48 | Cloudera Introduction
View the status and other details of a service instance or the role instances associated with the service
Make configuration changes to a service instance, a role, or a specific role instance
Add and delete a service or role
Stop, start, or restart a service or role.
View the commands that have been run for a service or a role
View an audit event history
Deploy and download client configurations
Decommission and recommission role instances
Enter or exit maintenance mode
Perform actions unique to a specific type of service. For example:
Enable HDFS high availability or NameNode federation
Run the HDFS Balancer
Create HBase, Hive, and Sqoop directories
View the status and a variety of detail metrics about individual hosts
Make configuration changes for host monitoring
View all the processes running on a host
Run the Host Inspector
Add and delete hosts
Create and manage host templates
Cloudera Introduction | 49
Manage parcels
Decommission and recommission hosts
Make rack assignments
Run the host upgrade wizard
Diagnostics - Review logs, events, and alerts to diagnose problems. The subpages are:
Events - Search for and displaying events and alerts that have occurred.
Logs - Search logs by service, role, host, and search phrase as well as log level (severity).
Server Log -Display the Cloudera Manager Server log.
Audits - Query and filter audit events across clusters, including logins, across clusters.
Charts - Query for metrics of interest, display them as charts, and display personalized chart dashboards.
Backup - Manage replication schedules and snapshot policies.
Administration - Administer Cloudera Manager. The subpages are:
Help
Installation Guide
API Documentation
Release Notes
About - Version number and build details of Cloudera Manager and the current date and time stamp of the
Cloudera Manager server.
Logged-in User Menu - The currently logged-in user. The subcommands are:
Change Password - Change the password of the currently logged in user.
Logout
50 | Cloudera Introduction
You can also go to the Home > Status tab by clicking the Cloudera Manager logo in the top navigation bar.
Status
The Status tab contains:
Clusters - The clusters being managed by Cloudera Manager. Each cluster is displayed either in summary form or
in full form depending on the configuration of the Administration > Settings > Other > Maximum Cluster Count
Shown In Full property. When the number of clusters exceeds the value of the property, only cluster summary
information displays.
Summary Form - A list of links to cluster status pages. Click Customize to jump to the Administration >
Settings > Other > Maximum Cluster Count Shown In Full property.
Full Form - A separate section for each cluster containing a link to the cluster status page and a table containing
links to the Hosts page and the status pages of the services running in the cluster.
Each service row in the table has a menu of actions that you select by clicking
and can contain one or more of the following indicators:
Indicator
Meaning
Description
Health issue
Indicates that the service has at least one health issue. The indicator shows
the number of health issues at the highest severity level. If there are Bad
Cloudera Introduction | 51
Meaning
Description
health test results, the indicator is red. If there are no Bad health test results,
but Concerning test results exist, then the indicator is yellow. No indicator
is shown if there are no Bad or Concerning health test results.
Important: If there is one Bad health test result and two
Concerning health results, there will be three health issues,
but the number will be one.
Click the indicator to display the Health Issues pop-up dialog box.
By default only Bad health test results are shown in the dialog box. To display
Concerning health test results, click the Also show n concerning issue(s)
link.Click the link to display the Status page containing with details about
the health test result.
Configuration
issue
Indicates that the service has at least one configuration issue. The indicator
shows the number of configuration issues at the highest severity level. If
there are configuration errors, the indicator is red. If there are no errors
but configuration warnings exist, then the indicator is yellow. No indicator
is shown if there are no configuration notifications.
Important: If there is one configuration error and two
configuration warnings, there will be three configuration
issues, but the number will be one.
Click the indicator to display the Configuration Issues pop-up dialog box.
By default only notifications at the Error severity level are listed, grouped
by service name are shown in the dialog box. To display Warning
notifications, click the Also show n warning(s) link.Click the message
associated with an error or warning to be taken to the configuration property
for which the notification has been issued where you can address the
issue.See Managing Services.
Restart
Needed
Configuration
modified
Refresh
Needed
Client
configuration
redeployment
required
Cloudera Management Service - A table containing a link to the Cloudera Manager Service. The Cloudera
Manager Service has a menu of actions that you select by clicking
.
Charts - A set of charts (dashboard) that summarize resource utilization (IO, CPU usage) and processing
metrics.
52 | Cloudera Introduction
Automatic Logout
For security purposes, Cloudera Manager automatically logs out a user session after 30 minutes. You can change this
session logout period.
To configure the timeout period:
1.
2.
3.
4.
When the timeout is one minute from triggering, the user sees the following message:
If the user does not click the mouse or press a key, the user is logged out of the session and the following message
appears:
54 | Cloudera Introduction
You cannot install a DSSD D5 cluster using a Cloudera Manager instance that is already managing a cluster.
You set a single property to enable DSSD Mode.
You set several DSSD D5-specific properties.
When installing CDH and other services from Cloudera Manager, only parcel installations are supported. Package
installations are not supported. See Managing Software Installation Using Cloudera Manager.
See Installation with the EMC DSSD D5 for complete installation instructions.
Quick Start
Cloudera Manager API tutorial
Cloudera Manager API documentation
Python client
Using the Cloudera Manager Java API for Cluster Automation on page 58
Cloudera Introduction | 55
For example:
http://cm_server_host:7180/api/v13/clusters/Cluster%201/services/OOZIE-1/roles/
OOZIE-1-OOZIE_SERVER-e121641328fcb107999f2b5fd856880d/process/configFiles/oozie-site.xml
Search the results for the display name of the desired property. For example, a search for the display name HDFS
Service Environment Advanced Configuration Snippet (Safety Valve) shows that the corresponding property name
is hdfs_service_env_safety_valve:
{
"name" : "hdfs_service_env_safety_valve",
"require" : false,
"displayName" : "HDFS Service Environment Advanced Configuration Snippet (Safety
Valve)",
"description" : "For advanced use onlyu, key/value pairs (one on each line) to be
inserted into a roles
environment. Applies to configurations of all roles in this service except client
configuration.",
"relatedName" : "",
"validationState" : "OK"
}
Similar to finding service properties, you can also find host properties. First, get the host IDs for a cluster with the URL:
http://cm_server_host:7180/api/v13/hosts
Then obtain the host properties by including one of the returned host IDs in the URL:
http://cm_server_host:7180/api/v13/hosts/2c2e951c-adf2-4780-a69f-0382181f1821?view=FULL
56 | Cloudera Introduction
Where:
admin_uname is a username with either the Full Administrator or Cluster Administrator role.
admin_pass is the password for the admin_uname username.
cm_server_host is the hostname of the Cloudera Manager server.
path_to_file is the path to the file where you want to save the configuration.
For example:
export CMF_JAVA_OPTS="-Xmx2G -Dcom.cloudera.api.redaction=true"
Cloudera Introduction | 57
Where:
admin_uname is a username with either the Full Administrator or Cluster Administrator role.
admin_pass is the password for the admin_uname username.
cm_server_host is the hostname of the Cloudera Manager server.
path_to_file is the path to the file containing the JSON configuration file.
For example, you can use the API to retrieve logs from HDFS, HBase, or any other service, without knowing the log
locations. You can also stop any service with no additional steps.
Use scenarios for the Cloudera Manager API for cluster automation might include:
58 | Cloudera Introduction
</repositories>
<dependencies>
<dependency>
<groupId>com.cloudera.api</groupId>
<artifactId>cloudera-manager-api</artifactId>
<version>4.6.2</version>
<!-- Set to the version of Cloudera Manager you use
-->
</dependency>
</dependencies>
...
</project>
The Java client works like a proxy. It hides from the caller any details about REST, HTTP, and JSON. The entry point is
a handle to the root of the API:
RootResourcev13 apiRoot = new ClouderaManagerClientBuilder().withHost("cm.cloudera.com")
.withUsernamePassword("admin", "admin").build().getRootv13();
From the root, you can traverse down to all other resources. (It's called "v13" because that is the current Cloudera
Manager API version, but the same builder will also return a root from an earlier version of the API.) The tree view
shows some key resources and supported operations:
RootResourcev13
ClustersResourcev13 - host membership, start cluster
ServicesResourcev13 - configuration, get metrics, HA, service commands
RolesResource - add roles, get metrics, logs
RoleConfigGroupsResource - configuration
ParcelsResource - parcel management
HostsResource - host management, get metrics
UsersResource - user management
For more information, see the Javadoc.
The following example lists and starts a cluster:
// List of clusters
ApiClusterList clusters = apiRoot.getClustersResource().readClusters(DataView.SUMMARY);
Cloudera Introduction | 59
To see a full example of cluster deployment using the Java client, see whirr-cm. Go to CmServerImpl#configure to
see the relevant code.
60 | Cloudera Introduction
Cloudera Introduction | 61
Navigator auditing, metadata, lineage, policies, and analytics all support multi-cluster deployments that are managed
by a single Cloudera Manager instance. So if you have five clusters, all centrally managed by a single Cloudera Manager,
you'll see all this information within a single Navigator data management UI. In the metadata portion of the UI, Navigator
also tracks the specific cluster the data comes from with the Cluster technical metadata property.
Starting and Logging into the Cloudera Navigator Data Management UI
1. Do one of the following:
Enter the URL of the Navigator UI in a browser: http://Navigator_Metadata_Server_host:port/,
where Navigator_Metadata_Server_host is the name of the host on which you are running the Navigator
Metadata Server role and port is the port configured for the role. The default port of the Navigator Metadata
Server is 7187. To change the port, follow the instructions in Configuring the Navigator Metadata Server Port.
Do one of the following:
Select Clusters > Cloudera Management Service > Cloudera Navigator.
Navigate from the Navigator Metadata Server role:
1. Do one of the following:
Select Clusters > Cloudera Management Service > Cloudera Management Service.
On the Home > Status tab, in Cloudera Management Service table, click the Cloudera
Management Service link.
2. Click the Instances tab.
3. Click the Navigator Metadata Server role.
4. Click the Cloudera Navigator link.
2. Log into Cloudera Navigator UI using the credentials assigned by your administrator.
62 | Cloudera Introduction
displays in a new window. The API is structured into resource categories. Click a category to display the resource
endpoints.
To view an API tutorial, click the Tutorial link at the top of the API documentation or go to
Navigator_Metadata_Server_host:port/api-console/tutorial.html
Cloudera Introduction | 63
64 | Cloudera Introduction
High-performance transparent data encryption for files, databases, and applications running on Linux
Separation of cryptographic keys from encrypted data
Centralized management of cryptographic keys
Integration with hardware security modules (HSMs) from Thales and SafeNet
Support for Intel AES-NI cryptographic accelerator for enhanced performance in the encryption and decryption
process
Process-Based Access Controls
Cloudera Navigator encryption can be deployed to protect different assets, including (but not limited to):
Databases
Log files
Temporary files
Spill files
HDFS data
For planning and deployment purposes, this can be simplified to two types of data that Cloudera Navigator encryption
can secure:
1. HDFS data
2. Local filesystem data
The following table outlines some common use cases and identifies the services required.
Table 2: Encrypting Data at Rest
Data Type
Data Location
Key Management
Additional Services
Required
HDFS
HDFS
Metadata databases,
including:
Local filesystem
Navigator Encrypt
Local filesystem
Navigator Encrypt
Hive Metastore
Cloudera Manager
Cloudera Navigator
Data Management
Sentry
Temp/spill files for CDH
components with native
encryption:
Impala
YARN
MapReduce
Flume
HBase
Accumulo
Cloudera Introduction | 65
Data Location
Key Management
Additional Services
Required
Local filesystem
Navigator Encrypt
Sqoop2
HiveServer2
Log files
Log Redaction
For instructions on using Navigator Encrypt to secure local filesystem data, see Cloudera Navigator Encrypt.
Key Trustee clients include Navigator Encrypt and Key Trustee KMS. Encryption keys are created by the client and
stored in Key Trustee Server.
66 | Cloudera Introduction
For more details on the individual components of Cloudera Navigator encryption, continue reading:
Cloudera Introduction | 67
The most common Key Trustee Server clients are Navigator Encrypt and Key Trustee KMS.
When a Key Trustee client registers with Key Trustee Server, it generates a unique fingerprint. All client interactions
with the Key Trustee Server are authenticated with this fingerprint. You must ensure that the file containing this
fingerprint is secured with appropriate Linux file permissions. The file containing the fingerprint is
/etc/navencrypt/keytrustee/ztrustee.conf for Navigator Encrypt clients, and
/var/lib/kms-keytrustee/keytrustee/.keytrustee/keytrustee.conf for Key Trustee KMS.
Many clients can use the same Key Trustee Server to manage security objects. For example, you can have several
Navigator Encrypt clients using a Key Trustee Server, and also use the same Key Trustee Server as the backing store
for Key Trustee KMS (used in HDFS encryption).
1. A Key Trustee client (for example, Navigator Encrypt or Key Trustee KMS) sends an encrypted secret to Key Trustee
Server.
2. Key Trustee Server forwards the encrypted secret to Key HSM.
3. Key HSM generates a symmetric encryption key and sends it to the HSM over an encrypted channel.
4. The HSM generates a new key pair and encrypts the symmetric key and returns the encrypted symmetric key to
Key HSM.
68 | Cloudera Introduction
For instructions on installing Navigator Key HSM, see Installing Cloudera Navigator Key HSM. For instructions on
configuring Navigator Key HSM, see Initializing Navigator Key HSM.
Databases
Temporary files (YARN containers, spill files, and so on)
Log files
Data directories
Configuration files
Navigator Encrypt uses dmcrypt for its underlying cryptographic operations. Navigator Encrypt uses several different
encryption keys:
Master Key: The master key can be a single passphrase, dual passphrase, or RSA key file. The master key is stored
in Key Trustee Server and cached locally. This key is used when registering with a Key Trustee Server and when
performing administrative functions on Navigator Encrypt clients.
Mount Encryption Key (MEK): This key is generated by Navigator Encrypt using openssl rand by default, but it
can alternatively use /dev/urandom. This key is generated when preparing a new mount point. Each mount point
has its own MEK. This key is uploaded to Key Trustee Server.
dmcrypt Device Encryption Key (DEK): This key is not managed by Navigator Encrypt or Key Trustee Server. It is
managed locally by dmcrypt and stored in the header of the device.
Process-Based Access Control List
The access control list (ACL) controls access to specified data. The ACL uses a process fingerprint, which is the SHA256
hash of the process binary, for authentication. You can create rules to allow a process to access specific files or
directories. The ACL file is encrypted with the client master key and stored locally for quick access and updates.
Here is an example rule:
"ALLOW @mydata * /usr/bin/myapp"
This rule allows the /usr/bin/myapp process to access any encrypted path (*) that was encrypted under the category
@mydata.
Navigator Encrypt uses a kernel module that intercepts any input/output (I/O) sent to an encrypted and managed path.
The Linux module filename is navencryptfs.ko and it resides in the kernel stack, injecting filesystem hooks. It also
authenticates and authorizes processes and caches authentication results for increased performance.
Because the kernel module intercepts and does not modify I/O, it supports any filesystem (ext3, ext4, xfs, and so
on).
The following diagram shows /usr/bin/myapp sending an open() call that is intercepted by
navencrypt-kernel-module as an open hook:
70 | Cloudera Introduction
The kernel module calculates the process fingerprint. If the authentication cache already has the fingerprint, the process
is allowed to access the data. If the fingerprint is not in the cache, the fingerprint is checked against the ACL. If the ACL
grants access, the fingerprint is added to the authentication cache, and the process is permitted to access the data.
When you add an ACL rule, you are prompted for the master key. If the rule is accepted, the ACL rules file is updated
as well as the navencrypt-kernel-module ACL cache.
The next diagram illustrates different aspects of Navigator Encrypt:
The user adds a rule to allow /usr/bin/myapp to access the encrypted data in the category @mylogs, and adds
another rule to allow /usr/bin/myapp to access encrypted data in the category @mydata. These two rules are loaded
into the navencrypt-kernel-module cache after restarting the kernel module.
The /mydata directory is encrypted under the @mydata category and /mylogs is encrypted under the @mylogs
category using dmcrypt (block device encryption).
When myapp tries to issue I/O to an encrypted directory, the kernel module calculates the fingerprint of the process
(/usr/bin/myapp) and compares it with the list of authorized fingerprints in the cache.
Cloudera Introduction | 71
The master key is encrypted with a local GPG key. Before being stored in the Key Trustee Server database, it is encrypted
again with the Key Trustee Server GPG key. When the master key is needed to perform a Navigator Encrypt operation,
Key Trustee Server decrypts the stored key with its server GPG key and sends it back to the client (in this case, Navigator
Encrypt), which decrypts the deposit with the local GPG key.
All communication occurs over TLS-encrypted connections.
72 | Cloudera Introduction
Cloudera Introduction | 73
Cloudera Express
Cloudera Enterprise
Unlimited
Unlimited
Cluster Management
Number of hosts supported
Host inspector for determining CDH
readiness
Multi-cluster management
Centralized view of all running
commands
Resource management
Global time control for historical
diagnosis
Cluster-wide configuration
Cluster-wide event management
Cluster-wide log search
Aggregate UI
Deployment
Support for CDH 4 and CDH 5
Automated deployment and readiness
checks
Installation from local repositories
Rolling upgrade of CDH
Service and Configuration Management
Manage Accumulo, Flume, HBase,
HDFS, Hive, Hue, Impala, Isilon, Kafka,
Kudu, MapReduce, Oozie, Sentry, Solr,
Spark, Sqoop, YARN, and ZooKeeper
services
Manage Key Trustee and Cloudera
Navigator
Manage add-on services
Rolling restart of services
High availability (HA) support
CDH 4 - HDFS and MapReduce
JobTracker (CDH 4.2)
74 | Cloudera Introduction
Cloudera Express
Cloudera Enterprise
Cloudera Introduction | 75
Cloudera Express
Cloudera Enterprise
Alert by email
Alert by SNMP
User-defined triggers
Custom alert publish scripts
Advanced Management Features
Automated backup and disaster
recovery
File browsing, searching, and disk
quota management
HBase, MapReduce, Impala, and YARN
usage reports
Support integration
Operational reports
Cloudera Navigator Data Management
Metadata management and
augmentation
Ingest policies
Analytics
Auditing
Lineage
General Questions
What are the new features of Cloudera Manager 5?
For a list of new features in Cloudera Manager 5, see New Features and Changes in Cloudera Manager 5.
What operating systems are supported?
See Supported Operating Systems for more detailed information on which operating systems are supported.
What databases are supported?
See Supported Databases for more detailed information on which database systems are supported.
What version of CDH is supported for Cloudera Manager 5?
See Supported CDH and Managed Service Versions for detailed information.
76 | Cloudera Introduction
Cloudera Introduction | 77
78 | Cloudera Introduction
Trying Impala
How do I try Impala out?
To look at the core features and functionality on Impala, the easiest way to try out Impala is to download the Cloudera
QuickStart VM and start the Impala service through Cloudera Manager, then use impala-shell in a terminal window
or the Impala Query UI in the Hue web interface.
Cloudera Introduction | 79
New features
Known and fixed issues
Incompatible changes
Installing Impala
Upgrading Impala
Configuring Impala
Starting Impala
Security for Impala
CDH Version and Packaging Information
Information about the latest CDH 4-compatible Impala release remains at the Impala for CDH 4 Documentation page.
Where can I get more information about Impala?
More product information is available here:
O'Reilly introductory e-book: Cloudera Impala: Bringing the SQL and Hadoop Worlds Together
O'Reilly getting started guide for developers: Getting Started with Impala: Interactive SQL for Apache Hadoop
Blog: Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
Webinar: Introduction to Impala
Product website page: Cloudera Enterprise RTQ
To see the latest release announcements for Impala, see the Cloudera Announcements forum.
How can I ask questions and provide feedback about Impala?
Join the Impala discussion forum and the Impala mailing list to ask questions and provide feedback.
Use the Impala Jira project to log bug reports and requests for features.
Where can I get sample data to try?
You can get scripts that produce data files and set up an environment for TPC-DS style benchmark tests from this Github
repository. In addition to being useful for experimenting with performance, the tables are suited to experimenting
with many aspects of SQL on Impala: they contain a good mixture of data types, data distributions, partitioning, and
relational data suitable for join queries.
80 | Cloudera Introduction
and your cluster has 50 nodes, then each of those 50 nodes will transmit a maximum of 1000 rows back to the
coordinator node. The coordinator node needs enough memory to sort (LIMIT * cluster_size) rows, although in
the end the final result set is at most LIMIT rows, 1000 in this case.
Likewise, if you execute the query:
select * from giant_table where test_val > 100 order by some_column;
then each node filters out a set of rows matching the WHERE conditions, sorts the results (with no size limit), and
sends the sorted intermediate rows back to the coordinator node. The coordinator node might need substantial
memory to sort the final result set, and so might use a temporary disk work area for that final phase of the query.
Whether the query contains any join clauses, GROUP BY clauses, analytic functions, or DISTINCT operators. These
operations all require some in-memory work areas that vary depending on the volume and distribution of data.
In Impala 2.0 and later, these kinds of operations utilize temporary disk work areas if memory usage grows too
large to handle. See SQL Operations that Spill to Disk for details.
The size of the result set. When intermediate results are being passed around between nodes, the amount of data
depends on the number of columns returned by the query. For example, it is more memory-efficient to query
only the columns that are actually needed in the result set rather than always issuing SELECT *.
Cloudera Introduction | 81
82 | Cloudera Introduction
How do I?
How do I prevent users from seeing the text of SQL queries?
For instructions on making the Impala log files unreadable by unprivileged users, see Securing Impala Data and Log
Files.
For instructions on password-protecting the web interface to the Impala log files and other internal server information,
see Securing the Impala Web User Interface.
In Impala 2.2 / CDH 5.4 and higher, you can use the log redaction feature to obfuscate sensitive information in Impala
log files. See Sensitive Data Redaction for details.
How do I know how many Impala nodes are in my cluster?
The Impala statestore keeps track of how many impalad nodes are currently available. You can see this information
through the statestore web interface. For example, at the URL http://statestore_host:25010/metrics you
might see lines like the following:
statestore.live-backends:3
statestore.live-backends.list:[host1:22000, host1:26000, host2:22000]
The number of impalad nodes is the number of list items referring to port 22000, in this case two. (Typically, this
number is one less than the number reported by the statestore.live-backends line.) If an impalad node became
unavailable or came back after an outage, the information reported on this page would change appropriately.
Impala Performance
Are results returned as they become available, or all at once when a query completes?
Impala streams results whenever they are available, when possible. Certain SQL operations (aggregation or ORDER
BY) require all of the input to be ready before Impala can return results.
Why does my query run slowly?
There are many possible reasons why a given query could be slow. Use the following checklist to diagnose performance
issues with existing queries, and to avoid such issues when writing new queries, setting up new nodes, creating new
tables, or loading data.
Cloudera Introduction | 83
If BytesReadLocal is lower than BytesRead, something in your cluster is misconfigured, such as the impalad
daemon not running on all the data nodes. If BytesReadShortCircuit is lower than BytesRead, short-circuit
reads are not enabled properly on that node; see Post-Installation Configuration for Impala for instructions.
If the table was just created, or this is the first query that accessed the table after an INVALIDATE METADATA
statement or after the impalad daemon was restarted, there might be a one-time delay while the metadata for
the table is loaded and cached. Check whether the slowdown disappears when the query is run again. When doing
performance comparisons, consider issuing a DESCRIBE table_name statement for each table first, to make
sure any timings only measure the actual query time and not the one-time wait to load the table metadata.
Is the table data in uncompressed text format? Check by issuing a DESCRIBE FORMATTED table_name statement.
A text table is indicated by the line:
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Although uncompressed text is the default format for a CREATE TABLE statement with no STORED AS clauses,
it is also the bulkiest format for disk storage and consequently usually the slowest format for queries. For data
where query performance is crucial, particularly for tables that are frequently queried, consider starting with or
converting to a compact binary file format such as Parquet, Avro, RCFile, or SequenceFile. For details, see How
Impala Works with Hadoop File Formats.
If your table has many columns, but the query refers to only a few columns, consider using the Parquet file format.
Its data files are organized with a column-oriented layout that lets queries minimize the amount of I/O needed
to retrieve, filter, and aggregate the values for specific columns. See Using the Parquet File Format with Impala
Tables for details.
If your query involves any joins, are the tables in the query ordered so that the tables or subqueries are ordered
with the one returning the largest number of rows on the left, followed by the smallest (most selective), the second
smallest, and so on? That ordering allows Impala to optimize the way work is distributed among the nodes and
how intermediate results are routed from one node to another. For example, all other things being equal, the
following join order results in an efficient query:
select some_col from
huge_table join big_table join small_table join medium_table
where
huge_table.id = big_table.id
and big_table.id = medium_table.id
and medium_table.id = small_table.id;
See Performance Considerations for Join Queries for performance tips for join queries.
Also for join queries, do you have table statistics for the table, and column statistics for the columns used in the
join clauses? Column statistics let Impala better choose how to distribute the work for the various pieces of a join
query. See Table and Column Statistics for details about gathering statistics.
Does your table consist of many small data files? Impala works most efficiently with data files in the multi-megabyte
range; Parquet, a format optimized for data warehouse-style queries, uses large files (originally 1 GB, now 256
MB in Impala 2.0 and higher) with a block size matching the file size. Use the DESCRIBE FORMATTED table_name
statement in impala-shell to see where the data for a table is located, and use the hadoop fs -ls or hdfs
dfs -ls Unix commands to see the files and their sizes. If you have thousands of small data files, that is a signal
that you should consolidate into a smaller number of large files. Use an INSERT ... SELECT statement to copy
the data to a new table, reorganizing into new data files as part of the process. Prefer to construct large data files
and import them in bulk through the LOAD DATA or CREATE EXTERNAL TABLE statements, rather than issuing
84 | Cloudera Introduction
Impala Availability
Is Impala production ready?
Impala has finished its beta release cycle, and the 1.0, 1.1, and 1.2 GA releases are production ready. The 1.1.x series
includes additional security features for authorization, an important requirement for production use in many
organizations. The 1.2.x series includes important performance features, particularly for large join queries. Some
Cloudera customers are already using Impala for large workloads.
The Impala 1.3.0 and higher releases are bundled with corresponding levels of CDH 5. The number of new features
grows with each release. See What's New in Apache Impala (incubating) for a full list.
How do I configure Hadoop high availability (HA) for Impala?
You can set up a proxy server to relay requests back and forth to the Impala servers, for load balancing and high
availability. See Using Impala through a Proxy for High Availability for details.
You can enable HDFS HA for the Hive metastore. See the CDH5 High Availability Guide or the CDH4 High Availability
Guide for details.
What happens if there is an error in Impala?
There is not a single point of failure in Impala. All Impala daemons are fully able to handle incoming queries. If a machine
fails however, all queries with fragments running on that machine will fail. Because queries are expected to return
quickly, you can just rerun the query if there is a failure. See Impala Concepts and Architecture for details about the
Impala architecture.
The longer answer: Impala must be able to connect to the Hive metastore. Impala aggressively caches metadata so
the metastore host should have minimal load. Impala relies on the HDFS NameNode, and, in CDH4, you can configure
HA for HDFS. Impala also has centralized services, known as the statestore and catalog services, that run on one host
only. Impala continues to execute queries if the statestore host is down, but it will not get state updates. For example,
if a host is added to the cluster while the statestore host is down, the existing instances of impalad running on the
other hosts will not find out about this new host. Once the statestore process is restarted, all the information it serves
is automatically reconstructed from all running Impala daemons.
What is the maximum number of rows in a table?
There is no defined maximum. Some customers have used Impala to query a table with over a trillion rows.
Cloudera Introduction | 87
Impala Internals
On which hosts does Impala run?
Cloudera strongly recommends running the impalad daemon on each DataNode for good performance. Although this
topology is not a hard requirement, if there are data blocks with no Impala daemons running on any of the hosts
containing replicas of those blocks, queries involving that data could be very inefficient. In that case, the data must be
transmitted from one host to another for processing by remote reads, a condition Impala normally tries to avoid.
See Impala Concepts and Architecture for details about the Impala architecture. Impala schedules query fragments on
all hosts holding data relevant to the query, if possible.
In cases where some hosts in the cluster have much greater CPU and memory capacity than others, or where some
hosts have extra CPU capacity because some CPU-intensive phases are single-threaded, some users have run multiple
impalad daemons on a single host to take advantage of the extra CPU capacity. This configuration is only practical
for specific workloads that rely heavily on aggregation, and the physical hosts must have sufficient memory to
accomodate the requirements for multiple impalad instances.
How are joins performed in Impala?
By default, Impala automatically determines the most efficient order in which to join tables using a cost-based method,
based on their overall size and number of rows. (This is a new feature in Impala 1.2.2 and higher.) The COMPUTE STATS
statement gathers information about each table that is crucial for efficient join performance. Impala chooses between
two techniques for join queries, known as broadcast joins and partitioned joins. See Joins in Impala SELECT
Statements for syntax details and Performance Considerations for Join Queries for performance considerations.
How does Impala process join queries for large tables?
Impala utilizes multiple strategies to allow joins between tables and result sets of various sizes. When joining a large
table with a small one, the data from the small table is transmitted to each node for intermediate processing. When
joining two large tables, the data from one of the tables is divided into pieces, and each node processes only selected
pieces. See Joins in Impala SELECT Statements for details about join processing, Performance Considerations for Join
Queries for performance considerations, and Query Hints in Impala SELECT Statements for how to fine-tune the join
strategy.
What is Impala's aggregation strategy?
Impala currently only supports in-memory hash aggregation. In Impala 2.0 and higher, if the memory requirements
for a join or aggregation operation exceed the memory limit for a particular host, Impala uses a temporary work area
on disk to help the query complete successfully.
How is Impala metadata managed?
Impala uses two pieces of metadata: the catalog information from the Hive metastore and the file metadata from the
NameNode. Currently, this metadata is lazily populated and cached when an impalad needs it to plan a query.
The REFRESH statement updates the metadata for a particular table after loading new data through Hive. The INVALIDATE
METADATA Statement statement refreshes all metadata, so that Impala recognizes new tables or other DDL and DML
changes performed through Hive.
In Impala 1.2 and higher, a dedicated catalogd daemon broadcasts metadata changes due to Impala DDL or DML
statements to all nodes, reducing or eliminating the need to use the REFRESH and INVALIDATE METADATA statements.
88 | Cloudera Introduction
Cloudera Introduction | 89
SQL
Is there an UPDATE statement?
Impala does not currently have an UPDATE statement, which would typically be used to change a single row, a small
group of rows, or a specific column. The HDFS-based files used by typical Impala queries are optimized for bulk operations
across many megabytes of data at a time, making traditional UPDATE operations inefficient or impractical.
You can use the following techniques to achieve the same goals as the familiar UPDATE statement, in a way that
preserves efficient file layouts for subsequent queries:
Replace the entire contents of a table or partition with updated data that you have already staged in a different
location, either using INSERT OVERWRITE, LOAD DATA, or manual HDFS file operations followed by a REFRESH
statement for the table. Optionally, you can use built-in functions and expressions in the INSERT statement to
transform the copied data in the same way you would normally do in an UPDATE statement, for example to turn
a mixed-case string into all uppercase or all lowercase.
To update a single row, use an HBase table, and issue an INSERT ... VALUES statement using the same key as
the original row. Because HBase handles duplicate keys by only returning the latest row with a particular key
value, the newly inserted row effectively hides the previous one.
Can Impala do user-defined functions (UDFs)?
Impala 1.2 and higher does support UDFs and UDAs. You can either write native Impala UDFs and UDAs in C++, or reuse
UDFs (but not UDAs) originally written in Java for use with Hive. See Impala User-Defined Functions (UDFs) for details.
Why do I have to use REFRESH and INVALIDATE METADATA, what do they do?
In Impala 1.2 and higher, there is much less need to use the REFRESH and INVALIDATE METADATA statements:
The new impala-catalog service, represented by the catalogd daemon, broadcasts the results of Impala DDL
statements to all Impala nodes. Thus, if you do a CREATE TABLE statement in Impala while connected to one
node, you do not need to do INVALIDATE METADATA before issuing queries through a different node.
The catalog service only recognizes changes made through Impala, so you must still issue a REFRESH statement
if you load data through Hive or by manipulating files in HDFS, and you must issue an INVALIDATE METADATA
statement if you create a table, alter a table, add or drop partitions, or do other DDL statements in Hive.
Because the catalog service broadcasts the results of REFRESH and INVALIDATE METADATA statements to all
nodes, in the cases where you do still need to issue those statements, you can do that on a single node rather
than on every node, and the changes will be automatically recognized across the cluster, making it more convenient
to load balance by issuing queries through arbitrary Impala nodes rather than always using the same coordinator
node.
90 | Cloudera Introduction
Partitioned Tables
How do I load a big CSV file into a partitioned table?
To load a data file into a partitioned table, when the data file includes fields like year, month, and so on that correspond
to the partition key columns, use a two-stage process. First, use the LOAD DATA or CREATE EXTERNAL TABLE
statement to bring the data into an unpartitioned text table. Then use an INSERT ... SELECT statement to copy
the data from the unpartitioned table to a partitioned one. Include a PARTITION clause in the INSERT statement to
specify the partition key columns. The INSERT operation splits up the data into separate data files for each partition.
For examples, see Partitioning for Impala Tables. For details about loading data into partitioned Parquet tables, a
popular choice for high-volume data, see Loading Data into Parquet Tables.
Can I do INSERT ... SELECT * into a partitioned table?
When you use the INSERT ... SELECT * syntax to copy data into a partitioned table, the columns corresponding
to the partition key columns must appear last in the columns returned by the SELECT *. You can create the table with
the partition key columns defined last. Or, you can use the CREATE VIEW statement to create a view that reorders
the columns: put the partition key columns last, then do the INSERT ... SELECT * from the view.
Cloudera Introduction | 91
General
The following are general questions about Cloudera Search and the answers to those questions.
What is Cloudera Search?
Cloudera Search is Apache Solr integrated with CDH, including Apache Lucene, Apache SolrCloud, Apache Flume,
Apache Tika, and Apache Hadoop MapReduce and HDFS. Cloudera Search also includes valuable integrations that make
searching more scalable, easy to use, and optimized for both near-real-time and batch-oriented indexing. These
integrations include Cloudera Morphlines, a customizable transformation chain that simplifies loading any type of data
into Cloudera Search.
What is the difference between Lucene and Solr?
Lucene is a low-level search library that is accessed by a Java API. Solr is a search server that runs in a servlet container
and provides structure and convenience around the underlying Lucene library.
What is Apache Tika?
The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using
existing parser libraries. Using the solrCell morphline command, the output from Apache Tika can be mapped to a
Solr schema and indexed.
How does Cloudera Search relate to web search?
Traditional web search engines crawl web pages on the Internet for content to index. Cloudera Search indexes files
and data that are stored in HDFS and HBase. To make web data available through Cloudera Search, it needs to be
downloaded and stored in Cloudera Enterprise.
92 | Cloudera Introduction
Why do I get an error no field name specified in query and no default specified via 'df' param" when I query a Schemaless
collection?
Schemaless collections initially have no default or df setting. As a result, simple searches that might succeed on
non-Schemaless collections may fail on Schemaless collections.
When a user submits a search, it must be clear which field Cloudera Search should query. A default field, or df, is often
specified in solrconfig.xml, and when this is the case, users can submit queries that do not specify fields. In such
situations, Solr uses the df value.
When a new collection is created in Schemaless mode, there are initially no fields defined, so no field can be chosen
as the df field. As a result, when query request handlers do not specify a df, errors can result. This issue can be
addressed in several ways:
Queries can specify any valid field name on which to search. In such a case, no df is required.
Queries can specify a default field using the df parameter. In such a case, the df is specified in the query.
Cloudera Introduction | 93
Schema Management
The following are questions about schema management in Cloudera Search and the answers to those questions.
94 | Cloudera Introduction
Supportability
The following are questions about supportability in Cloudera Search and the answers to those questions.
Does Cloudera Search support multiple languages?
Cloudera Search supports approximately 30 languages, including most Western European languages, as well as Chinese,
Japanese, and Korean.
Which file formats does Cloudera Search support for indexing? Does it support searching images?
Cloudera Search uses the Apache Tika library for indexing many standard document formats. In addition, Cloudera
Search supports indexing and searching Avro files and a wide variety of other file types such as log files, Hadoop
Sequence Files, and CSV files. You can add support for indexing custom file formats using a morphline command plug-in.
Cloudera Introduction | 95
Getting Support
Getting Support
This section describes how to get support.
Cloudera Support
Cloudera can help you install, configure, optimize, tune, and run CDH for large scale data processing and analysis.
Cloudera supports CDH whether you run it on servers in your own datacenter, or on hosted infrastructure services
such as Amazon EC2, Rackspace, SoftLayer, and VMware vCloud.
If you are a Cloudera customer, you can:
Register for an account to create a support ticket at the support site.
Visit the Cloudera Knowledge Base.
If you are not a Cloudera customer, learn how Cloudera can help you.
Community Support
There are several vehicles for community support. You can:
Register for the Cloudera forums.
If you have any questions or comments about CDH, you can visit the Using the Platform forum.
If you have any questions or comments about Cloudera Manager, you can
Visit the Cloudera Manager forum forum.
Cloudera Express users can access the Cloudera Manager support mailing list from within the Cloudera
Manager Admin Console by selecting Support > Mailing List.
Cloudera Enterprise customers can access the Cloudera Support Portal from within the Cloudera Manager
Admin Console, by selecting Support > Cloudera Support Portal. From there you can register for a support
account, create a support ticket, and access the Cloudera Knowledge Base.
If you have any questions or comments about Cloudera Navigator, you can visit the Cloudera Navigator forum.
Find more documentation for specific components by referring to External Documentation on page 34.
96 | Cloudera Introduction
Getting Support
Report Issues
Your input is appreciated, but before filing a request:
Search the Cloudera issue tracker, where Cloudera tracks software and documentation bugs and enhancement
requests for CDH.
Search the CDH Manual Installation, Using the Platform, and Cloudera Manager forums.
Cloudera Introduction | 97