SAP HANA Vora Installation Developer Guide en
SAP HANA Vora Installation Developer Guide en
SAP HANA Vora Installation Developer Guide en
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 SAP HANA Vora and Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Related Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 SAP HANA Vora Components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 SAP HANA Vora Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Cluster Node Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Installation Prerequisites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Collect Hadoop Cluster Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Install the SAP HANA Vora Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Install the SAP HANA Vora Engine Using Ambari. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Install the SAP HANA Vora Engine Using Cloudera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 Install the SAP HANA Vora Spark Extension Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8 Install the SAP HANA Vora Zeppelin Interpreter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.9 Connect SAP HANA Spark Controller to SAP HANA Vora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.10 Connect SAP Lumira to SAP HANA Vora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
2.11 Update SAP HANA Vora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Update the SAP HANA Vora Engine Using Ambari. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Update the SAP HANA Vora Engine Using Cloudera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Update the SAP HANA Vora Spark Extension Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.12 SAP HANA Vora Default Ports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Development. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Using the SAP HANA Vora Data Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Querying Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Code Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Loading Data from Amazon S3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Preventing Data Type Overflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44
SAP HANA Vora Data Source API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Working with Hierarchies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Representing Hierarchies as Adjacency Lists. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Creating Hierarchies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51
Joining Hierarchies with Other Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Using Hierarchies with Views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Hierarchy UDFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Administration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1 Configure Proxy Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Start and Stop the SAP HANA Vora Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Start the SAP HANA Vora Spark Thrift Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 SAP HANA Vora Service: Configuration Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Best Practices: Administration and Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Choosing a Cluster Manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Example Cluster Configuration Including a Client Machine (Jump Box). . . . . . . . . . . . . . . . . . . . 72
5 Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Technical System Landscape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
5.2 Other Security-relevant Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
SAP HANA Vora provides an in-memory processing engine that is integrated into the Hadoop ecosystem and
Spark execution framework. Able to scale to thousands of nodes, it is designed for use in large distributed
clusters and for handling big data.
The SAP HANA Vora processing engine holds data in memory and boosts the execution performance of Spark.
Supporting just-in-time code compilation, it translates incoming SQL queries into machine-level code on the
fly using a LLVM compiler, enabling them to be executed quickly and efficiently.
Data Analytics
SAP HANA Vora makes available OLAP-style capabilities for data on Hadoop, in particular, a hierarchy
implementation that allows hierarchical data structures to be defined and complex computations performed
on different levels of data. Extensions to Spark SQL also include enhancements to the data source API to
enable Spark SQL queries or parts of the queries to be pushed down to the SAP HANA Vora processing engine.
Data processing between the SAP HANA and Hadoop environments allows data in SAP HANA to be combined
with big data stored in Hadoop systems and processed in Spark or SAP HANA applications.
The SAP HANA Vora solution is built on the Hadoop ecosystem, an open-source project providing a collection
of components that support distributed processing of large data sets across a cluster of machines. Hadoop
The main components used in this environment are shown in the figure below:
Yarn Hadoop’s resource manager and job scheduler. Apache Hadoop YARN
Spark SQL A module for structured and semi-structured data Spark SQL and DataFrame Guide
processing.
MLib A machine learning tool that runs on Spark. Machine Learning Library (MLlib) Guide
Resource Details
To install SAP HANA Vora, first familiarize yourself with the components it contains and the installation
packages you require. Review the installation prerequisites to ensure a properly configured cluster and then
download and install the SAP HANA Vora packages.
Task See
Understand what components make up the SAP HANA Vora SAP HANA Vora Components [page 8]
system
Find out what packages are required to install SAP HANA SAP HANA Vora Packages [page 9]
Vora and where they are available
Check the overview of the different node types and see Cluster Node Overview [page 9]
which components are typically deployed where
Ensure your Hadoop cluster is correctly set up and meets Installation Prerequisites [page 10]
the installation requirements for SAP HANA Vora
Collect and document essential information about your Ha Collect Hadoop Cluster Information [page 14]
doop cluster
Download and install the package containing the SAP HANA Install the SAP HANA Vora Engine [page 14]
Vora engine
Download and install the package containing the SAP HANA Install the SAP HANA Vora Spark Extension Library [page
Vora Spark extension library 18]
Optionally enable the Zeppelin interpreter if you want to use Install the SAP HANA Vora Zeppelin Interpreter [page 20]
Zeppelin (an interactive data analytics tool)
Set up the Spark Controller if you want to query tables ac Connect SAP HANA Spark Controller to SAP HANA Vora
cessible through Spark from SAP HANA [page 24]
Connect SAP Lumira if you want to visualize SAP HANA Connect SAP Lumira to SAP HANA Vora [page 26]
Vora data in SAP Lumira
Update your SAP HANA Vora installation with the latest ver Update SAP HANA Vora [page 30]
sions of the installation packages
Related Information
The SAP HANA Vora system consists of two main components, the SAP HANA Vora engine, which needs to be
installed on all compute nodes in the cluster, and the SAP HANA Vora Spark extension library, which provides
access to the SAP HANA Vora engine and its functional features.
The SAP HANA Vora SQL engine is a service that you add to your existing Hadoop installation. SAP HANA Vora
instances hold data in memory and boost the performance of out-of-the box Spark. To increase execution
performance on the node level, you add an SAP HANA Vora instance to each compute node so that it contains
the following:
● A Spark worker
● An SAP HANA Vora engine
The integration of the SAP HANA Vora engine with Spark is shown in the overview below:
The SAP HANA Vora extension library allows SAP HANA Vora to be accessed through Spark. It also makes
available additional functionality, such as a hierarchy implementation, which allows you to build hierarchies
and run hierarchical queries.
To use the extension library, you need to install the extension package on the cluster on the nodes on which
Spark is installed.
To install the SAP HANA Vora system, you require two packages, one which contains the SAP HANA Vora
engine and the other the SAP HANA Vora Spark extension library. Both packages are available for download
from the SAP Software Download Center .
Package Description
VORA_AM<version>.TGZ The SAP HANA Vora engine for Ambari and Cloudera.
VORA_CL<version>.TGZ These packages allow the SAP HANA Vora engine to be deployed on the compute nodes
using the respective provisioning tool.
The package can be downloaded from the SAP Software Download Center at: https://
support.sap.com/swdc
VORA_SE<version>.TGZ The SAP HANA Vora Spark extension library. This library allows the SAP HANA Vora en
gine and its functional features to be accessed using Spark.
The package contains the JAR with all dependencies and a number of shell scripts to use
the SAP HANA Vora extension through Spark.
The package can be downloaded from the SAP Software Download Center at: https://
support.sap.com/swdc
You need to choose appropriate nodes when you deploy the SAP HANA Vora packages on the cluster. An
overview of the different node types is given below.
Node Types
For the purposes of setting up a cluster, four different types of cluster nodes are distinguished:
Management node Contains the cluster provisioning tool, for example, Ambari or Cloudera.
Master nodes Contain central cluster components, such as the NameNode or ZooKeeper serv
ers.
Worker nodes These are the compute nodes of the cluster. They contain components such as
DataNodes or NodeManagers.
Jump boxes Contain only client components, such as the HDFS client, and serve as an entry
point for users to start compute jobs using Spark.
If you use Yarn’s Resource Manager as the cluster manager, you should install and deploy the SAP HANA Vora
components in the following way:
A Hadoop cluster is a prerequisite for installing SAP HANA Vora. Review the installation requirements to
ensure that the cluster you use is correctly set up.
Note
Only certain combinations of operating system, cluster provisioning tool, and Hadoop distribution are
supported. These are listed under Supported Platforms.
Hadoop Distributions
SAP HANA Vora can only be used with selected Hadoop distributions:
The cluster must be managed by one of the following cluster provisioning tools:
Operating Systems
● SUSE Linux Enterprise Server (SLES) 11 SP3 (see compatibility pack details below)
● Red Hat Enterprise Linux (RHEL) 6.6 (see compatibility pack details below) and 7.1
SLES 11 SP3 You need to install the RPM packages libgcc_s1 and libstdc++6.
Ensure that the versions are not earlier than the following (earlier versions cause problems
during runtime due to improper exception handling):
● libgcc_s1-4.7.2_20130108-0.17.2
● libstdc++6-4.7.2_20130108-0.17.2
Install the RPM packages as follows, if they are not already installed by default:
RHEL 6.6 To run SAP HANA Vora on RHEL 6.6, an additional runtime environment for GCC 4.7 is re
quired, which you can add by installing the RPM package compat-sap-c++ (see also SAP
Note 2001528 ).
To be able to access the library, you need a subscription for "Red Hat Enterprise Linux Server
for SAP HANA". This allows you to subscribe your server to the "RHEL Server SAP HANA"
channel on the Red Hat Customer Portal or your local Satellite server. After you have subscri
bed your server to the channel, the output of yum repolist should contain the following:
You can then install the GCC 4.7 libstdc++ library with the following command:
For an up-to-date list of supported operating systems, see SAP Note 2203837 .
Supported Platforms
The following combinations of operating system, cluster provisioning tool, and Hadoop distribution are
supported:
To enable efficient cluster computation using the SAP HANA Vora extension, the cluster nodes should have at
least the following:
● 4 cores
● 8 GB of RAM
● 20 GB of free disk space for HDFS data
Required Components
Zeppelin v0.5.0 or v0.5.6 Optional – allows you to use the Zeppelin integration. Note that Zeppelin is still in
the incubation phase: https://zeppelin.incubator.apache.org/
Validation
To ensure that the components have been correctly installed, run a sample Spark application on the cluster,
such as SparkPi, which calculates the approximate value of Pi.
Sample Code
Pi is roughly 3.140292
Before proceeding with the installation, collect and document the following information about your Hadoop
cluster. You will need to have this information at hand during the installation.
Procedure
The SAP HANA Vora engine is contained in the VORA_AM<version>.TGZ and VORA_CL<version>.TGZ
packages, which are provided specifically for the Ambari and Cloudera provisioning tools so that they can be
used to install the SAP HANA Vora engine instances on the cluster.
Note
If your Hadoop cluster requires an HTTP(S) proxy to access content through the HTTP(S) protocol, make
sure that the proxy is configured before starting SAP HANA Vora. For more information, see Configure
Proxy Settings [page 64].
Procedure
● Install the SAP HANA Vora Engine Using Ambari [page 15]
Use the Ambari provisioning tool to install the SAP HANA Vora engine on your cluster.
Procedure
$ ambari-server restart
Depending on your cluster configuration, you may need to be the root user or a user with administrator
rights to do so.
6. Wait until the Ambari Administration Interface is up and running.
Ambari is now able to provision the SAP HANA Vora engine as a service on the Hadoop cluster.
Note
We recommend that you add the SAP HANA Vora service to each node that acts as a Spark worker
node.
You can now use Ambari to control the SAP HANA Vora instances in the cluster. An example of how this
looks is shown below:
Note
You can confirm that the SAP HANA Vora engine has been successfully deployed on the cluster nodes
by verifying that the v2server process is running on them.
Use the Cloudera provisioning tool to install the SAP HANA Vora engine on your cluster.
Procedure
Note
We recommend that you add the SAP HANA Vora service to each node that acts as a Spark worker
node.
18. Wait until the SAP HANA Vora service is up and running.
19. Choose Continue and then Finish.
20.Customize the service.
On the Home screen, click the SAP HANA Vora service and then choose the Configuration tab.
Modify the SAP HANA Vora service configuration, if needed. This includes, in particular, the following:
○ User and group under which the SAP HANA Vora engine runs:
○ Default user: vora
○ Default group: vora
○ File system location of the SAP HANA Vora engine logs:
Default directory: /var/log/vora
○ Level of logging information:
Default: INFO
Note
We recommend that you use the default values.
Results
You can now use the Cloudera Manager to control the SAP HANA Vora instances in the cluster.
To use the SAP HANA Vora engine in Spark, you need to install the SAP HANA Vora Spark extension library.
Prerequisites
● You have already successfully deployed the SAP HANA Vora SQL engine to the compute nodes of the
cluster and the instances are running.
● You have already installed Spark.
Procedure
1. SSH to the jump box as the user who runs the Spark jobs and create a vora directory in the home folder.
2. Download the library package VORA_SE<version>.TGZ from the SAP Software Download Center
(https://support.sap.com/swdc ) to the vora directory.
3. Unpack the archive into the vora directory.
The vora directory now contains the following folders:
○ lib/: Contains a JAR file with all necessary dependencies (excluding Spark).
○ bin/: Contains scripts for ease of use.
○ META-INF/: Contains the pom.properties and pom.xml files.
4. Make sure that the SAP HANA Vora extension has been successfully installed by creating a table and
loading data into it from a file stored in HDFS:
a. Create a file in HDFS. Note that in this example the test file, test.csv, is stored in a directory set up
for the user "vora" (user/vora):
Sample Code
d. Enter the following statements in the Spark shell to create a table and check that it has been
successfully created:
Note
SAP HANA Vora uses a catalog component to keep track of the hosts in the system that run a SAP
HANA Vora engine. There are two ways in which you can add hosts to the catalog:
It is good practice to configure the list of known hosts in the spark-defaults.conf file. This
applies equally to the ZooKeeper and NameNode URLs:
The port is required for namenodeurl (default: 8020) and zkurls (default: 2181) and is optional
for hosts.
Remember
The SAP HANA Vora catalog only knows those hosts that were either listed in the spark-
defaults.conf file or used at least once in a CREATE TABLE statement. Data is loaded on those
hosts only. This means that if there are hosts in your cluster on which SAP HANA Vora was installed
but which were never used in a CREATE TABLE statement or listed in the spark-defaults.conf
file, they will simply be ignored when data is distributed and processed across the cluster.
Results
You have now successfully installed the SAP HANA Vora extension and can use it as follows:
● Alternatively, the shell scripts in the bin folder can be used to run a Spark shell and a Thrift server with the
SAP HANA Vora extension library. To do so, the SPARK_HOME environment variable needs to point to the
Spark folder on the jump box.
You can then start the Spark shell in Yarn client mode as follows:
$ ./start-sapthriftserver.sh
Related Information
Zeppelin is a graphical user interface that allows you, as a data scientist, to interact easily with a cluster. The
SAP HANA Vora Spark extension provides an interpreter for the Zeppelin user interface.
Prerequisites
You require Zeppelin installed on one of the cluster nodes (most likely the jump box):
After the build process has completed, you should have a tar.gz package in the following directory:
./zeppelin-distribution/target
Context
The SAP HANA Vora extension library has its own SQL context class. A modified Zeppelin interpreter is
therefore required to allow Zeppelin to run in the modified context. To enable the interpreter, you need to
register it with Zeppelin.
Restriction
Zeppelin is still in the incubation stage. The steps below are provided for guidance only.
Procedure
$ cp ~/vora/lib/spark-sap-datasources-<VERSION>-assembly.jar \
<ZEPPELIN_HOME>/interpreter/spark/spark-sap-datasources-assembly.jar
Note
<ZEPPELIN_HOME> refers to the directory to which the Zeppelin binaries have been extracted.
2. SAP HANA Vora 1.1 Patch 1 only: Combine the Zeppelin Spark interpreter JAR with the spark-sap-
datasources-assembly JAR, replacing the versions as appropriate:
$ cd `<ZEPPELIN_HOME>/interpreter/spark`
$ mkdir tmp
$ (cd tmp; jar -xf ../spark-sap-datasources-<VERSION>-assembly.jar)
$ (cd tmp; jar -xf ../zeppelin-spark-<VERSION>-incubating.jar)
$ jar -cvf zeppelin-spark-sap-combined.jar -C tmp .
$ // remove the old jars
$ rm spark-sap-datasources-<VERSION>-assembly.jar
$ rm zeppelin-spark-<VERSION>-incubating.jar
Variables
Example
1. cp $ZEPPELIN_HOME/conf/zeppelin-env.sh.template $ZEPPELIN_HOME/conf/
zeppelin-env.sh
2. chmod 0755 $ZEPPELIN_HOME/conf/zeppelin-env.sh
3. vi $ZEPPELIN_HOME/conf/zeppelin-env.sh
4. Insert the variables shown above and save your changes.
Note
Zeppelin also requires the environment variables SPARK_HOME and HADOOP_CONF_DIR to be set. If
these are not already set, you can add them to the zeppelin-env.sh file as well.
...
<property>
<name>zeppelin.interpreters</name>
<value>INTERPRETER_1,...,INTERPRETER_N,org.apache.spark.sql.SapSqlInterpreter<
/value>
<description>Comma separated interpreter configurations.
First interpreter becomes the default</description>
</property>
...
Example
1. cp $ZEPPELIN_HOME/conf/zeppelin-site.xml.template $ZEPPELIN_HOME/conf/
zeppelin-site.xml
2. chmod 0755 $ZEPPELIN_HOME/conf/zeppelin-site.xml
3. Enter the following as one command (make sure there are no spaces after the trailing backslash
characters):
sed -i "s/FlinkInterpreter<\/value>/FlinkInterpreter,\
org.apache.spark.sql.SapSqlInterpreter<\/value>/" \
$ZEPPELIN_HOME/conf/zeppelin-site.xml
5. For HDP with Ambari only: Update the YARN configuration as follows:
a. Check the installed HDP version (<HDP_VERSION>), for example, from the following directory
name: /usr/hdp/<HDP_VERSION>
b. On the Ambari administration interface, select the YARN service and choose the Configs tab. Scroll
down to the Custom yarn-site section and choose Add Property.
c. Add a property with the key hdp.version and value <HDP_VERSION>.
$ <ZEPPELIN_HOME>/bin/zeppelin-daemon.sh start
You should see an additional interpreter prefix called %vora in the interpreter list.
9. Test that the Zeppelin interpreter has been successfully installed.
The execution of the first snippet might take some time (1-3 minutes), since a Spark application needs to
be started on the server. Once the application is running, subsequent calls will be much faster (depending
on the actual query).
Example output:
Note
Log files are available as follows:
○ <ZEPPELIN_HOME>/logs/zeppelin-*-.log: Contains the Web-UI related output.
○ <ZEPPELIN_HOME>/logs/zeppelin-interpreter-*-.log: Contains the output you would see
in a Spark shell.
Prerequisites
● The Spark Controller has been installed and configured. For more information, see Set up SAP HANA
Spark Controller in the SAP HANA Administration Guide.
● When installing the Spark Controller as described in Set up SAP HANA Spark Controller, the following
steps are not necessary:
○ Install Spark Assemby Files and Dependent Libraries
The three datanucleus artifacts listed in this section are not needed when you run the Spark
Controller with SAP HANA Vora:
○ datanucleus-rdbms
○ datanucleus-api-jdo
○ datanucleus-core
Do not download and copy these artifacts to HDFS.
○ Configure Hive Metastore
You do not need to copy the hive-site.xml when you run the Spark Controller with SAP HANA
Vora.
If you do copy the datanucleus* artifacts and hive-site.xml, you might encounter issues unless you
have a valid Hive installation that is appropriately configured and your Hive metastore is running properly.
Procedure
1. Make the SAP HANA Vora data sources package available to the Spark Controller.
Make sure that you copy the same version that you are using to create tables. Compatibility between
different packages is not always guaranteed.
2. Configure the Spark Controller.
The Spark Controller needs to be made aware of the metadata storage location of the SAP HANA Vora
tables and the hosts that run SAP HANA Vora.
In both cases, include the TCP port as well and do not use whitespaces in the values.
<property>
<name>spark.vora.hosts</name>
<value>IP_ADDRESSES_AND_PORTS_OF_VORA_HOSTS</value>
<final>true</final>
</property>
<property>
<name>spark.vora.zkurls</name>
<value>IP_ADDRESS_AND_PORT_OF_ZOOKEEPER</value>
<final>true</final>
</property>
Example
<property>
<name>spark.vora.hosts</name>
<value>10.0.0.1:2202,10.0.0.2:2202,10.0.03:2202</value>
<final>true</final>
</property>
<property>
<name>spark.vora.zkurls</name>
<value>10.0.0.1:2181</value>
<final>true</final>
</property>
For the configuration changes to take effect, restart the Spark Controller, for example, using the following
commands:
$ cd /usr/sap/spark/controller/bin
$ ./hanaes stop
$ ./hanaes start
To verify whether the configuration changes were successful, check the Spark Controller log
file: /var/log/hanaes/hana_controller.log
After initialization, the file should contain the following line at the end:
Results
After successful configuration, you can see the tables stored in SAP HANA Vora in SAP HANA Studio, and you
can add virtual tables and submit queries, as described in the SAP HANA Spark Controller documentation.
Prerequisites
Context
To use SAP Lumira with SAP HANA Vora, you need to install the relevant drivers in SAP Lumira to be able to
connect from SAP Lumira using JDBC. You can then create a connection to SAP HANA Vora using the SAP
HANA Vora Thrift server.
Procedure
1. Install the JDBC driver. You need to use the Spark drivers.
Option Description
./start-sapthriftserver.sh
./start-sapthriftserverd.sh
c. Select Generic JDBC datasource – JDBC Drivers and choose Next. Note that the green tick indicates
that the drivers are installed.
Field Value
e. Choose Connect.
You should now see the CATALOG_VIEW, where you can select tables and enter SQL queries.
./beeline
b. Execute the following statement to connect to the Thrift server, replacing the host name and port as
needed:
c. When prompted for a user name and password, enter lumira in both cases.
d. Register the tables by running the following command:
Note
Table definitions are stored on the ZooKeeper server. This allows you to register or re-register
tables when you start or restart the Thrift server. The tables are persisted as long as the Thrift
server is connected.
Update your SAP HANA Vora installation by downloading and installing the latest versions of the installation
packages.
Related Information
Update the SAP HANA Vora Engine Using Ambari [page 31]
Update the SAP HANA Vora Engine Using Cloudera [page 32]
Update the SAP HANA Vora Spark Extension Library [page 33]
Use the Ambari provisioning tool to install the latest version of the SAP HANA Vora engine on your cluster.
Context
If you want to check which version of SAP HANA Vora is currently installed, you can do this from the Ambari
dashboard. In the Services panel on the left, choose Add Service from the Actions dropdown menu. Then
locate SAP HANA Vora in the services list to see which version you are using.
Tip
You can also find this information in the metainfo.xml file in the directory /var/lib/ambari-server/
resources/stacks/HDP/<HDP_version>/services/VORA .
Procedure
Run the following command from any machine where curl is available, for example, the management node
of the cluster, replacing the placeholders with appropriate values:
$ ambari-server restart
Depending on your cluster configuration, you may need to be the root user or a user with administrator
rights to do so.
Related Information
Install the SAP HANA Vora Engine Using Ambari [page 15]
Use the Cloudera provisioning tool to install the latest version of the SAP HANA Vora engine on your cluster.
Procedure
Related Information
Install the SAP HANA Vora Engine Using Cloudera [page 16]
Download and install the latest version of the SAP HANA Vora Spark extension library.
Procedure
1. SSH to the jump box as the user who runs the Spark jobs.
2. Remove the directory (for example, vora) in which you previously unpacked the
VORA_SE<version>.TGZ package.
3. Create a new vora directory in the home folder.
4. Download the latest version of the VORA_SE<version>.TGZ package from the SAP Software Download
Center at https://support.sap.com/swdc to the vora directory.
5. Unpack the archive.
6. Make sure the SAP HANA Vora extension has been successfully installed by completing step 4 of the
installation procedure. See Install the SAP HANA Vora Spark Extension Library.
7. If Zeppelin has been configured to support the SAP HANA Vora Spark extension library, update the library
as follows:
a. Shut down the Zeppelin server:
$ <ZEPPELIN_HOME>/bin/zeppelin-daemon.sh stop
$ cp ~/vora/lib/spark-sap-datasources-<VERSION>-assembly.jar \
<ZEPPELIN_HOME>/interpreter/spark/spark-sap-datasources-assembly.jar
Caution
If you do not use a unified name for the JAR file:
○ Make sure that there is only one spark-sap-datasources-<VERSION>-assembly.jar file
in the folder.
○ If you did not use a wildcard in the ADD_JARS variable, remember that you need to update this
variable in the <ZEPPELIN_HOME>/conf/zeppelin-env.sh file.
$ <ZEPPELIN_HOME>/bin/zeppelin-daemon.sh start
Related Information
Install the SAP HANA Vora Spark Extension Library [page 18]
By default, SAP HANA Vora is configured to use the port numbers given below.
Zeppelin 9099
Ambari 8080
SAP HANA Vora allows you to develop applications from a Spark-based environment using its provided data
sources and Spark extensions.
Topic Description
Getting Started [page 35] Learn how to access the provided data sources from Spark
Using the SAP HANA Vora Data Source [page Use SAP HANA Vora as an in-memory database in your Spark program
37] ming environment
Working with Hierarchies [page 49] Build hierarchical data structures and query hierarchical data
Using the SAP HANA Data Source [page 55] Access SAP HANA data from a Spark-based environment
Extended Data Sources API [page 59] Leverage the full integration between Spark and SAP HANA Vora using
advanced data source features
System Architecture [page 61] Familiarize yourself with the components involved in the SAP HANA
Vora Spark programming environment
Related Information
To develop applications from a Spark-based environment using the provided data sources and Spark
extensions, follow the preparatory steps outlined below. These demonstrate, in particular, how to access the
SAP HANA Vora and SAP HANA data sources from Spark.
You can add the data source package to Spark using the --jars command line option. For example, to
include it when you start the Spark shell, use the following:
You need to link your Scala project to the modules you are using and to the core module contained in the
extensions package. This is necessary because both the SAP HANA Vora and SAP HANA data source modules
depend on the core module. To use the resulting program in a cluster environment, we recommend that you
build a shaded JAR. We also recommend that you use IntelliJ the first time you load the project, since it will
automatically load the dependencies in the pom.xml file as well.
You can use the SAP HANA Vora data source or SAP HANA data source in Spark by creating a table:
1. Create a SapSQLContext
Before you can create the table, you need to instantiate a SapSQLContext. A SapSQLContext is based on
a SparkContext object.
2. Create a table
You can register a table in Spark using the Spark SQL command CREATE TABLE. You need to provide a
table name and the fully qualified name of the source package (USING <data_source>):
○ com.sap.spark.vora for the SAP HANA Vora data source
○ com.sap.spark.hana for the SAP HANA data source
You also need to provide a set of options required by the data source.
The example below shows how a table is created using the SAP HANA Vora data source, based on a file in
HDFS. It is assumed that the sample code is executed in a Spark shell:
Sample Code
import org.apache.spark.sql._
val sqlc = new SapSQLContext(sc)
sqlc.sql(
s"""CREATE TABLE testTableName (column1 string, column2 integer)
USING com.sap.spark.vora
OPTIONS (
tablename "testTableName",
paths "/path/to/file.csv",
zkurls "zookeeper.host1.com:2181",
namenodeurl "namenode.host.com:8020")""".stripMargin)
Note that all options shown above can be set in the SparkConf or in the SQLContext configuration. The
specified NameNode URL indicates that the file is contained in HDFS.
Executing Queries
You can execute queries in the same way as in Spark SQL. You can also join tables regardless of their origin,
but you should bear in mind that there may be differences in performance. For example, if you join a table from
a SAP HANA data source with one from a SAP HANA Vora data source, this might require data to be offloaded
to Spark.
The SAP HANA Vora data source allows you to improve Spark performance by using SAP HANA Vora as an in-
memory database. It supports an enhanced data source API implementation that enables you to create Spark
DataFrames in a local or distributed file system.
Related Information
You can execute queries in the same way as in Spark SQL. The examples below show the syntax of some basic
queries you can use to work with the SAP HANA Vora data source.
Creating Tables
A CREATE TABLE statement registers the table with the Spark sqlContext and creates a table in the SAP
HANA Vora engine.
SQL
Sample Code
Programmatically (Scala)
Sample Code
To allow a table that has already been created to be registered with the Spark sqlContext, a CREATE TABLE
statement will succeed if the provided SCHEMA and the provided metadata (hosts, paths, csv delimiter, csv
quote, csv null value, format) are the same as those of the existing table, or if no SCHEMA information is
provided at all.
Note
If the paths option is not provided in the specified metadata, it is assumed to be the same as that
of the given table.
All tables that are persisted in the SAP HANA Vora in-memory database can be listed using the SHOW
DATASOURCETABLES statement, as shown in the example below:
Sample Code
Tables created in Spark exist only for the lifetime of a particular Spark sqlContext.
REGISTER TABLE
You can use the REGISTER TABLE <table_name> USING … OPTIONS … <IGNORING CONFLICTS>
statement to register a table in the Spark context. This corresponds to a CREATE TABLE statement where the
table already exists. However, no additional metadata or schema information is needed to perform the
registration:
An error is thrown if the table already exists in Spark, but you can enforce the action by using the IGNORING
CONFLICTS clause.
All tables already created with the SAP HANA Vora data source can be registered in the Spark context using
the following statement:
Sample Code
You can add more data to tables as shown in the example below:
Sample Code
Note that the only options you can specify in this command are paths and eagerload. Any other options are
ignored.
The DROP TABLE command drops the specified table in the Spark context and also deletes the corresponding
in-memory SAP HANA Vora table. You can therefore use it to free cluster memory:
Sample Code
If the specified table is referenced more than once in the ZooKeeper catalog, the drop table action will fail. This
could happen if, for example, the table is used in a number of views.
Add the CASCADE suffix to the DROP TABLE statement to drop both the table and every entry in the catalog
that references that table.
The ClusterUtils objects contains a method named clearZooKeeperCatalog(), which allows you to
wipe out all metadata in ZooKeeper. This method is very useful for advanced users who want to make their
scenarios and tests idempotent.
Sample Code
import com.sap.spark.vora.client._
ClusterUtils.clearZooKeeperCatalog("zookeeper.host1.com:2181")
Related Information
The following code examples show how a table can be created and queried in Spark using the SAP HANA Vora
data source.
Sample Code
import org.apache.spark.sql._
/*
Csv file that just contains:
John,10
Jane,20
John,20
Jane,40
*/
val sqlc = new SapSQLContext(sc)
/* Table name used to register the relation into the Spark schema */
val tableName = "testTable"
/* Source package needed to use the Vora source */
val source = "com.sap.spark.vora"
/* Creating the new table */
sqlc.sql(
s"""CREATE TABLE $tableName (name string, age integer)
USING $source
OPTIONS (
tablename "$tableName",
paths "/path/to/file.csv",
zkurls "zookeeper.host1.com:2181",
namenodeurl "namenode.host.com:8020"
)""".stripMargin)
/*
Jane,20
John,20
Jane,40
*/
val queryResult = sqlc.sql("SELECT name, age from testTable where age > 10")
queryResult.collect().foreach(println)
/*
John,15.0
Jane,30.0
*/
val aggregationResult = sqlc.sql("SELECT name, AVG(age) AS avgAge from
testTable GROUP BY name")
aggregationResult.collect().foreach(println)
This example uses Spark DataFrames. For information about programming with DataFrames, see the Spark
SQL and DataFrame Guide.
import org.apache.spark.sql._
import org.apache.spark.sql.types._
/*
Csv file that just contains:
John,10
Jane,20
John,20
Jane,40
*/
val stds1 = "..." // Some path to a csv file
val sqlc = new SapSQLContext(sc)
/* Source package needed to use the Vora source */
val source = "com.sap.spark.vora"
/* Table schema */
val schema = StructType(
StructField("name", StringType, nullable = true) ::
StructField("age", IntegerType, nullable = true) :: Nil
)
val options = Map(
/* Table name used in Vora nodes */
"tablename" -> "voraTable",
/* Comma-separated CSV file paths */
"paths" -> stds1,
"zkurls" -> "zookeeper.host1.com:2181",
"namenodeurl" -> "namenode.host.com:8020"
)
/* Creating the new table */
val voraTable = sqlc.read.format(source).schema(schema).options(options).load()
/*
Jane,20
John,20
Jane,40
*/
val queryResult = voraTable.select("name", "age").where(voraTable("age") > 10)
queryResult.collect().foreach(println)
/* We need to import this to use the different sql functions like MAX, MIN or
AVG */
import org.apache.spark.sql.functions._
/*
John,15.0
Jane,30.0
*/
val aggregationResult = voraTable.select("name",
"age").groupBy("name").agg(avg("age").as("avgAge"))
aggregationResult.collect().foreach(println)
Related Information
You can use SAP HANA Vora to load and distribute files stored in Amazon S3 (Simple Storage Service) on all
available nodes in your cluster. This allows you to run distributed queries on that data.
Prerequisites
If your cluster runs behind a proxy, your proxy settings need to be set up correctly. Otherwise the SAP HANA
Vora engine or Spark might not be able to read files from Amazon S3 due to missing proxy information. For
more information, see Configure Proxy Settings [page 64].
To load a data file from Amazon S3 into a table in SAP HANA Vora, create a table by running the CREATE
TABLE statement in the Spark shell. You need to specify the key ID and secret key ID of your Amazon S3
account, as well as the Amazon S3 region and endpoint to be contacted:
Sample Code
paths Fully qualified names of the files to SAP HANA Vora accepts Amazon S3 file names in the
be uploaded to SAP HANA Vora following format: <bucket_name>/<file_path>
storagebackend Storage backend ( "s3", "hdfs", or Set to "s3" to load files from Amazon S3
"local")
s3accesskeyid Amazon S3 access key You can get the key ID and secret access key from the
Amazon console
s3secretaccesskey Amazon S3 secret access key
s3region Amazon S3 region You can find information about your data center region
and endpoint at: http://docs.aws.amazon.com/
s3endpoint Amazon S3 endpoint
general/latest/gr/rande.html#s3_region
Note
As with other parameters, such as the SAP HANA Vora hosts and ZooKeeper and NameNode URLs, you can
also configure the Amazon S3 parameters in the spark-defaults.conf file.
For security reasons, we recommend that you configure the Amazon S3 secret key in the spark-
defaults.conf file to avoid having to enter it in the Spark shell.
Restriction
Data files can currently be loaded in one direction only, from Amazon S3 into SAP HANA Vora.
Related Information
The behavior of SAP HANA Vora in overflow situations is based on that of Apache Spark. That means, in
particular, that the data contained in ORC, Parquet, or CSV files must not exceed the size allowed by the data
types specified in the table schema.
DECIMAL (precision, scale) Arbitrary precision signed decimal numbers. Precision and
scale have to be 32-bit numbers.
The resulting data type of any binary operation is determined by the data type of the larger of the two input
parameters (that is, the higher data type). For example, the multiplication of an INTEGER and a BIGINT results
in the data type BIGINT.
Up Casting
If your expression might overflow, you can prevent errors by explicitly casting the data types to higher data
types. You can do this by using the cast operator as follows: cast(expression as type)
Example
Assume a and b are two integer columns with numbers that might lead to an overflow during multiplication.
To avoid an overflow, you apply the cast function to the select statement. The query should look something
like this:
Sample Code
The SAP HANA Vora data source API provides several configurable options.
tablename Table name on the SAP HANA Vora query engine. None testTable
schema Table schema used to create the SAP HANA Vora ta None name varchar(*),
ble. This parameter is only recommended if you spe age integer
cifically want to use special SAP HANA Vora data
types that are not directly supported in Spark.
null Default value for parsing NULL fields in CSV files. NULL null
partitionsize Optional preferred size of a file partition in MB. This 256 128, 256, 512
parameter should be a multiplicity of the HDFS block
size (if it is not, it will be rounded down to the closest
multiplicity). The minimum parameter value is the
HDFS block size.
loadstrategy Specifies how the table partitions (subsets of each ta relaxedlocal byterange
ble) are distributed among the hosts during load time.
The load strategy option is currently only used for files
loaded from HDFS.
local Flag used to execute the data source in local mode. It false Boolean
uses a non-persisted local catalog and a non-distrib
uted locking system. It is used only for test purposes.
eagerload If true, the table is loaded into memory by the CREATE true Boolean
TABLE statement. If false, the table is loaded in the
first query execution that uses the table.
port Port used by default in all SAP HANA Vora connec 2202 2000
tions.
memorymaximum If memory consumption exceeds this limit, SAP HANA None 10G
Vora exits with an out of memory exception.
format Format used to read the data. SAP HANA Vora sup csv orc
ports "csv", "orc", and "parquet".
storagebackend Optional parameter specifying the storage backend. It "local" if defined by "s3"
can be either "s3" (Amazon S3), "hdfs" (Hadoop Dis the user, otherwise
tributed File System), or "local" (local file system). "hdfs"
You can set all properties globally in the spark-defaults.conf file by adding the prefix spark.vora. For
more information about how to configure the variables in a convenient manner for users, see the best practice
topic Example Cluster Configuration.
Related Information
Example Cluster Configuration Including a Client Machine (Jump Box) [page 72]
Hierarchical data structures define a parent-child relationship between different data items, providing an
abstraction that makes it possible to perform complex computations on different levels of data.
An organization, for example, is basically a hierarchy where the connections between nodes (for example,
manager and developer) are determined by the reporting lines that are defined by that organization.
Since it is very difficult to use standard SQL to work with and perform analysis on hierarchical data, Spark SQL
has been enhanced to provide missing hierarchy functionality. Extensions to Spark SQL support hierarchical
queries that make it possible to define a hierarchical DataFrame and perform custom hierarchical UDFs on it.
This allows you, for example, to define an organization’s hierarchy and perform complex aggregations, such as
calculating the average age of all second-level managers or the aggregate salaries of different departments.
● The parser has been extended with hierarchy syntax that allows you to define a hierarchy. In addition, a list
of UDFs is available for performing calculations on hierarchy tables. For example, IS_ROOT returns true if
a row in a hierarchy table represents a root node.
● Two strategies have been made available for generating hierarchical DataFrames, where one uses self
joins and the other broadcasts the hierarchy structure.
● Since the SAP HANA Vora execution engine supports hierarchies, support has been added for pushing
down hierarchical queries to SAP HANA Vora using the data source implementation.
Related Information
You build a hierarchy using an adjacency list that defines the edges between hierarchy nodes. The adjacency
list is read from a source table where each row of the source table becomes a node in the hierarchy.
The hierarchy SQL syntax allows you to define the adjacency list and tweak it. It also allows you to determine
how the hierarchy is created by controlling the order of the children of each node and by explicitly determining
the roots of the hierarchy.
A source table representing the predecessors and successors of the hierarchy is a prerequisite for creating
hierarchies with the SAP HANA Vora Spark hierarchy extension.
The table h_src is used to represent the hierarchy shown above. It defines a basic hierarchy between
predecessors and successors:
h_src
Project Manager 1 2 1
Sales Manager 1 3 2
Project Coordinator 2 4 1
Architect 2 5 2
Programmer 4 6 1
Designer 4 7 2
You generally have another table as well, referred to as a fact table, which contains other data associated with
the hierarchy. In this example, the fact table contains addresses:
addresses
name address
Related Information
You create a hierarchy using an SQL statement. To create a hierarchy, you require a table that specifies the
relations between the predecessors and successors of the hierarchy.
For example, you have a table called h_src that contains two columns, pred and succ, showing predecessors
and successors respectively. You can then use the following statement to create and query the hierarchy:
Sample Code
● JOIN PARENT <alias> <equality_expression>: Defines how the adjacency list is constructed. In
this example it states that any two rows in h_src have an edge between them in the hierarchy if the child
row's pred column is equal to the parent row's succ column. This clause is mandatory.
● SEARCH BY <order_by_expression>: Determines the order of the children when the hierarchy is
constructed. The order is relevant for some UDFs, such as IS_PRECEDING and IS_FOLLOWING, and can
Related Information
Since a hierarchy is simply a Spark DataFrame, this means that it can be used in any valid SQL statement. This
includes statements for creating joins between it and other tables.
In the examples below, which show how to perform inner joins and left joins, the hierarchy h_src is joined with
the addresses table in order to output the name, address, and level of the employee in the tree. This can be
achieved as follows:
● Inner join
Sample Code
Result:
Sample Code
Result:
Architect null 3
Programmer null 4
Note that right joins and full outer joins can also be done in a similar manner.
Related Information
You have the option of using a view to create a hierarchy. Once created, the view can be used to perform SQL
queries with hierarchy UDFs (user-defined functions).
The statement below can be used to create a hierarchy view. It uses the example hierarchy table h_src:
The above command creates a view named HV that wraps a hierarchy. From now on, the view name can be
used in a SQL query and will be replaced with the underlying hierarchy.
In the following example, the hierarchy view is joined with itself in order to get the names of the children of the
root:
SELECT Children.name
FROM HV Children, HV Parents
WHERE IS_ROOT(Parents.Node) AND IS_PARENT(Parents.Node, Children.Node)
To select the addresses of all the descendants of the second-level employees, a two-level join is needed:
This list shows the user-defined functions (UDFs) that can be used with hierarchies.
UDF Description
The SAP HANA data source provides a pluggable mechanism for accessing data stored in SAP HANA from a
Spark-based environment through Spark SQL. It includes an enhanced data source API implementation that
supports predicate pushdown for all predicates that SAP HANA can process.
Related Information
You can execute queries in the same way as in Spark SQL. The examples below show the syntax of some basic
queries you can use to work with the SAP HANA data source.
Creating Tables
A CREATE TABLE statement registers the table with the Spark sqlContext and creates a table in the SAP
HANA database.
Sample Code
To allow a table that has already been created to be registered with the Spark sqlContext, a CREATE TABLE
statement will succeed if the provided SCHEMA is the same as that of the existing table, or if no SCHEMA
information is provided at all.
Note
If a table is created that does not yet exist, it will only be persisted if data is inserted into it.
Data can be loaded into a table in the SAP HANA database from a DataFrame in Spark as follows:
Sample Code
dataFrame.write.format("com.sap.spark.hana").mode(SaveMode.Append).options(tabl
eConf).save()
In general for all save modes, if the table does not yet exist in SAP HANA, a new table is created in the SAP
HANA database with the DataFrame’s schema and data is inserted into that table. If the table already exists in
SAP HANA, the behavior is as follows:
SaveMode.Overwrite Data of the current table is dropped and new data is inserted.
SaveMode.ErrorIfExists The statement fails and no changes are made to the existing table.
SaveMode.Ignore The statement doesn’t fail and no changes are made to the existing table.
Dropping Tables
The DROP TABLE command drops the specified table in the Spark context and also deletes the corresponding
in-memory SAP HANA table (provided it exists and the SAP HANA user is allowed to perform the action):
Sample Code
The following code example shows how a table can be created and queried in Spark using the SAP HANA data
source.
Sample Code
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val nameNodeHostAndPort = "name.node.mycompany.corp:8020"
val pathToCsvFile = "/path/to/file.csv"
val sqlc = new SapSQLContext(sc)
// this table name holds for HANA as well as for the Vora instance
val tableName = "people_test"
//Database Schema Name
val dbSchema = "dbschema"
// HANA Host instance
val host = "hana.host1.com"
The SAP HANA data source allows you to use UDFs that are implemented solely in SAP HANA (that is, they do
not exist in Spark). You can do this by using the "$" prefix.
The following example shows how to push down a unit of measure conversion:
Sample Code
import org.apache.spark.sql._
val sqlc = new SapSQLContext(sc)
lazy val configuration = Map(("host"->"hana.host1.com"),
("instance"->"02"),
("user"->"myuser"),
("passwd"->"secret"))
lazy val sampleInputConf =
configuration + ("path" -> "SAMPLE_INPUT") + ("dbschema" -> "SAPCCH")
sampleInputRelation =
sqlc.read.format("com.sap.spark.hana").options(sampleInputConf).load()
sampleInputRelation.registerTempTable("SAMPLE_INPUT")
var query = sqlc.sql("Select $CONVERT_UNIT" +
"(QUANT,SOURCE_UNIT,'SAPCCH',TARGET_UNIT,'000') as converted " +
"FROM SAMPLE_INPUT")
queryResult.collect().foreach(println)
The SAP HANA data source API provides several configurable options.
dbschema SAP HANA database schema of the table speci SYSTEM mySchema
fied in the parameter above
passwd Password for the SAP HANA database user None passwd
specified in the parameter above
SapSQLContext provides an extended data sources API that is needed to leverage the full integration between
Spark and SAP HANA Vora. Note that although the SAP HANA Vora and SAP HANA data sources work both
with SQLContext and HiveContext, they will not use the additional performance features unless used with
SapSQLContext.
The extended data sources API provides additional traits that data sources can implement to signal support
for advanced features. These traits are PrunedFilteredAggregatedScan, CatalystSource, ExpressionSupport,
DropRelation, AppendRelation and SqlLikeRelation.
PrunedFilteredAggregatedScan
This trait is based on the standard PrunedFilteredScan, which provides a data source that can perform a query
with a number of selected columns and a limited set of filters (WHERE clauses).
This trait is based on the standard PrunedFilteredScan, which provides a data source that can perform a query
with a number of selected columns and a limited set of filters (WHERE clauses).
PrunedFilteredExpressionsScan extends this further to provide support for any expression in the SELECT or
WHERE clause.
CatalystSource
CatalystSource allows potentially every query to be completely pushed down to the data source. For a given
query or sub-query, it checks if the query can be executed by the data source and, if supported, pushes it
down completely.
CatalystSource provides a very tight integration between the data source and Spark’s query optimizer,
Catalyst. While this allows the maximum degree of flexibility, it also requires deep knowledge of Catalyst
internals to implement it properly.
The SAP HANA Vora and SAP HANA data sources use CatalystSource to convert a Catalyst logical plan back
to a full SQL query that can be sent to the SAP HANA Vora processing engine.
The main components used in the SAP HANA Vora Spark development environment are shown in the figure
below.
The query server is based on the Hive Thrift server. It creates a SapSQLContext, which is an extension of the
HiveContext. Any client implementing or using a library that implements the Hive Thrift server protocol can
execute any compatible query on it.
The query server is in charge of creating and handling the Spark context. Any Spark job that is executed and
requires a connection to the system must use the same SapSQLContext.
Zeppelin
Zeppelin is a web-based console that allows you to execute queries on a Spark cluster. The SAP HANA Vora
integration package allows the Spark Vora features to be used from Zeppelin.
The SAP HANA Vora Spark component is an extension of Spark SQL. It keeps the standard features available
in Spark and adds new functionality customized for SAP HANA Vora. There are four main extensions:
● DDL/SQL parsers: These provide the SAP HANA Vora commands, such as APPEND and DROP, and also
extend the SQL grammar to handle hierarchy commands.
● Analyzer: Handles hierarchy analysis.
● Planner: Provides the pushdown strategies for aggregations, functions, hierarchies, and so on.
● Function registration: The new functions handled by SAP HANA Vora have been included in the ones
supported by Spark.
Discovery Service
It keeps the running status of a node, that is, failed or healthy. This status is required in order to apply recovery
policies and implement failover strategies. When a failing node is detected, it is updated so that it can be
recovered.
It is necessary for the SAP HANA Vora Spark component to store metadata information relevant to the
program workflow. The catalog provides a store for storing and retrieving generic hierarchical and versioned
key values, which are required to synchronize parallel updates.
The catalog also acts as a proxy to other metadata stores, such as HDFS NameNode, and caches their
metadata locally for better performance. It also determines the preferred locations of a given file stored on
HDFS based on the locations of its blocks.
Lock Manager
The lock manager provides distributed read-write locks, supported by ZooKeeper. This is done to ensure that
both the catalog and instances of the query engine keep a consistent state at all times. A write lock occurs
whenever data is loaded, while a read lock can occur when data is queried, although this is optional.
This library is used from two types of location: driver and worker. When it is called from a worker, it is simply
used to execute a query on a specific SAP HANA Vora node using the JDBC driver.
Its duties are more extended when it is called from the driver. Besides updating the SAP HANA Vora catalog
with the given data and marking a node as failed or healthy when detected, the SAP HANA Vora client is also
responsible of handling data loading on SAP HANA Vora nodes.
In order to get the data loaded, the SAP HANA Vora client:
Note that the SAP HANA Vora client can call SAP HANA Vora nodes directly from the driver without passing
through Spark workers.
There are some standard administration tasks you need to perform and best practices for the ongoing
operation of your SAP HANA Vora service and Hadoop cluster.
Topic Description
Configure Proxy Settings [page 64] If your cluster runs behind a proxy, set up your proxy settings
Start and Stop the SAP HANA Vora Service Start, stop, and restart the SAP HANA Vora instances on your cluster
[page 65]
Start the SAP HANA Vora Spark Thrift Server Start the Thrift server to enable JDBC access to the SAP HANA Vora
[page 67] Spark component
SAP HANA Vora Service: Configuration Set Configuration options for the SAP HANA Vora engine
tings [page 69]
Best Practices: Administration and Operations Achieve higher performance on your cluster by observing some basic
[page 70] best practices
Related Information
If your cluster runs behind a proxy, you need to set up your proxy settings correctly so that the SAP HANA
Vora engine and Spark are able to access external services, such as Amazon S3.
Procedure
1. Make sure that the following environment variables have been configured with the appropriate URLs in
the /etc/environment file:
http_proxy
HTTP_PROXY
https_proxy
HTTPS_PROXY
FTP_PROXY
ftp_proxy
no_proxy
Sample Code
export http_proxy=http://proxy.example.com:8080
export HTTP_PROXY=http://proxy.example.com:8080
export https_proxy=https://proxy.example.com:8080
export HTTPS_PROXY=https://proxy.example.com:8080
If any of the variables are not set up properly, make the necessary corrections and then restart the SAP
HANA Vora service using the cluster provisioning tool (for example, Ambari or Cloudera Manager).
2. Make sure that the following variables are passed to the JVM running the Spark driver:
http.proxyHost
http.proxyPort
https.proxyHost
https.proxyPort
You can do this by setting the extraJavaOptions property in the spark-defaults.conf file.
○ If you are running Spark in YARN client mode, you can set the property as follows:
spark.yarn.am.extraJavaOptions -Dhttp.proxyHost=<HTTP_HOST> -
Dhttp.proxyPort=<HTTP_PORT> -Dhttps.proxyHost=<HTTPS_HOST> -
Dhttps.proxyPort=<HTTPS_PORT>
○ If you are running Spark in YARN cluster mode, you can set the property as follows:
spark.driver.extraJavaOptions -Dhttp.proxyHost=<HTTP_HOST> -
Dhttp.proxyPort=<HTTP_PORT> -Dhttps.proxyHost=<HTTPS_HOST> -
Dhttps.proxyPort=< HTTPS_PORT>
Use the cluster provisioning tool to start, stop, and restart the SAP HANA Vora instances on your cluster.
Context
SAP HANA Vora instances hold data in memory and boost the performance of the compute nodes. When you
stop or restart an instance, the data is removed completely from memory. If SAP HANA Vora is needed to
provide acceleration for a specific query again, the fraction of data a certain instance was responsible for has
to be reloaded from disk.
Note that in the procedure below Ambari is used to manage the SAP HANA Vora instances.
1. On the Ambari dashboard, select SAP HANA Vora in the Services panel.
The Services summary tab shows how many SAP HANA Vora instances are running, for example, as
follows:
○ To start, stop, or restart all SAP HANA Vora instances, choose the appropriate option in the Service
Actions dropdown menu:
Option Description
Restart All Stops and then starts the SAP HANA Vora service on all hosts
Restart SAP HANA Voras Performs a rolling restart of the SAP HANA Vora service across all hosts.
You can specify the following:
○ The number of instances to be started at a time
○ How long to wait between batches
○ The number of allowed restart failures
○ To only restart instances with stale configuration
○ To activate maintenance mode
Turn On Maintenance Mode Suppresses alerts generated by the SAP HANA Vora service
Next Steps
After restarting the SAP HANA Vora service, the tables no longer exist in the SAP HANA Vora in-memory
database. However, the associated metadata has been retained. To make the SAP HANA Vora instances
reload the data, you can use the markAllHostsAsFailed() function in the ClusterUtils object, as
follows:
com.sap.spark.vora.client.ClusterUtils.markAllHostsAsFailed()
As a result, Spark will assume that the SAP HANA Vora instances are empty and reload the data according to
the metadata information.
The Thrift server enables applications to access the SAP HANA Vora Spark component on the cluster using
JDBC. The Thrift server runs as a Spark program.
Prerequisites
To use the shell scripts, you need to have set the SPARK_HOME environment variable.
The delivery package contains shell scripts for starting the server based on the spark-submit command,
which automatically uses the configuration options specified for your Spark installation. Once the server has
started, client applications can connect to the SAP HANA Vora Spark component using JDBC, as shown in the
figure below:
Note that applications still need to register tables in Spark to incorporate SAP HANA Vora as a data source.
The tables will be persisted as long as the server process runs.
The Thrift server is generally started from the command line in one of the following ways using the appropriate
shell script:
● As a Spark program
● As a daemon
Note
It is also possible to start the Thrift server manually as a Spark program using the spark-submit script,
but this is less convenient than the options above.
Procedure
Option Description
./start-sapthriftserver.sh
./start-sapthriftserverd.sh
If you have started the Thrift server as a Spark program (rather than as a daemon), you will see an information
message similar to the following after initialization:
This message indicates that the Thrift server is up and running and listening for incoming connections.
Caution
Do not close the terminal screen until you have finished using the Thrift server. The Thrift server is not a
daemon and by closing the terminal screen you will close the server.
You can connect to the Thrift server using any client that supports the jdbc:hive2 protocol. You can use the
following JDBC connection string:
jdbc:hive2://[machine_ip]:[port_number]
To stop the Thrift server, simply close the terminal screen you opened or press CTRL + C in the terminal
screen where it is running.
./stop-sapthriftserverd.sh
Configuration options for the SAP HANA Vora engine, using the Ambari or Cloudera cluster provisioning tool.
You can change the configuration settings for the SAP HANA Vora service, if necessary, as follows:
● Ambari: On the SAP HANA Vora service Configs tab, in the Advanced vora-config section.
● Cloudera: On the SAP HANA Vora service Configuration tab.
OS user The operating system user under which the SAP HANA Vora vora
engine runs
OS group The operating system group under which the SAP HANA vora
Vora engine runs
Note that OS users and groups are created during the installation of the SAP HANA Vora engine if they do not
yet exist.
Note
Kerberos is supported as of SAP HANA Vora 1.1 Patch 1. Make sure that your Kerberos configuration
settings include the following:
● The user that starts SAP HANA Vora (the v2server process) needs to have a valid Kerberos ticket in the
credential cache.
You can examine the Kerberos tickets in the credential cache by running the klist command. You can
obtain or renew a ticket by running the kinit command, either specifying a keytab file or entering the
password for the principal. The principal for this user needs to be created in the Kerberos Key
Distribution Center (KDC), otherwise authentication to HDFS will fail.
● The Spark user is part of the OS group under which the SAP HANA Vora service is running.
This is needed to access the HDFS files that are created by Spark.
Restriction
Kerberos support is currently restricted to the Hortonworks Hadoop distribution (Ambari) on Red Hat
Enterprise Linux (RHEL) 6.6.
Logs
Log directory The file system location for the SAP HANA Vora engine logs /var/log/vora
By observing some basic best practices, you can achieve higher performance on your Hadoop cluster.
A Hadoop cluster typically involves a very large number of relatively similar computers. In general, a good way
to install a cluster is by distinguishing between four types of machines:
Note that if you have a very specific setup where you have, for example, divided compute nodes and HDFS
data nodes, this might not be the best choice.
Related Information
4.5.1 HDFS
By default HDFS stores three replicas of each data block on different machines. Besides the necessary fault
tolerance, this also increases data locality.
Be aware of the following, since this might affect the performance of the cluster when it is used in combination
with SAP HANA Vora:
● If the data that is used for SQL processing is not evenly distributed this might lead to longer loading times
for tables. This might be the case if you delete a large amount of data (it will be unbalanced) or if you also
use HDFS for data that is not used for processing with SAP HANA Vora.
● Using a lot of small files (that is, smaller than the block size of HDFS) will waste a lot of space.
Remember
It is important to keep the data that you use in SAP HANA Vora/Spark as evenly distributed as possible on
HDFS to increase speed. There are a number of HDFS tools available to re-balance the data.
The cluster manager is responsible for distributing tasks throughout the compute nodes of the cluster. Each
node that assumes computation tasks is managed by a cluster manager.
In order to run, an application requests resources from the cluster manager. If this is successful, the cluster
manager transfers the actual application to the nodes in question and starts it.
The cluster manager therefore serves as an abstraction layer for the application, allowing it to be developed
independently of the cluster setup. This means that Spark, as well as all its extensions for SAP HANA Vora, can
be installed on a single node and will then be automatically transferred to the compute nodes. The problem
The system provided by SAP HANA Vora is completely independent of the cluster manager. If you are
deploying a test and development environment with a small number of nodes, we recommend that you choose
Spark’s standalone cluster manager. For information about how to install it, see the Spark manual.
Your Hadoop distribution usually comes with a built-in cluster manager. In most cases, this is Yarn. Yarn
distinguishes between Node Managers, which are responsible for a compute node, and the Resource Manager,
which keeps track of the overall workload of the cluster and distributes tasks to the Node Managers.
Note
If your cluster manager has central components, such as the Resource Manager, you should put them on
separate machines that do not compute jobs.
Related Information
This example shows how a small Hadoop system consisting of 60 nodes in total can be configured.
Each node is quite small and contains 32 GB of RAM. Yarn is used as the cluster manager. The nodes are
configured as follows:
● 1 Ambari server
● 2 master nodes (Resource Manager, NameNodes, and ZooKeeper server)
● 56 worker/compute nodes
● 1 jump box containing client components
All components are provisioned by Ambari with the standard settings. Particularly noteworthy is the way the
jump box is configured to enable a user to easily deploy applications and use the platform.
Each user is assigned a separate Linux user, including a home directory containing Spark binaries as well as a
shaded JAR of all the components and dependencies provided by SAP. Each user then has the following
directory structure:
For convenience, the environment variables are configured as follows in the .profile file:
To use the SAP HANA Vora Spark integration component, several system-specific variables need to be
configured in Spark. See the developer manual for more details. For convenience, these are configured in the
spark-defaults.conf file so that all system-specific variables are located in one place:
spark.driver.extraJavaOptions -XX:MaxPermSize=256m
spark.vora.namenodeurl name.node.mycompany.corp:8020
spark.vora.zkurls zkserv1.mycompany.corp:2181,zkserv2.mycompany.corp:2181
spark.vora.hosts host1.mycompany.corp,host2.mycompany.corp
# Uncomment the following line and enter your Amazon S3 secret access key, if
# you have one
# spark.vora.s3secretaccesskeyid <S3 secret access key>
Based on this configuration, users can easily start a shell or deploy an application with the following
commands:
When using a distributed system, you need to be sure that your data and processes support your business
needs without allowing unauthorized access to critical information. User errors, negligence, or attempted
manipulation of your system should not result in loss of information or processing time.
Security Guides
SAP HANA Vora functions as an execution engine within a Spark/Hadoop landscape. Therefore, the following
security guides outline all applicable security considerations:
Related Information
SAP HANA Vora integrates into the Hadoop ecosystem, as shown below.
When installed on nodes in an Ambari cluster, SAP HANA Vora becomes an available service that can be added
through the Ambari administration interface provided by the management node, in parallel with existing
services.
Related Information
The Ambari installation procedure includes a step for installing SAP HANA Vora. This step requires elevated
user access (as do several other steps in the installation process) and installs SAP HANA Vora so that it is
accessible from all accounts on that image. No state, including permissions and data, is made visible as a
result of this.
SAP HANA Vora stores no persistent state locally. All state is stored in the Hadoop landscape, using HDFS,
and with the specified security measures for that instance. It is transferred into SAP HANA Vora only during
the execution of queries.
Some Spark components provide custom SSL or HTTPS connectors. SAP HANA Vora does not provide HTTP
or HTTPS connectivity, and because it acts as a local service to a Spark node, providing an SSL connector is
not a consideration at this time.
Coding Samples
Any software coding and/or code lines / strings ("Code") included in this documentation are only examples and are not intended to be used in a productive system
environment. The Code is only intended to better explain and visualize the syntax and phrasing rules of certain coding. SAP does not warrant the correctness and
completeness of the Code given herein, and SAP shall not be liable for errors or damages caused by the usage of the Code, unless damages were caused by SAP
intentionally or by SAP's gross negligence.
Accessibility
The information contained in the SAP documentation represents SAP's current view of accessibility criteria as of the date of publication; it is in no way intended to be
a binding guideline on how to ensure accessibility of software products. SAP in particular disclaims any liability in relation to this document. This disclaimer, however,
does not apply in cases of wilful misconduct or gross negligence of SAP. Furthermore, this document does not result in any direct or indirect contractual obligations of
SAP.
Gender-Neutral Language
As far as possible, SAP documentation is gender neutral. Depending on the context, the reader is addressed directly with "you", or a gender-neutral noun (such as
"sales person" or "working days") is used. If when referring to members of both sexes, however, the third-person singular cannot be avoided or a gender-neutral noun
does not exist, SAP reserves the right to use the masculine form of the noun and pronoun. This is to ensure that the documentation remains comprehensible.
Internet Hyperlinks
The SAP documentation may contain hyperlinks to the Internet. These hyperlinks are intended to serve as a hint about where to find related information. SAP does
not warrant the availability and correctness of this related information or the ability of this information to serve a particular purpose. SAP shall not be liable for any
damages caused by the use of related information unless damages have been caused by SAP's gross negligence or willful misconduct. All links are categorized for
transparency (see: http://help.sap.com/disclaimer).