Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
19 views

4exploring Hadoop Ecosystem With Simple Linux Commands

Uploaded by

y manoj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

4exploring Hadoop Ecosystem With Simple Linux Commands

Uploaded by

y manoj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Exploring Hadoop Ecosystem with Simple Linux Commands

Overview: This assignment is intended to get you more familiar with the Hadoop
Ecosystem with Linux

Prerequisites:
1. Google account OR Google Gmail Account Before proceeding:
• The user should have a Google account or a Google Gmail account at hand
• If not, the user should create a new one first

2. Access Google Cloud Platform (GCP) console


• The user should be able to access his/her Google Cloud Platform (GCP) console

3. An existing project to host the Hadoop-Spark cluster


• The user has an existing project under this account to host the to-be-created
cluster

4. GCP storage bucket ready for use


• The user has created a GCP storage bucket and have it ready for use

5. GCP Hadoop and Spark Cluster were created with 1 Manager Node and 2 Worker Nodes.
The nodes must be turned on for this assignment

NOTES: Please see the following documents, if you need a refresher


• How to Setup a GCP Account with Free Credit
• How to Create Projects in GCP
• How to Create New Storage Buckets in GCP
• How to Create a Hadoop and Spark Cluster in GCP

VERY IMPORTANT: Be sure all nodes are running in GCP

Step One: Start all 3 nodes in the cluster you have already created
• You will then Click on the chevron next to the SSH
• Click on “Open in browser window”
Step Two: Explore the Cluster in Hadoop
• Open terminal via SSH in GCP
• See all the services of Hadoop in our cluster
• Use the command
o whoami
o pwd
o These command lines show you your username and the home directory

• Try other commands in the lecture notes but be careful when you are deleting
something.
• Enter the command
o ps -ef | grep -i hadoop
o This will list all the processing currently running
o Remember when we set up Hadoop all of these services were setup when setup
Hadoop and Spark Cluster with DataProc
• How to move up and down on the terminal window
o Click on the Setting icon in the upper right-hand side of the terminal
o Click on “Show Scrollbar” to see the scrollbar

• You can scroll the terminal using your mouse wheel or trackpad. Alternatively, the
Ctrl+Shift+PageUp/Ctrl+Shift+PageDn keyboard shortcuts scroll the terminal on
Windows and Linux, and Fn+Shift+Up/Fn+Shift+Down scroll the terminal on macOS
zorhan@hadoop-spark-2-cluster-m:~$ ps -ef | grep -i hadoop
hive 769 1 5 00:28 ? 00:00:16 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-
metastore.log -Dhive.log.threshold=INFO -
Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -
Dhadoop.root.logger=INFO,console -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Xmx8027m -Dproc_metastore -Dlog4j2.formatMsgNoLookups=true -
Dlog4j.configurationFile=hive-log4j2.properties -
Djava.util.logging.config.file=/usr/lib/hive/conf/parquet-
logging.properties -Dhadoop.security.logger=INFO,NullAppender
org.apache.hadoop.util.RunJar /usr/lib/hive/lib/hive-metastore-2.3.7.jar
org.apache.hadoop.hive.metastore.HiveMetaStore
hive 771 1 5 00:28 ? 00:00:18 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-
server2.log -Dhive.log.threshold=INFO -
Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -
Dhadoop.root.logger=INFO,console -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Xmx8027m -Dproc_hiveserver2 -Dlog4j2.formatMsgNoLookups=true -
XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -
XX:+PrintGCDetails -Dlog4j.configurationFile=hive-log4j2.properties -
Djava.util.logging.config.file=/usr/lib/hive/conf/parquet-
logging.properties -Djline.terminal=jline.UnsupportedTerminal -
Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
/usr/lib/hive/lib/hive-service-2.3.7.jar
org.apache.hive.service.server.HiveServer2
mapred 885 1 7 00:28 ? 00:00:22 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_historyserver -Xmx4000m -
Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -
Dhadoop.root.logger=INFO,console -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Dhadoop.log.dir=/var/log/hadoop-mapreduce -
Dhadoop.log.file=hadoop.log -Dhadoop.root.logger=INFO,console -
Dhadoop.id.str=mapred -Dhadoop.log.dir=/usr/lib/hadoop/logs -
Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -
Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Dhadoop.log.dir=/var/log/hadoop-mapreduce -
Dhadoop.log.file=mapred-mapred-historyserver-hadoop-spark-2-cluster-m.log
-Dhadoop.root.logger=INFO,RFA -Dmapred.jobsummary.logger=INFO,JSA -
XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -
XX:+PrintGCDetails -Dhadoop.security.logger=INFO,NullAppender
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
hdfs 897 1 3 00:28 ? 00:00:09 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_secondarynamenode -Xmx1000m -
Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-
secondarynamenode-hadoop-spark-2-cluster-m.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hdfs -
Dhadoop.root.logger=INFO,RFA -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Xmx6422m -XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -
XX:+PrintGCDateStamps -XX:+PrintGCDetails -
Dhadoop.security.logger=INFO,RFAS
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
yarn 899 1 6 00:28 ? 00:00:20 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_resourcemanager -Xmx4000m -
Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -
Dhadoop.log.file=yarn-yarn-resourcemanager-hadoop-spark-2-cluster-m.log -
Dyarn.log.file=yarn-yarn-resourcemanager-hadoop-spark-2-cluster-m.log -
Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,RFA -
Dyarn.root.logger=INFO,RFA -Djava.library.path=/usr/lib/hadoop/lib/native
-Dyarn.policy.file=hadoop-policy.xml -Xmx12844m -
Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -
Dhadoop.log.file=yarn-yarn-resourcemanager-hadoop-spark-2-cluster-m.log -
Dyarn.log.file=yarn-yarn-resourcemanager-hadoop-spark-2-cluster-m.log -
Dyarn.home.dir=/usr/lib/hadoop-yarn -Dhadoop.home.dir=/usr/lib/hadoop -
Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -
Djava.library.path=/usr/lib/hadoop/lib/native -classpath
/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/
usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-
hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-
yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-
mapreduce/lib/*:/usr/lib/hadoop-
mapreduce/.//*:/usr/lib/spark/yarn/*::/usr/local/share/google/dataproc/lib
/*:/usr/local/share/google/dataproc/lib/*:/usr/local/share/google/dataproc
/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-
yarn/lib/*:/etc/hadoop/conf/rm-config/log4j.properties:/usr/lib/hadoop-
yarn/.//timelineservice/*:/usr/lib/hadoop-yarn/.//timelineservice/lib/*
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
hdfs 901 1 5 00:28 ? 00:00:15 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_namenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-
hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-hadoop-spark-2-cluster-m.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hdfs -
Dhadoop.root.logger=INFO,RFA -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Xmx6422m -XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -
XX:+PrintGCDateStamps -XX:+PrintGCDetails -
Dhadoop.security.logger=INFO,RFAS
org.apache.hadoop.hdfs.server.namenode.NameNode
yarn 903 1 4 00:28 ? 00:00:14 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_timelineserver -Xmx4000m -
Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -
Dhadoop.log.file=yarn-yarn-timelineserver-hadoop-spark-2-cluster-m.log -
Dyarn.log.file=yarn-yarn-timelineserver-hadoop-spark-2-cluster-m.log -
Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,RFA -
Dyarn.root.logger=INFO,RFA -Djava.library.path=/usr/lib/hadoop/lib/native
-Dyarn.policy.file=hadoop-policy.xml -XX:+UseConcMarkSweepGC -
XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -
XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -
XX:+PrintGCDetails -Djava.util.logging.config.file=/etc/hadoop/conf/yarn-
timelineserver.logging.properties -
Djava.util.logging.config.file=/etc/hadoop/conf/yarn-
timelineserver.logging.properties -Dhadoop.log.dir=/var/log/hadoop-yarn -
Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-yarn-
timelineserver-hadoop-spark-2-cluster-m.log -Dyarn.log.file=yarn-yarn-
timelineserver-hadoop-spark-2-cluster-m.log -
Dyarn.home.dir=/usr/lib/hadoop-yarn -Dhadoop.home.dir=/usr/lib/hadoop -
Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -
Djava.library.path=/usr/lib/hadoop/lib/native -classpath
/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/
usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-
hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-
yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-
mapreduce/lib/*:/usr/lib/hadoop-
mapreduce/.//*:/usr/lib/spark/yarn/*::/usr/local/share/google/dataproc/lib
/*:/usr/local/share/google/dataproc/lib/*:/usr/local/share/google/dataproc
/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-
yarn/lib/*:/etc/hadoop/conf/timelineserver-config/log4j.properties
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistory
Server
root 1328 1 5 00:28 ? 00:00:17 /usr/bin/java -
XX:+AlwaysPreTouch -Xms1605m -Xmx1605m -XX:+CrashOnOutOfMemoryError -
XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/crash/google-
dataproc-agent.hprof -Djava.util.logging.config.file=/etc/google-
dataproc/logging.properties -cp /usr/local/share/google/dataproc/dataproc-
agent.jar:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr
/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-
hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-
yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-
mapreduce/.//*:/usr/local/share/google/dataproc/lib/*
com.google.cloud.hadoop.services.agent.AgentMain
/usr/local/share/google/dataproc/startup-script.sh
/usr/local/share/google/dataproc/post-hdfs-startup-script.sh
spark 1560 1 3 00:28 ? 00:00:11 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -cp
/usr/lib/spark/conf/:/usr/lib/spark/jars/*:/etc/hadoop/conf/:/etc/hive/con
f/:/usr/local/share/google/dataproc/lib/*:/usr/share/java/mysql.jar -
Xmx4000m org.apache.spark.deploy.history.HistoryServer
zorhan 2737 2513 0 00:33 pts/0 00:00:00 grep -i hadoop

• So what does this all mean? These are all the components of the Hadoop Ecosystem.
o You see root. The process number 1328. This process is what is needed to run a
program, a Hadoop component. The process number is an ID for that program if
you will use it. It is very important in the Ecosystem as you could shut down a
process with a command using the process ID number
o You also see mapred. The process number is 885, that is running JobHistoryServer
o You see yarn. The process number is 899 that is running the ResourceManager
o See yarn. The process number is 903 that is running ApplicationHistoryServer
o See hive. The process number is 769 that is running the HiveMetastore (see
below)
o See hdfs with a process number 901 that is running the NameNode
o Another hive. The process number 771 that is running the HiveServer2
o Lastly you see Spark. The process number 1560 that is running HistoryServer
o Is this sounding familiar?
o Take note of each service, process ID of each service and what each is running

• Let’s look at Hive(details will be next week)

o Megastore and Hive Server (Engine) are critical to run Hive


• Let’s look at YARN Architecture
o The Resource Manager (Manager Node) is a major component of Yarn
o This is so that it can work with the Application Master and Node Master or worker
nodes
• Let’s once again look at the HDFS Architecture

You might also like