This document discusses profiling a Hadoop cluster to determine infrastructure needs. It describes instrumenting a cluster running a 10TB TeraSort workload using the SAR tool to collect CPU, memory, I/O, and network metrics from each node. The results show the I/O subsystem was underutilized at 10% while CPU utilization was high, indicating the workload was not I/O bound. Memory metrics showed a high percentage of cached data, meaning the CPUs were not waiting on memory. Profiling workloads in this way helps right-size Hadoop infrastructure.
1 of 18
More Related Content
Apache con 2012 taking the guesswork out of your hadoop infrastructure
The average enterprise customers are running 1 or 2 Racks in Production. In that scenario, you have redundant switches in each RACK and you RAID Mirror the O/S and Hadoop Runtime and JBOD the Data Drives because losing Racks and Servers is costly. VERY Different to the Web Scale perspective.
Hadoop Slave Server Configurations are about balancing the following: Performance (Keeping the CPU as busy as possible) Storage Capacity (Important for HDFS) Price (Managing the above at a price you can afford)Commodity Servers do not mean cheap. At scale you want to fully optimize performance and storage capacity for your workloads to keep costs down.
So as any good Infrastructure Designer will tell you… you begin by Profiling your Hadoop Applications.
But generally speaking, you design a pretty decent cluster infrastructure baseline that you can later optimize, provided you get these things right.
In reality, you don’t want Hadoop Clusters popping up all over the place in your company. It becomes untenable to manage. You really want a single large cluster managed as a service.If that’s the case then you don’t want to be optimizing your cluster for just one workload, you want to be optimizing this for multiple concurrent workloads (dependent on scheduler) and have a balanced cluster.