Assignmeng 3
Assignmeng 3
Assignmeng 3
assumes framework could adjust hardware failures automatically. Hadoop separates files into large
blocks, distributes them among the nodes in cluster, and then transfers the code with package into
nodes for data processing.
The most advantage of Hadoop is it ensures the data with locality because the data can be only
reached by nodes with accessibility in local.
What’s more, it’s an efficient and much faster way to process data due to the file system is operated
parallelly at the same time via the highly speed network.
1)Apache Hadoop is an open-source distribution offers software framework for distributed storage, it
uses computer network to address big data and processing. As the crucial part of Apache Hadoop,
Hadoop Distributed File System (HDFS) is a storage system which stores data on machines at the
same time offers superior bandwidth among the cluster.
Apache Hadoop is one of the best data processing platforms due to it costless feature, fast speed
network and scalable processing capabilities overall. It provides the effective and economical
services with low expense costs in most of the commercial area with the best and the efficient big
data processing and analytics solutions.
In addition to Apache, there are some other Hadoop implementations offered with some additional
features and functions that overcomes some shortages of Apache meanwhile provides more
additional advantages and functions with easy access assistance and professional solution
guidance. All these vendors provide convenience by innovation technology for commercial users for
big data managing and analyzing.
However, with all the vendors in the same market, it raised more competition with the open source
platform. Choice of the implementations from the commercial users depending on the different
features, advantages, using capabilities, technical supports, processing speed and expense
consideration on all these various distributions.
Below is some selected examples for some other Hadoop distributions available to be chosen in the
market and their different advantages and features.
2)MapR Hadoop:
MapR is one of the best chose implementations to manage large data programming processing, it
was recognized as one of the best Hadoop Distributions among all implementations in the market. It
superiors and exceeds all other competitors by its exceptional map and reduce methods or
functions. It overcomes some barriers from other implementations regarding to data security,
operational accessibility and commercial executing reliability.
As in the banking industry, I would recommend choosing MapR as the implementation for the
company according to the company’s large datasets in amount, as well as the data accuracy,
privacy and highly security demands.
Firstly, with a superior framework feature of distributed processing, map and reduction operates via
a parallel system on the cluster, large amounts of nodes and map can process with the program.
MapR is extremely useful and suitable for large datasets inputs, processing and analyzing.
This would be the most important thing with a primary demand from the bank because the company
inputs, operates, maintains and analyzes extremely large amount of highly confidential datasets on a
daily basis.
Secondly, MapR provides its highly competitive features and functions such as filtering, sorting, and
summary method. For instance, it can count the number of customers in every single line and
generate the frequency of the names. What’s more, it can sort customers by first name/last name
into line with every single name in one line.
Dealing/researching customer and branches data with name, location, gender and other frequently
used information such as products, items and times of usage, then sorting, filtering and summary
these datasets would be a highly useful demands and essential needs for the bank’s data inputs,
maintenance and analytics.
Thirdly, another advantage of MapR is its fault tolerance and scalability features. By transmitting
data to the distributed server, MapR framework managing data transition among several different
parts and procedures in the system, while operating multiple tasks, if the server/storage fails during
the operation, data is still available, and the initial work could be recovered and rescheduled with
chance, thus increasing the security, reliability and fault tolerance, at the same time, decrease cost
of network communication.
This would be the crucial important thing in the banking industry since the company positions data
security, instance and accuracy as the primary priority. With millions of highly confidential datasets
on a daily base, the company commits all customers data including clients’ personal information
such as name, SSN as well as clients’ assets and the bank assets not only confidential but also time
and accuracy sensitive. The fault tolerance framework of MapR would be a strong technical support
with reliable guarantee.
Last but not the least, reduce cost and increase efficiency is the crucial key for every business and
company. By MapR parallel system, every map operates separately and independently from each
other, thus reduce the network costs while improve efficiency.
Since the bank operating large amounts of nodes and computers in branches on the local network,
we input datasets with millions of customers and distributed branches to the cluster with multiple
thread implementations on multiple processor hardware, we could take advantage of parallelly
processing of mapR in the local area of stored datasets, hence decrease the cost of network
communication, and meanwhile increase efficiency.
3) Hortonworks Hadoop:
Hortonworks in one of the top Hadoop implementations, it’s a solely company with open source
distributions especially in the IT industry. Hortonworks is recognized and continuing working on its
new vision of highly speed data process capability via open source platform with the easy and high
volume executing by all companies and organizations.
Partner with RedHat, Microsoft, and SAP, Hortonworks was used by companies specially in the IT
industry. Companies currently using Hortonworks including Samsung, Spotify, eBay and Bloomberg.
4) Cloudera Hadoop:
Cloudera is one of the top big data Hadoop platforms in the business world and commercial market.
It was dveloped by engineers in Google, Yahoo and Facebook. Partnered with Oracle, IBM and HP
and used by US army and states, Cloudera attracts plenty of customers and keep increasing its big
data analytics users with thousands of nodes on cluster.
The most advantage of Cloudera is offering users with prepared solutions and extra technical
supports via services and coaching provided to companies and organization. Done a great job with
Hadoop focusing and aggressive innovation, Cloudera exceeds its users’ expectations by meeting
various demands from clients and providing multiple professional solutions to the market to meet
different needs. So far, the market share of Cloudera has been higher and exceeding its competitors
compares to MapR and Hortoonworks.
5) IBM Infosphere BigInsights Hadoop:
IBM Infosphere BigInsights combines Hadoop with unique users’ needs with demands. It offers extra
additional services with big sheets, big insights by its cloud enterprise structure.
The most advantage of IBM Infosphere BigInsights is its highly speed processing capability with
efficiency. Users can input and transmit data to the cluster easily and efficiently within just a half
hour, the processing rate of data is as less as 60 cents/per cluster/per hour. Thus, customers can
efficiently access the most superior big data analytics supports due to the efficient feature of IBM
Infosphere BigInsights.