BDA Module-2 Notes PDF

Module – 2: Big Data and Analytics
Module 2: Big Data Technology
Syllabus
Hadoop’s Parallel World – Data discovery – Open source technology for Big Data Analytics cloud and
Big Data –Predictive Analytics – Mobile Business Intelligence and Big Data - Cloud Sourcing Analytics
– Inter and Trans-Firewall Analytics.
There are many Big Data technologies that have been making an impact on the new technology
stacks for handling Big Data, but Apache Hadoop is one technology that has been the most
liked of Big Data talk.
2.1 Hadoop ’s Parallel World

• Hadoop is an open-source platform for storage and processing of diverse data types
that enables data-driven enterprises to rapidly derive the complete value from all their
data.
• The original creators of Hadoop are Doug Cutting and Mike Cafarella.
• Doug and Mike were building a project called “Nutch” with the goal of creating a
large Web index.
• They saw the MapReduce and GFS(Google File System) research papers from Google,
which were obviously similarly relevant to the Nutch Project was trying to develop.
• They integrated the concepts from MapReduce and GFS into Nutch; then later these
two components were pulled out to form the initial version of the Hadoop project.
• Yahoo! Hired Doug and invested significant resources into growing the Hadoop
project, initially to store and index the Web for the purpose of Yahoo! Search.
2.1.2 What Hadoop could do?

• Moving beyond rigid legacy frameworks, Hadoop gives organizations the flexibility
to ask questions across their structured and unstructured data that were previously
impossible to ask or solve.
• The scale and variety of data have permanently overwhelmed the ability to cost-
effectively extract value using traditional platforms.
• The scalability and elasticity of free, open-source Hadoop running on standard
hardware allow organizations to hold onto more data than ever before.
Pa ge |1
• Hadoop excels at supporting complex analyses— including detailed, special-

purpose computation—across large collections of data.
• Hadoop handles a variety of workloads, including search, log processing,
recommendation systems, data warehousing, and video/image analysis.
• Hadoop stores terabytes, and even petabytes, of data inexpensively. It is robust and
reliable and handles hardware and system failures automatically, without losing data
or interrupting data analyses.
• Hadoop runs on clusters of commodity servers and each of those servers has local
CPUs and disk storage that can be leveraged by the system.
2.2 Two critical components of Hadoop
Figure 2.1: Core Components of Hadoop System
2.2.1 Hadoop Distributed File System

• HDFS is the storage system for a Hadoop cluster.
• When data lands in the cluster, HDFS breaks it into pieces and distributes those pieces
among the different servers participating in the cluster.
• Each server stores just a small fragment of the complete data set, and each piece of
data is replicated on more than one server
2.2.2 Map Reduce

• Because Hadoop stores the entire dataset in small pieces across a collection of servers,
analytical jobs can be distributed, in parallel, to each of the servers storing part of the
data.
Pa ge |2
• Each server evaluates the question against its local fragment simultaneously and reports
its results back for collation into a comprehensive answer.
• MapReduce is the agent that distributes the work and collects the results.
2.2.3 Working together HDFS and MR
• Both HDFS and MapReduce are designed to continue to work in the face of system
failures.
• HDFS continually monitors the data stored on the cluster. If a server becomes
unavailable, a disk drive fails, or data is damaged, whether due to hardware or software
problems, HDFS automatically restores the data from one of the known good replicas
stored elsewhere on the cluster.
• Likewise, when an analysis job is running, MapReduce monitors progress of each of
the servers participating in the job.
• If one of them is slow in returning an answer or fails before completing its work,
MapReduce automatically starts another instance of that task on another server that has
a copy of the data. Because of the way that HDFS and MapReduce work, Hadoop
provides scalable, reliable, and fault-tolerant services for data storage and analysis at
very low cost.
Figure 2.2 : Working of HDFS and MapReduce
2.3 Data Discovery
• There is a lot of buzz in the industry about data discovery , the term used to describe
the new wave of business intelligence that enables users to exploredata, make
Pa ge |3
discoveries, and uncover insights in a dynamic and intuitive way versus predefined
queries and preconfigured drill-down dashboards.
• In fact, there are two software companies that stand out in the crowd by growing
their businesses at unprecedented rates in this space: Tableau Software and
QlikTech International.
• Both companies’ approach to the market is much different than the traditional BI
software vendor. They grew through a sales model that many refer to as “land and
expand.”
• When you have a product that is “easy to use,” it also means the “self-service”
approach
• Analytics and reporting are produced by the people using the results. IT provides
the infrastructure, but business people create their own reports and dashboards.
• There is a simple example of powerful visualization that the Tableau team is

referring to. A company uses an interactive dashboard to track the critical metrics
driving their business. Every day, the CEO and other executives are plugged in real-
time to see how their markets are performing in terms of sales and profit, what the
service quality scores look like against advertising investments, and how products
are performing in terms of revenue and profit.
• Interactivity is key: a click on any filter lets the executive look into specific markets
or products.
• Qliktech CTO narrates that Business intelligence needs to work the way people ’s
minds work. Users need to navigate and interact with data any way they want to—
asking and answering questions on their own and in big groups or teams.
• Qliktech has designed a way for users to leverage direct— and indirect—search.
With QlikView search, users type relevant words or phrases in any order and get
instant, associative results. With a global search bar, users can search across the
entire data set.
Pa ge |4
2.4 Open-Source Technology for Big Data Analytics
• Open-source software is computer software that is available in source code form under
an open-source license that permits users to study, change, and improve and at times
also to distribute the software.
• The most prominent and popular example is the GNU General Public License (GPL),
which “allows free distribution under the condition that further developments and
applications are put under the same license.”
• One of the key attributes of the open-source analytics stack is that it ’s not constrained
by someone else ’s predetermined ideas or vision.
• The advantage of the opensource stack—flexibility, extensibility, and lower cost.
• The open-source stack reflects the new realities of the networked global economy,
which is increasingly dependent on big data.
• Some of the open source projects of Hadoop are:
• For example, Teradata Aster designed SQL-H, which is a seamless way to execute SQL
and SQL-MapReduce on Apache Hadoop data.
Some of the open source technologies for Big Data Analytics are: Ambari: A
web-based tool for monitoring Apache Hadoop clusters Cassandra: A
scalable multi-master database with no single points of failure. Chukwa: A
data collection system for managing large distributed systems.
HBase: A Distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and
expressive programming model that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
ZooKeeper: A high-performance coordination service for distributed applications.
Pa ge |5
2.5 The Cloud and Big Data
• Cloud models are basic necessity for every industry functioning and it is just a matter
of when an industry will shift do the cloud model.
• Many startup industries do not have unlimited capital to invest in infrastructure, in such
cases a cloud can provide Infrastructure as a Service(IaaS), Platform as a Service (PaaS)
and Software as a Service (Saas)
• The data is exploding—both structured and unstructured, so the storage devices are
very costly and sometimes leads to problem due to hardware failure. So, cloud provides
Storage Space as a Service (SPaaS) example: Google Drive, OneDrive, FileHipo,
4Shared etc.,
• The Cloud app products price and manage risks are broken (come down) due to the
service of free storage space and high security.
• Many IT experts says
• Stop Saying Cloud: as the true value lies in delivering software, data, and/or
analytics. So instead of calling Cloud Computing Model, it is more appropriate
to say “as a service” model.
• Acknowledge the business: The matters around information privacy, security,

access, and delivery are regulated by implementation of cloud model lead to
new business opportunity for many IT companies. Example: AWS, Microsoft
Azure, IBM BlueMix etc.,
• Fix the core Business - Technical gap: The ability to run analytics at scale in
a virtual environment, to ensuring information processing and analytics
authenticity are issues that gave solutions for many core business problems and
have to be fixed. This reduced the gap between business and technology.
2.6 Predictive Analytics
• Enterprises will move from business intelligence to forward leaning positions

(predictive analytics).
• Using all the data available—traditional internal data sources combined with new rich
external data sources—will make the predictions more accurate and meaningful.
Pa ge |6
• Because the analytics are contextual, enterprises can build confidence in the analytics
and the trust will result in using analytic insights to trigger business events.
• Some leading trends that are making their way to the forefront of businesses today:
Recommendation engines similar to those used in Netflix and Amazon that

use past purchases and buying behavior to recommend new purchases.
Risk engines for a wide variety of business areas, including market and
credit risk, catastrophic risk, and portfolio risk.
Innovation engines for new product innovation, drug discovery, and
consumer and fashion trends to predict potential new product formulations
and discoveries.
Customer insight engines that integrate a wide variety of customer related
info, including sentiment, behavior, and even emotions. Customer insight
engines will be the backbone in online and set-top box advertisement
targeting, customer loyalty programs to maximize customer lifetime value,
optimizing marketing campaigns for revenue lift, and targeting individuals
or companies at the right time to maximize their spend.
Optimization engines that optimize complex interrelated operations and
decisions that are too overwhelming for people to systematically handle at
scales, such as when, where, and how to seek natural resources to maximize
output while reducing operational costs— or what potential competitive
strategies should be used in a global business that takes into account the
various political, economic, and competitive pressures along with both
internal and external operational capabilities.
2.7 Mobile Business Intelligence and Big Data
Business intelligence (BI) refers to IT based techniques used in spotting, digging-out, and
analyzing business data, such as sales revenue by products as wellas associated costs and
incomes.
Mobile BI is a system comprising both technical and organizational elements that present
historical and/or real-time information to its users for analysis on mobile devices such as
smartphones and tablets, to enable effective decision-making and management support, for the
overall purpose of increasing firm performance.
Pa ge |7
Although the concept of mobile computing has been prevalent for over a decade, Mobile BI
has shown a momentum/growth only very recently. This change has been partly encouraged
by a change from the ‘wired world’ to a wireless world with the advantage of smartphones
which has led to a new era of mobile computing, especially in the field of BI
Figure 2.3: Business Intelligence for Mobile Workforce

Mobile apps drive straight to the heart of simplicity and ease of use that clears major barrier to
BI adoption very quickly. Figure 2.3 depicts the dashboard that drive the business decisions
quickly and efficiently.
According to statistics, most people spend only 16% of their time on smartphones to make
phone calls. The other 84% of their time is spent on other activities such as email
communication, web browsing, using social apps, gaming, TV, watching videos and much
more. Because of these stats, mobile data analytics has completely changed what we see and
hear. All of that big data is compiled and used to target consumers with precision focus. It is
also used to help them in many positive ways as well.
By way of mobile applications, people connect with each other and use their smart phones for
just about everything. With each click and second spent on their mobile devices, the growth of
information gathered and big data increases each day. That compilation of big data in the
mobile industry is being used to not only recommends but also saves many lives in the field of
medical emergencies and risky scenarios.
2.7.1 Ease of Mobile Application Development
Another inflection point for the industry is the development and deployment of mobile
applications. In the past, that was controlled by the relationship with the carrier.
The three elements that have impacted the viability of mobile BI:
Location - the GPS component and location indicates where the customer spends more
time as well as his travel.
Pa ge |8
It is not just about pushing data; even we can transact with the smart phones based on
information we get. Example: Google Map Live traffic updates.
Multimedia functionality allows the visualization pieces to really come into play.
The three challenges with mobile BI include:
Managing standards for rolling out these devices
Managing security
Managing “bring you own device” where you have devices both owned by the
company and devices owned by the individual, both contributing to productivity.
2.8 Crowdsourcing Analytics

Crowdsourcing is the process of getting work or funding, usually online, from a crowd
of people. The word is a combination of the words 'crowd' and 'outsourcing'. The idea
is to take work and outsource it to a crowd of workers.
Crowdsourcing & Quality: The principle of crowdsourcing is that more heads are better
than one. By canvassing a large crowd of people for ideas, skills, or participation, the
quality of content and idea generation will be superior.
Famous Example: Wikipedia. Instead of Wikipedia creating an encyclopedia on their

own, hiring writers and editors, they gave a crowd the ability to create the information
on their own. The result? The most comprehensive encyclopedia this world has ever
seen.
Different Types of Crowdsourcing

• Crowdsource Design
• Crowdfunding
• Microtasks
• Open Innovation
2.8.1 Crowdsource Design
If we’re looking for a logo design, we can tell a crowd of designers what we want, how much
we will pay, and the deadline. All interested designers will create a finished design specifically
for us. We’ll receive 50-300+ different finished logo designs, and we can keep whichever
Pa ge |9
design we like the best. By doing design this way, crowdsourcing actually increases the quality
& decreases the price, compared to online freelancing.
Crowdsourcing can also be used to get designs for furniture, fashion, advertisements, video, &
product design. Just about anything that can be designed can be crowdsourced.
2.8.2 Crowdfunding
Crowdfunding involves asking a crowd of people to donate money to your project. For
example, if you want to raise $10,000 to pay for studio time to record a new CD, crowdfunding
can help you raise that money. You find a crowdfunding platform, set the goal amount,
deadline, and any rewards offered to donors. You must raise 100% of your goal before the
deadline, or all the donations are returned to the donors. Deadlines are typically less than 60
days.
Crowdfunding is mostly used by artists, charities, & start-ups to raise money for projects such
as filming a documentary, manufacturing an iPod watch, cancer research, or seed money.
2.8.3 Microtasks
Microtasking involves breaking work up into tiny tasks and sending the work to a crowd of
people. If you have 1,000 photos on your website that need captions, you can ask 1,000
individual people to each add a caption to one photo. Break up the work and decide the payment
for each completed task (typically .01¢ – .10¢ per task). With microtasking, you can expect to
see results within minutes. Microtasking can involve tasks such as scanning images,
proofreading, database correction and transcribing audio files.
Work is done faster, cheaper, and usually with less errors (when validation systems are in
place). Additionally, microtasks can often be performed by people in less fortunate countries,
including those with SMS capabilities but without computers.
2.8.4 Open Innovation
If you are unsure of where to begin with an idea for a business opportunity, whether it’s product
design or perhaps a marketing firm, crowdsourcing can help through open innovation. Open
innovation allows people from all aspects of business such as investors, designers, inventors,
and marketers to collaborate into a functional profit-making reality. This can be done either
P a g e | 10
10
through a dedicated web platform to gain outside perspective, or used with only internal
employees.
Open innovation brings together people from different parts of the world and different sectors
of business to work together on a project. This is effectively a collection of different fields and
levels of expertise that would not otherwise be available to any budding entrepreneur. It also
elevates previously considered uninvolved parties, such as investors, to roll up their sleeves
and impart their knowledge, essentially becoming more than just a cash cow.
Crowdsourcing usages in data analysis: Data analysis process is divided into two
parts:
Datafication: data collection, data cleansing, annotation, …
Analysis: data modeling, visualization, evaluation, …
Figure 2.4: Crowdsourcing data analysis
Crowdsourcing for datafication include Microtasks for data annotation and cleaning, mainly
suited for simple human-intelligence tasks such as Image labeling, speech recognition. The
advantage is Small payment for a small piece of work. Large-scale datasets can be processed
with reasonable costs
Crowdsourcing for analysis is based on competitions for data modeling that have been most
suited for creative/complex tasks, Design, software development, analysis and Winner-takes-
all payment. The major advantage of this is a few high-quality results can be obtained. The
Peer reviewing for assessing results and interpretation can be the ultimate produce of analytical
process.
P a g e | 11
11
2.9 Inter and Trans firewall analysis
Over the last 100 years, supply chain has evolved to connect multiple companies and enable
them to collaborate to create enormous value to the end-consumer through concepts like CPFR
(collaborative planning, forecasting and replenishment) a collection of business practices that
leverage the Internet and electronic data interchange to reduce inventories and expenses while
improving customer service, Supply chain has evolved to connect multiple companies and
enable them to collaborate to create enormous value to the end-consumer.
VMI (vendor-managed inventory) — a technique used by customers in which manufacturers

receive sales data to forecast consumer demand more accurately.
In the healthcare industry, rich consumer insights can be generated by collaborating on data
and insights from the health insurance provider, pharmacy delivering the drugs and the drug
manufacturer. Figure 2.3 depicts the Value Chain for Inter-Firewall and Trans-Firewall
Analytics. In fact, this is not necessarily limited to companies within the traditional demand-
supply chain. Disruptive value and efficiencies can be extracted by co-operative and exploring
outside the boundaries of the firewall.
Figure 2.3: Value Chain for Inter-Firewall and Trans-Firewall Analytics

Some of the more progressive companies will take this a step further and work on leveraging
the large volumes of data outside the firewall such as social data, location data, etc. In other
words, it will not be long before internal data and insights from within the enterprise firewall
P a g e | 12
12
is no longer a differentiator. We call this trend the move from intra- to inter- and trans-firewall
analytics.
There are instances where a retailer and a social media company can come together to share
insights on consumer behaviour that will benefit both concerns. Some of the more progressive
companies will take this a step further and work on leveraging the large volumes of data outside
the firewall such as social data, location data, etc.
Figure 2.4: Inter-Firewall and Trans-Firewall Analytics
It will not be long before internal data and insights from within the enterprise firewall is no
longer a differentiator. We call this trend the move from intra- to inter- and trans-firewall
analytics.
All the previous years the companies were doing functional silo-based analytics. Silo Analytics
is a type of analysis provides rich information through statistical reporting. Today the
companies are doing intra-firewall analytics with data within the firewall. May be in future
they will be collaborating on insights with other companies to do inter-firewall analytics as
well as leveraging the public domain spaces to do trans-firewall analytics (see Figure 2.4).
2.9.1 Challenges of Inter-Firewall and Trans-Firewall analysis
• First, as one moves outside the firewall, the information-to-noise ratio increases,
putting additional requirements on analytical methods and technology
requirements.
P a g e | 13
13
• Further, organizations are often limited by a fear of collaboration and an

overreliance on proprietary information.
• The fear of collaboration is mostly driven by competitive fears, data privacy

concerns, and proprietary orientations that limit opportunities for cross-
organizational learning and innovation.
********
P a g e | 14
14

BDA Module-2 Notes PDF

Uploaded by

Copyright:

Available Formats

BDA Module-2 Notes PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA Module-2 Notes PDF

Uploaded by

Copyright:

Available Formats

Module – 2: Big Data and Analytics

Module 2: Big Data Technology

2.1 Hadoop ’s Parallel World

2.1.2 What Hadoop could do?

• Hadoop excels at supporting complex analyses— including detailed, special-

2.2 Two critical components of Hadoop

Figure 2.1: Core Components of Hadoop System

2.2.1 Hadoop Distributed File System

2.2.2 Map Reduce

2.2.3 Working together HDFS and MR

Figure 2.2 : Working of HDFS and MapReduce

2.3 Data Discovery

• There is a simple example of powerful visualization that the Tableau team is

2.4 Open-Source Technology for Big Data Analytics

• The advantage of the opensource stack—flexibility, extensibility, and lower cost.

• Some of the open source projects of Hadoop are:

web-based tool for monitoring Apache Hadoop clusters Cassandra: A

scalable multi-master database with no single points of failure. Chukwa: A

data collection system for managing large distributed systems.

Mahout: A Scalable machine learning and data mining library.

ZooKeeper: A high-performance coordination service for distributed applications.

2.5 The Cloud and Big Data

• Many IT experts says

• Acknowledge the business: The matters around information privacy, security,

2.6 Predictive Analytics

• Enterprises will move from business intelligence to forward leaning positions

Recommendation engines similar to those used in Netflix and Amazon that

2.7 Mobile Business Intelligence and Big Data

Figure 2.3: Business Intelligence for Mobile Workforce

2.7.1 Ease of Mobile Application Development

2.8 Crowdsourcing Analytics

Famous Example: Wikipedia. Instead of Wikipedia creating an encyclopedia on their

Different Types of Crowdsourcing

2.8.1 Crowdsource Design

2.8.4 Open Innovation

Datafication: data collection, data cleansing, annotation, …

Analysis: data modeling, visualization, evaluation, …

Figure 2.4: Crowdsourcing data analysis

2.9 Inter and Trans firewall analysis

VMI (vendor-managed inventory) — a technique used by customers in which manufacturers

Figure 2.3: Value Chain for Inter-Firewall and Trans-Firewall Analytics

Figure 2.4: Inter-Firewall and Trans-Firewall Analytics

2.9.1 Challenges of Inter-Firewall and Trans-Firewall analysis

• Further, organizations are often limited by a fear of collaboration and an

• The fear of collaboration is mostly driven by competitive fears, data privacy

You might also like