BDA Module-2 Notes PDF
BDA Module-2 Notes PDF
BDA Module-2 Notes PDF
Syllabus
Hadoop’s Parallel World – Data discovery – Open source technology for Big Data Analytics cloud and
Big Data –Predictive Analytics – Mobile Business Intelligence and Big Data - Cloud Sourcing Analytics
– Inter and Trans-Firewall Analytics.
There are many Big Data technologies that have been making an impact on the new technology
stacks for handling Big Data, but Apache Hadoop is one technology that has been the most
liked of Big Data talk.
Pa ge |1
Module – 2: Big Data and Analytics
Pa ge |2
Module – 2: Big Data and Analytics
• Each server evaluates the question against its local fragment simultaneously and reports
its results back for collation into a comprehensive answer.
• MapReduce is the agent that distributes the work and collects the results.
• Both HDFS and MapReduce are designed to continue to work in the face of system
failures.
• HDFS continually monitors the data stored on the cluster. If a server becomes
unavailable, a disk drive fails, or data is damaged, whether due to hardware or software
problems, HDFS automatically restores the data from one of the known good replicas
stored elsewhere on the cluster.
• Likewise, when an analysis job is running, MapReduce monitors progress of each of
the servers participating in the job.
• If one of them is slow in returning an answer or fails before completing its work,
MapReduce automatically starts another instance of that task on another server that has
a copy of the data. Because of the way that HDFS and MapReduce work, Hadoop
provides scalable, reliable, and fault-tolerant services for data storage and analysis at
very low cost.
• There is a lot of buzz in the industry about data discovery , the term used to describe
the new wave of business intelligence that enables users to exploredata, make
Pa ge |3
Module – 2: Big Data and Analytics
discoveries, and uncover insights in a dynamic and intuitive way versus predefined
queries and preconfigured drill-down dashboards.
• In fact, there are two software companies that stand out in the crowd by growing
their businesses at unprecedented rates in this space: Tableau Software and
QlikTech International.
• Both companies’ approach to the market is much different than the traditional BI
software vendor. They grew through a sales model that many refer to as “land and
expand.”
• When you have a product that is “easy to use,” it also means the “self-service”
approach
• Analytics and reporting are produced by the people using the results. IT provides
the infrastructure, but business people create their own reports and dashboards.
• Interactivity is key: a click on any filter lets the executive look into specific markets
or products.
• Qliktech CTO narrates that Business intelligence needs to work the way people ’s
minds work. Users need to navigate and interact with data any way they want to—
asking and answering questions on their own and in big groups or teams.
• Qliktech has designed a way for users to leverage direct— and indirect—search.
With QlikView search, users type relevant words or phrases in any order and get
instant, associative results. With a global search bar, users can search across the
entire data set.
Pa ge |4
Module – 2: Big Data and Analytics
• Open-source software is computer software that is available in source code form under
an open-source license that permits users to study, change, and improve and at times
also to distribute the software.
• The most prominent and popular example is the GNU General Public License (GPL),
which “allows free distribution under the condition that further developments and
applications are put under the same license.”
• One of the key attributes of the open-source analytics stack is that it ’s not constrained
by someone else ’s predetermined ideas or vision.
• The open-source stack reflects the new realities of the networked global economy,
which is increasingly dependent on big data.
• For example, Teradata Aster designed SQL-H, which is a seamless way to execute SQL
and SQL-MapReduce on Apache Hadoop data.
Some of the open source technologies for Big Data Analytics are: Ambari: A
HBase: A Distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Pig: A high-level data-flow language and execution framework for parallel computation.
Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and
expressive programming model that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
Pa ge |5
Module – 2: Big Data and Analytics
• Cloud models are basic necessity for every industry functioning and it is just a matter
of when an industry will shift do the cloud model.
• Many startup industries do not have unlimited capital to invest in infrastructure, in such
cases a cloud can provide Infrastructure as a Service(IaaS), Platform as a Service (PaaS)
and Software as a Service (Saas)
• The data is exploding—both structured and unstructured, so the storage devices are
very costly and sometimes leads to problem due to hardware failure. So, cloud provides
Storage Space as a Service (SPaaS) example: Google Drive, OneDrive, FileHipo,
4Shared etc.,
• The Cloud app products price and manage risks are broken (come down) due to the
service of free storage space and high security.
• Stop Saying Cloud: as the true value lies in delivering software, data, and/or
analytics. So instead of calling Cloud Computing Model, it is more appropriate
to say “as a service” model.
• Fix the core Business - Technical gap: The ability to run analytics at scale in
a virtual environment, to ensuring information processing and analytics
authenticity are issues that gave solutions for many core business problems and
have to be fixed. This reduced the gap between business and technology.
• Using all the data available—traditional internal data sources combined with new rich
external data sources—will make the predictions more accurate and meaningful.
Pa ge |6
Module – 2: Big Data and Analytics
• Because the analytics are contextual, enterprises can build confidence in the analytics
and the trust will result in using analytic insights to trigger business events.
• Some leading trends that are making their way to the forefront of businesses today:
Business intelligence (BI) refers to IT based techniques used in spotting, digging-out, and
analyzing business data, such as sales revenue by products as wellas associated costs and
incomes.
Mobile BI is a system comprising both technical and organizational elements that present
historical and/or real-time information to its users for analysis on mobile devices such as
smartphones and tablets, to enable effective decision-making and management support, for the
overall purpose of increasing firm performance.
Pa ge |7
Module – 2: Big Data and Analytics
Although the concept of mobile computing has been prevalent for over a decade, Mobile BI
has shown a momentum/growth only very recently. This change has been partly encouraged
by a change from the ‘wired world’ to a wireless world with the advantage of smartphones
which has led to a new era of mobile computing, especially in the field of BI
According to statistics, most people spend only 16% of their time on smartphones to make
phone calls. The other 84% of their time is spent on other activities such as email
communication, web browsing, using social apps, gaming, TV, watching videos and much
more. Because of these stats, mobile data analytics has completely changed what we see and
hear. All of that big data is compiled and used to target consumers with precision focus. It is
also used to help them in many positive ways as well.
By way of mobile applications, people connect with each other and use their smart phones for
just about everything. With each click and second spent on their mobile devices, the growth of
information gathered and big data increases each day. That compilation of big data in the
mobile industry is being used to not only recommends but also saves many lives in the field of
medical emergencies and risky scenarios.
Another inflection point for the industry is the development and deployment of mobile
applications. In the past, that was controlled by the relationship with the carrier.
The three elements that have impacted the viability of mobile BI:
Location - the GPS component and location indicates where the customer spends more
time as well as his travel.
Pa ge |8
Module – 2: Big Data and Analytics
It is not just about pushing data; even we can transact with the smart phones based on
information we get. Example: Google Map Live traffic updates.
Multimedia functionality allows the visualization pieces to really come into play.
The three challenges with mobile BI include:
Managing standards for rolling out these devices
Managing security
Managing “bring you own device” where you have devices both owned by the
company and devices owned by the individual, both contributing to productivity.
Crowdsourcing & Quality: The principle of crowdsourcing is that more heads are better
than one. By canvassing a large crowd of people for ideas, skills, or participation, the
quality of content and idea generation will be superior.
• Crowdfunding
• Microtasks
• Open Innovation
If we’re looking for a logo design, we can tell a crowd of designers what we want, how much
we will pay, and the deadline. All interested designers will create a finished design specifically
for us. We’ll receive 50-300+ different finished logo designs, and we can keep whichever
Pa ge |9
Module – 2: Big Data and Analytics
design we like the best. By doing design this way, crowdsourcing actually increases the quality
& decreases the price, compared to online freelancing.
Crowdsourcing can also be used to get designs for furniture, fashion, advertisements, video, &
product design. Just about anything that can be designed can be crowdsourced.
2.8.2 Crowdfunding
Crowdfunding involves asking a crowd of people to donate money to your project. For
example, if you want to raise $10,000 to pay for studio time to record a new CD, crowdfunding
can help you raise that money. You find a crowdfunding platform, set the goal amount,
deadline, and any rewards offered to donors. You must raise 100% of your goal before the
deadline, or all the donations are returned to the donors. Deadlines are typically less than 60
days.
Crowdfunding is mostly used by artists, charities, & start-ups to raise money for projects such
as filming a documentary, manufacturing an iPod watch, cancer research, or seed money.
2.8.3 Microtasks
Microtasking involves breaking work up into tiny tasks and sending the work to a crowd of
people. If you have 1,000 photos on your website that need captions, you can ask 1,000
individual people to each add a caption to one photo. Break up the work and decide the payment
for each completed task (typically .01¢ – .10¢ per task). With microtasking, you can expect to
see results within minutes. Microtasking can involve tasks such as scanning images,
proofreading, database correction and transcribing audio files.
Work is done faster, cheaper, and usually with less errors (when validation systems are in
place). Additionally, microtasks can often be performed by people in less fortunate countries,
including those with SMS capabilities but without computers.
If you are unsure of where to begin with an idea for a business opportunity, whether it’s product
design or perhaps a marketing firm, crowdsourcing can help through open innovation. Open
innovation allows people from all aspects of business such as investors, designers, inventors,
and marketers to collaborate into a functional profit-making reality. This can be done either
P a g e | 10
10
Module – 2: Big Data and Analytics
through a dedicated web platform to gain outside perspective, or used with only internal
employees.
Open innovation brings together people from different parts of the world and different sectors
of business to work together on a project. This is effectively a collection of different fields and
levels of expertise that would not otherwise be available to any budding entrepreneur. It also
elevates previously considered uninvolved parties, such as investors, to roll up their sleeves
and impart their knowledge, essentially becoming more than just a cash cow.
Crowdsourcing usages in data analysis: Data analysis process is divided into two
parts:
Crowdsourcing for datafication include Microtasks for data annotation and cleaning, mainly
suited for simple human-intelligence tasks such as Image labeling, speech recognition. The
advantage is Small payment for a small piece of work. Large-scale datasets can be processed
with reasonable costs
Crowdsourcing for analysis is based on competitions for data modeling that have been most
suited for creative/complex tasks, Design, software development, analysis and Winner-takes-
all payment. The major advantage of this is a few high-quality results can be obtained. The
Peer reviewing for assessing results and interpretation can be the ultimate produce of analytical
process.
P a g e | 11
11
Module – 2: Big Data and Analytics
Over the last 100 years, supply chain has evolved to connect multiple companies and enable
them to collaborate to create enormous value to the end-consumer through concepts like CPFR
(collaborative planning, forecasting and replenishment) a collection of business practices that
leverage the Internet and electronic data interchange to reduce inventories and expenses while
improving customer service, Supply chain has evolved to connect multiple companies and
enable them to collaborate to create enormous value to the end-consumer.
In the healthcare industry, rich consumer insights can be generated by collaborating on data
and insights from the health insurance provider, pharmacy delivering the drugs and the drug
manufacturer. Figure 2.3 depicts the Value Chain for Inter-Firewall and Trans-Firewall
Analytics. In fact, this is not necessarily limited to companies within the traditional demand-
supply chain. Disruptive value and efficiencies can be extracted by co-operative and exploring
outside the boundaries of the firewall.
P a g e | 12
12
Module – 2: Big Data and Analytics
is no longer a differentiator. We call this trend the move from intra- to inter- and trans-firewall
analytics.
There are instances where a retailer and a social media company can come together to share
insights on consumer behaviour that will benefit both concerns. Some of the more progressive
companies will take this a step further and work on leveraging the large volumes of data outside
the firewall such as social data, location data, etc.
It will not be long before internal data and insights from within the enterprise firewall is no
longer a differentiator. We call this trend the move from intra- to inter- and trans-firewall
analytics.
All the previous years the companies were doing functional silo-based analytics. Silo Analytics
is a type of analysis provides rich information through statistical reporting. Today the
companies are doing intra-firewall analytics with data within the firewall. May be in future
they will be collaborating on insights with other companies to do inter-firewall analytics as
well as leveraging the public domain spaces to do trans-firewall analytics (see Figure 2.4).
• First, as one moves outside the firewall, the information-to-noise ratio increases,
putting additional requirements on analytical methods and technology
requirements.
P a g e | 13
13
Module – 2: Big Data and Analytics
********
P a g e | 14
14