Big Data Analytics Report
Big Data Analytics Report
Big Data Analytics Report
1 ABSTRACT
Big data is a broad term for data sets so large or complex that
traditional data processing applications are inadequate.
Challenges include analysis, capture, data curation, search,
sharing, storage, transfer, visualization, and information privacy.
The term often refers simply to the use of predictive analytics or
other certain advanced methods to extract value from data, and
seldom to a particular size of data set. Accuracy in big data may
lead to more confident decision making. And better decisions can
mean greater operational efficiency, cost reductions and reduced
risk.
Analysis of data sets can find new correlations, to "spot business
trends, prevent diseases, combat crime and so on." Scientists,
practitioners of media and advertising and governments alike
regularly meet difficulties with large data sets in areas including
Internet search, finance and business informatics. Scientists
encounter limitations in e-Science work, including meteorology,
genomics, connectomics, complex physics simulations, and
biological and environmental research.
Data sets grow in size in part because they are increasingly being
gathered by cheap and numerous information-sensing mobile
devices, aerial (remote sensing), software logs, cameras,
microphones, radio-frequency identification (RFID) readers, and
wireless sensor networks. The world's technological per-capita
capacity to store information has roughly doubled every 40
months since the 1980s; as of 2012, every day 2.5 exabytes
(2.51018) of data were created; The challenge for large
enterprises is determining who should own big data initiatives
that straddle the entire organization.
Work with big data is necessarily uncommon; most analysis is of
"PC size" data, on a desktop PC or notebook that can handle the
available data set. Relational database management systems and
desktop statistics and visualization packages often have difficulty
handling big data. The work instead requires "massively parallel
software running on tens, hundreds, or even thousands of
servers". What is considered "big data" varies depending on the
capabilities of the users and their tools, and expanding
1
Big Data and Business Analytics
1. INTRODUCTION
2
Big Data and Business Analytics
3
Big Data and Business Analytics
4
Big Data and Business Analytics
NOTE: Big data is any attribute size being one of them, that
challenges constrains of a system capabilities or a business need.
But why should this be the case now? Havent data always been
part of the impact of information and communication technology?
Yes, but research suggests that the scale and scope of changes
that big data are bringing about are at an inflection point, set to
expand greatly, as a series of technology trends accelerate and
converge. We are already seeing visible changes in the economic
landscape as a result of this convergence.
5
Big Data and Business Analytics
2.1 CHALLENGES
Challenges include the need to ensure that the right infrastructure
is in place and that incentives and competition are in place to
encourage continued innovation; that the economic benefits to
users, organizations, and the economy are properly understood;
and that safeguards are in place to address public concerns about
big data.
To understand these challenges one has to understand the state
of digital data, how different domains can use large datasets to
create value, the potential value across stakeholders, and the
implications for the leaders of private sector companies and
public sector organizations, as well as for policy makers.
6
Big Data and Business Analytics
7
2.3 LIMITS TO CONSUME AND UNDERSTAND BIG DATA
The generation of big data may be growing exponentially and
advancing technology may allow the global economy to store and
process ever greater quantities of data, but there may be limits to
our innate human abilityour sensory and cognitive faculties to
process this data torrent. It is said that the mind can handle about
seven pieces of information in its short-term memory.1 Roger
Bohn and James Short at the University of California at San Diego
discovered that the rate of growth in data consumed by
consumers, through various types of media, was a relatively
modest 2.8 percent in bytes per hour between 1980 and 2008.
We should note that one of the reasons for this slow growth was
the relatively fixed number of bytes delivered through television
before the widespread adoption of high-definition digital video.2
The topic of information overload has been widely studied by
academics from neuroscientists to economists. Economist Herbert
Simon once said, A wealth of information creates a poverty of
attention and a need to allocate that attention efficiently among
the overabundance of information sources that might consume
it.3Despite these apparent limits, there are ways to help
organizations and individuals to process, visualize, and synthesize
meaning from big data. For instance, more sophisticated
visualization techniques and algorithms, including automated
algorithms, can enable people to see patterns in large amounts of
data and help them to unearth the most pertinent insights (see
chapter 2 for examples of visualization). Advancing collaboration
technology also allows a large number of individuals, each of
whom may possess understanding of a special area of
information, to come together in order to create a whole picture
to tackle interdisciplinary problems.
3. SEVEN KEY INSIGHTS
1. DATA HAVE SWEPT INTO EVERY INDUSTRY AND
BUSINESS FUNCTION AND ARE NOW AN IMPORTANT
FACTOR OF PRODUCTION
MGI estimates that enterprises globally stored more than 7
exabytes of new data on disk drives in 2010, while consumers
stored more than 6 exabytes of new data on devices such as PCs
7
Big Data and Business Analytics
8
Big Data and Business Analytics
9
Big Data and Business Analytics
10
4. THE IMPORTANCE OF BIG DATA
10
Big Data and Business Analytics
11
Big Data and Business Analytics
12
Big Data and Business Analytics
were over 120 open source key-value databases for acquiring and
13
storing big data, with Hadoop emerging as the primary system for
organizing big data and relational databases expanding their
reach into less structured data sets to analyze big data. These
new systems have created a divided solutions spectrum
comprised of:
Not Only SQL (NoSQL) solutions: developer-centric specialized
systems.
SQL solutions: the world typically equated with the
manageability, security and trusted nature of relational
database management systems (RDBMS)
NoSQL systems are designed to capture all data without
categorizing and parsing it upon entry into the system, and
therefore the data is highly varied. SQL systems, on the other
hand, typically place data in well-defined structures and impose
metadata on the data captured to ensure consistency and
validate data types
13
Big Data and Business Analytics
5.14HADOOP
Hadoop is a rapidly evolving ecosystem of components for
implementing the Google MapReduce algorithms in a scalable
fashion on commodity hardware. Hadoop enables users to store
and process large volumes of data and analyze it in ways not
previously possible with less scalable solutions or standard SQL-
based approaches. As an evolving technology solution, Hadoop
design considerations are new to most users and not common
knowledge. As part of the Dell | Hadoop solution, Dell has
developed a series of best practices and architectural
considerations to use when designing and implementing Hadoop
solutions. Hadoop is a highly scalable compute and storage
platform.
While most users will not initially deploy servers numbered in the
hundreds or thousands, Dell recommends following the design
principles that drive large, hyper-scale deployments. This ensures
that as you start with a small Hadoop environment, you can easily
scale that environment without rework to existing servers,
software, deployment strategies, and network connectivity. The
Apache Hadoop project develops open-source software for
reliable, scalable,
distributed computing. The Apache Hadoop software library is a
framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming
models.
It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather
than rely on hardware to deliver high-availability, the library itself
is designed to detect and handle failures at the application layer,
so delivering a
Highly availabile service on top of a cluster of computers, each of
which may be prone to failures.
The project includes these modules:
Hadoop Distributed File System (HDFS): A distributed file system
that provides high-throughput access to application data.
Hadoop MapReduce: A YARN-based system for parallel
processing of large datasets.
14
Big Data and Business Analytics
15
5.1 A BRIEF HISTORY OF HADOOP
Hadoop was created by Doug Cutting, the creator of Apache
Lucene, the widely used text search library. Hadoop has its origins
in Apache Nutch, an open source web search engine, itself a part
of the Lucene project.
15
Big Data and Business Analytics
16
Big Data and Business Analytics
17
Big Data and Business Analytics
18
Big Data and Business Analytics
19
19
Big Data and Business Analytics
20
20
Big Data and Business Analytics
21
Big Data and Business Analytics
22
Big Data and Business Analytics
23
Big Data and Business Analytics
24
Big Data and Business Analytics
STEPS:
25
Map reduce steps involve following
Input step: Loads the data into HDFS by splitting the data
into blocks and distributing to data nodes of the cluster. The
blocks are replicated for availability in case of failures. The
Name node keeps track of blocks and the data nodes.
Job Step: Submits the Map Reduce job and its details to the
JobTracker.
Job Init Step: The Job Tracker interacts with Task Tracker on
each data node to schedule Map Reduce tasks.
Map step: Mapper process the data blocks and generates a
list of key value pairs.
Sort step: Mapper sorts the list of key value pairs
Shuffle step: Transfers the mapped output to the reducers in
a sorted fashion.
Reduce step: Reducers merge the list of key value pairs to
generate the final result.
Finally, the results are stored in HDFS and replicated as per the
Configuration. The results are finally read from the HDFS by the
clients
25
Big Data and Business Analytics
26
27
Big Data and Business Analytics
How should we utilize such a big data? Business intelligence (BI) has developed
28
along with visualization in
the business environment; however, to utilize big data,
visualization is not enough. Incorporating business analytics (BA) which includes
prediction and optimization is the key to success.
There are three major types of business analytics.
Type 1 is to find the relationship and regularity between data sets. For example,
good customers can be determined based on an analysis of the causal relationship
between their attributes and purchasing history.
Type 2 is to find an optimal solution under a specified set of constraints. This type
is valid for problems where limited resources are used effectively, for instance,
when optimizing order quantity or scheduling shift workers.
Type 3 is to anticipate future trends by understanding customer behaviors.
Attentive services and functions that are ahead of the curve can be offered by this
type.
Examples include financial services detecting fraud or anomaly, and offering
recommendations. In order to realize BA, many IT service providers already offer
solutions for large-scale distributed processing (Hadoop), and streaming data
processing (CEP: Complex Event Processing). Some providers have even
established specialized teams for analytics in-company, and continue to advance in
these activities.
28
Big Data and Business Analytics
29
Use and analysis of large amounts of numerical data from sensors continue to
progress. From here on, in aspects of big data, progression of diversified data such
as unstructured data will lead toward data fusion where data is fused from multiple
sources. For example, in the field of transportation, there is currently an effort to
integrate traffic information as text expressions, and weather information as
graphical expressions to analyze traffic congestion. The important technical point
is how to supplement and overlay data which differ in spatial granularity and
acquired timing. From an analytical aspect, analytical technology for diverse data
continues to progress, leading toward even more accurate future predictions and
control of the real world. For example, a retailer anticipates demand based on
sales, and by automatically calculating the appropriate order for number of
products, the retailer determines the actual order amount. For retailers, the need for
optimization of order amount is large because opportunity loss resulting from
inventory shortages and excess inventory is a risk too large to be ignored. Five to
ten years down the line, with an integration of these technologies, comprehensive
decision making, which currently only possible for humans, will partially be done
by machines. A predecessor to this is the Open Source Indicators project at
Intelligence Advanced Research Projects Activity (IARPA) of the U.S. Department
of Defense. This initiative uses Twitter posts, search engine queries, and street
corner surveillance webcams, integrating all kinds of data for automatic analysis to
assist with identifying revolutionary changes and other important social incidents.
Although it is currently in the research stage, it may help prevent terrorism and
large-scale crimes in the near future.
29
Big Data and Business Analytics
30
8.4 BIG DATA ANALYTICS APPROACH
We should perceive the extensive use of big data as an extension of business
intelligence. Traditional business intelligence was based on aggregate analysis, and
stopped at the point of visualization. However, visualization alone has limits to
how high a degree of knowledge can be derived from data. We can perceive a
wider application of business intelligence including visualization and business
analytics, and categorize Business Intelligence into four categories based on our
data analysis consulting experience in various fields of business. Aggregate
Analysis Business Intelligence immediately aggregates and analyzes all data.
Discovery Business Intelligence analyzes variations to match data granularity and
discovers rules. WHAT-IF Business Intelligence uses simulations and searches
for optimal solutions to optimize business operations. And Proactive Business
Inteligence analyzes data in real time to offer future services that are ahead of the
curve. Of these abovementioned categories, Discovery BI, WHAT-IF BI, and
Proactive BI are analogs of business analytics, and these are equivalent to the
previously described Type 1, Type2, and Type 3, respectively. The data analysis
methodology BICLAVIS(Developed by NTT-DATA) has been developed around
the axis of the previously mentioned four classes. At the core of this lies analysis
scenarios classified into analysis patterns by objective. Based on these scenarios,
BICLAVIS uses a template for efficient analysis. At the Data Warehouses, initial
process assistance is offered in the form of support in proof-of-concept design and
selecting tools for core products, and development of demo systems tailored to
customer requirements.
30
Big Data and Business Analytics
31
31
Big Data and Business Analytics
BPO (Business Process Outsourcing) involves outsourcing all but core aspects of a
32
business, radically revising business processes and resources. For this initiative, we
execute shift scheduling for offices that process multiple types of duties, along
with estimating work volume of each task, considering the necessary time limit for
completing a task, as well as the personnel skill needed, and other real constraints
in the workplace. With this system, we can automatically generate an optimum
schedule, maximizing BPO effectiveness.
(4) Medical cost reduction policy for health insurance organizations
[Targeting]
The increase in insured persons who become seriously ill due to lifestyle diseases
has caused a problem in increased costs for health insurance organizations. From
insured persons current state of health, we identified who are at high risk for
serious diseases and gave health counseling at an early stage to prevent such
increase cost. By looking at past insurance claims, data mining can help identify
patterns that lead toward lifestyle diseases or complications, allowing organizations
to make a list of high-risk patients and to provide health counseling. The result can
lead to prevention of health risks in insured persons, and curtailing extra costs for
the health insurance organization overall.
2. Kissmetrics
Looking to increase your marketing ROI? Kissmetrics, a popular customer
intelligence and web analytics platform, could be your best friend. The
32
Big Data and Business Analytics
4. Google Analytics
You don't need fancy, expensive software to begin gathering data. It can
start from an asset you already have your website. Google Analytics,
Google's free Web-traffic-monitoring tool, provides all types of data about
website visitors, using a multitude of metrics and traffic sources.
With Google Analytics, you can extract long-term data to reveal trends and
other valuable information, so you can make wise, data-driven decisions.
For instance, by tracking and analyzing visitor behavior such as where
traffic is coming from, how audiences engage and how long visitors stay on
a website (known as bounce rates) you can make better decisions when
33
Big Data and Business Analytics
34
Big Data and Business Analytics
7.35
Tranzlogic
It's no secret that credit card transactions are chock full of invaluable data.
Although access was once limited to companies with significant resources,
customer intelligence company Tranzlogic makes this information available
to small businesses without the big business budget. Tranzlogic works
with merchants and payment systems to extract and analyze proprietary
data from credit card purchases. This information can then be used to
measure sales performance, evaluate customers and customer segments,
improve promotions and loyalty programs, launch more-effective marketing
campaigns, write better business plans, and perform other tasks that lead
to smarter business decisions. Moreover, Tranzlogic requires no tech
smarts to get started it is a turnkey program, meaning there is no
installation or programming required. Simply log in to access your merchant
portal.
8. Qualtrics
If you don't currently have any rich sources for data, conducting research
may be the answer. Qualtrics lets businesses conduct a wide range of
studies and surveys to gain quality insights to guide data-driven decision
making. Qualtrics offers three types of real-time insights: customer, market
and employee insights. To gain customer insight, use Qualtrics' survey
software for customer satisfaction, customer experience and website
feedback surveys. To study the market, Qualtrics also offers advertising
testing, concept testing and market research programs. And when it comes
to your team, Qualtrics can help conduct employee surveys, exit interviews
and reviews. Other options include online samples, academic research and
mobile surveys.
35
Big Data and Business Analytics
36
9. CONCLUSION
Big data is a disruptive force that will affect organizations across industries, sectors
and economies. Not only will enterprise IT architectures need to change to
accommodate it, but almost every department within a company will undergo
adjustments to allow big data to inform and reveal. Data analysis will change,
becoming part of a business process instead of a distinct function performed only
by trained specialists. Big data productivity will come as a result of giving users
across the organization the power to work with diverse data sets through self-
service tools.
And thats just the beginning. Once companies begin leveraging big data for
insight, the action they take based on that insight has the potential to revamp
business as it is known today. If a marketing department can gain immediate
feedback on a new branding campaign by analyzing blog comments and social
networking conversations, do focus groups and customer surveys become
obsolete? Nimble new companies that understand the value of big data will not
only challenge existing competitors, but may also begin defining the way business
is done in their industries. Customer relationships will undergo transformation as
companies strive to quickly understand concepts that previously couldnt be
captured, such as sentiment and brand perception.
Achieving the vast potential of big data calls for a thoughtful, holistic approach to
data management, analysis and information intelligence. Across industries,
organizations that get ahead of big data will create new operational efficiencies,
new revenue streams, differentiated competitive advantage and entirely new
business models. Business leaders should begin thinking strategically about
how to prepare their organizations for big dataand big opportunities.
36
Big Data and Business Analytics
37
BIBILOGRAPHY
1. Dean, Jeffrey, and Sanjay Ghemawat, MapReduce: Simplified data
processing on large clusters, Sixth Symposium on Operating System
Design and Implementation, San Francisco, CA, December 2004.
[2] D. Abadi et al. Column-Oriented Database Systems.
PVDLB,2(2):16641665, 2009.
[3] F. N. Afrati and J. D. Ullman. Optimizing Joins in a Map-
ReduceEnvironment. In EDBT, pages 99110, 2010.
[4] S. Babu. Towards automatic optimization of MapReduce
programs.In SOCC, pages 137142, 2010.
[5] S. Blanas et al. A Comparison of Join Algorithms for Log
Processing in MapReduce. In SIGMOD, pages 975986, 2010.
[6] J. Dean and S. Ghemawat. MapReduce: A Flexible Data
Processing Tool. CACM, 53(1):7277, 2010.
[7] J. Dittrich, J.-A. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J.
Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah
(Without It Even Noticing). PVLDB, 3(1):519529, 2010.
[8] J. Dittrich, J.-A. Quiane-Ruiz, S. Richter, S. Schuh, A. Jindal, and
J. Schad. Only Aggressive Elephants are Fast Elephants. PVLDB,
5,2012.
[9] A. Floratou et al. Column-Oriented Storage Techniques for
MapReduce. PVLDB, 4(7):419429, 2011.
[10] ] J. Lin et al. Full-Text Indexing for Optimizing Selection
Operationsin Large-Scale Data Analytics. MapReduce Workshop,
2011.
[12] H. Herodotou and S. Babu. Profiling, What-if Analysis, and
Cost-based Optimization of MapReduce Programs. PVLDB,
(11):11111122, 2011. [13] E. Jahani, M. J. Cafarella, and C. Re.
Automatic Optimization for MapReduce Programs. PVLDB,
4(6):385396, 2011.
[15] D. Jiang et al. The Performance of MapReduce: An In-depth
Study.PVLDB, 3(1-2):472483, 2010.
37