Big Data Analytics: A Literature Review Paper: Abstract. in The Information Era, Enormous Amounts of Data Have Become
Big Data Analytics: A Literature Review Paper: Abstract. in The Information Era, Enormous Amounts of Data Have Become
Big Data Analytics: A Literature Review Paper: Abstract. in The Information Era, Enormous Amounts of Data Have Become
1 Introduction
Imagine a world without data storage; a place where every detail about a person or
organization, every transaction performed, or every aspect which can be documented
is lost directly after use. Organizations would thus lose the ability to extract valuable
information and knowledge, perform detailed analyses, as well as provide new oppor-
tunities and advantages. Anything ranging from customer names and addresses, to
products available, to purchases made, to employees hired, etc. has become essential
for day-to-day continuity. Data is the building block upon which any organization
thrives.
Now think of the extent of details and the surge of data and information provided
nowadays through the advancements in technologies and the internet. With the in-
crease in storage capabilities and methods of data collection, huge amounts of data
have become easily available. Every second, more and more data is being created and
needs to be stored and analyzed in order to extract value. Furthermore, data has be-
come cheaper to store, so organizations need to get as much value as possible from
the huge amounts of stored data.
The size, variety, and rapid change of such data require a new type of big data ana-
lytics, as well as different storage and analysis methods. Such sheer amounts of big
data need to be properly analyzed, and pertaining information should be extracted.
how enormous it is. Velocity refers to the rate with which data is changing, or how
often it is created. Finally, variety includes the different formats and types of data, as
well as the different kinds of uses and ways of analyzing the data [9].
Data volume is the primary attribute of big data. Big data can be quantified by size
in TBs or PBs, as well as even the number of records, transactions, tables, or files.
Additionally, one of the things that make big data really big is that it’s coming from a
greater variety of sources than ever before, including logs, clickstreams, and social
media. Using these sources for analytics means that common structured data is now
joined by unstructured data, such as text and human language, and semi-structured
data, such as eXtensible Markup Language (XML) or Rich Site Summary (RSS)
feeds. There’s also data, which is hard to categorize since it comes from audio, video,
and other devices. Furthermore, multi-dimensional data can be drawn from a data
warehouse to add historic context to big data. Thus, with big data, variety is just as
big as volume.
Moreover, big data can be described by its velocity or speed. This is basically the
frequency of data generation or the frequency of data delivery. The leading edge of
big data is streaming data, which is collected in real-time from the websites [17].
Some researchers and organizations have discussed the addition of a fourth V, or
veracity. Veracity focuses on the quality of the data. This characterizes big data quali-
ty as good, bad, or undefined due to data inconsistency, incompleteness, ambiguity,
latency, deception, and approximations [22].
cluster nodes. Additionally, the data is protected among the nodes by a replication
mechanism, which ensures availability and reliability despite any node failures [3].
There are two types of HDFS nodes: the Data Nodes and the Name Nodes. Data is
stored in replicated file blocks across the multiple Data Nodes, and the Name Node
acts as a regulator between the client and the Data Node, directing the client to the
particular Data Node which contains the requested data [3].
Figure 1 shows how the MapReduce nodes and the HDFS work together. At stepp 1,
there is a very large dataseet including log files, sensor data, or anything of the soorts.
The HDFS stores replicas of o the data, represented by the blue, yellow, beige, and ppink
icons, across the Data Nodees. In step 2, the client defines and executes a map job and
a reduce job on a particularr data set, and sends them both to the Job Tracker. The Job
Tracker then distributes thee jobs across the Task Trackers in step 3. The Task Traccker
runs the mapper, and the mapper
m produces output that is then stored in the HDFS file
system. Finally, in step 4, the
t reduce job runs across the mapped data in order to ppro-
duce the result.
Hadoop is a MAD systeem, thus making it popular for big data analytics by loadding
data as files into the distrib
buted file system, and running parallel MapReduce com mpu-
tations on the data. Hadoop p gets its magnetism and agility from the fact that datta is
loaded into Hadoop simply y by copying files into the distributed file system, and M
Ma-
pReduce interprets the dataa at processing time rather than loading time [11]. Thuss, it
is capable of attracting all data
d sources, as well as adapting its engines to any evoolu-
tions that may occur in such h big data sources [6].
After big data is stored, managed, and processed, decision makers need to extrract
useful insights by performin ng big data analyses. In the subsections below, various big
data analyses will be discu ussed, starting with selected traditional advanced data aana-
lytics methods, and follow wed by examples of some of the additional, applicable big
data analyses.
research and technologies, since decision makers have become more and more inter-
ested in learning from previous data, thus gaining competitive advantage [21].
Along with some of the most common advanced data analytics methods, such as
association rules, clustering, classification and decision trees, and regression some
additional analyses have become common with big data.
For example, social media has recently become important for social networking
and content sharing. Yet, the content that is generated from social media websites is
enormous and remains largely unexploited. However, social media analytics can be
used to analyze such data and extract useful information and predictions [2]. Social
media analytics is based on developing and evaluating informatics frameworks and
tools in order to collect, monitor, summarize, analyze, as well as visualize social me-
dia data. Furthermore, social media analytics facilitates understanding the reactions
and conversations between people in online communities, as well as extracting useful
patterns and intelligence from their interactions, in addition to what they share on
social media websites [24].
On the other hand, Social Network Analysis (SNA) focuses on the relationships
among social entities, as well as the patterns and implications of such relationships
[23]. An SNA maps and measures both formal and informal relationships in order to
comprehend what facilitates the flow of knowledge between interacting parties, such
as who knows who, and who shares what knowledge or information with who and
using what [19].
However, SNA differs from social media analysis, in that SNA tries to capture the
social relationships and patterns between networks of people. On the other hand, so-
cial media analysis aims to analyze what social media users are saying in order to
uncover useful patterns, information about the users, and sentiments. This is tradition-
ally done using text mining or sentiment analysis, which are discussed below.
On the other hand, text mining is used to analyze a document or set of documents
in order to understand the content within and the meaning of the information
contained. Text mining has become very important nowadays since most of the in-
formation stored, not including audio, video, and images, consists of text. While data
mining deals with structured data, text presents special characteristics which basically
follow a non-relational form [18].
Moreover, sentiment analysis, or opinion mining, is becoming more and more im-
portant as online opinion data, such as blogs, product reviews, forums, and social data
from social media sites like Twitter and Facebook, grow tremendously. Sentiment
analysis focuses on analyzing and understanding emotions from subjective text pat-
terns, and is enabled through text mining. It identifies opinions and attitudes of indi-
viduals towards certain topics, and is useful in classifying viewpoints as positive or
negative. Sentiment analysis uses natural language processing and text analytics in
order to identify and extract information by finding words that are indicative of a
sentiment, as well as relationships between words, so that sentiments can be accurate-
ly identified [15].
Finally, from the strongest potential growths among big data analytics options is
Advanced Data Visualization (ADV) and visual discovery [17]. Presenting informa-
tion so that people can consume it effectively is a key challenge that needs to be met,
in order for decision makers to be able to properly analyze data in a way to lead to
concrete actions [14].
Big Data Analytics: A Literature Review Paper 221
ADV has emerged as a powerful technique to discover knowledge from data. ADV
combines data analysis methods with interactive visualization to enable comprehen-
sive data exploration. It is a data driven exploratory approach that fits well in situa-
tions where analysts have little knowledge about the data [20]. With the generation of
more and more data of high volume and complexity, an increasing demand has arisen
for ADV solutions from many application domains [25]. Additionally, such visualiza-
tion analyses take advantage of human perceptual and reasoning abilities, which
enables them to thoroughly analyze data at both the overview and the detailed levels.
Along with the size and complexity of big data, intuitive visual representation and
interaction is needed to facilitate the analyst’s perception and reasoning [20].
ADV can enable faster analysis, better decision making, and more effective presen-
tation and comprehension of results by providing interactive statistical graphics and a
point-and-click interface [4]. Furthermore, ADV is a natural fit for big data since it
can scale its visualizations to represent thousands or millions of data points, unlike
standard pie, bar, and line charts. Moreover, it can handle diverse data types, as well
as present analytic data structures that aren’t easily flattened onto a computer screen,
such as hierarchies and neural nets. Additionally, most ADV tools and functions can
support interfaces to all the leading data sources, thus enabling business analysts to
explore data widely across a variety of sources in search of the right analytics dataset,
usually in real-time [17].
and the data needs to be gathered from different sources, processed, stored, and mi-
grated to the end user. Such big data needs to be treated accordingly, so after the data
sources and types of data required for the analysis are defined, the chosen data is ac-
quired and stored in any of the big data storage and management tools previously
discussed After the big data is acquired and stored, it is then organized, prepared, and
processed, This is achieved across a high-speed network using ETL/ELT or big data
processing tools, which have been covered in the previous sections.
The next phase in the decision making process is the design phase, where possible
courses of action are developed and analyzed through a conceptualization, or a repre-
sentative model of the problem. The framework divides this phase into three steps,
model planning, data analytics, and analyzing. Here, a model for data analytics, such
as those previously discussed, is selected and planned, and then applied, and finally
analyzed.
Consequently, the following phase in the decision making process is the choice
phase, where methods are used to evaluate the impacts of the proposed solutions, or
courses of action, from the design phase. Finally, the last phase in the decision mak-
ing process is the implementation phase, where the proposed solution from the pre-
vious phase is implemented [8].
As the amount of big data continues to exponentially grow, organizations through-
out the different sectors are becoming more interested in how to manage and analyze
such data. Thus, they are rushing to seize the opportunities offered by big data, and
gain the most benefit and insight possible, consequently adopting big data analytics in
order to unlock economic value and make better and faster decisions. Therefore, or-
ganizations are turning towards big data analytics in order to analyze huge amounts of
data faster, and reveal previously unseen patterns, sentiments, and customer intelli-
gence. This section focuses on some of the different applications, both proposed and
implemented, of big data analytics, and how these applications can aid organizations
across different sectors to gain valuable insights and enhance decision making.
According to Manyika et al.’s research, big data can enable companies to create
new products and services, enhance existing ones, as well as invent entirely new busi-
ness models. Such benefits can be gained by applying big data analytics in different
areas, such as customer intelligence, supply chain intelligence, performance, quality
and risk management and fraud detection [14]. Furthermore, Cebr’s study highlighted
the main industries that can benefit from big data analytics, such as the manufactur-
ing, retail, central government, healthcare, telecom, and banking industries [4].
well as what they don’t like. By performing sentiment analysis on this data, firms can
be alerted beforehand when customers are turning against them or shifting to different
products, and accordingly take action [7].
Additionally, using SNAs to monitor customer sentiments towards brands, and
identify influential individuals, can help organizations react to trends and perform
direct marketing. Big data analytics can also enable the construction of predictive
models for customer behavior and purchase patterns, therefore raising overall profita-
bility [4]. Even organizations which have used segmentation for many years are
beginning to deploy more sophisticated big data techniques, such as real-time micro-
segmentation of customers, in order to target promotions and advertising [14].
Consequently, big data analytics can benefit organizations by enabling better targeted
social influencer marketing, defining and predicting trends from market sentiments, as
well as analyzing and understanding churn and other customer behaviors [17].
significant expenditures [4]. Additionally, big data analytics can result in manufactur-
ing lead improvements [17]. Furthermore, real-time data analyses and monitoring of
machine logs can enable managers to make swifter decisions for quality management.
Also, big data analytics can allow for the real-time monitoring of network demand, in
addition to the forecasting of bandwidth in response to customer behavior.
Moreover, healthcare IT systems can improve the efficiency and quality of care, by
communicating and integrating patient data across different departments and institu-
tions, while retaining privacy controls [4]. Analyzing electronic health records can
improve the continuity of care for individuals, as well as creating a massive dataset
through which treatments and outcomes can be predicted and compared. Therefore,
with the increasing use of electronic health records, along with the advancements in
analytics tools, there arises an opportunity to mine the available de-identified patient
information for assessing the quality of healthcare, as well as managing diseases and
health services [22].
Additionally, the quality of citizens’ lives can be improved through the utilization
of big data. For healthcare, sensors can be used in hospitals and homes to provide the
continuous monitoring of patients, and perform real-time analyses on the patient data
streaming in. This can be used to alert individuals and their health care providers if
any health anomalies are detected in the analysis, requiring the patient to seek medical
help [22]. Patients can also be monitored remotely to analyze their adherence to their
prescriptions, and improve drug and treatment options [14].
Moreover, by analyzing information from distributed sensors on handheld devices,
roads, and vehicles, which provide real-time traffic information, transportation can be
transformed and improved. Traffic jams can be predicted and prevented, and drivers
can operate more safely and with less disruption to the traffic flow. Such a new type
of traffic ecosystem, with “intelligent” connected cars, can potentially renovate trans-
portation and how roadways are used [22]. Accordingly, big data applications can
provide smart routing, according to real-time traffic information based on personal
location data. Furthermore, such applications can automatically call for help when
trouble is detected by the sensors, and inform users about accidents, scheduled road-
work, and congested areas in real-time [14].
Furthermore, big data can be used for better understanding changes in the location,
frequency, and intensity of weather and climate. This can benefit citizens and busi-
nesses that rely upon weather, such as farmers, as well as tourism and transportation
companies. Also, with new sensors and analysis techniques for developing long term
climate models and nearer weather forecasts, weather related natural disasters can be
predicted, and preventive or adaptive measures can be taken beforehand [22].
be used to integrate the risk profiles managed in isolation across separate departments,
into enterprise wide risk profiles. This can aid in risk mitigation, since a comprehen-
sive view of the different risk types and their interrelations is provided to decision
makers [4].
Furthermore, new big data tools and technologies can provide for managing the
exponential growth in network produced data, as well reduce database performance
problems by increasing the ability to scale and capture the required data. Along with
the enhancement in cyber analytics and data intensive computing solutions, organiza-
tions can incorporate multiple streams of data and automated analyses to protect
themselves against cyber and network attacks [22].
As for fraud detection, especially in the government, banking, and insurance indus-
tries, big data analytics can be used to detect and prevent fraud [17]. Analytics are
already commonly used in automated fraud detection, but organizations and sectors
are looking towards harnessing the potentials of big data in order to improve their
systems. Big data can allow them to match electronic data across several sources,
between both public and private sectors, and perform faster analytics [4].
In addition, customer intelligence can be used to model normal customer behavior,
and detect suspicious or divergent activities through the accurate flagging of outlier
occurrences. Furthermore, providing systems with big data about prevailing fraud
patterns can allow these systems to learn the new types of frauds and act accordingly,
as the fraudsters adapt to the old systems designed to detect them. Also, SNAs can be
used to identify the networks of collaborating fraudsters, as well as discover evidence
of fraudulent insurance or benefits claims, which will lead to less fraudulent activity
going undiscovered [4]. Thus, big data tools, techniques, and governance processes
can increase the prevention and recovery of fraudulent transactions by dramatically
increasing the speed of identification and detection of compliance patterns within all
available data sets [22].
4 Conclusion
In this research, we have examined the innovative topic of big data, which has recent-
ly gained lots of interest due to its perceived unprecedented opportunities and bene-
fits. In the information era we are currently living in, voluminous varieties of high
velocity data are being produced daily, and within them lay intrinsic details and pat-
terns of hidden knowledge which should be extracted and utilized. Hence, big data
analytics can be applied to leverage business change and enhance decision making, by
applying advanced analytic techniques on big data, and revealing hidden insights and
valuable knowledge.
Accordingly, the literature was reviewed in order to provide an analysis of the big
data analytics concepts which are being researched, as well as their importance to
decision making. Consequently, big data was discussed, as well as its characteristics
and importance. Moreover, some of the big data analytics tools and methods in par-
ticular were examined. Thus, big data storage and management, as well as big data
analytics processing were detailed. In addition, some of the different advanced data
analytics techniques were further discussed.
226 N. Elgendy and A. Elragal
By applying such analytics to big data, valuable information can be extracted and
exploited to enhance decision making and support informed decisions. Consequently,
some of the different areas where big data analytics can support and aid in decision
making were examined. It was found that big data analytics can provide vast horizons
of opportunities in various applications and areas, such as customer intelligence, fraud
detection, and supply chain management. Additionally, its benefits can serve different
sectors and industries, such as healthcare, retail, telecom, manufacturing, etc.
Accordingly, this research has provided the people and the organizations with ex-
amples of the various big data tools, methods, and technologies which can be applied.
This gives users an idea of the necessary technologies required, as well as developers
an idea of what they can do to provide more enhanced solutions for big data analytics
in support of decision making. Thus, the support of big data analytics to decision
making was depicted.
Finally, any new technology, if applied correctly can bring with it several potential
benefits and innovations, let alone big data, which is a remarkable field with a bright
future, if approached correctly. However, big data is very difficult to deal with. It
requires proper storage, management, integration, federation, cleansing, processing,
analyzing, etc. With all the problems faced with traditional data management, big data
exponentially increases these difficulties due to additional volumes, velocities, and
varieties of data and sources which have to be dealt with. Therefore, future research
can focus on providing a roadmap or framework for big data management which can
encompass the previously stated difficulties.
We believe that big data analytics is of great significance in this era of data over-
flow, and can provide unforeseen insights and benefits to decision makers in various
areas. If properly exploited and applied, big data analytics has the potential to provide
a basis for advancements, on the scientific, technological, and humanitarian levels.
References
1. Adams, M.N.: Perspectives on Data Mining. International Journal of Market Research
52(1), 11–19 (2010)
2. Asur, S., Huberman, B.A.: Predicting the Future with Social Media. In: ACM International
Conference on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 492–499
(2010)
3. Bakshi, K.: Considerations for Big Data: Architecture and Approaches. In: Proceedings of
the IEEE Aerospace Conference, pp. 1–7 (2012)
4. Cebr: Data equity, Unlocking the value of big data. in: SAS Reports, pp. 1–44 (2012)
5. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD Skills: New Analy-
sis Practices for Big Data. Proceedings of the ACM VLDB Endowment 2(2), 1481–1492
(2009)
6. Cuzzocrea, A., Song, I., Davis, K.C.: Analytics over Large-Scale Multidimensional Data:
The Big Data Revolution! In: Proceedings of the ACM International Workshop on Data
Warehousing and OLAP, pp. 101–104 (2011)
7. Economist Intelligence Unit: The Deciding Factor: Big Data & Decision Making. In:
Capgemini Reports, pp. 1–24 (2012)
Big Data Analytics: A Literature Review Paper 227
8. Elgendy, N.: Big Data Analytics in Support of the Decision Making Process. MSc Thesis,
German University in Cairo, p. 164 (2013)
9. EMC: Data Science and Big Data Analytics. In: EMC Education Services, pp. 1–508
(2012)
10. He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: A Fast and Space-
efficient Data Placement Structure in MapReduce-based Warehouse Systems. In: IEEE
International Conference on Data Engineering (ICDE), pp. 1199–1208 (2011)
11. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A
Self-tuning System for Big Data Analytics. In: Proceedings of the Conference on Innova-
tive Data Systems Research, pp. 261–272 (2011)
12. Kubick, W.R.: Big Data, Information and Meaning. In: Clinical Trial Insights, pp. 26–28
(2012)
13. Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.: Ysmart: Yet Another SQL-to-
MapReduce Translator. In: IEEE International Conference on Distributed Computing
Systems (ICDCS), pp. 25–36 (2011)
14. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big
Data: The Next Frontier for Innovation, Competition, and Productivity. In: McKinsey
Global Institute Reports, pp. 1–156 (2011)
15. Mouthami, K., Devi, K.N., Bhaskaran, V.M.: Sentiment Analysis and Classification Based
on Textual Reviews. In: International Conference on Information Communication and
Embedded Systems (ICICES), pp. 271–276 (2013)
16. Plattner, H., Zeier, A.: In-Memory Data Management: An Inflection Point for Enterprise
Applications. Springer, Heidelberg (2011)
17. Russom, P.: Big Data Analytics. In: TDWI Best Practices Report, pp. 1–40 (2011)
18. Sanchez, D., Martin-Bautista, M.J., Blanco, I., Torre, C.: Text Knowledge Mining: An Al-
ternative to Text Data Mining. In: IEEE International Conference on Data Mining Work-
shops, pp. 664–672 (2008)
19. Serrat, O.: Social Network Analysis. Knowledge Network Solutions 28, 1–4 (2009)
20. Shen, Z., Wei, J., Sundaresan, N., Ma, K.L.: Visual Analysis of Massive Web Session Da-
ta. In: Large Data Analysis and Visualization (LDAV), pp. 65–72 (2012)
21. Song, Z., Kusiak, A.: Optimizing Product Configurations with a Data Mining Approach.
International Journal of Production Research 47(7), 1733–1751 (2009)
22. TechAmerica: Demystifying Big Data: A Practical Guide to Transforming the Business of
Government. In: TechAmerica Reports, pp. 1–40 (2012)
23. Van der Valk, T., Gijsbers, G.: The Use of Social Network Analysis in Innovation Studies:
Mapping Actors and Technologies. Innovation: Management, Policy & Practice 12(1),
5–17 (2010)
24. Zeng, D., Hsinchun, C., Lusch, R., Li, S.H.: Social Media Analytics and Intelligence.
IEEE Intelligent Systems 25(6), 13–16 (2010)
25. Zhang, L., Stoffel, A., Behrisch, M., Mittelstadt, S., Schreck, T., Pompl, R., Weber, S.,
Last, H., Keim, D.: Visual Analytics for the Big Data Era—A Comparative Review of
State-of-the-Art Commercial Systems. In: IEEE Conference on Visual Analytics Science
and Technology (VAST), pp. 173–182 (2012)
The author has requested enhancement of the downloaded file. All in-text references underlined in blue are link