Big Data Paper
Big Data Paper
Big Data Paper
ABSTRACT
The age of big data has come with voluminous, complex, noise and ever fast growing data from
technology driven multiple data sources. Big data encompasses the domains such as healthcare, biomedical,
government, research and commerce. Researchers from diverse fields have been focusing on the big data to obtain
quality knowledge of contribution in their field. The researchers has reported to use many modern tools and
techniques specific to big data in solving the big data decision making task. This paper is focused at (i) concepts
and features associated to big data, (ii) the state-of-the-art techniques and tools for big data decision making, and
(iii) to discover the challenges pertaining to big data so that future researchers can obtain the directions from it.
With the arrival of Hadoop in 2006, the first generation of big data processing started. Hadoop uses
MapReduce as its processing engine. The second generation of big data processing was started by S4 (a Yahoo
product of 2010). S4 dealt with both the static and big data. The hybrid processing can bring us to the third
generation. However, the enough development is this area is yet to happen to let us inter into the third generation.
Table 1 simplifies with its visualization regarding the three generations of technologies by the detailed processing
technologies.
Table 1. Three Generations of Big Data Technologies
Paradigm Technology
Batch Processing MapReduce
Hadoop
Flume
Scribe
Dryad
Apache Mahout
Jaspersoft BI Suite
Pentaho
Skytree Server
Cascading
Spark
Tableau
Karmasphere
Pig
Sqoop
Stream Processing Kafka
Flume
Kestrel
Strom
S4
SQLstream
Splunk
SAP Hana
Spark Streaming
Hybrid Processing Lambdoop
SummingBird
Batch processing takes care of the data which is stored in storage. The advantages associated with batch
processing are scalability and reliability. The scalability is achieved by parallel implementations like that of
MapReduce. The stream processing on the other hand process big data in real time. This paradigm takes diskless
processing approach to achieve low latency. The hybrid processing synthesizes both the batch and stream
processing based on Lambda architecture [48].
III. CHALLENGES PERTAINING TO BIG DATA
The ultimate target is to develop the big data solutions for decision making which were never before
available. In this section, we shall discuss the challenges in big data decision making and the future solutions to
it. There are many factors which influences the decision making process for big data. The literature reports and
studies show that the factors and its impacts changes over time. The big data and its analysis for decision making
is an ad-hoc process where the organizations changes are frequently altered for obtaining quality output. The
agreements are changed to obtain big data, new staff are hired and new departments are formed so as to obtain
advantages by discover features from big data and subsequent decision making in a short span of time. The factors
which affect the decision making from big data is listed in table 2.
The main challenges discovered in decision making can relate to the velocity, validity and veracity and
these are connected with the following:
1. Processing: The velocity of data sometimes let the application obtain and deal with just a part of data, leaving
another part behind. This makes the decision making poor as the entire picture of the dataset becomes not so clear.
For example, some part of the data which shows a behavior like fraudulent becomes unknown if that part of data
in unavailable.
2. Noise: The presence of noise creates a problems on data perception and to obtain key insight becomes a problem
in case of noise presence.
3. Error: In many cases the source only has the information on the context of the data. The data analytics have no
idea on data context in this cases. For example, a data may be collected two years back and it reflect the scenarios
of last two years but wrongly it was communicated that the data is of last years. This is an error of data context
and the decision making from this data would be wrong.
IV. CONCLUSION
Big data is a popular domain which is still developing in an enormous speed. The big data decision
making is a new sub-domain of big data analysis which encompasses the techniques and technologies of many
other domains. The techniques and technologies of big data is presented in this paper while providing textual
details and also using few figures and tables also. The paper then provides the challenges associated with big data
processing and discovers factors which influences the big data decision making from the literature.
REFERENCES
[1] Hilbert, Martin, and Priscila López. "The world’s technological capacity to store, communicate, and compute information." science 332,
no. 6025 (2011): 60-65.
[2] Chen, CL Philip, and Chun-Yang Zhang. "Data-intensive applications, challenges, techniques and technologies: A survey on Big
Data." Information sciences 275 (2014): 314-347.
[3] Tien, James M. "Big data: Unleashing information." Journal of Systems Science and Systems Engineering 22, no. 2 (2013): 127-151.
[4] R. Casado et al.Emerging trends and technologies in big data processing, Concurr. Comp-Pract. E., (2015).
[5] Miller, H. Gilbert, and Peter Mork. "From data to decisions: a value chain for big data." It Professional 15, no. 1 (2013): 57-59.
[6] Manyika, James, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers. Big data: The
next frontier for innovation, competition, and productivity. McKinsey Global Institute, 2011.
[7] Mahani, Alireza S., and Mansour TA Sharabiani. "SIMD parallel MCMC sampling with applications for big-data Bayesian
analytics." Computational Statistics & Data Analysis 88 (2015): 75-99.
[8] Wilkinson, Leland. "The future of statistical computing." Technometrics 50, no. 4 (2008): 418-435.
[9] Hastie, Trevor, Robert Tibshirani, Jerome H. Friedman, and Jerome H. Friedman. The elements of statistical learning: data mining,
inference, and prediction. Vol. 2. New York: springer, 2009.
[10] Yan, Jun, Ning Liu, Shuicheng Yan, Qiang Yang, Weiguo Fan, Wei Wei, and Zheng Chen. "Trace-oriented feature analysis for large-
scale text data dimension reduction." IEEE Transactions on Knowledge and Data Engineering 23, no. 7 (2010): 1103-1117.
[11] Sahimi, Muhammad, and Hossein Hamzehpour. "Efficient computational strategies for solving global optimization
problems." Computing in Science & Engineering 12, no. 04 (2010): 74-83.
[12]Li, Xiaodong, and Xin Yao. "Cooperatively coevolving particle swarms for large scale optimization." IEEE Transactions on Evolutionary
Computation 16, no. 2 (2011): 210-224.
[13]Sardar, Tanvir Habib, and Zahid Ansari. "An analysis of MapReduce efficiency in document clustering using parallel K-means
algorithm." Future Computing and Informatics Journal 3, no. 2 (2018): 200-209.
[14]Sardar, Tanvir Habib, and Zahid Ansari. "Partition based clustering of large datasets using MapReduce framework: An analysis of recent
themes and directions." Future Computing and Informatics Journal 3, no. 2 (2018): 247-261.
[15]Sardar, Tanvir Habib, and Zahid Ansari. "Detection and confirmation of web robot requests for cleaning the voluminous web log data."
In 2014 International Conference on the IMpact of E-Technology on US (IMPETUS), pp. 13-19. IEEE, 2014.
[16]Sardar, Tanvir Habib, and Zahid Ansari. "An analysis of distributed document clustering using MapReduce based K-means
algorithm." Journal of The Institution of Engineers (India): Series B 101, no. 6 (2020): 641-650.
[17]Ansari, Zahid, Asif Afzal, and Tanvir Habib Sardar. "Data categorization using hadoop MapReduce-based parallel K-means
clustering." Journal of The Institution of Engineers (India): Series B 100, no. 2 (2019): 95-103.
[18]Wen, Zhenshu, Wanwei Zhang, Tao Zeng, and Luonan Chen. "MCentridFS: a tool for identifying module biomarkers for multi-
phenotypes from high-throughput data." Molecular BioSystems 10, no. 11 (2014): 2870-2875.
[19]Wang, Yi, Xinli Jiang, Rongyu Cao, and Xiyang Wang. "Robust indoor human activity recognition using wireless signals." Sensors 15,
no. 7 (2015): 17195-17208.
[20]Nedjah, Nadia, Felipe P. da Silva, Alan O. de Sá, Luiza M. Mourelle, and Diana A. Bonilla. "A massively parallel pipelined reconfigurable
design for M-PLN based neural networks for efficient image classification." Neurocomputing 183 (2016): 39-55.
[21]Arel, Itamar, Derek C. Rose, and Thomas P. Karnowski. "Deep machine learning-a new frontier in artificial intelligence research [research
frontier]." IEEE computational intelligence magazine 5, no. 4 (2010): 13-18.
[22]Preuer, Kristina, Richard PI Lewis, Sepp Hochreiter, Andreas Bender, Krishna C. Bulusu, and Günter Klambauer. "DeepSynergy:
predicting anti-cancer drug synergy with Deep Learning." Bioinformatics 34, no. 9 (2018): 1538-1546.
[23]Sardar, Tanvir Habib, Ahmed Rimaz Faizabadi, and Zahid Ansari. "An evaluation of MapReduce framework in cluster analysis." In 2017
International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), pp. 110-114. IEEE, 2017.
[24]Sardar, Tanvir Habib, Zahid Ansari, and Amina Khatun. "An evaluation of Hadoop cluster efficiency in document clustering using parallel
K-means." In 2017 IEEE International Conference on Circuits and Systems (ICCS), pp. 17-20. IEEE, 2017.
[25]Sardar, Tanvir Habib, Amina Khatun, and Sahanowaj Khan. "Design of energy aware collection tree protocol in wireless sensor network."
In 2017 IEEE International Conference on Circuits and Systems (ICCS), pp. 12-17. IEEE, 2017.
[26]Ansari, Zahid, Tanvir Habib Sardar, Moksud Alam Mallik, and Naveen D. Chandavarkar. "Data mining in soft computing framework: a
survey." (2002).
[27]Sardar, Tanvir Habib, and Ahmed Rimaz Faizabadi. "Parallelization and analysis of selected numerical algorithms using OpenMP and
Pluto on symmetric multiprocessing machine." Data Technologies and Applications (2019).
[28]Siddiqa, Noor, and Tanvir Habib Sardar. "Multi-Layered Security System Using Cryptography and Steganography." (2019).
[29]Sardar, Tanvir Habib, Zahid Ansari, Naveen D. Chandavarkar, and Amjad Khan. "A Methodology for Detecting Web Robot Requests."
[30]Sardar, Tanvir Habib. "A Methodology in Mobile Networks for Global Roaming." Oriental Journal of Computer Science and
Technology 6, no. 4 (2013): 391-396.
[31]Sardar, T. Habib, Zahid Ansari, and Amjad Khan. "A Methodology for Wireless Intrusion Detection System." International Journal of
Computer Applications 975: 8887.
[33]Bennett, Janine Camille, David Thompson, Joshua Levine, Peer-Timo Bremer, Attila Gyulassy, Valerio Pascucci, and Philippe Pierre
Pebay. Analysis of Large-Scale Scalar Data Using Hixels. No. SAND2011-8450C. Sandia National Lab.(SNL-CA), Livermore, CA (United
States), 2011.
[34]Ahrens, James, Kristi Brislawn, Ken Martin, Berk Geveci, C. Charles Law, and Michael Papka. "Large-scale data visualization using
parallel data streaming." IEEE Computer graphics and Applications 21, no. 4 (2001): 34-41.
[35]Childs, Hank, Berk Geveci, Will Schroeder, Jeremy Meredith, Kenneth Moreland, Christopher Sewell, Torsten Kuhlen, and E. Wes
Bethel. "Research challenges for visualization software." Computer 46, no. 5 (2013): 34-42.
[36]Assunção, Marcos D., Rodrigo N. Calheiros, Silvia Bianchi, Marco AS Netto, and Rajkumar Buyya. "Big Data computing and clouds:
Trends and future directions." Journal of parallel and distributed computing 79 (2015): 3-15.
[37]Morente-Molinera, Juan Antonio, Ignacio J. Pérez, M. Raquel Ureña, and Enrique Herrera-Viedma. "Creating knowledge databases for
storing and sharing people knowledge automatically using group decision making and fuzzy ontologies." Information Sciences 328 (2016):
418-434.
[38]Lin, Chun‐Wei, and Tzung‐Pei Hong. "A survey of fuzzy web mining." Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery 3, no. 3 (2013): 190-199.
[39]Kaburlasos, Vassilis G., and George A. Papakostas. "Learning distributions of image features by interactive fuzzy lattice reasoning in
pattern recognition applications." IEEE Computational Intelligence Magazine 10, no. 3 (2015): 42-51.
[40]Iglesias, José Antonio, Alexandra Tiemblo, Agapito Ledezma, and Araceli Sanchis. "Web news mining in an evolving
framework." Information Fusion 28 (2016): 90-98.
[41]Chang, Hsien-Tsung, Nilamadhab Mishra, and Chung-Chih Lin. "IoT big-data centred knowledge granule analytic and cluster framework
for BI applications: a case base analysis." PloS one 10, no. 11 (2015): e0141980.
[42]López, Victoria, Sara Del Río, José Manuel Benítez, and Francisco Herrera. "Cost-sensitive linguistic fuzzy rule based classification
systems under the MapReduce framework for imbalanced big data." Fuzzy Sets and Systems 258 (2015): 5-38.
[43]Lu, Hua-pu, Zhi-yuan Sun, and Wen-cong Qu. "Big data-driven based real-time traffic flow state identification and prediction." Discrete
Dynamics in Nature and Society 2015 (2015).
[44]Ramachandramurthy, Sivaraman, Srinivasan Subramaniam, and Chandrasekeran Ramasamy. "Distilling big data: refining quality
information in the era of yottabytes." The Scientific World Journal 2015 (2015).
[45]Wang, Hai, Zeshui Xu, Hamido Fujita, and Shousheng Liu. "Towards felicitous decision making: An overview on challenges and trends
of Big Data." Information Sciences 367 (2016): 747-765.
[46]Azar, Ahmad Taher, and Aboul Ella Hassanien. "Dimensionality reduction of medical big data using neural-fuzzy classifier." Soft
computing 19, no. 4 (2015): 1115-1127.
[47]Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified data processing on large clusters." (2004).
[48]Janssen, Marijn, Haiko van der Voort, and Agung Wahyudi. "Factors influencing big data decision-making quality." Journal of business
research 70 (2017): 338-345.