Big Data1

BIG DATA
Abstract—A huge repository of terabytes of data everyday whereas velocity is the rate of growth and
is generated each day from modern information how fast the data are gathered for being analysis.
systems and digital technologies such as Internet Variety provides information about the types of data
of Things and cloud computing. Analysis of these such as structured, unstructured, semi-structured etc.
massive data requires a lot of efforts at multiple The fourth V refers to veracity that includes availability
levels to extract knowledge for decision making. and accountability. The fifth V refers to value. The
Therefore, big data analysis is a current area of prime objective of big data analysis is to process data
research and development. The basic objective of of high volume, velocity, variety, veracity, and value
this paper is to explore the potential impact of big using various traditional and computational intelligent
data challenges and various tools associated with techniques.
it. As a result, this article provides a platform to
explore big data at numerous stages. II. BIG DATA: AN OVERVIEW
I. INTRODUCTION Big Data is the ocean of information we swim

in every day – vast zettabytes of data flowing from our
The world is experiencing a huge data computers, mobile devices, and machine sensors.
revolution. The explosion of data is directly connected This data is used by organizations to drive decisions,
to the arrival of the digital age. The term “Big Data” improve processes and policies, and create customer-
refers to the vast amounts of data in which traditional centric products, services, and experiences. Big Data
data processing procedures and tools would not be is defined as “big” not just because of its volume, but
able to handle. It emerged in the 1990s and gained also due to the variety and complexity of its nature.
momentum in the early 2000s and has been variously Typically, it exceeds the capacity of traditional
defined and operationalized. Clearly, size often databases to capture, manage, and process it. And,
comes to mind when referring to big data. It is Big Data can come from anywhere or anything on
commonly defined as the astonishing amount of earth that we’re able to monitor digitally. Weather
structured and unstructured data that are being satellites, Internet of Things (IoT) devices, traffic
generated, captured, and stored at an amazing cameras, social media trends – these are just a few of
speed. It is obvious that data comes from literally the data sources being mined and analyzed to make
everywhere, every time and from all sorts of devices. businesses more resilient and competitive.Big data is a
Data is often produced and accessible in real time, collection of data from many different sources. Big data
and it arises from the merging of different sources. refers to data that is so large, fast or complex that it's
Organizations have been relying on these sources of difficult or impossible to process using traditional
data to describe, interpret, and forecast and provision methods.
economic and business activities and to decide for the
next direction. Today, various efficient and intelligent
techniques are available to help the organization in
providing the best interpretation of this large volume
of data from different types of heterogeneous
sources, to be processed and analysed and
presented in an understandable, visual and decent
manner to suit the business language and
stakeholders’ objectives. In digital world, data are
generated from various sources and the fast transition
from digital technologies has led to growth of big data.
It provides evolutionary breakthroughs in many fields
with collection of large datasets. In general, it refers to
the collection of large and complex datasets which
are difficult to process using traditional database A. Types of Big data
management tools or data processing applications.
These are available in structured, semi-structured, Data sets are typically categorized into three types
and unstructured format in petabytes and beyond. based on its structure and how straightforward (or
Formally, it is defined from 3Vs to 5Vs. 3Vs refers to not) it is to index.
volume, velocity, and variety. Volume refers to the
huge amount of data that are being generated 1. Structured data:
This kind of data is the simplest to organize and As it sounds, semi-structured data is a hybrid of
search. It can include things like financial data, structured and unstructured data. E-mails are a good
machine logs, and demographic details. An Excel example as they include unstructured data in the body
spreadsheet, with its layout of pre-defined columns of the message, as well as more organizational
and rows, is a good way to envision structured data. properties such as sender, recipient, subject, and
Its components are easily categorized, allowing date. Devices that use geo-tagging, time stamps, or
semantic tags can also deliver structured data
database designers and administrators to define
alongside unstructured content. An unidentified
simple algorithms for search and analysis. Even when
smartphone image, for instance, can still tell you that
structured data exists in enormous volume, it doesn’t it is a selfie, and the time and place where it was
necessarily qualify as Big Data because structured taken. A modern database running AI technology can
data on its own is relatively simple to manage and not only instantly identify different types of data, it can
therefore doesn’t meet the defining criteria of Big also generate algorithms in real time to effectively
Data. Traditionally, databases have used a manage and analyze the disparate data sets involved.
programming language called Structured Query
Language (SQL) in order to manage structured data.
SQL was developed by IBM in the 1970s to allow
developers to build and manage relational
(spreadsheet style) databases that were beginning to
take off at that time.
B
. Characteristics of Big Data Management
As one of the current trend terms in the world

today, there is no exact way to define big data. The
term is often used with related concepts such as
2. Unstructured data: Business Intelligence (BI) and data mining. The five
major characteristics that define big data are volume,
variety, velocity, veracity and value.
This category of data can include things like
social media posts, audio files, images, and open-
ended customer comments. This kind of data cannot
be easily captured in standard row-column relational
databases. Traditionally, companies that wanted to
search, manage, or analyze large amounts of
unstructured data had to use laborious manual
processes. There was never any question as to the
potential value of analyzing and understanding such
data, but the cost of doing so was often too exorbitant
to make it worthwhile. Considering the time it took,
results were often obsolete before they were even
delivered. Instead of spreadsheets or relational
databases, unstructured data is usually stored in data
lakes, data warehouses, and NoSQL databases.
1. Volume:
While volume is by no means the only

component that makes Big Data “big,” it is certainly a
primary feature. To fully manage and utilize Big Data,
advanced algorithms and AI-driven analytics are
required. But before any of that can happen, there
needs to be a secure and reliable means of storing,
organizing, and retrieving the many terabytes of data
that are held by large companies.
3. Semi-structured data: 2. Velocity:

In the past, any data that was generated had prediction markets where as internet search indexing
to later be entered into a traditional database system includes ISI, IEEE Xplorer, Scopus, Thomson Reuters
– often manually – before it could be analyzed or etc. Considering this advantages of big data it
retrieved. Today, Big Data technology allows provides a new opportunities in the knowledge
databases to process, analyze, and configure data processing tasks for the upcoming researchers.
while it is being generated – sometimes within However, opportunities always follow some
milliseconds. For businesses, that means real-time challenges. To handle the challenges, we need to
data can be used to capture financial opportunities, know various computational complexities, information
respond to customer needs, thwart fraud, and security, and computational method, to analyze big
address any other activity where speed is critical. data. For example, many statistical methods that
perform well for small data size do not scale to
3. Variety: voluminous data. Similarly, many computational
techniques that perform well for small data face
significant challenges in analyzing big data. Various
Data sets that are comprised solely of challenges that the health sector face was being
structured data are not necessarily Big Data, researched by much researchers. Here the
regardless of how voluminous they are. Big Data is challenges of big data analytics are classified into four
typically comprised of combinations of structured, broad categories namely data storage and analysis;
unstructured, and semi-structured data. Traditional knowledge discovery and computational complexities;
databases and data management solutions lack the scalability and visualization of data; and information
flexibility and scope to manage the complex, security.
disparate data sets that make up Big Data.
 Data Storage and Analysis
4. Veracity:
In recent years the size of data has grown
While modern database technology makes it exponentially by various means such as mobile
possible for companies to amass and make sense of devices, aerial sensory technologies, remote sensing,
staggering amounts and types of Big Data, it’s only radio frequency identification readers etc. These data
valuable if it is accurate, relevant, and timely. For are stored on spending much cost whereas they
traditional databases that were populated only with
ignored or deleted finally because there is no enough
structured data, syntactical errors and typos were
space to store them. Therefore, the first challenge for
the usual culprits when it came to data accuracy.
With unstructured data, there is a whole new set of big data analysis is storage mediums and higher
veracity challenges. Human bias, social noise, and input/output speed. In such cases, the data
data provenance issues can all have an impact upon accessibility must be on the top priority for the
the quality of data. knowledge discovery and representation. The prime
reason is being that, it must be accessed easily and
5. Value: promptly for further analysis. In past decades, analyst
use hard disk drives to store data but, it slower
random input/output performance than sequential
Without question, the results that come from
input/output. To overcome this limitation, the concept
Big Data analysis are often fascinating and
unexpected. But for businesses, Big Data analytics of solid-state drive (SSD) and phrase change memory
must deliver insights that can help businesses (PCM) was introduced. However, the available
become more competitive and resilient – and better storage technologies cannot possess the required
serve their customers. Modern Big Data performance for processing big data. Another
technologies open up the capacity for collecting and challenge with Big Data analysis is attributed to
retrieving data that can provide measurable benefit diversity of data. with the ever growing of datasets,
to both bottom lines and operational resilience. data mining tasks has significantly increased.
Additionally data reduction, data selection, feature
C. Challenges in Big data selection is an essential task especially when dealing
with large datasets. This presents an unprecedented
Recent years big data has been accumulated challenge for researchers. It is because existing
in several domains like health care, public algorithms may not always respond in an adequate
administration, retail, bio-chemistry, and other time when dealing with these high dimensional data.
interdisciplinary scientific researches. Web-based Automation of this process and developing new
applications encounter big data frequently, such as machine learning algorithms to ensure consistency is
social computing, internet text and documents, and a major challenge in recent years. In addition to all
internet search indexing. Social computing includes these Clustering of large datasets that help in
social network analysis, online communities, analyzing the big data is of prime concern. Recent
recommender systems, reputation systems, and technologies such as hadoop and mapReduce make
it possible to collect large amount of semi structured requirements. The basic objective in these research is
and unstructured data in a reasonable amount of to minimize computational cost processing and
time. The key engineering challenge is how to complexities. However, current big data analysis tools
effectively analyze these data for obtaining better have poor performance in handling computational
knowledge. A standard process to this end is to complexities, uncertainty, and inconsistencies. It
transform the semi structured or unstructured data leads to a great challenge to develop techniques and
into structured data, and then apply data mining technologies that can deal computational complexity,
algorithms to extract knowledge. A framework to uncertainty and inconsistencies in a effective manner.
analyze data was discussed by Das and Kumar.
Similarly detail explanation of data analysis for public  Scalability and Visualization of Data
tweets was also discussed by Das et al in their paper.
The most important challenge for big data
The major challenge in this case is to pay more
analysis techniques is its scalability and security. In
attention for designing storage systems and to elevate
the last decades researchers have paid attentions to
efficient data analysis tool that provide guarantees on
accelerate data analysis and its speed up processors
the output when the data comes from different
followed by Moore’s Law. For the former, it is
sources. Furthermore, design of machine learning
necessary to develop sampling, on-line, and
algorithms to analyze data is essential for improving
multiresolution analysis techniques. Incremental
efficiency and scalability.
techniques have good scalability property in the
 Knowledge Discovery and Computational aspect of big data analysis. As the data size is scaling
Complexities much faster than CPU speeds, there is a natural
dramatic shift in processor technology being
Knowledge discovery and representation is a embedded with increasing number of cores. This shift
prime issue in big data. It includes a number of sub in processors leads to the development of parallel
fields such as authentication, archiving, management, computing. Real time applications like navigation,
preservation, information retrieval, and social networks, finance, internet search, timeliness
representation. There are several tools for knowledge etc. requires parallel computing. The objective of
discovery and representation such as fuzzy set, rough visualizing data is to present them more adequately
set, soft set, near set, formal concept analysis, using some techniques of graph theory. Graphical
principal component analysis etc to name a few. visualization provides the link between data with
Additionally many hybridized techniques are also proper interpretation. However, online marketplace
developed to process real life problems. All these like flip-kart, amazon, e-bay have millions of users
techniques are problem dependent. Further some of and billions of goods to sold each month. This
these techniques may not be suitable for large generates a lot of data. To this end, some company
datasets in a sequential computer. At the same time uses a tool Tableau for big data visualization. It has
some of the techniques has good characteristics of capability to transform large and complex data into
scalability over parallel computer. Since the size of intuitive pictures. This help employees of a company
big data keeps increasing exponentially, the available to visualize search relevance, monitor latest customer
tools may not be efficient to process these data for feedback, and their sentiment analysis. However,
obtaining meaningful information. The most popular current big data visualization tools mostly have poor
approach in case of large dataset management is performances in functionalities, scalability, and
data warehouses and data marts. Data warehouse is response in time. We can observe that big data have
mainly responsible to store data that are sourced from produced many challenges for the developments of
operational systems whereas data mart is based on a the hardware and software which leads to parallel
data warehouse and facilitates analysis. Analysis of computing, cloud computing, distributed computing,
large dataset requires more computational visualization process, scalability. To overcome this
complexities. The major issue is to handle issue, we need to correlate more mathematical
inconsistencies and uncertainty present in the models to computer science.
datasets. In general, systematic modeling of the
computational complexity is used. It may be difficult to  Information Security
establish a comprehensive mathematical system that
In big data analysis massive amount of data
is broadly applicable to Big Data. But a domain
are correlated, analyzed, and mined for meaningful
specific data analytics can be done easily by
patterns. All organizations have different policies to
understanding the particular complexities. A series of
safe guard their sensitive information. Preserving
such development could simulate big data analytics
sensitive information is a major issue in big data
for different areas. Much research and survey has
analysis. There is a huge security risk associated with
been carried out in this direction using machine
big data. Therefore, information security is becoming
learning techniques with the least memory
a big data analytics problem. Security of big data can
be enhanced by using the techniques of internet, Internet of Things enables the devices to
authentication, authorization, and encryption. Various exist in a myriad of places and facilitates applications
security measures that big data applications face are ranging from trivial to the crucial. Conversely, it is still
scale of network, variety of different devices, real time mystifying to understand IoT well, including
security monitoring, and lack of intrusion system. The definitions, content and differences from other similar
security challenge caused by big data has attracted concepts. Several diversified technologies such as
the attention of information security. Therefore, computational intelligence, and big-data can be
attention has to be given to develop a multi-level incorporated together to improve the data
security policy model and prevention system. management and knowledge discovery of large scale
Although much research has been carried out to automation applications. Much research in this
secure big data but it requires lot of improvement. The direction has been carried out by Mishra, Lin and
major challenge is to develop a multi-level security, Chang. Knowledge acquisition from IoT data is the
privacy preserved data model for big data. biggest challenge that big data professional are
facing. Therefore, it is essential to develop
D. Open research issues in Big data infrastructure to analyze the IoT data. An IoT device
generates continuous streams of data and the
Big data analytics and data science are
researchers can develop tools to extract meaningful
becoming the research focal point in industries and
information from these data using machine learning
academia. Data science aims at researching big data
techniques. Understanding these streams of data
and knowledge extraction from data. Applications of
generated from IoT devices and analysing them to get
big data and data science include information
meaningful information is a challenging issue and it
science, uncertainty modeling, uncertain data
leads to big data analytics. Machine learning
analysis, machine learning, statistical learning, pattern
algorithms and computational intelligence techniques
recognition, data warehousing, and signal processing.
is the only solution to handle big data from IoT
Effective integration of technologies and analysis will
prospective. Key technologies that are associated
result in predicting the future drift of events. Main
with IoT are also discussed in many research papers.
focus of this section is to discuss open research
issues in big data analytics. The research issues Knowledge exploration system have originated from
pertaining to big data analysis are classified into three theories of human information processing such as
broad categories namely internet of things (IoT), cloud frames, rules, tagging, and semantic networks. In
computing, bio inspired computing, and quantum general, it consists of four segments such as
computing. However, it is not limited to these issues. knowledge acquisition, knowledge base, knowledge
More research issues related to health care big data dissemination, and knowledge application. In
can be found in Husing Kuo et al. p. knowledge acquisition phase, knowledge is
discovered by using various traditional and
 IoT for Big Data Analytics computational intelligence techniques. The
Internet has restructured global interrelations, discovered knowledge is stored in knowledge bases
and expert systems are generally designed based on
the art of businesses, cultural revolutions and an the discovered knowledge. Knowledge dissemination
unbelievable number of personal characteristics. is important for obtaining meaningful information from
Currently, machines are getting in on the act to the knowledge base. Knowledge extraction is a
control innumerable autonomous gadgets via internet process that searches documents, knowledge within
and create Internet of Things (IoT). Thus, appliances documents as well as knowledge bases. The final
are becoming the user of the internet, just like phase is to apply discovered knowledge in various
humans with the web browsers. Internet of Things is applications. It is the ultimate goal of knowledge
attracting the attention of recent researchers for its discovery. The knowledge exploration system is
most promising opportunities and challenges. It has necessarily iterative with the judgement of knowledge
an imperative economic and societal impact for the application. There are many issues, discussions, and
future construction of information, network and researches in this area of knowledge exploration. It is
communication technology. The new regulation of beyond scope of this survey paper.
future will be eventually, everything will be connected
and intelligently controlled. The concept of IoT is  Cloud Computing for Big Data Analytics
becoming more pertinent to the realistic world due to
the development of mobile devices, embedded and The development of virtualization technologies
ubiquitous communication technologies, cloud have made supercomputing more accessible and
computing, and data analytics. Moreover, IoT affordable. Computing infrastructures that are hidden
presents challenges in combinations of volume, in virtualization software make systems to behave like
velocity and variety. In a broader sense, just like the a true computer, but with the flexibility of specification
details such as number of processors, disk space, solution on considering cost of data management and
memory, and operating system. The use of these service maintenance. These techniques are
virtual computers is known as cloud computing which developed by biological molecules such as DNA and
has been one of the most robust big data technique. proteins to conduct computational calculations
Big Data and cloud computing technologies are involving storing, retrieving, and processing of data. A
developed with the importance of developing a significant feature of such computing is that it
scalable and on demand availability of resources and integrates biologically derived materials to perform
data. Cloud computing harmonize massive data by on computational functions and receive intelligent
demand access to configurable computing resources performance. These systems are more suitable for big
through virtualization techniques. The benefits of data applications.
utilizing the Cloud computing include offering
resources when there is a demand and pay only for Huge amount of data are generated from variety of
the resources which is needed to develop the product. resources across the web since the digitization.
Simultaneously, it improves availability and cost Analyzing these data and categorizing into text, image
reduction. Open challenges and research issues of and video etc.. will require lot of intelligent analytics
big data and cloud computing are discussed in detail from data scientists and big data professionals.
by many researchers which highlights the challenges Proliferations of technologies are emerging like big
in data management, data variety and velocity, data data, IoT, cloud computing, bio inspired computing
storage, data processing, and resource management. etc.. whereas equilibrium of data can be done only by
So, Cloud computing helps in developing a business selecting right platform to analyze large and furnish
model for all varieties of applications with cost effective results. Bio-inspired computing
infrastructure and tools. Big data application using techniques serve as a key role in intelligent data
cloud computing should support data analytic and analysis and its application to big data. These
development. The cloud environment should provide algorithms help in performing data mining for large
tools that allow data scientists and business analysts datasets due to its optimization application. The most
to interactively and collaboratively explore knowledge advantage is its simplicity and their rapid concergence
acquisition data for further processing and extracting to optimal solution while solving service provision
fruitful results. This can help to solve large problems. Some applications to this end using bio
applications that may arise in various domains. In inspired computing was discussed in detail by Cheng
addition to this, cloud computing should also enable et al. From the discussions, we can observe that the
scaling of tools from virtual technologies into new bio-inspired computing models provide smarter
technologies like spark, R, and other types of big data interactions, inevitable data losses, and help is
processing techniques. Big data forms a framework handling ambiguities. Hence, it is believed that in
for discussing cloud computing options. Depending on future bio-inspired computing may help in handling big
special need, user can go to the marketplace and buy data to a large extent.
infrastructure services from cloud service providers
 Quantum Computing for Big Data Analysis
such as Google, Amazon, IBM, software as a service
(SaaS) from a whole crew of companies such as A quantum computer has memory that is
NetSuite, Cloud9, Job science etc. Another exponentially larger than its physical size and can
advantage of cloud computing is cloud storage which manipulate an exponential set of inputs
provides a possible way for storing big data. The simultaneously. This exponential improvement in
obvious one is the time and cost that are needed to computer systems might be possible. If a real
upload and download big data in the cloud quantum computer is available now, it could have
environment. Else, it becomes difficult to control the solved problems that are exceptionally difficult on
distribution of computation and the underlying recent computers, of course today’s big data
hardware. But the major issues are privacy concerns problems. The main technical difficulty in building
relating to the hosting of data on public servers, and quantum computer could soon be possible. Quantum
the storage of data from human studies. All these computing provides a way to merge the quantum
issues will take big data and cloud computing to a mechanics to process the information. In traditional
high level of development. computer, information is presented by long strings of
bits which encode either a zero or a one. On the other
 Bio-inspired Computing for Big Data
hand a quantum computer uses quantum bits or
Bio-inspired computing is a technique inspired qubits. The difference between qubit and bit is that, a
by nature to address complex real-world problems. qubit is a quantum system that encodes the zero and
Biological systems are self-organized without a the one into two distinguishable quantum states.
central control. A bio-inspired cost minimization Therefore, it can be capitalized on the phenomena of
mechanism search and find the optimal data service superposition and entanglement. It is because qubits
behave quantumly. For example, 100 qubits in of Apache mahout is to provide a tool for elleviating
quantum systems require 2100 complex values to be big challenges. The different companies those who
stored in a classic computer system. It means that have implemented scalable machine learning
many big data problems can be solved much faster by algorithms are Google, IBM, Amazon, Yahoo, Twitter,
larger scale quantum computers compared with and facebook.
classical computers. Hence it is a challenge for this
generation to built a quantum computer and facilitate  Apache Spark
quantum computing to solve big data problems.
Apache spark is an open-source big data
E. Tools for Big data processing processing framework built for speed processing, and
sophisticated analytics. It is easy to use and was
Large numbers of tools are available to process originally developed in 2009 in UC Berkeleys
big data. In this section, we discuss some current AMPLab. It was open sourced in 2010 as an Apache
techniques for analyzing big data with emphasis on project. Spark lets you quickly write applications in
three important emerging tools namely MapReduce, java, scala, or python. In addition to map reduce
Apache Spark, and Storm. Most of the available tools operations, it supports SQL queries, streaming data,
concentrate on batch processing, stream processing, machine learning, and graph data processing. Spark
and interactive analysis. Most batch processing tools runs on top of existing hadoop distributed file system
are based on the Apache Hadoop infrastructure such (HDFS) infrastructure to provide enhanced and
as Mahout and Dryad. Stream data applications are additional functionality. Spark consists of components
mostly used for real time analytic. Some examples of namely driver program, cluster manager and worker
large scale streaming platform are Strom and Splunk. nodes. The driver program serves as the starting
The interactive analysis process allow users to point of execution of an application on the spark
directly interact in real time for their own analysis. cluster. The cluster manager allocates the resources
and the worker nodes to do the data processing in the
 Apache Hadoop and MapReduce form of tasks. Each application will have a set of
processes called executors that are responsible for
The most established software platform for big
executing the tasks. The major advantage is that it
data analysis is Apache Hadoop and MapReduce. It
provides support for deploying spark applications in
consists of hadoop kernel, mapReduce, hadoop
an existing hadoop clusters.
distributed file system (HDFS) and apache hive etc.
Map reduce is a programming model for processing • The prime focus of spark includes resilient
large datasets is based on divide and conquer distributed datasets (RDD), which store data in-
method. The divide and conquer method is memory and provide fault tolerance without
implemented in two steps such as Map step and replication. It supports iterative computation, improves
Reduce Step. Hadoop works on two kinds of nodes speed and resource utilization.
such as master node and worker node. The master
node divides the input into smaller sub problems and • The foremost advantage is that in addition to
then distributes them to worker nodes in map step. MapReduce, it also supports streaming data, machine
Thereafter the master node combines the outputs for learning, and graph algorithms.
all the subproblems in reduce step. Moreover,
Hadoop and MapReduce works as a powerful • Another advantage is that, a user can run the
software framework for solving big data problems. It is application program in different languages such as
also helpful in fault-tolerant storage and high Java, R, Python, or Scala. This is possible as it
throughput data processing. comes with higher-level libraries for advanced
analytics. These standard libraries increase developer
 Apache Mahout productivity and can be seamlessly combined to
create complex workflows.
Apache mahout aims to provide scalable and
commercial machine learning techniques for large • Spark helps to run an application in Hadoop cluster,
scale and intelligent data analysis applications. Core up to 100 times faster in memory, and 10 times faster
algorithms of mahout including clustering, when running on disk. It is possible because of the
classification, pattern mining, regression, reduction in number of read or write operations to
dimensionally reduction, evolutionary algorithms, and disk. • It is written in scala programming language and
batch based collaborative filtering run on top of runs on java virtual machine (JVM) environment.
Hadoop platform through map reduce framework. The Additionally, it supports java, python and R for
goal of mahout is to build a vibrant, responsive, developing applications using Spark.
diverse community to facilitate discussions on the
project and potential use cases. The basic objective
 Dryad
It is another popular programming model for to support many types of query languages, data
implementing parallel and distributed programs for formats, and data sources. It is also specially
handling large context bases on dataflow graph. It designed to exploit nested data. Also it has an
consists of a cluster of computing nodes, and an user objective to scale up on 10,000 servers or more and
use the resources of a computer cluster to run their reaches the capability to process patabytes of data
program in a distributed way. Indeed, a dryad user and trillions of records in seconds. Drill use HDFS for
use thousands of machines, each of them with storage and map reduce to perform batch analysis.
multiple processors or cores. The major advantage is
that users do not need to know anything about  Jaspersoft
concurrent programming. A dryad application runs a
The Jaspersoft package is an open-source
computational directed graph that is composed of
software that produce reports from database columns.
computational vertices and communication channels.
It is a scalable bigdata analytical platform and has a
Therefore, dryad provides a large number of
capability of fast data visualization on popular storage
functionality including generating of job graph,
platforms, including MangoDB, Cassandra, Redis etc.
scheduling of the machines for the available
One important property of Jaspersoft is that it can
processes, transition failure handling in the cluster,
quickly explore big data without extraction,
collection of performance metrics, visualizing the job,
transformation, and loading (ETL). In addition to this,
invoking user defined policies and dynamically
it also have an ability to build powerful hypertext
updating the job graph in response to these policy
markup language (HTML) reports and dashboards
decisions without knowing the semantics of the
interactively and directly from big data store without
vertices.
ETL requirement. These generated reports can be
 Storm shared with anyone inside or outside user’s
organization.
Storm is a distributed and fault tolerant real
time computation system for processing large  Splunk
streaming data. It is specially designed for real time
In recent years a lot of data are generated
processing in contrasts with hadoop which is for batch
through machine from business industries. Splunk is a
processing. Additionally, it is also easy to set up and
real-time and intelligent platform developed for
operate, scalable, fault-tolerant to provide competitive
exploiting machine generated big data. It combines
performances. The storm cluster is apparently similar
the up-to-the-moment cloud technologies and big
to hadoop cluster. On storm cluster users run different
data. In turn it helps user to search, monitor, and
topologies for different storm tasks whereas hadoop
analyze their machine generated data through web
platform implements map reduce jobs for
interface. The results are exhibited in an intuitive way
corresponding applications. There are number of
such as graphs, reports, and alerts. Splunk is different
differences between map reduce jobs and topologies.
from other stream processing tools. Its peculiarities
The basic difference is that map reduce job eventually
include indexing structured, unstructured machine
finishes whereas a topology processes messages all
generated data, real-time searching, reporting
the time, or until user terminate it. A storm cluster
analytical results, and dashboards. The most
consists of two kinds of nodes such as master node
important objective of Splunk is to provide metrices
and worker node. The master node and worker node
for many application, diagnose problems for system
implement two kinds of roles such as nimbus and
and information technology infrastructures, and
supervisor respectively. The two roles have similar
intelligent support for business operations.
functions in accordance with jobtracker and
tasktracker of map reduce framework. Nimbus is in III. Suggestions for future work
charge of distributing code across the storm cluster,
scheduling and assigning tasks to worker nodes, and The amount of data collected from various
monitoring the whole system. The supervisor applications all over the world across a wide variety of
complies tasks as assigned to them by nimbus. In fields today is expected to double every two years. It
addition, it start and terminate the process as has no utility unless these are analyzed to get useful
necessary based on the instructions of nimbus. The information. This necessitates the development of
whole computational technology is partitioned and techniques which can be used to facilitate big data
distributed to a number of worker processes and each analysis. The development of powerful computers is a
worker process implements a part of the topology. boon to implement these techniques leading to
automated systems. The transformation of data into
 Apache Drill knowledge is by no means an easy task for high
performance large-scale data processing, including
Apache drill is another distributed system for
exploiting parallelism of current and upcoming
interactive analysis of big data. It has more flexibility
computer architectures for data mining. Moreover, learning, data mining, intelligent analysis, cloud
these data may involve uncertainty in many different computing, quantum computing, and data stream
forms. Many different models like fuzzy sets, rough processing. We believe that in future researchers will
sets, soft sets, neural networks, their generalizations pay more attention to these techniques to solve
and hybrid models obtained by combining two or problems of big data effectively and efficiently.
more of these models have been found to be fruitful in
representing data. These models are also very much REFERENCES
fruitful for analysis. More often than not, big data are
[1] M. K.Kakhani, S. Kakhani and S. R.Biradar,
reduced to include only the important characteristics
Research issues in big data analytics, International
necessary from a particular study point of view or
Journal of Application or Innovation in Engineering &
depending upon the application area. So, reduction
Management, 2(8) (2015), pp.228-232.
techniques have been developed. Often the data
collected have missing values. These values need to [2] A. Gandomi and M. Haider, Beyond the hype: Big
be generated or the tuples having these missing data concepts, methods, and analytics, International
values are eliminated from the data set before Journal of Information Management, 35(2) (2015),
analysis. More importantly, these new challenges may pp.137-144.
comprise, sometimes even deteriorate, the
performance, efficiency and scalability of the [3] C. Lynch, Big data: How do your data grow?,
dedicated data intensive computing systems. The Nature, 455 (2008), pp.28-29.
later approach sometimes leads to loss of information
and hence not preferred. This brings up many [4] X. Jin, B. W.Wah, X. Cheng and Y. Wang,
research issues in the industry and research Significance and challenges of big data research, Big
community in forms of capturing and accessing data Data Research, 2(2) (2015), pp.59-64.
effectively. In addition, fast processing while achieving
[5] R. Kitchin, Big Data, new epistemologies and
high performance and high throughput, and storing it
paradigm shifts, Big Data Society, 1(1) (2014), pp.1-
efficiently for future use is another issue. Further,
12.
programming for big data analysis is an important
challenging issue. Expressing data access [6] C. L. Philip, Q. Chen and C. Y. Zhang, Data-
requirements of applications and designing intensive applications, challenges, techniques and
programming language abstractions to exploit technologies: A survey on big data, Information
parallelism are an immediate need. Additionally, Sciences, 275 (2014), pp.314-347.
machine learning concepts and tools are gaining
popularity among researchers to facilitate meaningful [7] K. Kambatla, G. Kollias, V. Kumar and A. Gram,
results from these concepts. Research in the area of Trends in big data analytics, Journal of Parallel and
machine learning for big data has focused on data Distributed Computing, 74(7) (2014), pp.2561-2573.
processing, algorithm implementation, and
optimization. Many of the machine learning tools for
big data are started recently needs drastic change to
adopt it. We argue that while each of the tools has
their advantages and limitations, more efficient tools
can be developed for dealing with problems inherent
to big data. The efficient tools to be developed must
have provision to handle noisy and imbalance data,
uncertainty and inconsistency, and missing values.
IV. CONCLUSION
In recent years data are generated at a
dramatic pace. Analyzing these data is challenging for
a general man. To this end in this paper, we survey
the various research issues, characteristics, types
challenges, and tools used to analyze these big data.
From this survey, it is understood that every big data
platform has its individual focus. Some of them are
designed for batch processing whereas some are
good at real-time analytic. Each big data platform also
has specific functionality. Different techniques used
for the analysis include statistical analysis, machine

Big Data1

Uploaded by

Copyright:

Available Formats

Big Data1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data1

Uploaded by

Copyright:

Available Formats

BIG DATA

I. INTRODUCTION Big Data is the ocean of information we swim

As one of the current trend terms in the world

While volume is by no means the only

3. Semi-structured data: 2. Velocity:

You might also like