I.J. Information Technology and Computer Science, 2017, 4, 31-38
Published Online April 2017 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijitcs.2017.04.05
The Obstacles in Big Data Process
Rasim M. Alguliyev
Institute of Information Technology of Azerbaijan National Academy of Sciences 9,
B. Vahabzade str., Baku, AZ1141, Azerbaijan
E-mail: r.alguliev@gmail.com
Rena T. Gasimova, Rahim N. Abbaslı
Institute of Information Technology of Azerbaijan National Academy of Sciences 9,
B. Vahabzade str., Baku, AZ1141, Azerbaijan, GoEasy LTD, Canada, Mississauga L5B2N5
E-mail: {renakasumova@gmail.com, rahim.abbasli@gmail.com}
Abstract—The increasing amount of data and a need to
analyze the given data in a timely manner for multiple
purposes has created a serious barrier in the big data
analysis process. This article describes the challenges that
big data creates at each step of the big data analysis
process. These problems include typical analytical
problems as well as the most uncommon challenges that
are futuristic for the big data only. The article breaks
down problems for each step of the big data analysis
process and discusses these problems separately at each
stage. It also offers some simplistic ways to solve these
problems.
Index Terms—Big data, big data analytics, database,
management, NoSQL, MapReduce, Hadoop, cloud, data
scientists.
I. INTRODUCTION
The advanced technologies has allowed companies to
collect data from multiple sources to create a big data
stream that was initially designed to be used to extract a
valuable information to manage the business. The big
data flood allows companies to build essential conceptual
models that help to adjust to the new market trends and
understand customer behavior. These models are used to
differentiate the products offered by the company to
match consumer’s expectations.
The information is also used to explore the niche
markets before competitors enter the market. The
leverage and the power that the big data offers has
attracted many companies and scientists. However big
data has also created challenges that must be solved to
eliminate the gaps in the big data analysis process [1, 2].
The problems start right away during data acquisition,
when the data tsunami requires us to make decisions,
currently in an ad hoc manner, about what data to keep
and what to discard, and how to store what we keep
reliably with the right metadata. Much data today is not
natively in structured format; for example, tweets and
blogs are weakly structured pieces of text, while images
and video are structured for storage and display, but not
for semantic content and search: transforming such
content into a structured format for later analysis is a
Copyright © 2017 MECS
major challenge. The value of data increases when it can
be linked with other data, thus data integration is a major
creator of value. Since most data is directly generated in
digital format today, we have the opportunity and the
challenge both to influence the creation to facilitate later
linkage and to automatically link previously created data.
Data analysis, organization, retrieval, and modeling are
other foundational challenges. Data analysis is a clear
bottleneck in many applications, both due to lack of
scalability of the underlying algorithms and due to the
complexity of the data that needs to be analyzed. Finally,
presentation of the results and its interpretation by nontechnical experts is crucial to extracting actionable
knowledge.
The many novel challenges and opportunities
associated with Big Data necessitate rethinking many
aspects of these data management platforms, while
retaining other desirable aspects. That appropriate
investment in Big Data will lead to a new wave of
fundamental technological advances that will be
embodied in the next generations of Big Data
management and analysis platforms, products, and
systems. A major investment in Big Data, properly
directed, can result not only in major scientific advances,
but also lay the foundation for the next generation of
advances in science, medicine, and business [3, 4].
The big data has created sophisticated analytical
bottlenecks that cannot be solved with common tools and
practices that are used in the industry today. Many
counterintuitive approaches are taken to reduce the clout
in the process and gain the beneficial competitive
advantage that the big data analysis offers. Hence it has
become a prerequisite to build the scientific approaches
and theoretical models to tackle these problems.
II. THE BIG DATA ANALYTICAL PROBLEMS
Through better analysis of the large volumes of data
that are becoming available, there is the potential for
making faster advances in many scientific disciplines and
improving the profitability and success of many
enterprises. The challenges include not just the obvious
issues of scale, but also heterogeneity, lack of structure,
error-handling, privacy, timeliness, provenance, and
I.J. Information Technology and Computer Science, 2017, 4, 31-38
32
The Obstacles in Big Data Process
visualization, at all stages of the analysis pipeline from
data acquisition to result interpretation. These technical
challenges are common and therefore not cost-effective to
address in the context of one big data alone. Furthermore,
these challenges will require transformative solutions,
and will not be addressed naturally by the next generation
of industrial products. For this we need to encourage
basic research in the direction of solving these technical
problems that would achieve the promised benefits of big
data.
The technologies used for big data analysis include
MPP (Massively Parallel Processing) analytical platform
systems, Cloud Services, Hadoop and MapReduce and
NoSQL data warehouse management systems. The
Hadoop systems that are part of Apache Software
Foundation is one of the most common technologies used
to analyze immense amount of data in distributed file
systems. Hadoop consists of two main components;
Hadoop MapReduce and Hadoop Distributed File
Systems (HDFS). The MapReduce component is used in
parallel calculations whereas HDFS is assisting in
managing the distribution of the files within system [5].
The programming model used in Hadoop is
MapReduce which was proposed by Dean and Ghemawat
at Google. MapReduce is the basic data processing
scheme used in Hadoop which includes breaking the
entire task into two parts, known as mappers and reducers.
At a high-level, mappers read the data from HDFS,
process it and generate some intermediate results to the
reducers. Reducers are used to aggregate the intermediate
results to generate the final output which is again written
to HDFS. A typical Hadoop job involves running several
mappers and reducers across different nodes in the cluster.
A certain set of wrappers are currently being developed
for MapReduce. These wrappers can provide a better
control over the MapReduce code and aid in the source
code development. The following wrappers are being
widely used in combination with MapReduce.
Apache Pig is a SQL-like environment developed at
Yahoo is being used by many organizations like Yahoo,
Twitter, AOL, LinkedIn etc. Hive is another MapReduce
wrapper developed by Facebook. These two wrappers
provide a better environment and make the code
development simpler since the programmers do not have
to deal with the complexities of MapReduce coding. In
addition to these wrappers, some researchers have also
developed scalable machine learning libraries such as
Mahout using MapReduce paradigm [6, 7].
The most advanced technologies are used in order to
find a better way to extract the information from the big
data. The process of analyzing the big data and extracting
the essential information can be divided into four simple
steps regardless of the purpose of the analysis:
data collection;
integration;
analysis;
real world application.
A. Data Collection
Color figures will be appearing only in online
publication. All figures will be black and white graphs in
print publication. Collecting the data from multiple
sources is the first step of general big data schema
(Figure 1). Challenges arise when the data sources are
complex and sophisticated. The main source of data for
Big Data stream is rapidly shifting from manual data
entries to the data collected from sensors, social networks,
automatic data collector machines that are triggered when
a particular event happens, geographic information
systems, automatic page scanners that enable to extract
particular data characteristics form emails and online
pages [8, 9].
Heterogeneity of the data sources is the most important
problem in the data collection step. Heterogeneous data
problems arises due to – Variety, Representation and
Semantics of the data sources. Most of the data created
nowadays fundamentally differ from the data types that
the original systems were designed for [10, 11].
Semantic problems emerge due to the difference in the
definition of the collected data between two parties. For
example, the system designers might program the bank
system to include some fees into total earnings whereas
the data analysts would assume these fees to be reported
separately. The analyst who uses the total earnings in
calculations might not be aware of these definition, hence
create blunder in reports.
Data
source
Data
collection
Integration
Data
source
Analysis
Real World
Applications
Data
source
Fig.1. The Big Data Analytical Process Steps Problems
The data representation problems are similar in nature
to the semantics problems. The data misrepresentation
Copyright © 2017 MECS
might be caused by different types of data that is used to
show the same information. In the similar example we
I.J. Information Technology and Computer Science, 2017, 4, 31-38
The Obstacles in Big Data Process
used above, even if both parties agree on the earnings
definition, the data collector might capture the data in
floating numbers where as it is required to be an integer
for the other party in order to be able to merge the data
into a bigger dataset. Common mistakes in representation
is caused by date formats and character fields. Database
designers might try to join datasets using name and
surname of the customer to extract some essential
information. Since character fields are case sensitive,
even small misrepresentation, such as, using different
capital letters, will make the search and joins inefficient.
Taking into account the commonality of the problem
the most regularly used database tools such as SAS has
created functions like “UPCASE” to capitalize all letters
before making the comparison. SAS also uses
“DATEPART” function to generalize the date formats
into single form before trying to match the observations.
The easy way to solve the problem is to use ontologies
agreed by both parties beforehand [12-13].
One of the most important problems in data collection
is to collect the data required for the purpose. The
immense amount of data acquisition requires to make
instant decisions of what data to keep and what to discard.
Due to the size of the big data this process usually takes
enormous efforts and resources. Note that at this stage it
is also important to delete or ignore the data deviations.
Since most of the data collected nowadays is in the digital
format, it is easy to link the data to each other or
previously collected data. This creates false correlations
that will be discussed in the next sections of the big data
analysis process.
The outliers may also appear in dataset due to human
or technical errors. Since the data collection process is
separated from the data analysis, the systems and data
collectors are not necessarily familiar with the purpose of
the data that they are collecting. The detachment from the
end result makes it hard for data collectors to pinpoint the
outliers or data errors right away.
Another big problem is to transfer the data collected.
Due to the size of the data the speed of the transfer may
be a bottleneck in the process. Researchers work on
creating the high speed fiber optics that can transfer the
big data fast. Using new type of fiber optics, researchers
in the Technical University of Denmark were able to
transfer the data using single optics with the speed of 43
TBps. This is the highest speed achieved so far after
successful attempt of scientists in İnstitute of the
Technology Karlsruhe were able to achieve the speed of
32 Tbps [14].
Note that even though the fiber optics are developed to
achieve high speeds in data transfer process, there are few
successful attempts in creating the devices that can
receive the data and store it with the same speed. At this
stage of the big data process, the data protection and
security problems must also be taken into consideration.
Losing the high sensitive data during transition is one of
the most common data security problems [15].
Many of the secure transmissions require some type of
encryption agreed on beforehand by both parties. The
multiple layered encryption codes are used by banks to
Copyright © 2017 MECS
33
transfer the sensitive customer information from one
source to the other. Despite these security measures data
losses happen in the system. Common practice is to
create secure channels or SSL (secured socket layer)
portals between parties to add additional layer of
protection [16, 17].
B. Integration
The data that has been transferred must be stored in
some form. Every day we create so much data that it
costs companies fortune to store it in order for them to
improve their business. The demand for storing the big
data has increased so immensely and in such a fast pace
that, new companies such as Switch has been created
solely to help companies to resolve their problems with
storing the data. According to the SuperNap, one of the
biggest data centers in the world located in Las Vegas,
US, Switch’s seven football court sized server helps
Google, Morgan Stanley and the other big companies to
store the data required for their business. The data storage
market has grown to 70 billion dollar a year. According
to Google’s fourth quarter fiscal year spending results In
order to decrease their dependency on companies that
focus on storing the data, Google spent 7.3 billion dollars
in 2013 to invest in its infrastructure and data storage
facilities [18, 19].
Storing such a huge sized data requires enormous
amount of energy and resources. One of the problems of
the Big Data is to find the best located servers to store the
data. The server locations must also be energy efficient
and scalable. The location is important due to the speed
of transfer of the stored data to do the analyses. The
problem of a storage location and the speed of transfer is
more severe for companies that need to make instant
decisions based on market fluctuations. Since most of the
mathematical algorithms are designed to start the
calculation of market variables at a given market prices,
most brokers and stock market participants are vulnerable
to the differences in information transfer speed. Most
companies started to solve the problem using the cloud
computing.
Cloud computing is a successful computational
paradigm for managing and processing big data
repositories, mainly because of its innovative metaphors
known under the terms “Database as a Service” (DaaS)
and “Infrastructure as a Service” (IaaS). DaaS defines a
set of tools that provide final users with seamless
mechanisms for creating, storing, accessing and
managing their proper databases on remote (data) servers.
Due to the naïve features of big data, DaaS is the most
appropriate computational data framework to implement
big data repositories.
IaaS is a provision model according to which
organizations outsource infrastructures used to support
ICT operations. Due to specific application requirements
of applications running over big data repositories, IaaS is
the most appropriate computational service framework to
implement big data applications.
Even though the cloud computing have multiple
advantages it requires most small companies to “rent” the
I.J. Information Technology and Computer Science, 2017, 4, 31-38
34
The Obstacles in Big Data Process
storage places from other companies. This creates
business dependencies that only the big companies have
resources to avoid.
Most data generated today is not generated with a
metadata or is not transferred with a metadata attached.
Omitted metadata creates problem during the integration
process of large amount of data. The process requires to
attach a meaning and assign the data to a field [20, 21].
Think about a case when customer information such as
latest balance is transferred from one branch to another
without indicating what the data is actually representing.
This will obviously create confusion when attaching the
data to the customer’s profile. When metadata does not
exist or is not created during data collection process the
stored data is basically useless for further analysis.
Integration of the data requires IT technicians to
thoroughly understand the data transferred in order to
store them in a meaningful fashion. It is not uncommon to
spawn assumptions about the metadata when it does not
exist.
However this approach might create false assumption
and give wrong results in further analysis. The
assumptions made on the metadata might also result in an
increase in the scale of the data and create cross products
during the join operations. The cross products will result
in “useless” repetition of observations in dataset that
should be avoided by any cost due to the size of the big
data. Therefore there is no doubt that integrating multiple
datasets in order to create a data storages is the hardest
process at this step [22].The data storage for the Big Data
must have some properties in order to serve the needs for
long term and work efficiently (Figure 2).
Data
Storages
Scalable
Dynamic
Well
Structured
Fig.2. The Data Storages Characteristics
First of all the data storage must be well structured.
Data structure is a specialized format for organizing and
storing the data. The data structure must be easy to
understand, easy to extract information from and easy to
change when required. The structure of the data must be
consistent [23]. Some of the data storages such as Oracle
have physical and logical structure forms. Logical
structure forms are not known to the operating systems.
The data analyst or app developer might be aware of the
logical structure which would help the analysis. It is
harder to solve the problems when database administrator
is not aware of the link between the physical structure
and logical structure.
When dealing with Big Data the structure of the data
storage might get too complicated. In that case many
database administrators will try to create the data
Copyright © 2017 MECS
structure control files in order to find the required fields
or understand the impact of the changes to the main
structure. At this point it is also important to create the
rollback codes. When the integration process fails and the
data storage gets ugly after the changes has been made to
the structure, data rollback files will help to undo the
damage done to the database. Oracle has also created the
redo files that would redo all the selected changes to the
database [24].
Since the big data variables and data types can change
actively the data storage structure must also be dynamic
enough to accommodate the changes in the data received.
The fast changing aspect of big data will also require the
data storage to change its scale rapidly. Logical structure
files are meant to help the data administrators to review
the changes to the structure and adjust the data storages to
eliminate any inconsistencies. One of the advantages of
using clouds for storing the data is its dynamic and
scalable properties. However most of the companies have
raised their concerns for privacy issues of using clouds
for storing the data. Most business units do not want to
lose the control over the data. By using the stored data for
marketing and advertising purposes, the companies
offering cloud computing have lost their creditability.
The low data protection and poor data backup policy
issues have also caused the clients of cloud computing
losses failure to protect customer sensitive data. These
problems have created serious concerns over reliability
and continuity of the services provided by the companies
offering the cloud storage. Data protection laws have
been passed in many countries to eliminate the above
problems. However it is important to understand that the
cloud storages and companies using them might be
operating in different countries under different laws. Note
that this type of storage of the big data requires high
speed of connection for transfer of data extracted as
discussed above [25-27].
Since storing the big data can be expensive, some
companies have tried to store only part of the data
collected. This approach has eliminated the need to
outsource data storages and decrease the cost of storing
the data. This simple technique might be helpful for the
small companies that are focused only on one aspect of
the customer service. However for most of the companies
deciding on the data that needs to be stored can be more
challenging than storing all of the data acquired. A
complex analysis may require ad hoc information that can
be hard to include in to the dataset if the data storage has
not been designed to store that information [28].
C. Analysis
The problems that data analysts face when dealing with
an average size of data set emerge in more severe form
for the big data. Most business decisions need to be made
in a timely manner. The companies that cannot modify
their behavior to the changes in the market behavior in a
timely manner have serious problems and will likely face
severe problems in the future. The decisions made to
adjust the company behavior must be based on the results
of market realities. The challenges with the big data is
I.J. Information Technology and Computer Science, 2017, 4, 31-38
The Obstacles in Big Data Process
that the data extracted for decisions may skew those
realities by more than most companies can afford. The
false correlations and unknown data links will create
challenges in the interpretation of data extraction results.
When the data scientist or the senior management do not
have clear idea of how the data should look like, such
biased results are hard to pinpoint [29].
False temporary correlation will most likely result in
wrong decisions that can damage the company in the long
term. The correlations are hard to remove unless it is
extremely controversial. For example data analysis might
show that the oil prices are highly correlated with the
demand for a medicine to fight alcoholism. At this stage
it is always good to keep in mind that correlation does not
mean causation. Hence data analyst might want to ignore
this trend due to the high unlikeliness or the counter
intuition of results. However, even in such a controversial
situation, market adjustments might create real data that
might seem extreme, but could be real. The alcohol prices
are indeed highly correlated with oil prices due to
changes in the transportation cost, which affects the
alcohol price and hence the consumptions which
increases the demand for that medicine. It is indeed hard
to find false correlations and links within big data. The
links within the big data may be created due to the
business practices or false data. Think about a new
product introduced by the company. The product such as
credit monitoring subscription can be bought with the
new loan or transferred from the previous loan of the
customer. During creation of metadata or database this
link might be lost. Hence the data analyst that wants to
understand the penetration rate or profitability on that
product can ignore the transferred subscriptions. The
result will most likely create a false view that can result
in wrong decision such as cancellation of the product.
Management might want to try to eliminate the problem
using sanity checks before making big decisions. The
management will require an alternative analysis that will
support the decision. However this takes time that most
companies cannot afford.
False results are not the only problem in big data
analytics. The main problem is the velocity of the
analysis. This is one of the three V’s of the big data. Most
define these V’s as; Velocity, Variety and Volume.
Variety and Volume has already been discussed during
the data collection and integration process. Velocity of
big data not only refers to the flow of data from sources
to database but also the flow of the data from database to
the end result in analytics. The speed of the data
extraction and data analysis is the most important
advantage that a company can get over competitors. Fast
and correct strategic decisions will increase the return on
investment and increase the market share of the company.
However this requires to analyze the dynamic big data
extremely fast [30, 31].
Think about the stock price fluctuations. The investor
might want to short sell the stock if the implied volatility
will decrease below certain level. The implied volatility
might be calculated using Black Scholes model and then
be adjusted using the average volatility of all bonds in
Copyright © 2017 MECS
35
that industry. The investor expects the calculations to be
done instantly and buy the stocks at a given point in time.
The investor will lose millions of dollars if such simple
calculations will take more than coupe seconds. This
process that might seem simple at first glance requires to
analyse and extract immense amount of data. The
problem can be partially solved using high end
technologies. However such technological advanced
products cost companies fortunes. Taking into account
the abrupt changes in the technological trends and the
pace of the technological innovations, the long term
benefit from such programs is extremely low. It creates
challenges for the small companies or start-ups to adjust
their businesses swiftly and catch up to their competitors.
The cost based optimization programs have been
developed by big companies to increase the efficiency of
the use of such technologies. For small companies such a
simultaneous change to the new trend is nearly
impossible. Most of these companies would prefer to
adjust their capacity and enter into niche markets to
survive. The new features in the technology have created
new ways to spread the information from one end to
another. It has created the market interdependencies
which makes niche markets interconnected to bigger
market events. Hence scraping all the information from
the big data becomes reality and a need to survive [32].
It is also important to underline the models that are
used to extract the information from the big data. Due to
the size of the big data, common practice for most
financial institutions would be to create the model using
development sample and validate the results in the
validation test sample. One of the problems in is to
choose the development and the validation sample. Given
the size of the big data and the computer capacity, the test
sample might be less than 25% of the whole data.
The test sample is then divided into the development
and validation sample which further decreases the model
dataset. It must be noted that even though the test
statistics could result in somewhat predictive models, the
samples chosen are not enough to be fully sure in the
decisions made. Of course even at this stage companies
must make choice between how fast they can get the
result versus how predictive they want their end result to
be. Note that the big data collected might include
thousands and thousands of variables. This feature of the
big data makes it impossible to create the predictive
models using all of these variables.
Companies bin the development datasets and use
logistic regression on the development dataset to find a
handful set of variables that can be used as predictive. It
is extremely important to understand how these
regression models work. Commonly used regression
models- logistic models would choose one predictive
variable and eliminate the variables that are correlated
with that variable and have lower predictive power in
comparison.
However we have already underlined the commonality
of the false correlations in the big data. Hence this false
correlation is not only analytical problem that arises at the
final step of the big data analytics, it also causes problems
I.J. Information Technology and Computer Science, 2017, 4, 31-38
36
The Obstacles in Big Data Process
during the model building process. The eliminated
variables from the model due to the high correlation,
could have been predictive ones that showed the false
correlation and false linkage to the selected variable.
Even after the modelling is complete, most senior
management would chose to eliminate some of the
predictive variables in order to make the model simplistic
for operational purposes. It creates additional problems in
the strength of the model built on the big data. Hence
even though, the big data captures most of the variations
in the customer behavior, and the models built on this
data should be more predictive, the end result is not
necessarily always true [33-35].
companies. The companies also hope to attract the better
data analysts buy offering them high salary. This might
create new incentives for data analysts and scientists to
implement the new ideas that might solve some of the
most challenging problems in big data analytics.
Big data has already attracted a lot of attention and
many work on solving fundamental problems that can
change the way we perceive the reality right now. The
technologies that understand and process huge amount of
data to interact with humans are more and more of the
reality and attract customers. Markets has already created
the demand for such technologies. It is now up to the
companies to find ways to satisfy that demand.
D. Real World Application Problems
REFERENCES
The big data creates challenges at every step of the
analytical process for data scientists and management.
The companies are having bigger challenges in finding
qualified data scientist that can work with the big data
rather than the problems in the analytical process itself.
Companies spend millions of dollars every year to train
their staff to work with the big data. There are very few
analysts in the job market who can work with the big data.
However, there are even fewer people in the market
who can understand the data and the meaning underlying
the numbers. Most analysts have hard time to understand
and see the false data results. Data scientist must have
ability to descend patterns where others do not see any. A
study by McKinsey projects that “by 2018, the U.S. alone
may face a 50 percent to 60 percent gap between supply
and requisite demand of deep analytic talent” [36].
The shortage is already being felt across a broad
spectrum of industries, including aerospace, insurance,
pharmaceuticals, and finance. The negative trend has also
been noted buy high ranking universities which now offer
exclusive programs to train such analysts. The lack of
data analyst will also create more problems for the
companies that want to do ad hoc analysis on big data and
implement the new models in the market. The problems
discussed in this article will result in most companies
losing their customers as a result of skew in the data and
misinterpretation [37- 41].
III. CONCLUSION
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
The big data creates challenges at every step of
extraction and analysis. Despite the challenges described
in the article companies will not stop using big data for
their business purposes. The companies and scientist will
not stop trying to solve these problems using various
analytical and scientific tools. The reality is that the
future is depended on the big data.
The companies that want to survive and operate in the
future will have to learn to work with the big data and
solve these problems. Markets have already shifted to
react to the big data trends. Even though the cost of big
data analytics leaves less hope for small companies, most
believe the small entrepreneurs and start-ups have higher
chance to adjust their businesses due to the lack of a
complicated hierarchical structure in those small
Copyright © 2017 MECS
[1]
[10]
[11]
[12]
[13]
L. Clifford, “Big data: How do your data grow?”, Nature,
vol.455, 2008, pp.28–29.
The digital universe in 2020: Big Data, Bigger Digital
Shadows, and Biggest Growth in the Far East. Study
report,
IDC,
December
2012.
http://www.emc.com/collateral/analyst-reports/idc-thedigital-universe-in-2020.pdf
V. Gopalkrishnan, D. Steier, H. Lewis, J., “Guszcza Big
data, big business: bridging the gap” in Proceedings of the
1st International Workshop on Big Data, Streams and
Heterogeneous Source Mining: Algorithms, Systems,
Programming Models and Applications (BigMine '12),
NY, USA, 2012. pp. 7–11.
S. Madden, “From Databases to Big Data”, IEEE Internet
Computing, vol.16, no.3, 2012, pp. 4–6.
K-H. Lee, Y-J. Lee, H. Choi, Y.D. Chung, B. Moon,
“Parallel data processing with MapReduce: a survey”,
ACM SIGMOD Record, vol.40, no.4, 2011,pp.11–20.
K.H. Lee, Y.J. Lee, H. Choi, Y.D. Chung, B. Moon
“Parallel data processing with MapReduce: a survey”
ACM SIGMOD Record, 2012, vol. 40, no. 4, pp. 11–20.
Y. Chen, S. Alspaugh, R.H. Katz, “Interactive analytical
processing in big data systems: A cross-industry study of
mapreduce workloads” Proceedings of the VLDB
Endowment (PVLDB), 2012, vol. 5, no. 12, pp. 1802–
1813.
W. Shang, Z.M. Jiang, H. Hemmati, B. Adams, A.E.
Hassan, P. Martin, “Assisting developers of big data
analytics applications when deploying on hadoop clouds”,
in Proceedings of the 2013 International Conference on
Software Engineering (ICSE '13), NJ, USA, 2013,
pp.402–411.
C.L.P. Chen, C.-Y. Zhang, “Data-intensive applications,
challenges, techniques and technologies: A survey on Big
Data, Information Sciences”, vol. 275, no. 10, 2014, pp.
314–347.
C. Statchuk, M. Iles, F. Thomas, “Big data and analytics”,
in Proceedings of the 2013 Conference of the Center for
Advanced Studies on Collaborative Research (CASCON
'13), USA, 2013, pp. 341–343.
V.Mayer-Schonberger, K. Cukier, “Big Data: A
Revolution That Will Transform How We Live”, Work
and Think, Pub.: John Murray, 2013, p. 256.
R. Birke, M. Björkqvist, L. Y. Chen, E. Smirni, T.
Engbersen, “(Big)data in a virtualized world: volume,
velocity, and variety in cloud datacenters” in Proceedings
of the 12th USENIX conference on File and Storage
Technologies (FAST'14), USENIX Association Berkeley,
CA, USA, 2014, pp.177–189.
Big Data - What Is It? 2013, http://www.sas.com/big-
I.J. Information Technology and Computer Science, 2017, 4, 31-38
The Obstacles in Big Data Process
data/what-is-big-data.html
[14] SAS 9.2 Language Reference: Dictionary 4th Edition,
Publisher SAS Institute Inc, Cary, NC, USA, 2011, p.
2356.
https://support.sas.com/documentation/cdl/en/lrdict/64316
/PDF/default/lrdict.pdf
[15] K. Munir, M. Odeh, R. McClatchey, S. Khan, I. Habib,
“Semantic Information Retrieval from Distributed
Heterogeneous Data Sources”, Presented at the 4th
International Workshop on Frontiers of Information
Technology (FIT 2006), Islamabad, Pakistan, 2006, pp. 1–
6. http://arxiv.org/ftp/arxiv/papers/0707/0707.0745.pdf.
[16] O. Leif Katsuo, H. Hao, “World data transfer record back
in Danish hands”, Technical University of Denmark
(DTU),
2014,
online
resource,
http://www.dtu.dk/english/News/2014/07/Verdensrekordi-dataoverfoersel-paa-danske-haender-igen?id=bed76c33c9da-4214-91f3-c9ed3f8a0e24
[17] A. Cuzzocrea, “Privacy and Security of Big Data: Current
Challenges and Future Research Perspectives”, in
Proceedings of the First International Workshop on
Privacy and Security of Big Data (PSBD '14), NY, USA,
2014, pp. 45–47.
[18] R.T. Gasimova, “Security of global domain infrastructure
in the Internet”, Journal Problems of İnformation
Technology, "İnformasiya Texnologiyaları" Publishing
house,
2015,
no.
2,
p.
61–67.
http://jpit.az/storage/files/article/71c96379ecf1714a60247
e0206a0ba4b.pdf
[19] DigiCert is a U.S.-based Certificate Authority. It provides
SSL Certificates and SSL management tools, online
resource, https://www.digicert.com/ssl.htm
[20] S.V. Stacey, “Big Data creates big industry for storing
data”,
online
resource,
http://www.marketplace.org/topics/business/big-datacreates-big-industry-storing-data
[21] Google Inc. Announces Fourth Quarter and Fiscal Year
2013
Results
http://investor.google.com/pdf/2013Q4_google_earnings_
release.pdf
[22] T. Mastelic, A. Oleksiak, H. Claussen , I. Brandic, J-M.
Pierson, V.A. Vasilakos, “Cloud Computing: Survey on
Energy Efficiency”, Journal ACM Computing Surveys
(CSUR), NY, USA, vol. 47, no.2, 2015, pp. 1–36.
[23] K. Smith, L. Seligman, A. Rosenthal, C. Kurcz, M. Greer,
C. Macheret, M. Sexton, A. Eckstein, “"Big Metadata":
The Need for Principled Metadata Management in Big
Data Ecosystems”, in Proceedings of Workshop on Data
analytics in the Cloud (DanaC'14), NY, USA, 2014, pp.
1–4.
[24] R.M. Alguliev, R.T. Gasimova, “Identification of
Categorical Registration Data of Domain Names in Data
Warehouse Construction Task” Intelligent Control and
Automation, vol.4, no.2, 2013, pp. 227–234.
[25] M. L. Haas, “The Power Behind the Throne: Information
Integration in the Age of Data-Driven Discovery”, in
Proceedings of the 2015 ACM SIGMOD International
Conference on Management of Data (SIGMOD '15), NY,
USA, 2015, p. 661.
[26] Oracle Database Online Documentation 11 g Release 2
(11.2), E10897-10, 2012, Primary Author: Bert Rich.
http://docs.oracle.com/cd/E11882_01/server.112/e10897.p
df
Copyright © 2017 MECS
37
[27] D. Lin, A. Squicciarini, “Data protection models for
service provisioning in the cloud”, in Proceedings of the
15th ACM symposium on Access control models and
technologies (SACMAT '10), NY, USA, 2010, pp.183–
192.
[28] M. L. Kaufman, “Data Security in the World of Cloud
Computing”, Journal IEEE Security and Privacy, vol.7, no.
4, 2009, pp. 61–64.
[29] C. Marinescu Dan. Cloud Computing: Theory and
Practice. Publisher: Morgan Kaufmann, 1 edition, San
Francisco, CA, USA, 2013, p. 416.
[30] D. Assunção Marcos, N. Rodrigo, Bianchi Silvia, A.S.
Netto Marco, Buyya Rajkumar, “Big Data computing and
clouds: Trends and Future Directions” Journal of Parallel
and Distributed Computing, vol.79, 2015, p. 3–15.
[31] B. Marr, “Big Data: Using Smart Big Data, Analytics and
Metrics to Make Better Decisions and Improve
Performance”, Pub. John Wiley & Sons, Ltd.; 1 edition,
2015, p. 258.
[32] D-H. Tran, M.M. Gaber, K-U. Sattler, “Change detection
in streaming data in the era of big data: models and
issues”, ACM SIGKDD Explorations Newsletter - Special
issue on big data, vol. 16, no. 1, 2014, NY, USA, pp. 30–38.
[33] L. Doug, “3D Data Management: Controlling Data
Volume, Velocity and Variety”, Technical report, META
Group, Inc (now Gartner, Inc.), February 2001, pp.1–3.
http://blogs.gartner.com/doug-laney/files/2012/01/ad9493D-Data-Management-Controlling-Data-VolumeVelocity-and-Variety.pdf
[34] K. Slagter, C-H. Hsu, Y-C. Chung Zhang Daqiang, “An
improved partitioning mechanism for optimizing massive
data analysis using MapReduce” Journal of
Supercomputing, vol.66, no.1, 2013, pp.539–555.
[35] A. Ashraf, B. Shivnath, “Workload management for big
data analytics”, in Proceedings of the 2013 ACM
SIGMOD International Conference on Management of
Data (SIGMOD '13), NY, USA, 2013, pp. 929–932.
[36] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C.
Roxburgh, A. Hung Byers, “Big data: The next frontier
for innovation, competition, and productivity”, Analyst
report, McKinsey Global Institute, May 2011. online
resource,
http://www.mckinsey.com/insights/business_technology/b
ig_data_the_next_frontier_for_innovation
[37] H. Chen, R.H.L. Chiang, V.C. Storey, “Business
intelligence and analytics: from big data to big impact”,
Journal Management Information Systems Quarterly,
vol.36, no.4, 2012, pp.1165–1188.
[38] R. Ramasamy, “Towards big data analytics framework:
ICT professionals salary profile compilation perspective”,
in Proceedings of the 8th International Conference on
Theory and Practice of Electronic Governance (ICEGOV
'14), NY, USA, 2014, pp. 450–451.
[39] A. Labrinidis, H. V. Jagadish, “Challenges and
Opportunities with Big Data”, Proceedings of the VLDB
Endowment, vol. 5, no.12, 2012, pp. 2032–2033.
[40] A. Baaziz, L. Quoniam, “How to use Big Data
technologies to optimize operations in Upstream
Petroleum Industry”, International Journal of Innovation,
2013, vol. 1, no. 1, pp. 19–29.
[41] K. Karthik, G. Kollias, V. Kumar, A. Grama, “Trends in
Big Data analytics” Journal of Parallel and Distributed
Computing, 2014, vol. 74, no. 7, pp. 2561–2573.
I.J. Information Technology and Computer Science, 2017, 4, 31-38
38
The Obstacles in Big Data Process
Authors’ Profiles
Rasim M. Alguliyev. He is director of the
Institute of Information Technology of
Azerbaijan National Academy of Sciences
(ANAS) and academician-secretary of
ANAS. He is professor and full member of
ANAS. His research interests include:
Information
Security,
E-government;
Information Society, Social Network Mining and Analysis,
Cloud Computing, Evolutionary and Swarm Optimization, Data
Mining, Text Mining, Web Mining, Social Network Analysis,
Big Data Analytics, Scientometrics and Bibliometrics.
Rena T. Gasimova. She is head of sector at
the Institute of Information Technology of
ANAS. Her research interests include: Data
Mining, Big Data Analytics, Domain name
system, Decision Support Systems, Data
Warehouse.
Rahim N. Abbasli. He is Senior Risk
Analyst at one of the leading Canadian
organizations specializing in subprime credits.
He is developing credit models, analyzing big
data, developing algorithms and adjusting
underwriting rules according to the results.
How to cite this paper: Rasim M. Alguliyev, Rena T.
Gasimova, Rahim N. Abbaslı,"The Obstacles in Big Data
Process", International Journal of Information Technology and
Computer Science(IJITCS), Vol.9, No.4, pp.31-38, 2017. DOI:
10.5815/ijitcs.2017.04.05
Copyright © 2017 MECS
I.J. Information Technology and Computer Science, 2017, 4, 31-38