Optimization of The Search Graph Using Hadoop and Linux Operating System
Optimization of The Search Graph Using Hadoop and Linux Operating System
Optimization of The Search Graph Using Hadoop and Linux Operating System
Abstract—The evolution of Social networking sites has posed systematic and swift solutions are warranted in today’s
lot of challenges for technology firms and researchers [1]. The scenario.
Social networking sites are gaining popularity amongst users
across the globe and networking of individuals is increasing very Humongous data is stored across the globe in the social
rapidly. People search on the Social networking sites to find old networking sites; this data is interacting, assimilating and
friends and other interesting people, this search operation runs in coordinating [10]. Hence this huge data may pose issues
the background of the Social network. The search operation related to fragmentation of the social graph on the networks,
needs to be optimized to enhance the user’s experience. In this data security & ethical issues in the networked globe [9]. The
research paper a MapReduce based search algorithm was data available on social networks could be used for various
designed and processed on a selected Hadoop cluster. The search business solutions like marketing, data management and other
operation on Hadoop framework could be affected by system fields.
parameters like memory, I/O and CPU performance etc. In this
research paper one of the functionality of memory i.e. swappiness In Social networking one of the main backbone
(which handles a Swap space) is changed & all the other technologies is graph search algorithm. It runs in the
parameters are kept constant to achieve the optimization of the background of the Social network for searching and
search operation. The search operation is optimized by networking operation. Today, the main problems faced by
processing the BigData of different sizes and varying the graph search algorithm are Inefficiency of the system,
swappiness which behave as a virtual memory of the system. unstructured graph problem, Poor locality, BigData access to
computation etc [2].
Keywords—Social Network; Search Graph algorithms;
BigData; Hadoop; MapReduce; Swappiness; Linux; Operating In this research, two problem areas- Inefficiency of the
System. system and BigData access to computation are targeted [5].
The BigData computation is optimized by selecting an
I. INTRODUCTION appropriate configuration of the system. The system is
With the advent of the twenty first century, several configured in form of Hadoop cluster and computation of
advancements in computing and networking technologies are BigData is processed by designed graph search algorithm [6].
manifested to change the way of computing in business. Many The sample of BigData size is 2GB, 4GB, 8GB & 10GB &
of these changes/trends emphasized through managerial system configuration is constant for processing of all size of
computing likewise Cloud computing, Distributed computing, data. The processing of BigData is optimized, in terms of time
and many more. Social network initiated at the grassroots complexity, by selecting an optimum value of swappiness
level has been growing quickly in several sectors [1] and it’s a (form of virtual memory).The search algorithm is designed for
combination of many computing technologies like Cloud processing the BigData on the Hadoop cluster and able to
computing, Distributed computing &, Parallel computing. achieve the optimized performance of algorithm for the
Social network leading to real business models, and provide designed cluster using operating system memory functionality
efficient platform for new emerging businesses. Some [16].
examples of Social network platforms are MySpace, Bebo, The paper is divided into six sections. I section motivates
FaceBook, and LinkedIn [1]. to take Social network and search graph algorithm as a
The internet users are empowered to connect and access research area. And also discuss the problem faced in it.
any information through social networking by rapid Section-II focus on concept involved in research and also
development and integration of various computing concepts. include designed algorithm of the graph search, where section-
The number of users is increasing at a fast pace and at the III discusses the further expansion of algorithm to optimize the
same time the evolvement of users is increasing rapidly. The performance, due to a new function (Swappiness), Section-IV
development & emergence of social network posed various shows the experimental result to justifying the optimization of
research problems and issues for researchers, & hence algorithm and Section-V- and Section VI is conclusion and
references of the paper.
II. DEVELOPMENT OF SEARCH ALGORITHM FOR cluster and can divide the tasks to work independently, & in
OPTIMIZATION ON HADOOP FRAMEWORK
Parallel. This helps in swift & efficient processing of the
BigData [11].
In section-I, the problems are discussed in detail; in this
section the focus is on solutions for the defined problems. The E. What is Hadoop?
process of finding a solution can be divided in various steps, Hadoop is based on the Google paper on MapReduce
like developing an algorithm, executing it on Hadoop published in 2004, and its development started in 2005. At the
framework, finding the parameters affecting the processing, time, Hadoop was developed to support the open-source web
selecting the parameters for optimization & finally concluding search engine project called Nutch. Today Hadoop is the best–
on the efficient & effective method of optimizing the known MapReduce framework in the market [8]. Its java base
developed algorithm. program but it also supports python and ruby programming for
the development of the algorithms [3].
Before understanding further, it is necessary to know the
basic terminologies related to the developed algorithm. F. Development of Algorithm-
A. What is Distributed System? Social networking sites have become prominent tool to
connect with people across the globe. Users keep on searching
The data is distributed in different systems and the data people to expand their network, this search operation happens
search operations are carried out by connecting various after a lot of processing of the data. The search operation on
systems (defined as nodes). The advantage of this system is Social networking sites needs to be fast to enhance the user’s
that data can be retrieved & stored on the network. experience; therefore all technology companies working on
“The distributed is a collection of various computers, Cloud computing; are striving to optimize the search
connected through a network with the help of software called operation.
middleware. The system helps to share various resources of In this research, the search algorithm is developed on
the all the computers and is perceived as a single system with MapReduce concept & processed on the Hadoop framework.
good, fast and integrated computing facility” [14]. As the Hadoop framework is an open source and is a
B. What is Parallel Process? combination of Distributed systems and Parallel processing
The concept is focused on the processing of the huge data Algorithm:-
simultaneously on the same network. This concept is widely
applicable in BigData processing in social networks. I. Mapper:
Step-1- Enter the keyword to search
“Parallel processing solves BigData problems by
Step-2- The keyword will be searched on the networking
fragmenting them into smaller ones and solving them at the
sites
same time. Parallel processing was considered mainly for
solving data-intensive problems encountered in computing Step-3 Searched keyword data is stored in a variable
problems related to the field of engineering, science” [4]. Step-4 Restore all URL and text data in new variable.
Sr. Swap 2GB Data 4GB Data 8GB Data 10GB Data
No. piness
1. 0
1m28.45s* 2m32.77s 4m23.08s 14m5.75s
5. 40 1m30.33s 2m33.10s 4m22.94s 12m39.60s Fig 4.1- Processing time for 2GB Data against swappiness
6. 50
1m26.26s 2m28.65s 4m15.93s 10m59.60s
7. 60
(Default)
1m30.16s 2m31.65s 4m24.90s 11m12.18s
V. CONCLUSION
All the experiments set to run the designed Search
MapReduce algorithm on the configured Hadoop cluster
shows that optimization of BigData processing can be
achieved by changing the virtual memory (swappiness) in the
optimal range.
These results could be extrapolated in the search operation of
the Social networking sites, and searching time can be
optimized to enhance the user’s experience. The results also
shows that range of improvement in processing time vary
between 1.5 to 4.5 % for the given size of BigData by fixing
the swappiness in the optimal range.
The sample data taken for processing varied from 2 to 10
GB, the data size is small as compared to actual BigData
Fig. 4.4- Processing time for 10GB Data against swappiness problems of the real world; but due to resource constraint the
bigger size of the data cannot be taken for the experiment. The
The second set of the experiment is also processed on the processing of only 10 GB data required a Hadoop Cluster of 5
same Hadoop cluster to swappiness is changed in similar systems with 2GB RAM, 2 core processor and 100mbps
fashion as done in the first experiment; Fig. 4.2 shows the network connection. The optimal swappiness range in bigger
results of the experiment. size of data would fetch us more fruitful results in terms of
optimization, of the search operation on the Social networking
The third set of the experiment is on 8GB BigData processing sites.
on the same Hadoop cluster. Fig. 4.3 shows the result of 8GB
The extreme lower ( below 30%) & higher ( above 70%)
data. swappiness have given reduction in processing time in some
of the rare cases, but are not considered since lower
Table 4.2 presents the comparison of the processing time swappiness create hindrance in multitasking and higher
of BigData of 2GB, 4GB, 8GB & 10GB size with default swappiness will slow down the BigData processing.
swappiness of the Hadoop cluster and optimal value of
swappiness. The task of optimization (reduction in the The above fact of lower and higher % of swappiness for
processing time of BigData) is achieved in all the sets of the BigData processing; can be ascertained by going in depth of
experiment & the optimal range of swappiness coming for the the memory & processor’s functionalities, of the operating
experimental set up is in the range 45-50 The processing time system. For future work researchers may take other functions
improvement in 2GB, 4GB, 8GB & 10GB data is 4.32%, of the system memory and processor; for optimization of the
1.98%, 3.38% & 1.87% respectively in case of optimal range BigData processing.
of swappiness i.e. 45-50% as compared to default system
value of swappiness (60%). VI. ACKNOWLEDGMENT
I would like to extend my deepest gratitude to my guide
TABLE-4.2 COMPARISON OF PROCESSING TIME FOR
SWAPPINESS AT DEFAULT & MODIFIED VALUE FOR 2GB, 4GB, Dr. Sunita Choudhry for her valuable guidance, suggestions
8GB AND 10GB BIGDATA FOR HADOOP CLUSTER. and motivation. Mr. Vimal Daga also helped during the
implementation phase of the project and deserves a big thank
you and special mention in this work.
Time taken Time taken
%
(sec) to (sec) to
Improvement
VII. REFERENCE
Sr. Data process Data process Data
No. Size at Default at Swappiness
in processing [1] Major Parameswaran and Andrew B. Whinston, “Resaearch Issues in
time ( Social Computing”, In: Journal of the Association for Information
Swappiness ( range ( 45-
Optimization)
60)% 50)% System 8.6, 2007, p. 22.
[2] Preeti Narooka and Dr. Sunita Chaodhary, “Graph Search Process in
Social Networks and it's Challenges”, In IJCSET June 2016 | Vol 6,
1 2 GB 86.269 86.785 4.32% Issue 6. P. 228-232.
[3] Firat Tekiner and John A. Keane, “Big Data Framework”, In:
148.654 International Conference on system, Man and Cybernetics, IEEE 2013,
2 4 GB 151.658 1.98% p. 1494-1499.
[4] Preeti Narooka, Sunita Chaodhary, “Paradigm Shift of Big-Data
264.903 255.938 Application in Cloud Computing”, International Journal of Advanced
3 8 GB 3.38%
Research in Computer and Communication Engineering Vol. 5,
Issue 5, May 2016 p. 515-521.
672.188 659.609 [5] Bo Li and Prof. Raj Jain, “Survey of Recent research Progress and issues
4 10GB 1.87 %
in Big Data”, 2013
2017 International Conference on Nascent Technologies in the Engineering Field (ICNTE-2017)