Optimization of The Search Graph Using Hadoop and Linux Operating System

2017 International Conference on Nascent Technologies in the Engineering Field (ICNTE-2017)
Optimization of the Search Graph Using Hadoop and

Linux Operating System
Ms. Preeti Narooka Dr. Sunita Choudhary
Ph. D Student, Computer Science Former Associate Professor, Computer Science,
Banasthali University, Banasthali University,
Jaipur, India Jaipur, India
preeti.narooka@gmail.com sunitaburdak@yahoo.co.in
Abstract—The evolution of Social networking sites has posed systematic and swift solutions are warranted in today’s
lot of challenges for technology firms and researchers [1]. The scenario.
Social networking sites are gaining popularity amongst users
across the globe and networking of individuals is increasing very Humongous data is stored across the globe in the social
rapidly. People search on the Social networking sites to find old networking sites; this data is interacting, assimilating and
friends and other interesting people, this search operation runs in coordinating [10]. Hence this huge data may pose issues
the background of the Social network. The search operation related to fragmentation of the social graph on the networks,
needs to be optimized to enhance the user’s experience. In this data security & ethical issues in the networked globe [9]. The
research paper a MapReduce based search algorithm was data available on social networks could be used for various
designed and processed on a selected Hadoop cluster. The search business solutions like marketing, data management and other
operation on Hadoop framework could be affected by system fields.
parameters like memory, I/O and CPU performance etc. In this
research paper one of the functionality of memory i.e. swappiness In Social networking one of the main backbone
(which handles a Swap space) is changed & all the other technologies is graph search algorithm. It runs in the
parameters are kept constant to achieve the optimization of the background of the Social network for searching and
search operation. The search operation is optimized by networking operation. Today, the main problems faced by
processing the BigData of different sizes and varying the graph search algorithm are Inefficiency of the system,
swappiness which behave as a virtual memory of the system. unstructured graph problem, Poor locality, BigData access to
computation etc [2].
Keywords—Social Network; Search Graph algorithms;
BigData; Hadoop; MapReduce; Swappiness; Linux; Operating In this research, two problem areas- Inefficiency of the
System. system and BigData access to computation are targeted [5].
The BigData computation is optimized by selecting an
I. INTRODUCTION appropriate configuration of the system. The system is
With the advent of the twenty first century, several configured in form of Hadoop cluster and computation of
advancements in computing and networking technologies are BigData is processed by designed graph search algorithm [6].
manifested to change the way of computing in business. Many The sample of BigData size is 2GB, 4GB, 8GB & 10GB &
of these changes/trends emphasized through managerial system configuration is constant for processing of all size of
computing likewise Cloud computing, Distributed computing, data. The processing of BigData is optimized, in terms of time
and many more. Social network initiated at the grassroots complexity, by selecting an optimum value of swappiness
level has been growing quickly in several sectors [1] and it’s a (form of virtual memory).The search algorithm is designed for
combination of many computing technologies like Cloud processing the BigData on the Hadoop cluster and able to
computing, Distributed computing &, Parallel computing. achieve the optimized performance of algorithm for the
Social network leading to real business models, and provide designed cluster using operating system memory functionality
efficient platform for new emerging businesses. Some [16].
examples of Social network platforms are MySpace, Bebo, The paper is divided into six sections. I section motivates
FaceBook, and LinkedIn [1]. to take Social network and search graph algorithm as a
The internet users are empowered to connect and access research area. And also discuss the problem faced in it.
any information through social networking by rapid Section-II focus on concept involved in research and also
development and integration of various computing concepts. include designed algorithm of the graph search, where section-
The number of users is increasing at a fast pace and at the III discusses the further expansion of algorithm to optimize the
same time the evolvement of users is increasing rapidly. The performance, due to a new function (Swappiness), Section-IV
development & emergence of social network posed various shows the experimental result to justifying the optimization of
research problems and issues for researchers, & hence algorithm and Section-V- and Section VI is conclusion and
references of the paper.
978-1-5090-2794-1/17/$31.00 ©2017 IEEE

II. DEVELOPMENT OF SEARCH ALGORITHM FOR cluster and can divide the tasks to work independently, & in
OPTIMIZATION ON HADOOP FRAMEWORK
Parallel. This helps in swift & efficient processing of the
BigData [11].
In section-I, the problems are discussed in detail; in this
section the focus is on solutions for the defined problems. The E. What is Hadoop?
process of finding a solution can be divided in various steps, Hadoop is based on the Google paper on MapReduce
like developing an algorithm, executing it on Hadoop published in 2004, and its development started in 2005. At the
framework, finding the parameters affecting the processing, time, Hadoop was developed to support the open-source web
selecting the parameters for optimization & finally concluding search engine project called Nutch. Today Hadoop is the best–
on the efficient & effective method of optimizing the known MapReduce framework in the market [8]. Its java base
developed algorithm. program but it also supports python and ruby programming for
the development of the algorithms [3].
Before understanding further, it is necessary to know the
basic terminologies related to the developed algorithm. F. Development of Algorithm-
A. What is Distributed System? Social networking sites have become prominent tool to
connect with people across the globe. Users keep on searching
The data is distributed in different systems and the data people to expand their network, this search operation happens
search operations are carried out by connecting various after a lot of processing of the data. The search operation on
systems (defined as nodes). The advantage of this system is Social networking sites needs to be fast to enhance the user’s
that data can be retrieved & stored on the network. experience; therefore all technology companies working on
“The distributed is a collection of various computers, Cloud computing; are striving to optimize the search
connected through a network with the help of software called operation.
middleware. The system helps to share various resources of In this research, the search algorithm is developed on
the all the computers and is perceived as a single system with MapReduce concept & processed on the Hadoop framework.
good, fast and integrated computing facility” [14]. As the Hadoop framework is an open source and is a
B. What is Parallel Process? combination of Distributed systems and Parallel processing
The concept is focused on the processing of the huge data Algorithm:-
simultaneously on the same network. This concept is widely
applicable in BigData processing in social networks. I. Mapper:
Step-1- Enter the keyword to search
“Parallel processing solves BigData problems by
Step-2- The keyword will be searched on the networking
fragmenting them into smaller ones and solving them at the
sites
same time. Parallel processing was considered mainly for
solving data-intensive problems encountered in computing Step-3 Searched keyword data is stored in a variable
problems related to the field of engineering, science” [4]. Step-4 Restore all URL and text data in new variable.
C. What is Cloud Computing? II. Reducer

The principle of Cloud computing is a type of bigger Step-5 Calculate the number of search of that particular
platform of Distributed systems on the internet. The nodes can keyword on webpage/networking site.
be connected on a large scale; this concept has a vital usability
in the Social networking sites. [9]. (MapReduce concept) the algorithm is processed on it.
“Cloud computing: is based on a principle of sharing The Social networking site/webpage data will be stored on the
computing resources on internet, cloud computing concept Distributed system; the stored data is further used for
promotes cost effective solution for business to set virtual optimization of the search operation.
office with convenience of accessing the system at any place Following steps are undertaken to develop the algorithm; it
at the desired time. is divided in two stages- Mapper & Reducer:
D. What is MapReduce? The developed algorithm was processed in a Hadoop
Map/Reduce was first implemented by the Google search cluster and parameters like system memory, Input-Output
engine; the technology helps in indexing and analyzing of the system & CPU are monitored to optimize the algorithm.
BigData. Map/Reduce distributes the data across a large
B. Why to Change Swappiness?

Distributed
Parallel The default value cannot be a perfect fit solution for all the
System/Computing Processing individual cases; and it depends upon hardware specifications
and user needs. Hence, it is very important to understand the
functionality of swappiness; this function derives the
effcieincy of the operating system in BigData processing,
MapReduce
having said that, it is required to focus on swappiness function
Processing for different system configurations to optimize the user
experience on the social networks [17] [12].
Let us elaborate the applicability of swappiness with an
Hadoop example.
Cloud can run on
Computing Cloud Example: BigData of 10 GB is taken for processing; the
system configuration has 4GB RAM and 10 GB swap
memory. The 10GB data is executed, the default value of
Hadoop
swappiness is “60”; in the next step it was observed that when
Framework 40% of the RAM memory is used the processing will be
handled by the swap memory [3]. In the second iteration the
swappiness is changed from default to “100”; this resulted in
Process in slow processing of the data. The basic reason of slow
Hadoop
Cluster
processing of the data is because swap memory act as a virtual
memory, but originally it opt space from the hard disk only
Graph Search [13]. Similarly when swappiness value is “10”; it will
Algorithm consume 90% of memory and 10% of swap space. This means
processing will be faster than “vm.swappiness=100” [7].
Fig 2.1 Concept of emerging technologies in flow-diagram
As discussed in section II of the paper, the advanced stage
Figure 2.1 is the flow diagram to understand the concept of of optimization of the developed MapReduce algorithm is by
all technologies used in research. It was observed that the changing the swappiness of the system. Therefore an
system memory is the key parameter affecting the processing experiment is set by varying the value of swappiness from “0”
time of BigData; hence in the next step, the system memory to “100” in the Hadoop cluster. The optimum value of
was microscopically focused for the optimization of the swappiness is observed while running the algorithm; and
algorithm [15]. Further study shows that the swappiness of various results of the all the set experiments on the swappiness
the system memory could help in understanding the advance will be discussed in the section –IV.
level optimization of the designed search algorithm. IV. OPTIMIZED ALGORITHM’S RESULTS BASED ON
III. OPTIMIZATION OF SEARCH ALGORITHM & EFFECT OF SWAPPINESS IN HADOOP CLUSTER
SWAPPINESS ON HADOOP FRAMEWORK To study the swappiness affect on the optimized
As mentioned in section-II, understanding swappiness is performance of the algorithm, a cluster of five systems was
very important to set next level of experiment for search created; three out of the five performed the task of
algorithm optimization. mapper/reducer (Task trackers); one is job tracker (Controller
of mapper/reducers) and one is name node (Hadoop HDFS).
A. Swappiness
Swappiness is an important parameter of Linux kernel and The configuration of three mapper/reducer system, job
its value is between “0” to ‘100”.This function is basically a tracker and name node is 2GB RAM 2 Core processor & 100
devoted space in the hard drive and that is generally two times mbps network. BigData of different size viz. a viz. 2GB, 4GB,
the capacity random access memory ( RAM) of the system 8GB & 10GB are processed on the above defined Hadoop
[15]. Linux kernel uses this swap function by swapping the cluster. The swappiness is varied from “0” to “100” and
chucks from the RAM to the swap, this allows the availability results are obtained in terms of time (to the precision of
of RAM for other active and important processes. milliseconds) taken to process the BigData.
The value of swappiness derives that, how much and how Table 4.1; present the variation in swappiness and
frequently the linux kernel will copy the RAM contents to the processing time of various size of the BigData. The Table 4.1
swap memory. The system defined default value of is presenting consolidated results of all the iterations and the
swappiness is “60” and it can vary between “0” to “100” [17]. details of the each iteration are represented in the graphical
The higher value of swappiness denotes that the kernel will be format in the next few paragraphs. It can be depicted from the
more aggressive to ummap the mapped pages and on the table that minimum processing time of BigData of different
counter side a lower value of swappiness shows that kernel sizes, is falling in the range of 45-50 swappiness. Each of the
will not tend to unmap the mapped pages [18], [16]. iterations is executed on the Hadoop cluster for 3 times and
average BigData processing is taken.
.
TABLE-4.1: THREE TASKTRACKER WITH 2GB RAM AND DATA

ARE 2GB, 4GB, 8GB AND 10GB.
Sr. Swap 2GB Data 4GB Data 8GB Data 10GB Data
No. piness
1. 0
1m28.45s* 2m32.77s 4m23.08s 14m5.75s
2. 10 1m30.59s 2m31.98s 4m25.13s 11m32.31s
3. 20 1m26.23s 2m32.49s 4m25.93s 12m17.16s
4. 30 1m31.55s 2m32.82s 4m23.28s 11m43.57s
5. 40 1m30.33s 2m33.10s 4m22.94s 12m39.60s Fig 4.1- Processing time for 2GB Data against swappiness
6. 50
1m26.26s 2m28.65s 4m15.93s 10m59.60s
7. 60
(Default)
1m30.16s 2m31.65s 4m24.90s 11m12.18s
8. 70 1m30.41s 2m31.62s 4m24.99s 11m17.37s
9. 80 1m30.69s 2m31.55s 4m20.98s 13m3.28s
10. 90 1m26.61s 2m33.69s 4m23.97s 10m47.51s
11. 100 1m30.58s 2m36.81s 4m23.97s 11m0.26s
*m- minutes, s- seconds
In the first set of experiment, 2GB BigData is processed on

a Hadoop cluster of five systems (as described in section- IV)
and the swappiness is changed for understanding its impact on Fig. 4.2- Processing time for 4GB Data against swappiness
the processing time of the Data. It could be depicted from the
Fig. 4.1 that for swappiness range of 45-50 (Data point is
shown with green colour) the processing time reduced, while
the time taken at default value of swappiness is more and
shown in blue colour in the graph. The graph also shows
some dips in the processing time at 20 % and 80% of
swappiness but these values could be rejected due to extreme
lower & higher values of the swappiness; as on lower
swappiness value may face the problem during multitasking of
the system due to incapability of the Memory and on higher
swappiness values may slow down the process since swap
memory use is at its peak.
The fourth set of the experiment is on 10GB BigData
processing on the same Hadoop cluster. Fig. 4.4 shows the
result of the 10GB data. The processing time presented on y-
axis against the % virtual memory (Swappiness) in Fig. 4.1 to
4.4, shows that the processing time of all four size BigData
viz. a viz. 2GB, 4GB, 8 GB & 10GB is reduced on the
Fig 4.3- Processing time for 8GB Data against swappiness
configured Hadoop cluster taken for the experiment.
V. CONCLUSION
All the experiments set to run the designed Search
MapReduce algorithm on the configured Hadoop cluster
shows that optimization of BigData processing can be
achieved by changing the virtual memory (swappiness) in the
optimal range.
These results could be extrapolated in the search operation of
the Social networking sites, and searching time can be
optimized to enhance the user’s experience. The results also
shows that range of improvement in processing time vary
between 1.5 to 4.5 % for the given size of BigData by fixing
the swappiness in the optimal range.
The sample data taken for processing varied from 2 to 10
GB, the data size is small as compared to actual BigData
Fig. 4.4- Processing time for 10GB Data against swappiness problems of the real world; but due to resource constraint the
bigger size of the data cannot be taken for the experiment. The
The second set of the experiment is also processed on the processing of only 10 GB data required a Hadoop Cluster of 5
same Hadoop cluster to swappiness is changed in similar systems with 2GB RAM, 2 core processor and 100mbps
fashion as done in the first experiment; Fig. 4.2 shows the network connection. The optimal swappiness range in bigger
results of the experiment. size of data would fetch us more fruitful results in terms of
optimization, of the search operation on the Social networking
The third set of the experiment is on 8GB BigData processing sites.
on the same Hadoop cluster. Fig. 4.3 shows the result of 8GB
The extreme lower ( below 30%) & higher ( above 70%)
data. swappiness have given reduction in processing time in some
of the rare cases, but are not considered since lower
Table 4.2 presents the comparison of the processing time swappiness create hindrance in multitasking and higher
of BigData of 2GB, 4GB, 8GB & 10GB size with default swappiness will slow down the BigData processing.
swappiness of the Hadoop cluster and optimal value of
swappiness. The task of optimization (reduction in the The above fact of lower and higher % of swappiness for
processing time of BigData) is achieved in all the sets of the BigData processing; can be ascertained by going in depth of
experiment & the optimal range of swappiness coming for the the memory & processor’s functionalities, of the operating
experimental set up is in the range 45-50 The processing time system. For future work researchers may take other functions
improvement in 2GB, 4GB, 8GB & 10GB data is 4.32%, of the system memory and processor; for optimization of the
1.98%, 3.38% & 1.87% respectively in case of optimal range BigData processing.
of swappiness i.e. 45-50% as compared to default system
value of swappiness (60%). VI. ACKNOWLEDGMENT
I would like to extend my deepest gratitude to my guide
TABLE-4.2 COMPARISON OF PROCESSING TIME FOR
SWAPPINESS AT DEFAULT & MODIFIED VALUE FOR 2GB, 4GB, Dr. Sunita Choudhry for her valuable guidance, suggestions
8GB AND 10GB BIGDATA FOR HADOOP CLUSTER. and motivation. Mr. Vimal Daga also helped during the
implementation phase of the project and deserves a big thank
you and special mention in this work.
Time taken Time taken
%
(sec) to (sec) to
Improvement
VII. REFERENCE
Sr. Data process Data process Data
No. Size at Default at Swappiness
in processing [1] Major Parameswaran and Andrew B. Whinston, “Resaearch Issues in
time ( Social Computing”, In: Journal of the Association for Information
Swappiness ( range ( 45-
Optimization)
60)% 50)% System 8.6, 2007, p. 22.
[2] Preeti Narooka and Dr. Sunita Chaodhary, “Graph Search Process in
Social Networks and it's Challenges”, In IJCSET June 2016 | Vol 6,
1 2 GB 86.269 86.785 4.32% Issue 6. P. 228-232.
[3] Firat Tekiner and John A. Keane, “Big Data Framework”, In:
148.654 International Conference on system, Man and Cybernetics, IEEE 2013,
2 4 GB 151.658 1.98% p. 1494-1499.
[4] Preeti Narooka, Sunita Chaodhary, “Paradigm Shift of Big-Data
264.903 255.938 Application in Cloud Computing”, International Journal of Advanced
3 8 GB 3.38%
Research in Computer and Communication Engineering Vol. 5,
Issue 5, May 2016 p. 515-521.
672.188 659.609 [5] Bo Li and Prof. Raj Jain, “Survey of Recent research Progress and issues
4 10GB 1.87 %
in Big Data”, 2013
[6] Harshawardhan S. Bhosale, Prof. Devendra P. Gadekar “A Review

Paper on Big Data and Hadoop” in International Journal of Scientific
and Research Publications, Volume 4, Issue 10, October 2014 p 1-7.
[7] Book: Bovet, Daniel & Cesati, Marco, Understanding the Linux
Kernel. Copyright © 2006 O’reilly Media Inc & Associates Sebastopol,
CA 2006.
[8] Book: Wadkar, Sameer, and Madhu Siddalingaiah. "Hadoop Concepts",
Pro Apache Hadoop, 2014.
[9] Book: Tom White, “Hadoop: The Definitive Guide”, publication is
Oreilly and Yahoo press, 2009.
[10] Whitepaper: nextMEDIA, CSA, “Social Networks Overview: Current
Trends and Research Challenges”2012.
[11] Whitepaper: Jean Yan, “Big Data, Bigger Opportunities”.
Available:http://www.meritalk.com/pdfs/bdx/bdx-whitepaper-
090413.pdf 2013.
[12] Whitepaper: Novell’s Technical Whitepaper, “Determining the Correct
Usage of Swap in Linux* 2.6 Kernels” 2007.
[13] Whitepaper: Matthews, Bob & Murray, Norm, “Virtual Memory
Behavior in Red Hat Linux A.S. 2.1”, Red Hat whitepaper, Raleigh, NC
2001.
[14] Article: International Telecommunication Union – “Distributed
Computing: Utilities, Grids & CloudsUtilities, Grids & Clouds”, ITU-T
Technology Watch Report, March 2009
[15] Website: Bhavin Turakhia, “Understanding and Optimizing Memory
Utilization”, http://careers.directi.com/display/tu/Understanding+and
+optimizing +Memory+utilization, 2013.
[16] Website: http://unix.stackexchange.com/questions/265713/how-to-
configure-swappiness-in-linux-memory-management.
[17] Website: https://www.howtoforge.com/tutorial/linux-swappiness/
[18] Website: http://www.linuxquestions.org/questions/linux-general-1/how-
to-change-the-swappiness-of-your-linux-system-4175546212/

Optimization of The Search Graph Using Hadoop and Linux Operating System

Uploaded by

Copyright:

Available Formats

Optimization of The Search Graph Using Hadoop and Linux Operating System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimization of The Search Graph Using Hadoop and Linux Operating System

Uploaded by

Copyright:

Available Formats

2017 International Conference on Nascent Technologies in the Engineering Field (ICNTE-2017)

Optimization of the Search Graph Using Hadoop and

978-1-5090-2794-1/17/$31.00 ©2017 IEEE

C. What is Cloud Computing? II. Reducer

B. Why to Change Swappiness?

TABLE-4.1: THREE TASKTRACKER WITH 2GB RAM AND DATA

2. 10 1m30.59s 2m31.98s 4m25.13s 11m32.31s

3. 20 1m26.23s 2m32.49s 4m25.93s 12m17.16s

4. 30 1m31.55s 2m32.82s 4m23.28s 11m43.57s

8. 70 1m30.41s 2m31.62s 4m24.99s 11m17.37s

9. 80 1m30.69s 2m31.55s 4m20.98s 13m3.28s

10. 90 1m26.61s 2m33.69s 4m23.97s 10m47.51s

11. 100 1m30.58s 2m36.81s 4m23.97s 11m0.26s

*m- minutes, s- seconds

In the first set of experiment, 2GB BigData is processed on

[6] Harshawardhan S. Bhosale, Prof. Devendra P. Gadekar “A Review

You might also like