Synopsis On: (Development of Automatic Text Summarization Algorithm)
Synopsis On: (Development of Automatic Text Summarization Algorithm)
Synopsis On: (Development of Automatic Text Summarization Algorithm)
Submitted by
(Mansi Bhardwaj)
M.Tech
External Guide
(Mrs. Pooja Gupta, Assistant Professor, Banasthali Vidyapith University)
1
Table of Contents (For Research Project)
S.No. Title Page No.
1. Introduction 1-2
Organization
Problem Definition
3. Proposed Study 9
Aims and Objectives
4 Research Methodology
Methodology 10-12
Work Plan 13
Proposed Contents of the Thesis 14
Tools and Techniques 15
5 References 16
2
CHAPTER 1
INTRODUCTION
1.1 About The Organization
Banasthali Vidyapith University is India’s largest women accommodation university. The
source of inspiration of the Banasthali University is “Shanta Bai” She is the daughter of
our founder and freedom fighter and educationist “Shri. Hiralal Shastri” To complete the
unfinished task of his daughter, the Shri Shantabai Shiksha Kutir was started in 1935. The
name “Banasthali Vidyapith” was adopted only in 1943. This also happened to be the
year when undergraduate courses were first introduced. The UGC committee which
recommended the conferment of University status on the institution kept the following
points in mind:
(i) Vidyapith’s definite and viable programme for restructuring courses at the
undergraduate level and its eagerness to carry out various measures to make
education more meaningful and practical.
(iii) Vidyapith’s initiative to inculcate spiritual and moral values in the students
through various activities, emphasizing character-building and simplicity.
The campus is a sprawling 850 acres, located about 80 kilometres from the capital
city of Jaipur, in the Tonk district of Rajasthan, India. The campus has been
broadly divided into the school division, the University division and the
residential blocks. The residential blocks feature 29 hostels each with the capacity
of housing up to 438 students.
In Banasthali there are five-fold-activities so that the girls can grow in other
fields also like dancing, singing, sports, self defense and fitness. So that a girl who
is well developed and not depend to others for her education, safety and lifestyle.
In sports they have basket ball, cricket, football, tennis and many other sports. In
Dancing they teach the cultural dance like kathak, Manipuri and may dance
3
forms. In Fitness they have Yoga classes, Aerobics classes, Zumba classes,
swimming classes, Gym.
Manual text summarization is a time consuming and costly task that includes many steps.
For example, the following steps are done to manually summarize a single document
(Takeuchi, 2002):
3) Trying to compose a summary that satisfies the following requirements (Lloret et al.,
2017):
4
Due to the difficulty of manual text summarization of the huge amount of the textual
content on the Internet or various archives, ATS systems have appeared as the main
technology to solve this urgent and pressing issue.
CHAPTER 2
Literature Review
There is a lot of effort in the field of achieving effective text summarization. Nagwani et
al. [1] proposed a frequent term based text summarization algorithm that first processes
the document to be summarized by eliminating stop words and by applying stemmers.
Next, term-frequent data is calculated from the document and frequent terms are selected,
and for these selected words the semantic equivalent terms are also generated. Finally, all
sentences in the document that contain the frequent terms identified and their semantic
equivalents are filtered for summarization.
Aksoy et al. [3] proposed an idea of using Semantic Role Labeling (SRL) on generic
Multi-Document Summarization (MDS). Sentences are scored according to frequent
semantic phrases and the summary is formed using the top-scored sentences. This method
used a term-based sentence scoring approach to investigate the effects of using semantic
units instead of single words for sentence scoring. Then scoring metric is integrated as an
auxiliary feature with the intention of examining its effects on the performance.
Rushdi et al [4] put forth a novel technique for summarization of domain-specific text
from a single web document that uses statistical and linguistic analysis on the text in a
reference corpus and the web document is presented. The proposed summarizer used the
combinational function of Sentence Weight and Subject Weight to determine the rank of
a sentence. It used the number of terms and number of words in a sentence, and term
frequency in the corpus for summarization and about 30% of the ranked sentences were
considered to be the summary of the web document. Three web document summaries
using the proposed technique were generated and compared with the summaries
developed manually from 16 different human subjects.
Foong et al. [5] developed a hybrid Harmony Particle Swarm Optimization (PSO)
framework for an Extractive Text Summarizer to overcome high processing load. Their
objective was to find out if the proposed PSO model was capable of condensing original
5
electronic documents into shorter summarized texts more efficiently and accurately than
the alternative models. Their empirical results showed that the proposed hybrid PSO
model improved the efficiency and accuracy of composing summarized text.
Already Implemented System. These all are already implemented systems and which can
use different algorithms,
MEAD: MEAD [11] is the most elaborate publicly available platform for multi-lingual
summarization and evaluation. The platform implements multiple summarization
algorithms such as position-based, centroid-based, largest common subsequence, and
keywords. The methods for evaluating the quality of the summaries are both intrinsic and
extrinsic. MEAD implements a battery of summarization algorithms, including baselines
(lead-based and random) as well as centroid-based and query-based methods.
• Neural Network is used by S.P yong [6]. He used keywords extraction and summary
production system to generate summary.
• RST is used by Li Chengcheng [7] to analyze sentence and discover rhetoric relations to
generate a Summary.
• In 2000 Hongyan Jing [8] takes closely related sentences for this he used human
abstraction concept.
• In 2011 Nitin Agarwal [9] used unsupervised query-oriented approach with the help of
clustering based method.
• In 2004 Jun’ichi Fukumoto [10] using TF/IDF for single and multiple documents
abstract generation.
In (Mehdi Allahyari, Elizabeth D. Trippe, Krys Kochut [12]) authors give a survey on
text summarization survey which is very helpful for gaining the information about text
summarization.
6
CHAPTER 3
PROPOSED STUDY
3.1 Aims and Objectives:
Objective of the research is that we can make an algorithm which is different from
others and also easy to understand by other readers and also gives the accurate
results.
The main objective of an ATS system is to generate a summary that includes the
important ideas or sentences of input document in less space and the level of
repetition is minimum.
The ATS system helps the users to get the main ideas of the input document
without read the entire document which can save a lot of time and effort.
Aim of the AST is that the summary is short in length as compare to the input text
document so that user can easily understand the concept and aim of the whole
document without reading it and it can save time of the users.
Aim of ATS system that it can work on the web sources like, social media, news,
blogs or research papers and summarize the contents by headings or paragraphs
according to the classification of the ATS.
These are the aims and objectives of the Automatic Text Summarization System.
ATS will work for these objectives so that it can fulfill the requirements of the
users easily, because users generally think these things when first they listen
about the ATS systems.
In this research we can develop an algorithm so that the readers are easily
understand and may be the complexity of the developed algorithm is high but the
necessary thing is that it can give the accurate results.
7
CHAPTER 4
RESEARCH METHODOLOGY
4.1 Methodology
In this section, methodology will explain on which the processes of research project will
go on and explain those steps briefly, the steps are:
Data Collection
Pre Processing
Processing
Post Processing
Data Collection: In this section, collect the data from the websites or from the social
media networks, reviews of user on a particular topic so that we are able to summarize
that what users want. And we can also use a Corpus (collection of data). The data is
collected with the help of some data collection tools. This is first and basic step for
developing an algorithm for Automatic Text Summarization.
Processing: In this section we can work on the actual filtered textual data. Using one of
text summarization approaches by applying a technique or more to convert the input text
document into summary. Different text summarization approaches are: Extractive Text
Summarization Approach, Abstractive Text Summarization Approach and Hybrid Text
Summarization Approach. And different techniques are: Text Summarization operations
and Statistical and linguistics features. Different Building Blocks used are: Text
representation models, Linguistics Analysis and processing techniques, Soft computing
techniques. All of these techniques and approaches are used for developing an algorithm
for ATS.
8
Post-Processing: In this section, the problems are resolved which are generated in the
previous step. Solving some problems in the generated summary sentences like anaphora
resolution and reordering the selected sentences before generating the final summary.
Workflow: It can define the flow of work in which the whole algorithmic development
of ATS is proceed:
4.2 Work Plan: In this section, I can discuss the work plan means that approx how
much time I will take for each phase in the research project.
For first phase (Data collection): Time require for this phase is one and half
month.
In this phase, collect data from the web sources then use panda (tool for python
language) and convert those data into dataframes.
For Second Phase (Pre-Processing): Time require for this phase is two months.
In this phase, the data is filtered by applying some linguistics techniques like,
removal-of-stop-words, stemming, etc.
For Third Phase (Processing): Time require for this phase is two and half
months.
In this phase, the main task is we can select the approach and use one or more
techniques with it.
For Fourth Phase (Post Processing): Time require for this phase is one and half
months.
In this phase, ranking of the sentences is done on the summarized text and solve
some other problems.
9
These are four phases of the project and in these phases we have a lot of small
tasks to do. And in each phase we have to use python language to achieve the
desired output in each phase.
4.4 Tools and Techniques: In my research project we can use some tools and some
techniques like, Pandas for python, Machine Learning, Naïve Bayes Classifier, N-Gram
Algorithm etc.
11
CHAPTER 5
REFERENCES
1. Kumar Nagwani, Naresh, and Shrish Verma. "A Frequent Term and Semantic
Similarity based Single Document Text Summarization Algorithm." International
Journal of Computer Applications (2011)
5. Foong, Oi-Mean, and Alan Oxley. "A hybrid PSO model in Extractive Text
Summarizer." In Computers & Informatics (ISCI), 2011 IEEE Symposium on, pp.
130-134. IEEE, 2011.
9. Nitin Agarwal, Gvr Kiran, Ravi Shankar Reddy and Carolyn Penstein Ros´e,
“Towards Multi-Document Summarization of Scientific Articles: Making
Interesting Comparisons with SciSumm”, Proceedings of the Workshop on
Automatic Summarization for Different Genres, Media, and Languages, Portland,
Oregon, pp. 8–15.
12
10. Jun'ichi Fukumoto, “Multi-Document Summarization Using Document Set Type
Classification”, Proceedings of NTCIR- 4, Tokyo, pp. 412-416, 2004.
12. Mehdi Allahyari, Saeid Safaei, Krys Kochut, Seyedamin Pouriyeh, Mehdi Assefi,
Elizabeth D.Trippe, Juan B.Gutierrez, “Text Summarization Techniques: A Brief
Survey”, arXiv:1707.02268v3 [cs.CL] 28 july 2017.
13
14