Project
Project
Project
Abstract
An abstract is a brief summary of a research article, thesis, review, conference
proceeding or any in-depth analysis of a particular subject or discipline, and is often
used to help the reader quickly ascertain the paper's purpose. When used, an abstract
always appears at the beginning of a manuscript, acting as the point-of-entry for any
given scientific paper or patent application. Abstraction and indexing services are
available for a number of academic disciplines, aimed at compiling a body of
literature for that particular subject.
Automatic Summarization
Automatic summarization is the creation of a shortened version of a text by a
computer program. The product of this procedure still contains the most important
points of the original text.
The phenomenon of information overload has meant that access to coherent and
correctly-developed summaries is vital. As access to data has increased so has
interest in automatic summarization. An example of the use of summarization
technology is search engines such as Google.
Technologies that can make a coherent summary, of any kind of text, need to take
into account several variables such as length, writing-style and syntax to make a
useful summary.
Extraction and abstraction
Broadly, one distinguishes two approaches: extraction and abstraction. Extraction
techniques merely copy the information deemed most important by the system to the
summary (for example, key clauses, sentences or paragraphs), while abstraction
involves paraphrasing sections of the source document. In general, abstraction can
condense a text more strongly than extraction, but the programs that can do this are
harder to develop as they require the use of natural language generation technology,
which itself is a growing field.
Types of summaries
There are different types of summaries depending what the summarization program
focuses on to make the summary of the text, for example generic summaries or
query relevant summaries (sometimes called query-biased summaries).
Summarization systems are able to create both query relevant text summaries and
generic machine-generated summaries depending on what the user needs.
Summarization of multimedia documents, e.g. pictures or movies are also possible.
Some systems will generate a summary based on a single source document, while
others can use multiple source documents (for example, a cluster of news stories on
the same topic). These systems are known as multi-document summarization
systems.
Aided summarization
Machine learning techniques from closely related fields such as information retrieval
or text mining have been successfully adapted to help automatic summarization.
Apart from Fully Automated Summarizers (FAS), there are systems that aid users
with the task of summarization (MAHS = Machine Aided Human Summarization),
for example by highlighting candidate passages to be included in the summary, and
there are systems that depend on post-processing by a human (HAMS = Human
Aided Machine Summarization).
Evaluation
An ongoing issue in this field is that of evaluation. Human judgment often has wide
variance on what is considered a "good" summary, which means that making the
evaluation process automatic is particularly difficult. Manual evaluation can be used,
but this is both time and labor intensive as it requires humans to read not only the
summaries but also the source documents. Other issues are those concerning
coherence and coverage.
One of the metrics used in NIST's annual Document Understanding Conferences, in
which research groups submit their systems for both summarization and translation
tasks, is the ROUGE metric (Recall-Oriented Understudy for Gisting Evaluation ). It
essentially calculates n-gram overlaps between automatically generated summaries
and previously-written human summaries. A high level of overlap should indicate a
high level of shared concepts between the two summaries. Note that overlap metrics
like this are unable to provide any feedback on a summary's coherence. Anaphor
resolution remains another problem yet to be fully solved.
Information Retrieval
An information retrieval process begins when a user enters a query into the system.
Queries are formal statements of information needs, for example search strings in
web search engines. In information retrieval a query does not uniquely identify a
single object in the collection. Instead, several objects may match the query, perhaps
with different degrees of relevancy.
An object is an entity which keeps or stores information in a database. User queries
are matched to objects stored in the database. Depending on the application the data
objects may be, for example, text documents, images or videos. Often the documents
themselves are not kept or stored directly in the IR system, but are instead
represented in the system by document surrogates.
Most IR systems compute a numeric score on how well each object in the database
matches the query, and rank the objects according to this value. The top ranking
objects are then shown to the user. The process may then be iterated if the user
wishes to refine the query.
Text Analytics
The term text analytics describes a set of linguistic, lexical, pattern recognition,
extraction, tagging/structuring, visualization, and predictive techniques. The term
also describes processes that apply these techniques, whether independently or in
conjunction with query and analysis of fielded, numerical data, to solve business
problems. These techniques and processes discover and present knowledge – facts,
business rules, and relationships – that is otherwise locked in textual form,
impenetrable to automated processing.
A typical application is to scan a set of documents written in a natural language and
either model the document set for predictive classification purposes or populate a
database or search index with the information extracted. Current approaches to text
analytics use natural language processing techniques that focus on specialized
domains.
Chapter 2: Project Features
• Using this tool users can obtain the summary of their document depending on
their requirement.
• Input file size restricted to 20-40 pages for best possible summary.
• Tool supports Input Files like Plain Text, Rich Text, Ms-Word(Office 2007
supported) and HTML only.
• Uses keywords and related words supplied by the user to summarize the text,
allowing for greater flexibility in summary.
• The user can get different summaries from the same document depending
on what he gives as “keyword” and “related words”.
• The primary use intended while designing the program was to aid in
creating notes for exams, or lectures.
Chapter 4: Specification and Requirement
• Input file size restricted to 20-40 pages for best possible summary.
• Uses keywords supplied by the user to summarize the text, allowing for
greater flexibility in summary
4.2 Requirements
HARDWARE REQUIREMENTS :
SOFTWARE REQUIREMENTS :
• Windows XP/Vista with
• MS-office
Chapter 5: Design
6.1.1 Interface:
This is graphical user interface, the interactive medium between the
application and the user .The user exercises the feature of providing the input
file, keyword, related words and summary name.
Keyword is the most important word in the document provided by
the user as input, while Related Words are the words from the document
which have some relation with the keyword. The number of related words
taken from the user is five.
Buttons:
Open File: This button opens an “open file dialog box” from which the
user can select the file to open for summarization.
Summarize: This button starts the process of summarization. First the
various fields are checked for any blank spaces or incorrect input. If
they exist, suitable error messages are generated. For correct input, the
text is converted to plaintext. Once that is done, keywords are taken
either from interface or if the former is empty, from database. Details
of summarization process are covered in individual modules.
Redo: This button is initially not visible. After the summarization has
been done, it becomes visible. It can be clicked to redo the
summarization and generate an even smaller summary.
Fig1(a): User Interface
Fig 1(d): Re-summary option
6.1.2 Text Convertor:
This module will work on the input documents and convert them to
plain text for processing by the rest of the system. The input provided by the
user is allowed to be in plain text, ms-word and html format. Documents in
these formats will be converted to plain text. The process involved to
generate summary uses the plain text only. This tool does not support pdf
format. Support for pdf format can be implemented in future versions by
adding a pdf to plaintext converter.
Conversion is done in steps. Firstly, we create a new Microsoft
Office Word Application. Then we open the target document in this word
application in read only mode. Then the entire text is selected and copied.
Then the data from clipboard is assigned to a string variable which is finally
written to a plaintext file by the name of convertedtext.txt.
The above process references Microsoft Office Word Interop
12.0 to be present in the target system. This file has been bundled with the
project file so when we install the Tool, it is copied to target system. The
result is that even if the target system does not have Microsoft Office 2007™
installed on the target machine, the tool would operate normally on the target
system.
Fig 2: screenshot for Converted Text
6.1.3 Text Formatter:
Counter:
This class takes the input file path and counts the number of lines in the
text file. This is then returned as an int value.
Fig 3: screenshot for Formatted text
This module ranks the sentences according to the score given by a scoring
module. This module is contained within the Summary_Generator class. It has
the following methods which cooperate to generate the summary:
Min_score_calculator: As the name suggests, this calculates the
minimum score of any line in the Text.
Max_score_calculator: This method finds out the maximum score in the
Text.
Min_score_eliminator: This method calls the above two methods and
then calculates the threshold score. All sentences with score below the
threshold are eliminated.
Summary_Write: This method writes the remaining sentences to the
temporary summary file.
Fig 5: screenshot for Ranking algorithm
Based on the user input on keywords and related words, the sentences will be
picked from the ranked list and concatenated. The resulting summary file is
stored with the name provided by the user in the interface. The summary file
name is provided by the user itself providing him the convenience.
This module is implemented by Writer class. First the scores are removed
from individual sentences and then the descored sentences are written back to
the final summary.
Chapter 7: Testing
A new
X1002 Click Open Button Opens a new window PASS
File button Click window of i/p with i/p is
files opened
X1003 Click Button Window closes Expected PASS
OPEN Click & i/p file path Result
button is written in the
path box.
Chapter 8: Coding
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.IO;
using Word = Microsoft.Office.Interop.Word;
namespace Project2009
{
public partial class Summarizer : Form
{
Double i;
public Summarizer()
{
InitializeComponent();
}
}
private void label3_Click(object sender, EventArgs e)
{
if (outdir.Length == 0)
{
try
{
String related_word =
Database_Retriever.ConnectAndQuery(keyword);
if (related_word == null)
{
richTextBox1.Text = "Keyword not in database. Please enter the
keyword along with related words";
}
else if (related_word == "DNE")
{
related_word = textBox3.Text;
Writer.Write(keyword, related_word);
//Application.Exit();
}
else
{
Scoring_Module.score(keyword, related_word);
}
}
catch
{
richTextBox1.Text = "Invalid format for related words. Please enter
related words as shown:
related_word1,related_word2,related_word3,related_word4,related_word5.";
goto Error;
}
}
else
{
try
{
Scoring_Module.score(keyword, outdir);
String related_word =
Database_Retriever.ConnectAndQuery(keyword);
if (related_word == "DNE")
Writer.Write(keyword, outdir);
}
catch
{
richTextBox1.Text = "Invalid format for related words. Please enter
related words as shown:
related_word1,related_word2,related_word3,related_word4,related_word5.";
goto Error;
}
}
Double initial_count = Counter.line_count(path);
Double level_of_summarization =
Summary_Generator.Summary_Write(initial_count,0);
Writer.final_summary(summary_name);
String disp_text = @"C:\Program Files\Sonu\Summarizer\" +
summary_name + ".txt";
richTextBox1.Text = File.ReadAllText(disp_text);
goto Msg;
Error:
{
MessageBox.Show("Sorry!There was an exception during the
processing.Please try again.");
Application.Exit();
goto X;
}
Error2:
{
MessageBox.Show("Unrecognised File Format.Please input a plaintext
or Ms-Word file.");
Application.Restart();
goto X;
}
Exit_error:
{
Application.Exit();
Application.Restart();
goto X;
}
Msg:
{
MessageBox.Show("Summarization is complete!");
button3.Visible = true;
}
X:
{
String s = "k";
}
}
return (t_rel_words);
}
}
else
{
return ("DNE");
}
}
else
{
return (0);
}
}
}
internal class Scoring_Module
{
internal static void score(String keyword, String related_word)
{
String noutdir = @"C:\Program Files\Sonu\Summarizer\temp.txt";
String noutdir2 = @"C:\Program Files\Sonu\Summarizer\temp1.txt";
String k = keyword, k1, k2, k3, k4, k5;
String list = related_word;
char[] delimiterChars = { ',' };
string[] words = list.Split(delimiterChars);
k1 = words[0];
k2 = words[1];
k3 = words[2];
k4 = words[3];
k5 = words[4];
using (StreamReader sr = new StreamReader(@noutdir))
using (StreamWriter sw = new StreamWriter(@noutdir2))
{
String line;
while ((line = sr.ReadLine()) != null)
{
double score = scorer(line, k, k1, k2, k3, k4, k5);
String scoredtext = Convert.ToString(score) + " " + line;
if (score == 0)
scoredtext = null;
sw.WriteLine(scoredtext);
}
}
}
internal static double scorer(string line, String k, String k1, String k2, String k3,
String k4, String k5)
{
String rk, rk1, rk2, rk3, rk4, rk5;
rk = k;
rk1 = k1;
rk2 = k2;
rk3 = k3;
rk4 = k4;
rk5 = k5;
double score = 0.0;
{
if (System.Text.RegularExpressions.Regex.IsMatch(line, rk1,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
}
if (System.Text.RegularExpressions.Regex.IsMatch(line, rk2,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
} if (System.Text.RegularExpressions.Regex.IsMatch(line, rk3,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
} if (System.Text.RegularExpressions.Regex.IsMatch(line, rk4,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
} if (System.Text.RegularExpressions.Regex.IsMatch(line, rk5,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
}
if (System.Text.RegularExpressions.Regex.IsMatch(line, rk,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 1.0;
}
else
{
score = score + 0.0;
}
return (score);
}
}
}
internal class Summary_Generator
{
internal static double min_score_calc(string path)
{
using (StreamReader sr = new StreamReader(@path))
{
double min_score = 0, temp_score;
String line;
while ((line = sr.ReadLine()) != null)
{
temp_score = summa(line);
}
if (min_score == 0)
min_score += 0.75;
return (min_score);
}
}
internal static double summa(string line)
{
String s;
if (line.Length > 0)
s = line.Substring(0, 3);
else
s = "0";
double j;
try
{
j = Convert.ToDouble(s);
}
catch
{
s = line.Substring(0, 1);
j = Convert.ToDouble(s);
}
return (j);
}
score = summa(line);
}
internal static Double max_score_calc(String path)
{
using (StreamReader sr = new StreamReader(@path))
{
double max_score = 0, temp_score;
String line;
while ((line = sr.ReadLine()) != null)
{
temp_score = summa(line);
}
return (max_score);
}
}
String line;
String text_write;
while ((line = sr.ReadLine()) != null)
{
text_write = summ_wri(line);
sw.WriteLine(text_write);
}
}
}
}
internal class Word_to_Text_Converter
{
internal static void converter(String path)
{
object fileName = @path ;
object oMissing = System.Reflection.Missing.Value;
object oEndOfDoc = "\\endofdoc"; /* \endofdoc is a predefined bookmark
*/
//Start Word and create a new document.
Word._Application oWord;
Word._Document oDoc;
oWord = new Word.Application();
oWord.Visible = false;
oDoc = oWord.Documents.Open(ref fileName, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing,
ref oMissing);
oDoc.ActiveWindow.Selection.WholeStory();
oDoc.ActiveWindow.Selection.Copy();
IDataObject data = Clipboard.GetDataObject();
String text = data.GetData(DataFormats.Text).ToString();
oDoc.Close(ref oMissing, ref oMissing, ref oMissing);
oWord.Quit(ref oMissing, ref oMissing, ref oMissing);
using (StreamWriter sw = new StreamWriter(@"C:\Program
Files\Sonu\Summarizer\convertedtext.txt"))
sw.WriteLine(text);
}
}
}
Chapter 9: Conclusion and Enhancements
9.1 Conclusion:
9.2 Enhancements:
User can be provided a facility to print the document from the interface
directly.
Font and font size option can be added to the application to fulfill different
needs of different user’s.
Save As option can be added to the application for the user to save the
summary in different format.
References
The following resources has been very useful during the development of this
application:
http://msdn.microsoft.com/en-us/library/default.aspx
http://www.ics.mq.edu.au/~swan/summarization/
http://www.ics.mq.edu.au/~swan/readingroom/summarisation/index.htm
http://www1.cs.columbia.edu/~hjing/sumDemo/
http://complingone.georgetown.edu/~linguist/summarizer.html
http://mskw.cipher-sys.com