Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Project

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 65

Chapter 1: Introduction

Abstract
An abstract is a brief summary of a research article, thesis, review, conference
proceeding or any in-depth analysis of a particular subject or discipline, and is often
used to help the reader quickly ascertain the paper's purpose. When used, an abstract
always appears at the beginning of a manuscript, acting as the point-of-entry for any
given scientific paper or patent application. Abstraction and indexing services are
available for a number of academic disciplines, aimed at compiling a body of
literature for that particular subject.

Purpose and Limitations


Academic literature uses the abstract to succinctly communicate complex research.
An abstract may act as a stand-alone entity instead of a full paper. As such, an
abstract is used by many organizations as the basis for selecting research that is
proposed for presentation in the form of a poster, platform/oral presentation or
workshop presentation at an academic conference. Most literature database search
engines index only abstracts rather than providing the entire text of the paper. Full
texts of scientific papers must often be purchased because of copyright and/or
publisher fees and therefore the abstract is a significant selling point for the reprint
or electronic version of the full text.
Abstracts are protected under copyright law just as any other form of written speech
is protected. However, publishers of scientific articles invariably make abstracts
publicly available, even when the article itself is protected by a toll barrier. For
example, articles in the biomedical literature are available publicly from MEDLINE
which is accessible through PubMed. It is a common misconception that the
abstracts in MEDLINE provide sufficient information for medical practitioners,
students, scholars and patients. The abstract can convey the main results and
conclusions of a scientific article but the full text article must be consulted for details
of the methodology, the full experimental results, and a critical discussion of the
interpretations and conclusions. Consulting the abstract alone is inadequate for
scholarship and may lead to inappropriate medical decisions.
An abstract allows one to sift through copious amounts of papers for ones in which
the researcher can have more confidence that they will be relevant to his research.
Abstracts help a researcher decide which papers might be relevant to their research.
Once papers are chosen based on the abstract, they must be read carefully to be
evaluated for relevance. It is commonly surmised that one must not base reference
citations on the abstract alone, but the entire merits of a paper.

Automatic Summarization
Automatic summarization is the creation of a shortened version of a text by a
computer program. The product of this procedure still contains the most important
points of the original text.
The phenomenon of information overload has meant that access to coherent and
correctly-developed summaries is vital. As access to data has increased so has
interest in automatic summarization. An example of the use of summarization
technology is search engines such as Google.
Technologies that can make a coherent summary, of any kind of text, need to take
into account several variables such as length, writing-style and syntax to make a
useful summary.
Extraction and abstraction
Broadly, one distinguishes two approaches: extraction and abstraction. Extraction
techniques merely copy the information deemed most important by the system to the
summary (for example, key clauses, sentences or paragraphs), while abstraction
involves paraphrasing sections of the source document. In general, abstraction can
condense a text more strongly than extraction, but the programs that can do this are
harder to develop as they require the use of natural language generation technology,
which itself is a growing field.
Types of summaries
There are different types of summaries depending what the summarization program
focuses on to make the summary of the text, for example generic summaries or
query relevant summaries (sometimes called query-biased summaries).
Summarization systems are able to create both query relevant text summaries and
generic machine-generated summaries depending on what the user needs.
Summarization of multimedia documents, e.g. pictures or movies are also possible.
Some systems will generate a summary based on a single source document, while
others can use multiple source documents (for example, a cluster of news stories on
the same topic). These systems are known as multi-document summarization
systems.

Aided summarization
Machine learning techniques from closely related fields such as information retrieval
or text mining have been successfully adapted to help automatic summarization.
Apart from Fully Automated Summarizers (FAS), there are systems that aid users
with the task of summarization (MAHS = Machine Aided Human Summarization),
for example by highlighting candidate passages to be included in the summary, and
there are systems that depend on post-processing by a human (HAMS = Human
Aided Machine Summarization).

Evaluation
An ongoing issue in this field is that of evaluation. Human judgment often has wide
variance on what is considered a "good" summary, which means that making the
evaluation process automatic is particularly difficult. Manual evaluation can be used,
but this is both time and labor intensive as it requires humans to read not only the
summaries but also the source documents. Other issues are those concerning
coherence and coverage.
One of the metrics used in NIST's annual Document Understanding Conferences, in
which research groups submit their systems for both summarization and translation
tasks, is the ROUGE metric (Recall-Oriented Understudy for Gisting Evaluation ). It
essentially calculates n-gram overlaps between automatically generated summaries
and previously-written human summaries. A high level of overlap should indicate a
high level of shared concepts between the two summaries. Note that overlap metrics
like this are unable to provide any feedback on a summary's coherence. Anaphor
resolution remains another problem yet to be fully solved.

Information Retrieval
An information retrieval process begins when a user enters a query into the system.
Queries are formal statements of information needs, for example search strings in
web search engines. In information retrieval a query does not uniquely identify a
single object in the collection. Instead, several objects may match the query, perhaps
with different degrees of relevancy.
An object is an entity which keeps or stores information in a database. User queries
are matched to objects stored in the database. Depending on the application the data
objects may be, for example, text documents, images or videos. Often the documents
themselves are not kept or stored directly in the IR system, but are instead
represented in the system by document surrogates.
Most IR systems compute a numeric score on how well each object in the database
matches the query, and rank the objects according to this value. The top ranking
objects are then shown to the user. The process may then be iterated if the user
wishes to refine the query.

Text Analytics
The term text analytics describes a set of linguistic, lexical, pattern recognition,
extraction, tagging/structuring, visualization, and predictive techniques. The term
also describes processes that apply these techniques, whether independently or in
conjunction with query and analysis of fielded, numerical data, to solve business
problems. These techniques and processes discover and present knowledge – facts,
business rules, and relationships – that is otherwise locked in textual form,
impenetrable to automated processing.
A typical application is to scan a set of documents written in a natural language and
either model the document set for predictive classification purposes or populate a
database or search index with the information extracted. Current approaches to text
analytics use natural language processing techniques that focus on specialized
domains.
Chapter 2: Project Features

• Auto summarization is an application tool built around .Net Framework 2.0

• Using this tool users can obtain the summary of their document depending on
their requirement.

• Size of summary is user defined(User can redo summarization if he/she is not


satisfied).

• Input file size restricted to 20-40 pages for best possible summary.

• Tool supports Input Files like Plain Text, Rich Text, Ms-Word(Office 2007
supported) and HTML only.

• Uses keywords and related words supplied by the user to summarize the text,
allowing for greater flexibility in summary.

• User can provide a name to the summary


Chapter 3: Proposed Implementation

• The user can get different summaries from the same document depending
on what he gives as “keyword” and “related words”.

• The user can decrease size of summaries if he is unsatisfied with the


existing one.

• Approach to summary is done by counting number of keywords and


related words given by the user in each line(this is the score of each
line),ranking each line according to the obtained score and obtaining the
summary.

• Student/staff/institution which is interested in summaries of Technical


texts related to Computer Science/ IT.

• The primary use intended while designing the program was to aid in
creating notes for exams, or lectures.
Chapter 4: Specification and Requirement

4.1 User Requirement

• Size of summary : User Defined(User can redo summarization if he/she


is not satisfied).

• Input file size restricted to 20-40 pages for best possible summary.

• Supported Input Files Plain Text, Rich Text, Ms-Word(Office 2007


supported) and HTML only.

• Uses keywords supplied by the user to summarize the text, allowing for
greater flexibility in summary
4.2 Requirements

HARDWARE REQUIREMENTS :

• PC with 2GB hard disk and 256MB RAM

SOFTWARE REQUIREMENTS :
• Windows XP/Vista with

• MS-office

• .NET Framework 2.0

Chapter 5: Design

5.1 UML Diagram :

5.1.1 use case diagram


Fig : Use Case diagram

5.1.2 Class diagram:


Fig :Class diagram
Fig : Detailed class diagram
5.1.3 Activity diagram:

User Selects Conver


Input Data type. to Pla

Fig : Activity diagram


5.2 Module Block diagram:

Fig : Module diagram


5.3 Data Flow diagram:

Fig : Data flow diagram


Chapter 6: Implementation

6.1 Module details:

6.1.1 Interface:
This is graphical user interface, the interactive medium between the
application and the user .The user exercises the feature of providing the input
file, keyword, related words and summary name.
Keyword is the most important word in the document provided by
the user as input, while Related Words are the words from the document
which have some relation with the keyword. The number of related words
taken from the user is five.

The interface uses three buttons and four fields.


Input Fields:
 Keyword: It is the single most important word of the text. It is the
central idea of the text. E.g. Keyword for this documentation can be
“summarizer”. This field can never be left blank. If left blank, the tool
generates error message prompting the user to provide a keyword.
 Related Words: These are words related to the keyword. E.g. for
“database”, related words can be “data”, “schema” etc. The tool uses
five keywords though it is possible to provide less than five words.
However, one may not exceed the limit of five words. To provide less
than five words, we leave blank spaces instead of the word. The words
are entered as a single string with each word separated by a comma (,).
E.g. “word1,word2,word3,word4,word5” Alternatively, we can provide
as “word1, ,word2, ,word3”, thus providing only three words.
If this field is blank, the Tool searches the data dictionary for the
keyword and related words. If they exist in data dictionary, they are
selected from there, else an error is generated. Furthermore, if the data
dictionary does not have the keyword in its database, and this field is
not empty, the data is added to data dictionary.
 Filename: This field merely shows the name of the file which is being
opened.
 Summary Name: The summary generated by the Tool is saved under
this name.

Buttons:
 Open File: This button opens an “open file dialog box” from which the
user can select the file to open for summarization.
 Summarize: This button starts the process of summarization. First the
various fields are checked for any blank spaces or incorrect input. If
they exist, suitable error messages are generated. For correct input, the
text is converted to plaintext. Once that is done, keywords are taken
either from interface or if the former is empty, from database. Details
of summarization process are covered in individual modules.
 Redo: This button is initially not visible. After the summarization has
been done, it becomes visible. It can be clicked to redo the
summarization and generate an even smaller summary.
Fig1(a): User Interface
Fig 1(d): Re-summary option
6.1.2 Text Convertor:

This module will work on the input documents and convert them to
plain text for processing by the rest of the system. The input provided by the
user is allowed to be in plain text, ms-word and html format. Documents in
these formats will be converted to plain text. The process involved to
generate summary uses the plain text only. This tool does not support pdf
format. Support for pdf format can be implemented in future versions by
adding a pdf to plaintext converter.
Conversion is done in steps. Firstly, we create a new Microsoft
Office Word Application. Then we open the target document in this word
application in read only mode. Then the entire text is selected and copied.
Then the data from clipboard is assigned to a string variable which is finally
written to a plaintext file by the name of convertedtext.txt.
The above process references Microsoft Office Word Interop
12.0 to be present in the target system. This file has been bundled with the
project file so when we install the Tool, it is copied to target system. The
result is that even if the target system does not have Microsoft Office 2007™
installed on the target machine, the tool would operate normally on the target
system.
Fig 2: screenshot for Converted Text
6.1.3 Text Formatter:

This module converts the document to a more interpretable format.


All the full-stops in the text are replaced with newline characters. This
enables us to directly read a line with ReadLine() method of C#. The
formatted text is written into a separate text file, temp1.txt in the same
directory as the output directory.

Counter:

This class takes the input file path and counts the number of lines in the
text file. This is then returned as an int value.
Fig 3: screenshot for Formatted text

6.1.4 Scoring module:


This module takes the following inputs:
 Keyword
 Related word
 Outdir

“Related_word” is a value containing five related words separated by comma


(,) corresponding to the keyword for the text. The “,”s is removed and 5
related words are retrieved. Then the sentences are scored and the sentence
scores are appended to the beginning so that the new sentence starts with a
double precision score.
Scoring rules are as follows:
 Keywords are searched from the data dictionary in the
input text.
 If the sentence contains a keyword, the score is increased
by 1.0.
 If the sentence contains a related word, the score is
increased by 0.75.
 For each related word occurring within the sentence, 0.75
is added to score.
 The minimum score can be 0.0 and maximum score can be
4.75.
 All scores are added up to get the total score of sentence.
 Based on size of summary, top ranking sentences are
selected and rest eliminated.
Fig 4: screenshot for scoring module
6.1.5 Ranking module:

This module ranks the sentences according to the score given by a scoring
module. This module is contained within the Summary_Generator class. It has
the following methods which cooperate to generate the summary:
 Min_score_calculator: As the name suggests, this calculates the
minimum score of any line in the Text.
 Max_score_calculator: This method finds out the maximum score in the
Text.
 Min_score_eliminator: This method calls the above two methods and
then calculates the threshold score. All sentences with score below the
threshold are eliminated.
 Summary_Write: This method writes the remaining sentences to the
temporary summary file.
Fig 5: screenshot for Ranking algorithm

6.1.6 Summarization module:

Based on the user input on keywords and related words, the sentences will be
picked from the ranked list and concatenated. The resulting summary file is
stored with the name provided by the user in the interface. The summary file
name is provided by the user itself providing him the convenience.
This module is implemented by Writer class. First the scores are removed
from individual sentences and then the descored sentences are written back to
the final summary.
Chapter 7: Testing

Testing is the process of uncovering errors or flaws in programming.


The basic philosophy behind testing is that testing only shows the
presence of errors, we can not prove the absence of errors using testing
strategies. The testing included two independent phases, unit testing and
integration testing.Thorough testing has been done so that a user can
use the system effectively. Proper alerts have been given for tentative
errors and titles have been used to guide the user as to what is the
function of a particular element. Put the cursor on the element for a
second and function of that element will be shown on the screen.
Fig : Alert for an error
Fig : Different alert for an error
Test Cases:

Project Name : Auto Summarization

Module User Interface

Test Steps Data Expected Actual Report


Case ID Result Result

X1001 Take Button Gives select your PASS


cursor on Click msg”select i/p file
open file your i/p file from here
icon from here”

A new
X1002 Click Open Button Opens a new window PASS
File button Click window of i/p with i/p is
files opened
X1003 Click Button Window closes Expected PASS
OPEN Click & i/p file path Result
button is written in the
path box.

X1004 Click Button New window Expected PASS


CANCEL Click shuts down.i/p Result
button file path box is
blank.

X1005 Keyword Enter the Keyword saved Expected PASS


textbox keyword in file Result
keyword.txt

Related Enter the


X1006 word related Related word Expected
textbox word saved in Result PASS
file,keyword.txt
Enter the
X1007 Summary name of Summary Expected PASS
name o/p saved as .txt Result
summary file under given
name
Gives
X1008 Take Button msg”Click here Expexted PASS
cursor on Click to get Result
summary summary”
icon

X1009 Click Button Summary is Expected PASS


Summary click generated with Result
button a msg prompt

X1010 Keyword No data Sys prompts an Expected PASS


textbox entered error asking Result
user to enter
keyword.
X1011 Related No data Sys prompts an Expected PASS
word entered error msg . Result
textbox

X1012 Take Button Gives a msg Expected PASS


cursor to click Result
redo button

X1013 Click Redo Button Summary of Expected PASS


button Click the Summary is Result
obtained

X1014 Take cursor to Enter the Message will Expexted PASS


“Keyword”text Keyword be displayed Result

X1015 Take cursor to Enter the Message will Expected PASS


“Related related be displayed Result
Words”text word

X1016 Double click New Name of Expected PASS


summarizer Window developers Result
icon opens along with
guide is
displayed
Sample Input and Output
Fig : Input from User
Fig : Summary of the input file
Fig : Re-Summary option

Chapter 8: Coding
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.IO;
using Word = Microsoft.Office.Interop.Word;

namespace Project2009
{
public partial class Summarizer : Form
{
Double i;
public Summarizer()
{
InitializeComponent();
}

private void label1_Click(object sender, EventArgs e)


{

}
private void label3_Click(object sender, EventArgs e)
{

private void Form1_Load(object sender, EventArgs e)


{
button3.Visible = false;
MessageBox.Show("This program was developed as Final Year Project by
Nitish Raj, Parantap Das and Nishant of Dr. MGR University, CSE Dept,
2005-2009 batch under the guidence of Mrs. Sumathi Eswaran.");
}

private void button2_Click(object sender, EventArgs e)


{
String path = textBox1.Text;
String keyword = textBox2.Text;
String outdir = textBox3.Text;
String summary_name = textBox4.Text;
if (keyword.Length == 0)
{
MessageBox.Show("Oops! It seems you forgot to provide a
keyword.Please provide a keyword.");
goto Exit_error;
}
try
{
Word_to_Text_Converter.converter(path);
Formatting_Module.Formatter(@"C:\Program
Files\Sonu\Summarizer\convertedtext.txt");
path = @"C:\Program Files\Sonu\Summarizer\convertedtext.txt";
}
catch
{
goto Error2;
}

if (outdir.Length == 0)
{
try
{

String related_word =
Database_Retriever.ConnectAndQuery(keyword);

if (related_word == null)
{
richTextBox1.Text = "Keyword not in database. Please enter the
keyword along with related words";

}
else if (related_word == "DNE")
{
related_word = textBox3.Text;
Writer.Write(keyword, related_word);
//Application.Exit();
}
else
{
Scoring_Module.score(keyword, related_word);
}
}
catch
{
richTextBox1.Text = "Invalid format for related words. Please enter
related words as shown:
related_word1,related_word2,related_word3,related_word4,related_word5.";
goto Error;
}
}

else
{
try
{
Scoring_Module.score(keyword, outdir);
String related_word =
Database_Retriever.ConnectAndQuery(keyword);
if (related_word == "DNE")
Writer.Write(keyword, outdir);
}
catch
{
richTextBox1.Text = "Invalid format for related words. Please enter
related words as shown:
related_word1,related_word2,related_word3,related_word4,related_word5.";
goto Error;
}
}
Double initial_count = Counter.line_count(path);
Double level_of_summarization =
Summary_Generator.Summary_Write(initial_count,0);
Writer.final_summary(summary_name);
String disp_text = @"C:\Program Files\Sonu\Summarizer\" +
summary_name + ".txt";
richTextBox1.Text = File.ReadAllText(disp_text);
goto Msg;
Error:
{
MessageBox.Show("Sorry!There was an exception during the
processing.Please try again.");
Application.Exit();
goto X;
}
Error2:
{
MessageBox.Show("Unrecognised File Format.Please input a plaintext
or Ms-Word file.");
Application.Restart();
goto X;
}
Exit_error:
{
Application.Exit();
Application.Restart();
goto X;
}
Msg:
{
MessageBox.Show("Summarization is complete!");
button3.Visible = true;
}
X:
{
String s = "k";
}

private void button1_Click(object sender, EventArgs e)


{
OpenFileDialog dig = new OpenFileDialog();
dig.ShowDialog();
String str = dig.FileName;
textBox1.Text = str;
}
private void label5_Click(object sender, EventArgs e)
{

private void textBox4_TextChanged(object sender, EventArgs e)


{

private void button3_Click(object sender, EventArgs e)


{
i += 0.75;
//int i = 0;
String path = textBox1.Text;
String summary_name = textBox4.Text;
Double initial_count = Counter.line_count(path);
Double level_of_summarization =
Summary_Generator.Summary_Write(initial_count, i);
Writer.final_summary(summary_name);
String disp_text = @"C:\Program Files\Sonu\Summarizer\" +
summary_name + ".txt";
richTextBox1.Text = File.ReadAllText(disp_text);
MessageBox.Show("Done");
}
}
internal class Formatting_Module
{
internal static void Formatter(String path)
{
String npath = path, noutdir = @"C:\Program
Files\Sonu\Summarizer\Temp.txt";
String line = File.ReadAllText(@npath);
using (StreamWriter sw = new StreamWriter(@noutdir))
{
char[] delimiterChars = { '.' };
string[] words = line.Split(delimiterChars);
foreach (string s in words)
{
if (s.Length != 0)
sw.WriteLine(s);
else
sw.WriteLine("");
}
}
}
}

internal class Database_Retriever


{
static internal String ConnectAndQuery(String keyword)
{
String static_path = @"C:\Program Files\Sonu\Summarizer\keyword.txt",
t_rel_words = "", line;
String contents = File.ReadAllText(@static_path);
if (contents.Contains(keyword))
{

using (StreamReader sr = new StreamReader(@static_path))


{
while ((line = (sr.ReadLine())) != null)
{
t_rel_words = Search(keyword, line);

}
return (t_rel_words);
}
}
else
{
return ("DNE");
}
}

static internal String Search(String keyword, String line)


{
int l = line.Length;
int lk = keyword.Length;
String t_contents = line.Substring(lk);
if (System.Text.RegularExpressions.Regex.IsMatch(line, keyword,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
return (t_contents);
}
else
return (null);
}

internal class Counter


{
internal static long line_count(String path)
{
long number = LineCount2(@path, true);
return (number);
}
public static long LineCount2(string source, bool isFileName)
{
if (source != null)
{
string text = source;
long numOfLines = 0;
if (isFileName)
{
FileStream FS = new FileStream(source, FileMode.Open,
FileAccess.Read, FileShare.Read);
StreamReader SR = new StreamReader(FS);
while (text != null)
{
text = SR.ReadLine();
if (text != null)
{
++numOfLines;
}
}
SR.Close();
FS.Close();
return (numOfLines);
}
else
{
System.Text.RegularExpressions.Regex RE = new
System.Text.RegularExpressions.Regex("\n",
System.Text.RegularExpressions.RegexOptions.Multiline);
System.Text.RegularExpressions.MatchCollection theMatches =
RE.Matches(text);
return (theMatches.Count + 1);
}
}

else
{
return (0);
}
}
}
internal class Scoring_Module
{
internal static void score(String keyword, String related_word)
{
String noutdir = @"C:\Program Files\Sonu\Summarizer\temp.txt";
String noutdir2 = @"C:\Program Files\Sonu\Summarizer\temp1.txt";
String k = keyword, k1, k2, k3, k4, k5;
String list = related_word;
char[] delimiterChars = { ',' };
string[] words = list.Split(delimiterChars);
k1 = words[0];
k2 = words[1];
k3 = words[2];
k4 = words[3];
k5 = words[4];
using (StreamReader sr = new StreamReader(@noutdir))
using (StreamWriter sw = new StreamWriter(@noutdir2))
{
String line;
while ((line = sr.ReadLine()) != null)
{
double score = scorer(line, k, k1, k2, k3, k4, k5);
String scoredtext = Convert.ToString(score) + " " + line;
if (score == 0)
scoredtext = null;
sw.WriteLine(scoredtext);
}
}
}
internal static double scorer(string line, String k, String k1, String k2, String k3,
String k4, String k5)
{
String rk, rk1, rk2, rk3, rk4, rk5;
rk = k;
rk1 = k1;
rk2 = k2;
rk3 = k3;
rk4 = k4;
rk5 = k5;
double score = 0.0;
{
if (System.Text.RegularExpressions.Regex.IsMatch(line, rk1,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
}
if (System.Text.RegularExpressions.Regex.IsMatch(line, rk2,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
} if (System.Text.RegularExpressions.Regex.IsMatch(line, rk3,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
} if (System.Text.RegularExpressions.Regex.IsMatch(line, rk4,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
} if (System.Text.RegularExpressions.Regex.IsMatch(line, rk5,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 0.75;
}
else
{
score = score + 0.0;
}
if (System.Text.RegularExpressions.Regex.IsMatch(line, rk,
System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
score = score + 1.0;
}
else
{
score = score + 0.0;
}
return (score);

}
}
}
internal class Summary_Generator
{
internal static double min_score_calc(string path)
{
using (StreamReader sr = new StreamReader(@path))
{
double min_score = 0, temp_score;
String line;
while ((line = sr.ReadLine()) != null)
{

temp_score = summa(line);

if (temp_score < min_score)


min_score = temp_score;

}
if (min_score == 0)
min_score += 0.75;
return (min_score);
}

}
internal static double summa(string line)
{
String s;
if (line.Length > 0)
s = line.Substring(0, 3);
else
s = "0";
double j;
try
{
j = Convert.ToDouble(s);
}
catch
{
s = line.Substring(0, 1);
j = Convert.ToDouble(s);
}
return (j);
}

internal static void min_score_eliminator(string inpath, double min_score)


{
string read_name = inpath;
string write_name = @"C:\Program
Files\Sonu\Summarizer\Temp_summary.txt";
using (StreamReader sr = new StreamReader(@read_name))
using (StreamWriter sw = new StreamWriter(@write_name))
{
String line;
double score;
while ((line = sr.ReadLine()) != null)
{

score = summa(line);

if (score <= min_score)


line = null;
sw.WriteLine(line);
}
}
}
internal static String swap(String a, String b)
{
String temp = a;
a = b;
b = temp;
return (a);
}
internal static Double Summary_Write(Double initial_count,Double new_var)
{
String in_path = @"C:\Program Files\Sonu\Summarizer\temp1.txt";
String pathname_of_temp_file = @"C:\Program
Files\Sonu\Summarizer\Temp_summary.txt";
Double score_max = max_score_calc(in_path);
Double score_min = min_score_calc(in_path);
Double score_temp = (score_max - score_min) / 2 + new_var;
min_score_eliminator(in_path, score_temp);
Double count_temp = Counter.line_count(pathname_of_temp_file);
Double level_of_summarization = (count_temp / initial_count) * 100;
return (level_of_summarization);

}
internal static Double max_score_calc(String path)
{
using (StreamReader sr = new StreamReader(@path))
{
double max_score = 0, temp_score;
String line;
while ((line = sr.ReadLine()) != null)
{

temp_score = summa(line);

if (temp_score > max_score)


max_score = temp_score;

}
return (max_score);
}
}

internal class Writer


{
internal static void Write(String keyword, String rel_words)
{
String path = @"C:\Program Files\Sonu\Summarizer\keyword.txt";
String file_text = File.ReadAllText(path);
file_text = file_text + keyword + " " + rel_words;
using (StreamWriter sw = File.CreateText(path))
sw.WriteLine(file_text);
}
internal static void final_summary(string path)
{
String write_path = @"C:\Program Files\Sonu\Summarizer\" + path +
".txt";
using (StreamReader sr = new StreamReader(@"C:\Program
Files\Sonu\Summarizer\Temp_Summary.txt"))
using (StreamWriter sw = new StreamWriter(@write_path))
{

String line;
String text_write;
while ((line = sr.ReadLine()) != null)
{
text_write = summ_wri(line);
sw.WriteLine(text_write);

}
}

internal static String summ_wri(string line)


{
String s;
if (line.Length >= 4)
s = line.Substring(4);
else
s = "";
return (s);

}
}
internal class Word_to_Text_Converter
{
internal static void converter(String path)
{
object fileName = @path ;
object oMissing = System.Reflection.Missing.Value;
object oEndOfDoc = "\\endofdoc"; /* \endofdoc is a predefined bookmark
*/
//Start Word and create a new document.
Word._Application oWord;
Word._Document oDoc;
oWord = new Word.Application();
oWord.Visible = false;
oDoc = oWord.Documents.Open(ref fileName, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing,
ref oMissing, ref oMissing, ref oMissing, ref oMissing,
ref oMissing);
oDoc.ActiveWindow.Selection.WholeStory();
oDoc.ActiveWindow.Selection.Copy();
IDataObject data = Clipboard.GetDataObject();
String text = data.GetData(DataFormats.Text).ToString();
oDoc.Close(ref oMissing, ref oMissing, ref oMissing);
oWord.Quit(ref oMissing, ref oMissing, ref oMissing);
using (StreamWriter sw = new StreamWriter(@"C:\Program
Files\Sonu\Summarizer\convertedtext.txt"))
sw.WriteLine(text);
}
}
}
Chapter 9: Conclusion and Enhancements

9.1 Conclusion:
9.2 Enhancements:

 Support for pdf files can be added to the project.

 User can be provided a facility to print the document from the interface
directly.

 A limit to re-summary option may be added for document shorter in length.

 Video/Audio help can be added to the project.

 Extra line gap obtained in the summary can be removed.

 It can be made compatible to run along with different search engines.

 Font and font size option can be added to the application to fulfill different
needs of different user’s.

 Save As option can be added to the application for the user to save the
summary in different format.

 Email option may be added.


APPENDIX I

References

The following resources has been very useful during the development of this

application:

 http://msdn.microsoft.com/en-us/library/default.aspx

 http://www.ics.mq.edu.au/~swan/summarization/

 http://www.ics.mq.edu.au/~swan/readingroom/summarisation/index.htm

Summarization resources website maintained by Stephan Wan.

 http://www1.cs.columbia.edu/~hjing/sumDemo/

Summarization projects at Columbia University.

 http://complingone.georgetown.edu/~linguist/summarizer.html

Online text summarization tool.

 http://mskw.cipher-sys.com

You might also like