Paper 2
Paper 2
Paper 2
(2)
Formula (3) evaluates ambiguous concepts towards
indirect disambiguous concepts. 2 OC is an indirect
disambiguous concept and OC is an indirect ambiguous
concept. This formula is calculated for all the indirect
unambiguous concepts and results are added.
mark=oc2.counter 1 / oc2.distance
(1.5 / find_distance(oc, oc2))
rout_number(oc, oc2) m
(3)
The results of the formulas (2) and (3) are added, if the
result is more than a threshold, then ambiguity is removed from
this concept, otherwise the target concept will be deleted. If
the ratio of the disambiguated concepts to whole ambiguous
concepts of the paragraph is less than a specified threshold,
disambiguous concepts of the previous and the next paragraphs
are examined and ambiguous concepts of the current paragraph
are disambiguated. If the current paragraph is low in terms of
disambiguous concepts, then this examination causes that the
concepts will be evaluated more attentively and not to be
removed without cause.
D. Extraction Conceptual Hierarchy Structure
After performing the above steps, there will be a set of
direct and indirect concepts and also some information about
these concepts for each paragraph. The final stage is weighting
concepts and drawing the document graph schema. For
weighting a document concept, the number of references of
each concept is divided by the total number of references of
the specified concept in the whole document. Due to
importance of direct concepts in document content, all the
direct concepts with Coefficient 0.9 and other concepts in
paragraphs with 0.7 are considered. After calculating the
concepts weights, direct concepts and instances are selected
and then using class-class matrix and class instance matrix,
their relationships are extracted. Therefore graph nodes and
weights associated with them have been created.
In order to draw graph edges and calculate their weights,
all the concepts of type indirect-2 are considered as direct
concepts. Also concepts indirect-1 that are homonymous with
direct concepts are examined. If these indirect homonymous
concepts are selected as children therefore edges direction are
from direct concept that is its father to target direct concept. If
these indirect homonymous concepts are selected as parent,
edges' directions are from direct children of this concept to the
relevant concept. For calculating the edge weights in the
graph, matrix of the relationship is used. Indirect homonymous
concepts weights are multiplied by the number of relationships
February Issue Page 8 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
between two desired concepts and are divided by the number
of desired nodes associated with their parents (or children).
Also proportional to the distance of indirect homonymous
concepts a coefficient is considered. In formula (4) the
purpose is to examine parents with one distance. If i is a
direct concept that is homonymous with indirect parent with
one distance, then for all its children, j , edges weights j to
i are calculated as (4) where _ sum child
i
represents the
number of children of concept i with one distance,
. concept weight
i
is weight of indirect concept homonymous
with main concept and [ , ] matrix i j represents the number of
relationships between i and j . Also coefficient W can be
amounted, with respect to distance of indirect concept.
Formula (4) is written the same way for children.
.
[ , ] / _
weight concept weight W
j i i
matrix i j sum child
i
=
(4)
Finally, after calculating states of children and fathers with
different lengths, a directional graph is produced, that its
nodes are direct concepts. Concepts weights and the edges
weights have been calculated through expressed description
and the formula (4). Created graph of each document is stored
in the database as a matrix so that in calculation steps of
similarity matrix between documents and their mining, can be
retrieved easily.
As an example, Figure 2 is a desired document on hotel
domain (using the ontology). Figure 3 shows a generated
ontological graph which is based on the proposed method.
This graph includes concepts, weights and their edges.
Regarding concepts weights, document content is about the
hotel and luxury hotel concepts. According to these weights,
general, significant and detailed concepts can be specified
easily. In other words, these weights can be interpreted as
fuzzy membership degree that document with what
membership degree belongs to each concept. This
interpretation can be made about the weights and directions of
the edges. In fact, according to current ontology, document
ontological schema is a subset of domain ontology. In domain
ontology, weights of nodes and edges are 0.100 but in
documents, according to content and context, the values of
these weights are different.
IV. SIMILARTY MEASURE RELATED TO ONTOLOGICAL
REPRESENTATION
The most important steps, for Improvement document
mining procedures, are conceptual representations and
similarity measure, related to this representation. The more
similarity measure is capable for Approximation of levels of
differences and similarities between documents, the more
appropriate and practical it is.
The proposed ontological method has four meaningful
parts that are used to determine similarities and differences
between documents: concepts and the weights corresponding
to each concept, edges and the weights assigned to each edge.
The proposed criterion for the concepts and the edges are
calculated separately and its output is distinct similarities
matrices for the concepts and edges. In the next steps based on
the calculated similarity matrix and use of fuzzy inference
systems, fuzzy rules and an algorithm of document clustering,
mining results are improved.
The proposed criterion considers membership degree,
priority and importance of each concept and also approximates
the amount of similarity between the documents based on
common concepts (common edge) of both documents. For
every common concept in two documents, the two weights
1
w and
2
w are calculated. Finally using these weights,
similarity of two documents can be approximated. Formulas 5
and 6, respectively, express calculation of weights
1
w and
2
w
where
1
w is weight related to differences of priority and
importance of concepts and
2
w is weight related to Different
weights of common concepts in two documents. In Formula 5,
i
concept is Common Concept concept
in document i . Also
( )
i
order concept defines priority of concept concept in
document i . x and y
are two desired documents and the
Figure2- a sample document in the field of the hotel
Figure3- ontological graph of document in figure (2)
February Issue Page 9 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
goal is to calculate the similarity between them.
max_ ( , ) length x y is defined as the maximum difference of
importance of common concepts between two documents. In
Formula 6, ) (
x
weight concept is common concept weight
concept in document x .
max_ ( , ) ( ( ) ( ))
1
max_ ( , )
length x y order concept order concept
x y
w
length x y
=
(5)
1 ( ) ( )
2
1
weight concept weight concept
x y
w
= (6)
Formula 7 represents similarity criterion for measuring the
similarity between the two documents x and y . In this
formula m represents the number of common concepts in
both documents, symbol Indicates collection size (concepts
number) and x y x y x y = + .
1 2
1
( , )
m
i
w w
sim x y
x y
=
=
(7)
V. FUZZY INFERENCE SYSTEM AND CLUSTERING
The similarity between each pair of documents can be
computed through the similarity measure offered in (7). For
this reason, first, main concepts, detailed concepts and main
edges of the main graph of each document are identified. Then
the level of final similarity between two documents is
approximated by applying a fuzzy inference system. In section
A, fuzzy inference system is expressed in detail and document
clustering is dealt with in section B.
A. Fuzzy Inference System
A fuzzy inference system contains three sections of
fuzzifier, fuzzy inference engine and defuzzifier. In fuzzifier
section, a crisp variable is converted to a linguistic variable
through defined membership functions. In the second section,
fuzzy output value is produced through fuzzy rules (if- then
rules). Defuzzifier section converts fuzzy output value to a
crisp value through defined membership functions. The
process has been presented in figure 4. The inference system
designed in this part has three inputs: similarity level of main
concepts, similarity level of detailed concepts and similarity
level of main edge in documents graph schema. Detailed and
main concepts of each document are determined relatively.
First, the existing maximum weight is identified. Then,
detailed and general concepts are specified for each document
through formula 8.
Max refers to the maximum weight amount and co.weight
specifies the concept weight amount.
Similarity level of main and detailed concepts and the
main edges are computed for both documents through
extracted concepts and the similarity the measure (formula 7).
Eventually, three similarity matrices S
1
, S
2
and S
3
are
produced with the dimensions of n n
and n refers to the
number of documents.
pq
i
S
is the similarity level of p and q documents in S
i
matrix. As it has been shown, S
i
matrix is a symmetric matrix
and the similarity of each document with itself is one.
1 ...
12 1
... ...
21 2
... ... ... ...
... ... 1
1
s s
i
s s
i
S
i
s
n
=
(
(
(
According to figures 5-7, three membership functions of
High, Low and Medium have been defined for each inference
system input. The horizontal axis shows similarity level
among documents and the vertical axis shows membership
degree.
Mamdani fuzzy inference engine has been used for
fuzzifier of input values. Mamdani fuzzy inference system
model uses min- min- max operator [24]. Figure 8 is an
example of a Mamdani fuzzy inference system.
max
0.1, : main concept
2 (max* 0.1)
if . :
max
0.05, : detail concept
2 (max* 0.1)
co weight
> >
+
> <
+
|
\
(8)
Figure 4- Fuzzy Inference System
(8)
February Issue Page 10 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
Similarity Main_Edge Detailed_Concept Main_Concept No
High High High High 1
High Medium High High 2
High Low High High 3
Medium High Medium High 4
High Medium Medium High 5
High Low Medium High 6
Medium High Low High 7
Medium Medium Low High 8
High Low Low High 9
Low High High Low 10
Medium Medium High Low 11
Medium Low High Low 12
Low High Medium Low 13
Low Medium Medium Low 14
Medium Low Medium Low 15
Low High Low Low 16
Low Medium Low Low 17
Low Low Low Low 18
Medium High High Medium 19
High Medium High Medium 20
High Low High Medium 21
Medium High Medium Medium 22
Medium Medium Medium Medium 23
Medium Low Medium Medium 24
Low High Low Medium 25
Medium Medium Low Medium 26
Medium Low Low Medium 27
The designed inference engine uses the 27 fuzzy rules
expressed in table 1. Each row of the table is interpreted in this
way (Rule 1):
if main_concept is high and detail_concept is high
and main_edge is high then similarity is high
Fuzzy system output has three membership functions with
similarity values of High, Low and Medium. Finally, the final
similarity value between 2 documents is estimated through
defuzzification. So, the following processes can be stated for
computation of similarity between documents.
- Compute similarities between main and detailed
concepts and main edges of documents
Figure 5-Low Similarity membership function
Figure 6- Medium Similarity membership function
Figure 7- High Similarity membership function
Figure 8-Example of Mamdani Fuzzy Inference System
Table 1- Fuzzy Rules for fuzzy inference engine
February Issue Page 11 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
- Use of fuzzy inference system and produce fuzzy
output
- Defuzzification of output and compute final
similarities between documents
- Document clustering based on final similarity
matrix
There are two steps for defuzzification of output similarity
values. The first step is to specify the similarity level among
documents. The second step is defuzzification of this
similarity value. Max- finding method is used to specify the
similarity level among documents according to Mamdani
system.
B. Document Clustering
After computing the final similarity matrix among
documents, clustering is done through bottom up hierarchical
clustering algorithm. Clustering algorithm is done through the
following steps [22].
- Find the max value in the final similarity matrix
(S
ij
), and group the documents i and j into a new
cluster.
- Calculate the relationship between the new cluster
and other documents
- Go to Step (1), until there is only one cluster left.
Figure 10 shows an example of bottom-up hierarchical
clustering.
VI. EVALUATION PROPOSED METHOD
The results and experimental document collection of paper
[25] have been used to evaluate the suggested method. A
framework for clustering documents on the base of ontology
has been created in [25], and clustering have been dealt with
through combination of the method of vector space and the
existing concepts in ontology. Computer, wine and pizza
ontologies are used. Two hundred and fifty documents have
been selected from three domain of pizza, wine and computer.
One hundred and six documents belong to computer domain,
64 documents belong to pizza domain and 80 documents
belong to the domain of wine. Precision, recall, Fmeasure,
accuracy and error criteria have been used for evaluation. The
suggested method in this research is compared with the
method of paper [25] and Nave Bays method.
If FC is the number of documents which do not belong to
C
i
category but have been clustered in this category by
mistake, TC is the number of documents that belong to C
i
class and have also been clustered in this class, MC is the
number of documents that belong to C
i
category but they are
not in this category and have been clustered in other classes by
mistake and MM is the number of documents which do not
belong to C
i
category and have been clustered in other
categories, then, for each of the above criteria, formulas 9- 13
can be stated respectively:
Precision and recall criteria express results in different
aspects. Precision is used for evaluating the accuracy rate of
Figure 9- Bottom up Hierarchical clustering
F -measure
1
2PrecisionRecall
=
Precision+Recall
TC MM
Accuracy
TC FC MC MM
+
=
+ + +
FC MC
Error
TC FC MC MM
+
=
+ + +
Recall
TC
MC TC
=
+
Precision
TC
FC TC
=
+
(9)
(10)
(11)
(12)
(13)
(9)
(10)
(11)
(12)
(13)
February Issue Page 12 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
clustering and recall criterion for reviewing clustering
integrity. To review both criteria simultaneously, F
1
-measure
is used [25].
Graphs 10-14 show evaluation of the results of three
clustering methods and compare them with the criteria
indicated above. Considering the diagrams and comparison of
the results, it is observed that, on the based on these criteria,
the suggested method has higher values in relation to other
methods. Also, its error rate is lower.
VII. CONCLUSION AND FUTURE WORK
The paper offers a new framework for document
clustering. The new method of ontological representation of
documents and similarity measure appropriate for this method
have been suggested in this framework. With three inputs and
one output, Fuzzy inference system is used to estimate
similarity level among documents. Evaluation results show
higher efficiency of this method. Fuzzy clustering of
documents, and improving sentence ontological representation
through conceptual analysis of sentences can be studied in the
future.
REFERENCES
[1] Kh. Shaban, A Semantic Graph Model for Text Representation
and Matching in Document Mining: Doctor thesis. university of
Waterloo, Ontario, Canada, 2006.
[2] Aas, K., and Eikvil, L., Text categorisation: A survey, Technical
Report 941, Norwegian Computing Center, 1999.
[3] Berry, M. W., Dunais, S. T., and OBrien, G. W., Using Linear
Algebra for Intelligent Information Retrieval, SIAM Review
37(4), pp. 573-595, 1995.
Figure 10- evaluation with precision criterion
Figure 11- evaluation with recall criterion
Figure 12- evaluation with Fmeasure criterion
Figure 13- evaluation with Accuracy criterion
Figure 14- evaluation with Error criterion
February Issue Page 13 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
[4] Salton, G., and Mcgill, M. J., Introduction to Modern
Information Retrieval, McGraw-Hill, 1984.
[5] Salton, G., Wong, A., and Yang, C., A vector space model for
automatic indexing, Communications of the ACM, 18(11),
pp.613-620, 1975.
[6] Yang, Y., and Pedersen, J., A Comparative Study on Feature
Selection in Text Categorization, In Proceeding of the 14th
International Conference on Machine Learning, ICML, pp. 412-
420, Nashville, TN, 1997.
[7] Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller,
K., Introduction to WordNet: An On-line Lexical Database,
Cognitive Science Laboratory, Princeton University, 1993.
[8] . Porter, M. F. An algorithm for suffix stripping, Program, 14(3),
pp. 130-137, 1980.
[9] Suen, C., N-gram statistics for natural language understanding
and text processing, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 1(2), pp.164-172, 1979.
[10] Martinez, A. R., and Wegman, E. J., Text Stream
Transformation for Semantic-Based Clustering, Computing
Science and Statistics, 34, 2002 Proceedings.
[11] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer T. K.,
and Harshman, R., Indexing by Latent Semantic Analysis,
Journal of the American Society for Information Science, 1990.
[12] Hasan, M., Matsumoto, Y. Document Clustering: Before and
After the Singular Value Decomposition, Sapporo, Japan,
Information Processing Society of Japan (IPSJ-TR: 99-NL-134.)
pp. 47-55, 1999.
[13] Ljungstrand, P.; and Johansson, H. Intranet indexing using
semantic document clustering. Master Thesis. Department of
Informatics, Gteborg University, 1997.
[14] Wong, YukWah and Raymond Mooney, Learning synchronous
grammars for semantic parsing with lambda calculus, In
Proceedings of the 45th Annual Meeting of the Association for
Computational Linguistics, 2007.
[15] Zettlemoyer, Luke S. and Michael Collins, Learning to map
sentences to logical form: Structured classification with
probabilistic categorial grammars, In Proceedings of UAI-05,
2005.
[16] He, Yulan and Steve Young, Spoken language understanding
using the hidden vector state model, Speech Communication
Special Issue on Spoken Language Understanding in
Conversational Systems, 48(3-4), 2006.
[17] B. Andr Solheim, K.Vgsnes, Ontological Representation of
Texts and its Applications in Text Analysis, Master Thesis,
Agder University College, 2003.
[18] Nirenburg, Sergei and Victor Raskin, Ontological Semantics,
MIT Press, 2004.
[19] Hovy, Eduard, Mitchell Marcus, Martha Palmer, Lance
Ramshaw, and Ralph Weischedel, Ontonotes: The 90% solution,
In Proceedings of HLT-NAACL 2006.
[20] S. Muresan, Learning to Map Text to Graph-based Meaning
Representations via Grammar Induction, Creative Commons
Attribution-Noncommercial-Share, 2008.
[21] Hammouda, K., and Kamel, M. Phrase-based document
similarity based on an index graph model, In Proceedings of the
2002 IEEE Int'l Conf. on Data Mining (ICDM'02), 2002.
[22] J. C. Trappey, Charles V. Trappey,Fu-Chiang Hsu, and David
W. Hsiao, A Fuzzy Ontological Knowledge Document
Clustering Methodology, IEEE TRANSACTIONS ON
SYSTEMS, CYBERNETICS, VOL. 39, NO. 3, JUNE 2009.
[23] http://www.onelook.com/reverse-dictionary.shtml
[24] E. H. Mamdani, Application of fuzzy algorithm for control of
simple dynamic plant, Proc. Inst. Elect. Eng., vol. 121, no. 12,
pp. 15851588, 1974.
[25] Yang, X.-q.; Sun, N.; Zhang, Y. & Kong, D-r, General
Framework for Text Classification based on Domain Ontology,
In SAMP 08: Proceedings of the 2008 Third International
Workshop on Semantic Media Adaptation and Personalization.
IEEE Computer Society. Washington DC, USA. pp. 147-152,
2008.
February Issue Page 14 of 62 ISSN 2229 5208