1. Introduction
The most significant improvement within the Semantic Web research area pertains to reasoning and searching abilities. In this perspective, semantic similarity reasoning, which relies on the knowledge coded in a reference ontology [
1], is a different technique with respect to the well-known deductive reasoning used in expert systems. In [
2], we proposed SemSim, a semantic search method based on a
Weighted Reference Ontology (
WRO). In
SemSim, both the resources in the search space and the requests of users are represented by means of an
Ontology Feature Vector (
), which is a set of concepts from the
WRO. We distinguish the user request, also denoted as
Request Vector (
), from the description of a resource, also referred to as
Annotation Vector, indicated by
. In the search process,
SemSim contrasts the
against each
, and the result is a ranking of the resources that exhibit the highest similarity degree with respect to the request defined by the user.
In [
2], we analyzed two different approaches in order to weigh the reference ontology, namely the frequency-based and the uniform probabilistic approaches. In the experiment described in that paper, we show that
SemSim by the
frequency-based approach outperforms the
SemSim by the
uniform probabilistic approach, as well as the most representative similarity methods from the literature.
In this work, we present a new method, referred to as
SemSime. It relies on the frequency-based approach and revises
SemSim along two directions. According to the first direction, in contrasting the
with the
,
SemSime takes into consideration the cardinality of the set of the concepts (features) in the user request rather than the maximal cardinality of the compared
. This choice allows us to give more relevance to the features which are requested by the user rather than the extra features contained in the annotation vectors available in the search space. Along the second direction,
SemSim has been enhanced with the rating scores
(
H),
(
M), and
(
L) in the
, with regard to both the request and the search space resources. Within the request, rating scores denote the preferences given by the user to the concepts of the WRO used to specify the query whereas, within the annotation vectors, rating scores represent the levels of quality associated with the concepts when they describe the resources. Consider an example rooted in the tourism domain, where the user is searching for a vacation package by specifying the following features:
(
H),
(
M),
(
H), and
(
L). On the basis of the given rating scores, he/she gives a high preference to resorts which are international hotels offering cultural activities, and less priority to the remaining features, in particular to the entertainments. Analogously, a holiday package annotated with
(
H),
(
M), and
(
L) is characterized by a high quality level with regard to the horse riding service, rather than the facilities in visiting museums or having Thai meals at lunch or dinner. Note that, in [
3], a proposal concerning rating scores was given, where the concepts of the
WRO are weighted according to the
uniform probabilistic approach [
4], rather than the
frequency-based one. Furthermore, in our approach we assumed that, given a facility (for instance,
) included in a tourist package, the higher the user’s priority about that facility, the higher the expectancy about the quality of the same facility and, therefore, the greater the availability of the user for considering more expensive solutions.
In this paper, we have experimented
SemSime in the domain of tourism and we have compared it to the
SemSim method defined in [
2] and a further evolution of
SemSim, referred to as
SemSim. Essentially,
SemSim is the original
SemSim method where, in line with the first direction adopted in
SemSime illustrated above, more priority has been given to the features indicated by the user in his/her request. The results of the experiment show that
SemSime outperforms both these methods.
The organization of the rest of the paper is the following:
Section 2 is about the related work;
Section 3 shows the
SemSime method and recalls
SemSim;
Section 4 illustrates the experimental results;
Section 5 presents the conclusion and future activities.
2. Related Work
In the literature, several proposals regarding semantic similarity reasoning have been defined, as for instance [
5,
6,
7,
8,
9]. In general, the matchmaking between vectors of concepts is computed on the basis of their intersection as performed in Dice and Jaccard methods [
10], without considering neither the information content of the concepts in the ontology nor the hierarchical relationship among them. In [
11], the Weighted Sum method has been proposed where the similarity of hierarchically related concepts is evaluated, although by using a fixed value (i.e., 0.5). In [
12], the similarity between sets of ontology concepts is computed on the basis of the shortest path in a graph by applying an extended version of the Dijkstra algorithm [
13]. Finally, other proposals, such as [
14], consider the inverse document frequency (IDF) method, and combine it with the term frequency (TF) approach.
With respect to the mentioned papers, in this work the semantic matchmaking method is performed in accordance with the
information content approach defined by Lin in [
15], which is an evolution of [
16]. The Lin’s method results in a better correlation with human judgment when compared with other approaches, such as the edge-counting [
17,
18,
19]. Furthermore, concerning the evaluation of the similarity between vectors of concepts, we borrowed the Hungarian algorithm for solving the
maximum weighted matching problem in bipartite graphs [
20].
In [
21], the representation of users’ preferences relies on the definition of users’ profiles, which are built according to an ontology-based model. The ontology has been exploited in order to establish relationships among user profiles and features. Then, the level of interest of the user is modeled by taking into account both features and historical data. Reference [
22] copes with the evaluation of the similarity between users’ profiles. In particular, it proposes a collaborative filtering method in order to identify similar users rating in a given set of items, and to analyze the ratings associated with these users. In [
23], the traditional feature-based measures for computing similarity are used, and the authors present an ontology-based approach for evaluating similarity between classes where attributes replace features. The underpinning of this approach is that the more attributes two classes have in common, the higher similarity they have.
Finally, it turns out that SemSime is a novel approach because, as opposed to the above mentioned proposals, it introduces rating scores for enriching the semantic annotations of both user requests and resources.
3. The Semantic Similarity Method
The
SemSime method is an extension of
SemSim proposed in [
2,
24], which is based on the information content approach applied to ontologies [
15]. According to this approach, concepts of the ontology are associated with weights such that, along the hierarchy, as weights of concepts decrease their information content increase. A
Weighted Reference Ontology (
WRO) is a pair:
and, in particular:
Ont = <C, H>, where C is a set of , also referred to as , and H is the set of pairs of concepts of C that are related according to the hierarchy (specialization hierarchy);
w is a function, referred to as weight, such that, given a concept c in the ontology, w(c) is a number in [0,…,1].
In
Figure 1, a
related to the tourism domain is shown, whose weights are defined by using the frequency-based approach.
Given a
WRO and a concept
c , the information content
ic(
c) is defined as [
25]:
The
Resource Space represents all the searchable and available digital resources. To define the semantic content of a given resource, a structure gathering a set of concepts from the
is associated with it. This structure will be referred to as
Ontology Feature Vector (
OFV) and is represented as follows:
Similarly, an describes a user request. As mentioned in the Introduction, in the following, (Annotation Vector), and (Request Vector) will denote the semantics of a resource and a user request, respectively.
In order to search for the resources in the Resource Space, the
SemSim method has been defined. This method evaluates the semantic similarity between an
AV and a
RV by using the
semsim function which relies on the
consim function. The latter allows the evaluation of the similarity between concepts, say
,
, as defined below:
where
lca (
lowest common ancestor) is the least abstract concept in the hierarchy subsuming
c and
c. Given an instance of
, say
, and an instance of
, say
,
semsim computes
consim for each pair of concepts belonging to the Cartesian product of
and
. Indeed, we aim at identifying the set of pairs of concepts from
and
that maximizes the sum of
consim according to the Hungarian algorithm for the
maximum weighted matching problem in bipartite graphs [
20]. In particular, given:
where {
c,…,
c} ∪ {
c,…,
c} ⊆
C, let
S be defined as the following Cartesian product:
and let
be defined as:
Thus,
semsim(
rv,
av) is defined as follows:
SemSime
The SemSime method, which is based on SemSim, has been conceived in order to manage a more refined representation of an OFV, where concepts are associated with scores. In SemSime, an OFV is indicated as OFVe and, similar to the previous section, RVe and AVe as well.
In our approach, the presence of the scores in the
AVe and
RVe represents a refined annotation and a refined request vector, respectively. In the case of the
AVe, each score, denoted by
, stands for the quality level of the feature characterizing the resource, while in the case of
RVe, denoted by
, it indicates the priority degree assigned by the user to the requested feature. As mentioned above, these scores are:
H (
),
M (
), or
L (
). Therefore, a given
OFVe, say
ofve, is represented as follows:
where
s is
or
,
is a
or
,
.
The
semsime function evaluates the similarity between an
AVe and a
RVe, represented as
ave and a
rve, respectively, by coupling their elements. The value calculated by
semsime is achieved by multiplying the
consim value by the
score matching value, namely
sm, illustrated in
Table 1, on the basis of both
and
scores. Formally, given:
is defined as follows:
where (
,
) ∈
, (
,
) ∈
, and
is utilized to evaluate the
semsim value between
rv and
av, without considering the rating scores of both
and
.
As mentioned above, in the experimentation presented in the next section, we also address an evolution of
, referred to as
, where in contrasting the
with the
, analogously to
SemSime, the number of features defined in the user request are considered rather than the maximal cardinality of the compared
. Formally, the
function is defined as the
semsim(
rv,
av) (see formula (2)), where
max{n,m} is replaced by the cardinality of the request vector
m, i.e.,
Note that
in Equation (
2) has been replaced in both Equations (
3) and (
4) by
m, which is the cardinality of the request vector, in order to give relevance to the user request and, in particular, to the features searched by the user.
4. SemSime Evaluation
In the experimentation, we asked 15 colleagues at work to identify their privileged tourist packages (request vector,
) by choosing 4 or (at most) 5 concepts of the ontology illustrated in
Figure 1. In particular, assume the user desires to reside in an international hotel, in a location connected by flight and local transportation services, and the opportunity of enjoying cultural activities and entertainments. The related request vector is given below:
By associating a score among
H,
M, and
L with each concept in the
, the user can indicate the level of his/her priority for that concept. Accordingly, the user can express his/her priority in the above
by the following:
where
InternationalHotel,
CulturalActivity and
Flight received a higher preference with respect to
LocalTransportation and
Entertainment, the latter with the lowest score.
Then, we requested our colleagues to examine the 10 packages reported in
Table 2, where each concept has been associated with a score among
H,
M, and
L, specifying, similar to
, the quality level of the offered services. Our colleagues were also asked to choose 5 packages, out of 10, more similar to their
, and to assign a score to each chosen one representing the degree of similarity between the package and their
(Human Judgment—HJ). The request vectors defined by the colleagues are shown in
Table 3.
For the purpose of illustration, let us consider the request vector defined by the user . This user desires to get to his/her destination by flight, reside in an international hotel, commute by public transportation, enjoy international meals and local attractions. As we observe, in the request vector of all the features have been refined by the score H, except which has been refined by M.
Table 4 shows the experimental results related to the user
. The first column contains the packages chosen by the user, that are
,
,
,
,
, and, respectively, the related similarity values 0.95, 0.8, 0.5, 0.4, 0.2. The other columns illustrate the 10 packages ranked, respectively, according to
,
and
, with the associated similarity values with
.
In
Table 5 ,
, and
correlations with
are given. We observe that
has higher values in 73% of cases.
Table 6 shows the precision and recall for all the users, where “-” stands for undefined because the set of retrieved resources is empty. Note that with respect to
the precision of
increases in 5 cases and decreases in 1 case, while the recall increases in most of the cases. Regarding to
the precision of
is the same while the recall increases in 4 cases and is the same for the remaining cases.
Note that, due to the presence of the rating scores, the similarity values computed according to
never increase with respect to
and
. For this reason, the evaluation of the precision and the recall has been performed on the basis of two different thresholds: 0.5 with regard to
,
, and
, as indicated in
Table 4, and 0.4 concerning
. This distinction allows us to balance the decrease of the similarity values obtained by
due to the presence of rating scores. In particular, considering
in
Table 4, the precision remains invariant (i.e., 1), while the recall increases (0.33 vs. 0.66 vs. 1.00). Both the precision and the recall are equal to 1 because
retrieves all and only the resources relevant to the user whereas, for instance, the recall of
is equal to 0.66 because, among
,
, and
,
is not retrieved.
Overall, in selecting resources more similar to users’ requests, the experimental results show that improves the performances of both and . This improvement is achieved because, with respect to and , relies on the additional knowledge carried by the preference and quality rating scores.
5. Conclusions and Future Work
In this work a semantic similarity method, referred to as SemSime, has been introduced. It relies on a previous proposal of the authors, namely SemSim. In particular, SemSime has been endowed with the rating scores (H), (M), (L) in the for representing both the preferences of the user and the quality of the resources in the search space. An experimentation of SemSime has been performed in the tourism domain by comparing it with the original method SemSim, and a variant of it, referred to as . The experiment reveals that SemSime outperforms both SemSim and , with respect to precision, recall and correlation.
Currently, we are planning to apply the proposed approach on a large dataset with regard to the annotated resources, and also by considering a large participation in the human judgment activity. In particular, we are setting up an experiment by focusing on the Digital Library of the Association for Computing Machinery (ACM). In this context, the reference ontology is represented by the ACM Computing Classification System (ACM-CCS), and the semantic annotations of the resources (more than one thousand papers) are the sets of keywords selected by the authors according to the ACM-CCS.
As a future investigation, SemSime will be applied to the automotive domain, in the framework of a collaboration with an Italian SME working in this sector, in order to improve the decision making process from both the dealer and the customer sides. In this domain, the idea of giving more relevance to the offer rather than the request will be investigated in order to enable the customer to be aware about the new features from car companies.