Optimized Prediction of Hard Keyword Queries Over Databases
Optimized Prediction of Hard Keyword Queries Over Databases
Optimized Prediction of Hard Keyword Queries Over Databases
Volume: 2 Issue: 12
ISSN: 2321-8169
4243 - 4247
_____________________________________________________________________________________________
Prof P D Lambhate
Department of Computer Engineering
JSPMs JSCOE Pune, India
Email:lambhatepoonam9@gmail.com
Abstract Keyword Query Interface on databases gives easy access to data, but undergo from low ranking quality, i.e., low precision and/or
recall. It would be constructive to recognize queries that are likely to have low ranking quality to improve the user satisfaction. For example, the
system may suggest to the user alternative queries for such difficult queries. Goal of this paper is to predict the characteristics of hard queries
and propose a novel framework to measure the degree of difficulty for a keyword query over a database, allowing for both the structure and the
content of the database and the results of query. There are query difficulty prediction model but results indicate that even with structured data,
finding the desired answers to keyword queries is still a hard task.
Further, we will use linguistic features Such as morphological features, syntactical features, and semantic features for effective prediction of
difficult keyword queries over database. Due to this, Time required for predicting the difficult keywords over large dataset is minimized and
process becomes robust and accurate.
KeywordsQuery performance, query effectiveness, keyword query, robustness,Prediction,Database
__________________________________________________*****_________________________________________________
XML search methods which we implemented return rankings
I. INTRODUCTION
of considerably lower quality than their average ranking
KEYWORD query interfaces (KQIs) for databases paying
quality over all queries. Therefore, some queries are more
much attention in the last decade due to ease of use in
difficult than others. Furthermore, no matter which ranking
searching and exploring the data. [2], [4] keyword queries
method is used, we cannot deliver a reasonable ranking for
characteristically have many possible answers. KQIs must
these queries. It is important for a KQI to recognize such
recognize the information needs behind keyword queries and
queries and warn the user or employ alternative techniques
rank the answers so that the desired answers appear at the top
like query reformulation or query suggestions. It may also use
of the list. Databases contain entity and attributes and values.
techniques such as query results diversification. There has not
Some of the problems of answering a query are likely to have
been any work on predicting or analyzing the difficulties of
users do not specify the preferred schema element(s) for each
queries over databases. Researchers have proposed some
query term. For e.g. keyword Godfather on the movie
methods to detect difficult queries over plain text document
database does not state that user is interested in title or
collections. But, these techniques are not applicable to our
Distributor Company. So, a KQI must find the desired
problem since they ignore the schema of the database. In
attributes associated with each term in the query and users do
particular, as mentioned earlier, a KQI must assign each query
not give enough information about their desired entities. For
term to a schema element(s) in the database In this paper,
example; keyword may return movies or actors or producers.
analyzing the characteristics of difficult queries over databases
Recently, there have been joint efforts are taken for giving
and propose a novel method to detect or identify such queries
standard benchmarks and evaluation platforms for keyword
that are likely to improve user satisfaction.
search methods over databases. One effort is the data-centric
track of INEX Workshop where KQIs are evaluated over the
II. LITERATURE SURVEY
well-known IMDB data set which contains structured
Some Research studies have presented different methods to
information about movies and people. [9] One more effort is
predict hard queries over unstructured documents or plain text
the series of Semantic Search Challenges (Research) at
collection. It can classify into two groups: pre-retrieval and
Semantic Search Workshop, where the data set is the Billion
post-retrieval methods.Pre-retrieval methods predict the
Triple Challenge.[10] It is extracted from Wikipedia. The
difficulty of a query without computing its results. These
queries are used from Yahoo! Keyword query log. Users have
methods usually use the statistical properties of the terms in
provided relevance judgments for both benchmarks. These
the query to measure specificity, ambiguity, or termresults indicate that even with structured data, finding the
relatedness of the query to predict its difficulty. Examples are
preferred answers to keyword queries is still a hard task.
average inverse document frequency of the query terms or the
Ranking quality of the methods used in both workshops,
number of documents that contain at least one query term.
observed that they performing very poorly on a subset of
These methods generally assume that the more discriminative
queries. For example, consider the query ancient Rome era
the query terms are, the easier the query will be. Post retrieval
over the IMDB data set. Users would like to see information
methods make use of the results of a query to forecast its
about movies that talk about ancient Rome. For this query, the
4792
IJRITCC | December 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
4243 - 4247
_____________________________________________________________________________________________
difficulty and generally fall into one of the following
categories.
Clarity-score-based: It is based on the concept of clarity
score assume that users are concerned in a very few topics.
Thus, sufficiently noticeable from other documents in the
collection. It is efficient than pre-retrieval based methods for
text documents. Some systems compute the distinguish ability
of the queries results from the documents in the collection by
comparing the probability distribution of terms in the results
with the probability distribution of terms in the whole
collection. If these probability distributions are relatively
similar, the query results contain information about almost as
many topics as the whole collection, thus, the query is
considered difficult. Several successors propose methods to
improve the efficiency and effectiveness of clarity score.
However, one requires domain knowledge about the data sets
to extend idea of clarity score for queries over databases. Each
topic in a database contains the entities that are about a similar
subject.
Ranking-score-based: The ranking score of a document
returned by the retrieval systems for an input query may
estimate the similarity of the query and the document. Some
recent methods measure the difficulty of a query based on the
score distribution of its results.
Robustness-based: Another group of post-retrieval methods
argue that the results of an easy query are relatively stable
against the perturbation of queries [3], documents [11] or
ranking algorithms. Our proposed query difficulty
Prediction model falls in this category. Some methods use
machine learning techniques to study the properties of difficult
queries and predict their hardness. They have similar
limitations as the other approaches when applied to structured
data. Moreover, their success depends on the amount and
quality of their available training data. Enough and high
quality training data is not available for many databases. Some
researchers propose frameworks that theoretically explain
existing Predictors and combine them to achieve higher
prediction accuracy.
Keyword queries over structured databases are disreputably
ambiguous. No single understanding of a keyword query can
satisfy all users, and multiple interpretations may yield
overlapping results. It proposes a scheme to balance the
relevance and novelty of keyword search results over
structured databases. Firstly, it presents a probabilistic model
which effectively ranks the possible interpretations of a
keyword query over structured data. [12] Forecast query
difficulty based on linguistic features, using TreeTagger and
other natural language processing tools. Topic features include
morphological features (number of words, average of proper
nouns, and average number of numeral values), syntactical
A Unified
Framework for
Post-Retrieval
QueryPerformance
Prediction
Techniques
used
Machine
learning
Limitations
Post-retrieval
(Clarity)
understanding users
preferences.
Post-retrieval
It outperforms on large
dataset
_______________________________________________________________________________________
ISSN: 2321-8169
4243 - 4247
_____________________________________________________________________________________________
Sets to extend idea of clarity score for queries over databases.
3) Each topic in a database contains the entities that are about
a similar subject.
4) Some Post-retrieval methods success only depends on the
amount and quality of their available data.
Above problems were mitigated by recently presented efficient
method for prediction of difficult keywords over databases.
This method efficiently solving the problem of predicting the
effectiveness of keyword queries over DBs as compared to
existing methods with highest level of accuracy. This method
takes less time and having relatively low errors for predicting
difficulty of queries. This method suffered from limitations
like not evaluated with large datasets.
As well as string approximation is not taken under
considerations.
B) Proposed Method:
In this paper our main aim is to present new improve method
for difficult keyword prediction by overcoming the limitations
of Scalability, dataset flexibility, and string approximation. As
well .Due to this, Time required for predicting the difficult
keywords over large dataset is minimized and process
becomes robust and accurate. In addition to this, spatial
approximate string query is presented. We are going to use
edit distance as the similarity Measurement for the string
predicate and focus on the range queries as the spatial
predicate. We will be use linguistic features Such as
morphological features, syntactical features, semantic features
for effective prediction of difficult keyword queries over
database.
we have:
f
( )=
and
f
( )=
where
C) Mathematical Model:
Structured Robustness:Let V be the number of distinct terms in database DB. Each
attribute value Aa A, 1 a |A|, can be modeled using a Vdimensional multivariate distribution = ( ,1, . . . , ,V),
where
,j
is a random variable that represents the
frequency of term wj in Aa. The probability mass function of
Xa is:
FXa (
) = Pr(
,1 =
where = xa,1, . . . ,
integers.
f
= Pr(
where
,1, . . . ,
,V and
( )=f
,...,
,j ,
,...,
=
,V =
,V),
are non-negative
)
where 0 , ,
1 and
+
+
= 1. f
, f ,
and f
model the noise in attribute value, attribute, and entity
set levels, respectively. Parameters A, T and S have the
same values for all terms w QV and are set empirically.
Since each attribute value
is a small document, we
model f
as a Poisson distribution:
f
),
_______________________________________________________________________________________
ISSN: 2321-8169
4243 - 4247
_____________________________________________________________________________________________
f
D) Implementation Methodology:
Proposed algorithm:
4246
IJRITCC | December 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
4243 - 4247
_____________________________________________________________________________________________
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
4247
IJRITCC | December 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________