Vast collections of documents available in image format need to be indexed for information retrieval purposes. In this framework, word spotting is an alternative solution to optical character recognition (OCR), which is rather inefficient... more
Vast collections of documents available in image format need to be indexed for information retrieval purposes. In this framework, word spotting is an alternative solution to optical character recognition (OCR), which is rather inefficient for recognizing text of degraded quality and unknown fonts usually appearing in printed text, or writing style variations in handwritten documents. Over the past decade there has been a growing interest in addressing document indexing using word spotting which is reflected by the continuously increasing number of approaches. However, there exist very few comprehensive studies which analyze the various aspects of a word spotting system. This work aims to review the recent approaches as well as fill the gaps in several topics with respect to the related works. The nature of texts and inherent challenges addressed by word spotting methods are thoroughly examined. After presenting the core steps which compose a word spotting system, we investigate the use of retrieval enhancement techniques based on relevance feedback which improve the retrieved results. Finally, we present the datasets which are widely used for word spotting, we describe the evaluation standards and measures applied for performance assessment and discuss the results achieved by the state of the art.
With the growth of online businesses, it is necessary for consumers to have easy access to the desired product. This access is usually achieved through search features which associate lists of keywords to the available products or by... more
With the growth of online businesses, it is necessary for consumers to have easy access to the desired product. This
access is usually achieved through search features which associate lists of keywords to the available products or by
browsing through the different categories. Using Information Retrieval techniques like indexing and searching, this paper
shows how to create wordlists from the collections of documents sold by an online publisher and compare the lists of
associated keywords with the indexes so as to evaluate their completion, and if new keywords are obtained, a proposition
will be made to be added to the existing lists. This will be particularly useful for the consumers whose access to the
documents will be simplified, and to the business itself who will obtain customer satisfaction.
The contribution of this article is twofold. First, we present Indexing by latent Dirichlet allocation (LDI), an automatic document indexing method. Many ad hoc applications, or their variants with smoothing techniques suggested in... more
The contribution of this article is twofold. First, we present Indexing by latent Dirichlet allocation (LDI), an automatic document indexing method. Many ad hoc applications, or their variants with smoothing techniques suggested in LDA-based language modeling, can result in unsatisfactory performance as the document representations do not accurately reflect concept space. To improve document retrieval performance, we introduce a new definition of document probability vectors in the context of LDA and present a novel scheme for automatic document indexing based on LDA. Second, we propose an Ensemble Model (EnM) for document retrieval. EnM combines basic indexing models by assigning different weights and attempts to uncover the optimal weights to maximize the mean average precision. To solve the optimization problem, we propose an algorithm, which is derived based on the boosting method. The results of our computational experiments on benchmark data sets indicate that both the proposed approaches are viable options for document retrieval.
In this article, we propose a new approach for indexing biomedical documents based on a possibilistic network that carries out partial matching between documents and biomedical vocabulary. The main contribution of our approach is to deal... more
In this article, we propose a new approach for indexing biomedical documents based on a possibilistic network that carries out partial matching between documents and biomedical vocabulary. The main contribution of our approach is to deal with the imprecision and uncertainty of the indexing task using possibility theory. We enhance estimation of the similarity between a document and a given concept using the two measures of possibility and necessity. Possibility estimates the extent to which a document is not similar to the concept. The second measure can provide confirmation that the document is similar to the concept. Our contribution also reduces the limitation of partial matching. Although the latter allows extracting from the document other variants of terms than those in dictionaries, it also generates irrelevant information. Our objective is to filter the index using the knowledge provided by the Unified Medical Language System®. Experiments were carried out on different corpora, showing encouraging results (the improvement rate is +26.37% in terms of main average precision when compared with the baseline).
An automatic document indexing method with a probabilistic concept search is presented. The proposed method utilizes Latent Dirichlet Allocation (LDA), a generative model for document modeling and classification. Ad hoc applications of... more
An automatic document indexing method with a probabilistic concept search is presented. The proposed method utilizes Latent Dirichlet Allocation (LDA), a generative model for document modeling and classification. Ad hoc applications of LDA to document indexing, or their variants with smoothing techniques as prompted by previous studies in LDA-based language modeling, would result in disadvantaged empirical approaches. They could result in unsatisfactory performance as the terms in documents may not properly reflect concept space.
In this study, we introduce a new definition of document
probability vectors in the context of LDA and present a scheme for automatic document indexing based on it. The results of our computational experiment on a benchmark data set indicate that the proposed approach is a viable option to use in the document indexing. A small illustrative example is also included.
Internet searching is one of the easiest and useful ways to use the Internet. There are endless reasons for why Internet searching is so helpful. People use common search engines (such as Google and Yahoo) to find web pages, images,... more
Internet searching is one of the easiest and useful ways to use the Internet. There are endless reasons for why Internet searching is so helpful. People use common search engines (such as Google and Yahoo) to find web pages, images, books, currency conversions, definitions, file types, news, local information, movies, and many more. . Searching the Internet has become an almost reflexive act. Each day, tens of millions of global citizens fire up their personal computers and handheld devices to troll for information on products or services, news, job search, social life etc. , and to glean insights that will help guide business decisions.
The amount of information in term of documents, available to users as a result of information retrieval process for the purpose of resolution of decision problems is a major factor that determines whether economically viable decisions... more
The amount of information in term of documents, available to users as a result of information retrieval process for the purpose of resolution of decision problems is a major factor that determines whether economically viable decisions would be made or not. Various works in the literature had addressed the challenges of representing the documents with key terms (generated from the document) as well as the variations in the meaning of each key terms. In this work, a document representation scheme that is based on the key terms generated from the documents and their usage was developed. To realize this document representation scheme, a computational model for capturing document usage was designed with the use of attribute value pair technique of document annotation. The document usage model designed was applied in the development of a Competitive Intelligence based Document Usage Creation and Exploration system that is currently under development. A preliminary evaluation of the document usage model based on cosine similarity function between user query and documents set was carried out. The result obtained shows that representing documents in terms of their usage can enhance the quality of information search results as documents that would hitherto be considered not relevant to user query are found to be ranked very relevant based on previous usages. General Terms Content Analysis and Indexing Methods.
In this contribution we introduce a new method for global segmentation of color documents with a structure based on text frames and pictures. It is based on an extensive analysis of the expected shape of clusters in RGB-color space. The... more
In this contribution we introduce a new method for global segmentation of color documents with a structure based on text frames and pictures. It is based on an extensive analysis of the expected shape of clusters in RGB-color space. The method provides an improved segmentation, and gives a proper basis for indexing and layout analysis. Results are very promising.