This paper summarises the scientific work presented at the 28th European Conference on Informatio... more This paper summarises the scientific work presented at the 28th European Conference on Information Retrieval and demonstrates that the field has not only significantly progressed over the last year but has also continued to make inroads into areas such as Genomics, Multimedia, Peer-to-Peer and XML retrieval.
The retrieval of similar documents from the Web using documents as input instead of key-term quer... more The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document's content into a set of queries that are submitted to a list of Web search engines. Afterward, results are merged, their URLs are fetched and their content is compared with the given document using text comparison algorithms. However, the action of requesting results to multiple web servers could take a significant amount of time and effort. In this work, a similarity function between the given document and retrieved results is estimated. The function uses as variables features that come from information provided by search engine results records, like rankings, titles and snippets. Avoiding therefore, the bottleneck of requesting external Web Servers. We created a collection of around 10,000 search engine results by generating queries from 2,000 crawled Web documents. Then we fitted the similarity function using the cosine similarity between the input and results content as the target variable. The execution time between the exact and approximated solution was compared. Results obtained for our approximated solution showed a reduction of computational time of 86% at an acceptable level of precision with respect to the exact solution of the web document retrieval problem.
The User-over-Ranking hypothesis states that rather the user herself than a web search engine’s r... more The User-over-Ranking hypothesis states that rather the user herself than a web search engine’s ranking algorithm can help to improve retrieval performance. The means are longer queries that provide additional keywords. Readers who take this hypothesis for granted should recall the fact that virtually no user and none of the search index providers consider its implications. For readers who feel insecure about the claim, our paper gives empirical evidence.
To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one nee... more To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one needs accurate data on the variants occurring at each location in each witness. These data are usually obtained from computer collation programs. Existing programs either collate every witness against a base text or divide all texts up into segments as long as the longest variant phrase at each point. These methods do not give ideal data for stemma reconstruction. We describe a better collation algorithm (progressive multiple alignment) that collates all witnesses word by word without a base text, adding groups of witnesses one at a time, starting with the most closely related pair.
Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be ... more Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.
We present a new approach to managing redundancy in sequence databanks such as GenBank. We store ... more We present a new approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach in BLAST results in a 27% reduction is collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST, available from http://www.fsa-blast.org/. As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.
A malware mutation engine is able to transform a malicious program to create a different version ... more A malware mutation engine is able to transform a malicious program to create a different version of the program. Such mutation engines are used at distribution sites or in self-propagating malware in order to create variation in the distributed programs. Program normalization is a way to remove variety introduced by mutation engines, and can thus simplify the problem of detecting variant strains. This paper introduces the “normalizer construction problem” (NCP), and formalizes a restricted form of the problem called “NCP=”, which assumes a model of the engine is already known in the form of a term rewriting system. It is shown that even this restricted version of the problem is undecidable. A procedure is provided that can, in certain cases, automatically solve NCP= from the model of the engine. This procedure is analyzed in conjunction with term rewriting theory to create a list of distinct classes of normalizer construction problems. These classes yield a list of possible attack vectors. Three strategies are defined for approximate solutions of NCP=, and an analysis is provided of the risks they entail. A case study using the ${\tt W32.Evol}$ virus suggests the approximations may be effective in practice for countering mutated malware.
This paper summarises the scientific work presented at the 28th European Conference on Informatio... more This paper summarises the scientific work presented at the 28th European Conference on Information Retrieval and demonstrates that the field has not only significantly progressed over the last year but has also continued to make inroads into areas such as Genomics, Multimedia, Peer-to-Peer and XML retrieval.
The retrieval of similar documents from the Web using documents as input instead of key-term quer... more The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document's content into a set of queries that are submitted to a list of Web search engines. Afterward, results are merged, their URLs are fetched and their content is compared with the given document using text comparison algorithms. However, the action of requesting results to multiple web servers could take a significant amount of time and effort. In this work, a similarity function between the given document and retrieved results is estimated. The function uses as variables features that come from information provided by search engine results records, like rankings, titles and snippets. Avoiding therefore, the bottleneck of requesting external Web Servers. We created a collection of around 10,000 search engine results by generating queries from 2,000 crawled Web documents. Then we fitted the similarity function using the cosine similarity between the input and results content as the target variable. The execution time between the exact and approximated solution was compared. Results obtained for our approximated solution showed a reduction of computational time of 86% at an acceptable level of precision with respect to the exact solution of the web document retrieval problem.
The User-over-Ranking hypothesis states that rather the user herself than a web search engine’s r... more The User-over-Ranking hypothesis states that rather the user herself than a web search engine’s ranking algorithm can help to improve retrieval performance. The means are longer queries that provide additional keywords. Readers who take this hypothesis for granted should recall the fact that virtually no user and none of the search index providers consider its implications. For readers who feel insecure about the claim, our paper gives empirical evidence.
To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one nee... more To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one needs accurate data on the variants occurring at each location in each witness. These data are usually obtained from computer collation programs. Existing programs either collate every witness against a base text or divide all texts up into segments as long as the longest variant phrase at each point. These methods do not give ideal data for stemma reconstruction. We describe a better collation algorithm (progressive multiple alignment) that collates all witnesses word by word without a base text, adding groups of witnesses one at a time, starting with the most closely related pair.
Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be ... more Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.
We present a new approach to managing redundancy in sequence databanks such as GenBank. We store ... more We present a new approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach in BLAST results in a 27% reduction is collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST, available from http://www.fsa-blast.org/. As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.
A malware mutation engine is able to transform a malicious program to create a different version ... more A malware mutation engine is able to transform a malicious program to create a different version of the program. Such mutation engines are used at distribution sites or in self-propagating malware in order to create variation in the distributed programs. Program normalization is a way to remove variety introduced by mutation engines, and can thus simplify the problem of detecting variant strains. This paper introduces the “normalizer construction problem” (NCP), and formalizes a restricted form of the problem called “NCP=”, which assumes a model of the engine is already known in the form of a term rewriting system. It is shown that even this restricted version of the problem is undecidable. A procedure is provided that can, in certain cases, automatically solve NCP= from the model of the engine. This procedure is analyzed in conjunction with term rewriting theory to create a list of distinct classes of normalizer construction problems. These classes yield a list of possible attack vectors. Three strategies are defined for approximate solutions of NCP=, and an analysis is provided of the risks they entail. A case study using the ${\tt W32.Evol}$ virus suggests the approximations may be effective in practice for countering mutated malware.
Uploads
Papers by Kavita Patil-Pawar