This paper summarises the scientific work presented at the 28th European Conference on Informatio... more This paper summarises the scientific work presented at the 28th European Conference on Information Retrieval and demonstrates that the field has not only significantly progressed over the last year but has also continued to make inroads into areas such as Genomics, Multimedia, Peer-to-Peer and XML retrieval.
The first operand of the . operator shall have a qualified or unqualified structure or union type... more The first operand of the . operator shall have a qualified or unqualified structure or union type, and the second operator . first operand shall operand shall name a member of that type.
The retrieval of similar documents from the Web using documents as input instead of key-term quer... more The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document's content into a set of queries that are submitted to a list of Web search engines. Afterward, results are merged, their URLs are fetched and their content is compared with the given document using text comparison algorithms. However, the action of requesting results to multiple web servers could take a significant amount of time and effort. In this work, a similarity function between the given document and retrieved results is estimated. The function uses as variables features that come from information provided by search engine results records, like rankings, titles and snippets. Avoiding therefore, the bottleneck of requesting external Web Servers. We created a collection of around 10,000 search engine results by generating queries from 2,000 crawled Web documents. Then we fitted the similarity function using the cosine similarity between the input and results content as the target variable. The execution time between the exact and approximated solution was compared. Results obtained for our approximated solution showed a reduction of computational time of 86% at an acceptable level of precision with respect to the exact solution of the web document retrieval problem.
The User-over-Ranking hypothesis states that rather the user herself than a web search engine’s r... more The User-over-Ranking hypothesis states that rather the user herself than a web search engine’s ranking algorithm can help to improve retrieval performance. The means are longer queries that provide additional keywords. Readers who take this hypothesis for granted should recall the fact that virtually no user and none of the search index providers consider its implications. For readers who feel insecure about the claim, our paper gives empirical evidence.
Providing fair assessment with timely feedback for students is a diffi cult task with science lab... more Providing fair assessment with timely feedback for students is a diffi cult task with science laboratory classes containing large numbers of students. Throughout our Faculty, such classes are assessed by short-answer questions (SAQs) centred on principles encountered in the laboratory. We have shown recently that computer-assisted assessment (CAA) has several advantages and is well received by students. However, student evaluation has shown that this system does not provide suitable feedback. We thus introduced peer assessment (PA) as a complementary procedure. In October 2006, 457 students registered for a fi rst-year practical unit in the Faculty of Life Sciences, University of Manchester. This unit consists of ten compulsory biology practical classes. The fi rst four practicals were assessed using PA; the remaining six practicals were assessed by CAA and marked by staff or postgraduate student demonstrators. The reliability and validity of PA were determined by comparing duplicate scripts and by staff moderation of selected scripts. Student opinions were sought via questionnaires. We show that both assessments are valid, reliable, easy to administer and are accepted by students. PA increases direct feedback to students, although the initial concerns of student groups such as mature and EU/International students need to be addressed using pre-PA training.
For implementing content management solutions and enabling new applications associated with data ... more For implementing content management solutions and enabling new applications associated with data retention, regulatory compliance, and litigation issues, enterprises need to develop advanced analytics to uncover relationships among the documents, e.g., content similarity, provenance, and clustering. In this paper, we evaluate the performance of four syntactic similarity algorithms. Three algorithms are based on Broder's "shingling" technique while the fourth algorithm employs a more recent approach, "content-based chunking". For our experiments, we use a specially designed corpus of documents that includes a set of "similar" documents with a controlled number of modifications. Our performance study reveals that the similarity metric of all four algorithms is highly sensitive to settings of the algorithms' parameters: sliding window size and fingerprint sampling frequency. We identify a useful range of these parameters for achieving good practical results, and compare the performance of the four algorithms in a controlled environment. We validate our results by applying these algorithms to finding near-duplicates in two large collections of HP technical support documents.
An Information Systems Design Theory (ISDT) is a prescriptive theory that offers theory-based pri... more An Information Systems Design Theory (ISDT) is a prescriptive theory that offers theory-based principles that can guide practitioners in the design of effective Information Systems and set an agenda for on-going research. This paper introduces the origins of the ISDT concept and describes one ISDT for Web-based Education (WBE). The paper shows how this ISDT has, over the last seven years, produced a WBE Information System that is more inclusive, flexible and is more closely integrated with the needs of its host organization.
For some time now, researchers have been seeking to place software measurement on a more firmly g... more For some time now, researchers have been seeking to place software measurement on a more firmly grounded footing by establishing a theoretical basis for software comparison. Although there has been some work on trying to employ information theoretic concepts for the quantification of code documents, particularly on employing entropy and entropy-like measurements, we propose that employing the Similarity Metric of Li, Vitányi, and coworkers for the comparison of software documents will lead to the establishment of a theoretically justifiable means of comparing and evaluating software artifacts. In this paper, we review previous work on software measurement with a particular emphasis on information theoretic aspects, we examine the body of work on Kolmogorov complexity (upon which the Similarity Metric is based), and we report on some experiments that lend credence to our proposals. Finally, we discuss the potential advantages derived from the application of this theory to areas in the field of software engineering.
Many individual instructors-and, in some cases, entire universities-are gravitating towards the u... more Many individual instructors-and, in some cases, entire universities-are gravitating towards the use of comprehensive learning management systems (LMSs), such as Blackboard and Moodle, for managing courses and enhancing student learning. As useful as LMSs are, they are short on features that meet certain needs specific to computer science education. On the other hand, computer science educators have developed-and continue to develop-computer-based software tools that aid in management, teaching, and/or learning in computer science courses. In this report we provide an overview of current CS specific on-line learning resources and guidance on how one might best go about extending an LMS to include such tools and resources. We refer to an LMS that is extended specifically for computer science education as a Computing Augmented Learning Management System, or CALMS. We also discuss sound pedagogical practices and some practical and technical principles for building a CALMS. However, we do not go into details of creating a plug-in for some specific LMS. Further, the report does not favor one LMS over another as the foundation for a CALMS.
In this paper, we provide several alternatives to the classical Bag-Of-Words model for automatic ... more In this paper, we provide several alternatives to the classical Bag-Of-Words model for automatic authorship attribution. To this end, we consider linguistic and writing style information such as grammatical structures to construct different document representations. Furthermore we describe two techniques to combine the obtained representations: combination vectors and ensemble based meta classification. Our experiments show the viability of our approach.
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tie... more Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).
To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one nee... more To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one needs accurate data on the variants occurring at each location in each witness. These data are usually obtained from computer collation programs. Existing programs either collate every witness against a base text or divide all texts up into segments as long as the longest variant phrase at each point. These methods do not give ideal data for stemma reconstruction. We describe a better collation algorithm (progressive multiple alignment) that collates all witnesses word by word without a base text, adding groups of witnesses one at a time, starting with the most closely related pair.
Many database applications, such as sequence comparing, sequence searching, and sequence matching... more Many database applications, such as sequence comparing, sequence searching, and sequence matching, etc, process large database sequences. we introduce a novel and efficient technique to improve the performance of database applications by using a Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency resulting from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm. The experimental results show that our Hybrid GPU/CPU technique improves the average performance by a factor of 2.2, and improves the peak performance by a factor of 2.8 when compared to earlier implementations.
Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be ... more Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.
We present a new approach to managing redundancy in sequence databanks such as GenBank. We store ... more We present a new approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach in BLAST results in a 27% reduction is collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST, available from http://www.fsa-blast.org/. As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.
A malware mutation engine is able to transform a malicious program to create a different version ... more A malware mutation engine is able to transform a malicious program to create a different version of the program. Such mutation engines are used at distribution sites or in self-propagating malware in order to create variation in the distributed programs. Program normalization is a way to remove variety introduced by mutation engines, and can thus simplify the problem of detecting variant strains. This paper introduces the “normalizer construction problem” (NCP), and formalizes a restricted form of the problem called “NCP=”, which assumes a model of the engine is already known in the form of a term rewriting system. It is shown that even this restricted version of the problem is undecidable. A procedure is provided that can, in certain cases, automatically solve NCP= from the model of the engine. This procedure is analyzed in conjunction with term rewriting theory to create a list of distinct classes of normalizer construction problems. These classes yield a list of possible attack vectors. Three strategies are defined for approximate solutions of NCP=, and an analysis is provided of the risks they entail. A case study using the ${\tt W32.Evol}$ virus suggests the approximations may be effective in practice for countering mutated malware.
Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent ... more Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent to find different URLs that refer to the same document, leading crawlers to download duplicates. Hence, web archives built through incremental crawls waste space storing these documents. In this paper, we study the existence of duplicates within a web archive and discuss strategies to eliminate them at storage level during the crawl. We present a storage system architecture that addresses the requirements of web archives and detail its implementation and evaluation. The system is now supporting an archive for the Portuguese web replacing previous NFS-based storage servers. Experimental results showed that the elimination of duplicates can improve storage throughput. The web storage system outperformed NFS based storage by 68% in read operations and by 50% in write operations. 1
Current web search engines focus on searching only the most recent snapshot of the web. In some c... more Current web search engines focus on searching only the most recent snapshot of the web. In some cases, however, it would be desirable to search over collections that include many different crawls and versions of each page. One important example of such a collection is the Internet Archive, though there are many others. Since the data size of such an archive is multiple times that of a single snapshot, this presents us with significant performance challenges. Current engines use various techniques for index compression and optimized query execution, but these techniques do not exploit the significant similarities between different versions of a page, or between different pages.
Although coordination of concurrent objects is a fundamental aspect of object-oriented concurrent... more Although coordination of concurrent objects is a fundamental aspect of object-oriented concurrent programming, there is only little support for its specification and abstraction at the language level. This is a problem because coordination is often buried in the code of the coordinated objects, leading to a lack of abstraction and reuse. Here we present CoLaS, a coordination model and its implementation based on the notion of Coordination Groups. By clearly identifying and separating the coordination from the coordinated objects CoLaS provides a better abstraction and reuse of the coordination and the coordinated objects. Moreover CoLaS's high dynamicity provides better support for coordination of active objects.
This paper summarises the scientific work presented at the 28th European Conference on Informatio... more This paper summarises the scientific work presented at the 28th European Conference on Information Retrieval and demonstrates that the field has not only significantly progressed over the last year but has also continued to make inroads into areas such as Genomics, Multimedia, Peer-to-Peer and XML retrieval.
The first operand of the . operator shall have a qualified or unqualified structure or union type... more The first operand of the . operator shall have a qualified or unqualified structure or union type, and the second operator . first operand shall operand shall name a member of that type.
The retrieval of similar documents from the Web using documents as input instead of key-term quer... more The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document's content into a set of queries that are submitted to a list of Web search engines. Afterward, results are merged, their URLs are fetched and their content is compared with the given document using text comparison algorithms. However, the action of requesting results to multiple web servers could take a significant amount of time and effort. In this work, a similarity function between the given document and retrieved results is estimated. The function uses as variables features that come from information provided by search engine results records, like rankings, titles and snippets. Avoiding therefore, the bottleneck of requesting external Web Servers. We created a collection of around 10,000 search engine results by generating queries from 2,000 crawled Web documents. Then we fitted the similarity function using the cosine similarity between the input and results content as the target variable. The execution time between the exact and approximated solution was compared. Results obtained for our approximated solution showed a reduction of computational time of 86% at an acceptable level of precision with respect to the exact solution of the web document retrieval problem.
The User-over-Ranking hypothesis states that rather the user herself than a web search engine’s r... more The User-over-Ranking hypothesis states that rather the user herself than a web search engine’s ranking algorithm can help to improve retrieval performance. The means are longer queries that provide additional keywords. Readers who take this hypothesis for granted should recall the fact that virtually no user and none of the search index providers consider its implications. For readers who feel insecure about the claim, our paper gives empirical evidence.
Providing fair assessment with timely feedback for students is a diffi cult task with science lab... more Providing fair assessment with timely feedback for students is a diffi cult task with science laboratory classes containing large numbers of students. Throughout our Faculty, such classes are assessed by short-answer questions (SAQs) centred on principles encountered in the laboratory. We have shown recently that computer-assisted assessment (CAA) has several advantages and is well received by students. However, student evaluation has shown that this system does not provide suitable feedback. We thus introduced peer assessment (PA) as a complementary procedure. In October 2006, 457 students registered for a fi rst-year practical unit in the Faculty of Life Sciences, University of Manchester. This unit consists of ten compulsory biology practical classes. The fi rst four practicals were assessed using PA; the remaining six practicals were assessed by CAA and marked by staff or postgraduate student demonstrators. The reliability and validity of PA were determined by comparing duplicate scripts and by staff moderation of selected scripts. Student opinions were sought via questionnaires. We show that both assessments are valid, reliable, easy to administer and are accepted by students. PA increases direct feedback to students, although the initial concerns of student groups such as mature and EU/International students need to be addressed using pre-PA training.
For implementing content management solutions and enabling new applications associated with data ... more For implementing content management solutions and enabling new applications associated with data retention, regulatory compliance, and litigation issues, enterprises need to develop advanced analytics to uncover relationships among the documents, e.g., content similarity, provenance, and clustering. In this paper, we evaluate the performance of four syntactic similarity algorithms. Three algorithms are based on Broder's "shingling" technique while the fourth algorithm employs a more recent approach, "content-based chunking". For our experiments, we use a specially designed corpus of documents that includes a set of "similar" documents with a controlled number of modifications. Our performance study reveals that the similarity metric of all four algorithms is highly sensitive to settings of the algorithms' parameters: sliding window size and fingerprint sampling frequency. We identify a useful range of these parameters for achieving good practical results, and compare the performance of the four algorithms in a controlled environment. We validate our results by applying these algorithms to finding near-duplicates in two large collections of HP technical support documents.
An Information Systems Design Theory (ISDT) is a prescriptive theory that offers theory-based pri... more An Information Systems Design Theory (ISDT) is a prescriptive theory that offers theory-based principles that can guide practitioners in the design of effective Information Systems and set an agenda for on-going research. This paper introduces the origins of the ISDT concept and describes one ISDT for Web-based Education (WBE). The paper shows how this ISDT has, over the last seven years, produced a WBE Information System that is more inclusive, flexible and is more closely integrated with the needs of its host organization.
For some time now, researchers have been seeking to place software measurement on a more firmly g... more For some time now, researchers have been seeking to place software measurement on a more firmly grounded footing by establishing a theoretical basis for software comparison. Although there has been some work on trying to employ information theoretic concepts for the quantification of code documents, particularly on employing entropy and entropy-like measurements, we propose that employing the Similarity Metric of Li, Vitányi, and coworkers for the comparison of software documents will lead to the establishment of a theoretically justifiable means of comparing and evaluating software artifacts. In this paper, we review previous work on software measurement with a particular emphasis on information theoretic aspects, we examine the body of work on Kolmogorov complexity (upon which the Similarity Metric is based), and we report on some experiments that lend credence to our proposals. Finally, we discuss the potential advantages derived from the application of this theory to areas in the field of software engineering.
Many individual instructors-and, in some cases, entire universities-are gravitating towards the u... more Many individual instructors-and, in some cases, entire universities-are gravitating towards the use of comprehensive learning management systems (LMSs), such as Blackboard and Moodle, for managing courses and enhancing student learning. As useful as LMSs are, they are short on features that meet certain needs specific to computer science education. On the other hand, computer science educators have developed-and continue to develop-computer-based software tools that aid in management, teaching, and/or learning in computer science courses. In this report we provide an overview of current CS specific on-line learning resources and guidance on how one might best go about extending an LMS to include such tools and resources. We refer to an LMS that is extended specifically for computer science education as a Computing Augmented Learning Management System, or CALMS. We also discuss sound pedagogical practices and some practical and technical principles for building a CALMS. However, we do not go into details of creating a plug-in for some specific LMS. Further, the report does not favor one LMS over another as the foundation for a CALMS.
In this paper, we provide several alternatives to the classical Bag-Of-Words model for automatic ... more In this paper, we provide several alternatives to the classical Bag-Of-Words model for automatic authorship attribution. To this end, we consider linguistic and writing style information such as grammatical structures to construct different document representations. Furthermore we describe two techniques to combine the obtained representations: combination vectors and ensemble based meta classification. Our experiments show the viability of our approach.
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tie... more Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).
To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one nee... more To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one needs accurate data on the variants occurring at each location in each witness. These data are usually obtained from computer collation programs. Existing programs either collate every witness against a base text or divide all texts up into segments as long as the longest variant phrase at each point. These methods do not give ideal data for stemma reconstruction. We describe a better collation algorithm (progressive multiple alignment) that collates all witnesses word by word without a base text, adding groups of witnesses one at a time, starting with the most closely related pair.
Many database applications, such as sequence comparing, sequence searching, and sequence matching... more Many database applications, such as sequence comparing, sequence searching, and sequence matching, etc, process large database sequences. we introduce a novel and efficient technique to improve the performance of database applications by using a Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency resulting from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm. The experimental results show that our Hybrid GPU/CPU technique improves the average performance by a factor of 2.2, and improves the peak performance by a factor of 2.8 when compared to earlier implementations.
Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be ... more Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.
We present a new approach to managing redundancy in sequence databanks such as GenBank. We store ... more We present a new approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach in BLAST results in a 27% reduction is collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST, available from http://www.fsa-blast.org/. As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.
A malware mutation engine is able to transform a malicious program to create a different version ... more A malware mutation engine is able to transform a malicious program to create a different version of the program. Such mutation engines are used at distribution sites or in self-propagating malware in order to create variation in the distributed programs. Program normalization is a way to remove variety introduced by mutation engines, and can thus simplify the problem of detecting variant strains. This paper introduces the “normalizer construction problem” (NCP), and formalizes a restricted form of the problem called “NCP=”, which assumes a model of the engine is already known in the form of a term rewriting system. It is shown that even this restricted version of the problem is undecidable. A procedure is provided that can, in certain cases, automatically solve NCP= from the model of the engine. This procedure is analyzed in conjunction with term rewriting theory to create a list of distinct classes of normalizer construction problems. These classes yield a list of possible attack vectors. Three strategies are defined for approximate solutions of NCP=, and an analysis is provided of the risks they entail. A case study using the ${\tt W32.Evol}$ virus suggests the approximations may be effective in practice for countering mutated malware.
Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent ... more Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent to find different URLs that refer to the same document, leading crawlers to download duplicates. Hence, web archives built through incremental crawls waste space storing these documents. In this paper, we study the existence of duplicates within a web archive and discuss strategies to eliminate them at storage level during the crawl. We present a storage system architecture that addresses the requirements of web archives and detail its implementation and evaluation. The system is now supporting an archive for the Portuguese web replacing previous NFS-based storage servers. Experimental results showed that the elimination of duplicates can improve storage throughput. The web storage system outperformed NFS based storage by 68% in read operations and by 50% in write operations. 1
Current web search engines focus on searching only the most recent snapshot of the web. In some c... more Current web search engines focus on searching only the most recent snapshot of the web. In some cases, however, it would be desirable to search over collections that include many different crawls and versions of each page. One important example of such a collection is the Internet Archive, though there are many others. Since the data size of such an archive is multiple times that of a single snapshot, this presents us with significant performance challenges. Current engines use various techniques for index compression and optimized query execution, but these techniques do not exploit the significant similarities between different versions of a page, or between different pages.
Although coordination of concurrent objects is a fundamental aspect of object-oriented concurrent... more Although coordination of concurrent objects is a fundamental aspect of object-oriented concurrent programming, there is only little support for its specification and abstraction at the language level. This is a problem because coordination is often buried in the code of the coordinated objects, leading to a lack of abstraction and reuse. Here we present CoLaS, a coordination model and its implementation based on the notion of Coordination Groups. By clearly identifying and separating the coordination from the coordinated objects CoLaS provides a better abstraction and reuse of the coordination and the coordinated objects. Moreover CoLaS's high dynamicity provides better support for coordination of active objects.
Uploads
Papers by Kavita Patil-Pawar