160960475X
160960475X
160960475X
Senior Editorial Director: Director of Book Publications: Editorial Director: Acquisitions Editor: Development Editor: Production Coordinator: Typesetters: Cover Design:
Kristin Klinger Julia Mosemann Lindsay Johnston Erika Carter Michael Killian Jamie Snavely Michael Brehm, and Milan Vracarich Jr. Nick Newcomer
Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: cust@igi-global.com Web site: http://www.igi-global.com/reference Copyright 2011 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Advanced database query systems : techniques, applications and technologies / Li Yan and Zongmin Ma, editors. p. cm. Includes bibliographical references and index. Summary: This book focuses on technologies and methodologies of database queries, XML and metadata queries, and applications of database query systems, aiming at providing a single account of technologies and practices in advanced database query systems--Provided by publisher. ISBN 978-1-60960-475-2 (hardcover) -- ISBN 978-1-60960-476-9 (ebook) 1. Databases. 2. Query languages (Computer science) 3. Querying (Computer science) I. Yan, Li, 1964- II. Ma, Zongmin, 1965QA76.9.D32A39 2011 005.74--dc22 2010054423
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.
Table of Contents
Preface ................................................................................................................................................. xii Acknowledgment ............................................................................................................................... xvii Section 1 Chapter 1 Automatic Categorization of Web Database Query Results ................................................................... 1 Xiangfu Meng, Liaoning Technical University, China Li Yan, Northeastern University, China Z. M. Ma, Northeastern University, China Chapter 2 Practical Approaches to the Many-Answer Problem ............................................................................ 28 Mounir Bechchi, LINA-University of Nantes, France Guillaume Raschia, LINA-University of Nantes, France Noureddine Mouaddib, LINA-University of Nantes, Morocco Chapter 3 Concept-Oriented Query Language for Data Modeling and Analysis .................................................. 85 Alexandr Savinov, SAP Research Center Dresden, Germany Chapter 4 Evaluating Top-k Skyline Queries Efficiently .................................................................................... 102 Marlene Goncalves, Universidad Simn Bolvar, Venezuela Mara Esther Vidal, Universidad Simn Bolvar, Venezuela Chapter 5 Remarks on a Fuzzy Approach to Flexible Database Querying, its Extension and Relation to Data Mining and Summarization .................................................................................................... 118 Janusz Kacprzyk, Polish Academy of Sciences, Poland Guy De Tr, Ghent University, Belgium Sawomir Zadrony, Polish Academy of Sciences, Poland
Chapter 6 Flexible Querying of Imperfect Temporal Metadata in Spatial Data Infrastructures ......................... 140 Gloria Bordogna, CNR-IDPA, Italy Francesco Bucci, CNR-IREA, Italy Paola Carrara, CNR-IREA, Italy Monica Pepe, CNR-IREA, Italy Anna Rampini, CNR-IREA, Italy Chapter 7 Fuzzy Querying Capability at Core of a RDBMS .............................................................................. 160 Ana Aguilera, Universidad de Carabobo, Venezuela Jos Toms Cadenas, Universidad Simn Bolvar, Venezuela Leonid Tineo, Universidad Simn Bolvar, Venezuela Chapter 8 An Extended Relational Model & SQL for Fuzzy Multidatabases .................................................... 185 Awadhesh Kumar Sharma, M.M.M. Engg College, India A. Goswami, IIT Kharagpur, India D. K. Gupta, IIT Kharagpur, India Section 2 Chapter 9 Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System...................................................................................................................... 221 Tadeusz Pankowski, Poznan University of Technology, Poland Chapter 10 Deciding Query Entailment in Fuzzy OWL Lite Ontologies ............................................................. 247 Jingwei Cheng, Northeastern University, China Z. M. Ma, Northeastern University, China Li Yan, Northeastern University, China Chapter 11 Relational Techniques for Storing and Querying RDF Data: An Overview ....................................... 269 Sherif Sakr, University of New South Wales, Australia Ghazi Al-Naymat, University of New South Wales, Australia
Section 3 Chapter 12 Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword: An Experimental Query Rewriter in Java ........................................................................................... 287 Eric Draken, University of Calgary, Canada Shang Gao, University of Calgary, Canada Reda Alhajj, University of Calgary, Canada & Global University, Lebanon Chapter 13 Querying Graph Databases: An Overview .......................................................................................... 304 Sherif Sakr, University of New South Wales, Australia Ghazi Al-Naymat, University of New South Wales, Australia Chapter 14 Querying Multimedia Data by Similarity in Relational DBMS ......................................................... 323 Maria Camila Nardini Barioni, Federal University of ABC, Brazil Daniel dos Santos Kaster, University of Londrina, Brazil Humberto Luiz Razente, Federal University of ABC, Brazil Agma Juci Machado Traina, University of So Paulo at So Carlos, Brazil Caetano Traina Jnior, University of So Paulo at So Carlos, Brazil Compilation of References .............................................................................................................. 360 About the Contributors ................................................................................................................... 378 Index ................................................................................................................................................... 386
Preface ................................................................................................................................................. xii Acknowledgment ............................................................................................................................... xvii Section 1 Chapter 1 Automatic Categorization of Web Database Query Results ................................................................... 1 Xiangfu Meng, Liaoning Technical University, China Li Yan, Northeastern University, China Z. M. Ma, Northeastern University, China This chapter proposes a novel categorization approach which consists of two steps. The first step analyzes query history of all users in the system offline and generates a set of clusters over the tuples, where each cluster represents one type of user preference. When a user issues a query, the second step presents to the user a category tree over the clusters generated in the first step such that the user can easily select the subset of query results matching his needs. The chapter develops heuristic algorithms to compute the min-cost categorization. The efficiency and effectiveness of the proposed approach are demonstrated by experimental results. Chapter 2 Practical Approaches to the Many-Answer Problem ............................................................................ 28 Mounir Bechchi, LINA-University of Nantes, France Guillaume Raschia, LINA-University of Nantes, France Noureddine Mouaddib, LINA-University of Nantes, Morocco This chapter reviews and discusses several research efforts that have attempted to provide users with effective and efficient ways to access databases. The focus is on a simple but useful strategy for retrieving relevant answers accurately and quickly without being distracted by irrelevant ones. The chapter presents a very recent but promising approach to quickly provide users with structured and approximate representations of users query results, a must have for decision support systems. The underlying algorithm operates on pre-computed knowledge-based summaries of the queried data, instead of raw data themselves.
Chapter 3 Concept-Oriented Query Language for Data Modeling and Analysis .................................................. 85 Alexandr Savinov, SAP Research Center Dresden, Germany This chapter describes a novel query language, called the concept-oriented query language, and demonstrates how it can be used for data modeling and analysis. The query language is based on a novel construct, called concept, and two relations between concepts, inclusion and partial order. Concepts generalize conventional classes and are used for describing domain-specific identities. Inclusion relation generalized inheritance and is used for describing hierarchical address spaces. Partial order among concepts is used to define two main operations: projection and de-projection. The chapter demonstrates how these constructs are used to solve typical tasks in data modeling and analysis such as logical navigation, multidimensional analysis and inference. Chapter 4 Evaluating Top-k Skyline Queries Efficiently .................................................................................... 102 Marlene Goncalves, Universidad Simn Bolvar, Venezuela Mara Esther Vidal, Universidad Simn Bolvar, Venezuela This chapter describes existing solutions and proposes to use the TKSI algorithm for the Top-k Skyline problem. TKSI reduces the search space by computing only a subset of the Skyline that is required to produce the top-k objects. In addition, the Skyline Frequency Metric is implemented to discriminate among the Skyline objects those that best meet the multidimensional criteria. The chapter empirically studies the quality of TKSI, and the experimental results show the TKSI may be able to speed up the computation of the Top-k Skyline in at least 50% percent with regards to the state-of-the-art solutions. Chapter 5 Remarks on a Fuzzy Approach to Flexible Database Querying, its Extension and Relation to Data Mining and Summarization .................................................................................................... 118 Janusz Kacprzyk, Polish Academy of Sciences, Poland Guy De Tr, Ghent University, Belgium Sawomir Zadrony, Polish Academy of Sciences, Poland This chapter is meant to revive the line of research in flexible querying languages based on the use of fuzzy logic. Details of a basic technique of flexible fuzzy querying are recalled and some newest developments in this area are discussed. Moreover, it is shown how other relevant tasks may be implemented in the framework of such queries interface. In particular, the chapter considers fuzzy queries with linguistic quantifiers and shows their intrinsic relation with linguistic data summarization. Moreover, so called bipolar queries are mentioned and advocated as a next relevant breakthrough in flexible querying based on fuzzy logic and possibility theory.
Chapter 6 Flexible Querying of Imperfect Temporal Metadata in Spatial Data Infrastructures ......................... 140 Gloria Bordogna, CNR-IDPA, Italy Francesco Bucci, CNR-IREA, Italy Paola Carrara, CNR-IREA, Italy Monica Pepe, CNR-IREA, Italy Anna Rampini, CNR-IREA, Italy This chapter discusses the limitations of current temporal metadata in discovery services of spatial data infrastructures (SDIs) and proposes some solutions. The proposal of a formal and operational method is presented to represent imperfect temporal metadata values and allow users to express flexible search conditions, i.e. tolerant to under-satisfaction. In doing so, discovery services can apply partial matching mechanisms between the desired metadata, expressed by the user, and the archived metadata: this would allow retrieving geodata in decreasing order of relevance to the user needs, as it usually occurs on the Web when using search engines. Finally, the chapter illustrates the proposal with an example. Chapter 7 Fuzzy Querying Capability at Core of a RDBMS .............................................................................. 160 Ana Aguilera, Universidad de Carabobo, Venezuela Jos Toms Cadenas, Universidad Simn Bolvar, Venezuela Leonid Tineo, Universidad Simn Bolvar, Venezuela This chapter concentrates on incorporating the fuzzy capabilities to a relational database management system (RDBMS) of open source. The fuzzy capabilities include connectors, modifiers, comparators, quantifiers and queries. The extensions consider a more flexible DDL and DML languages. The aim is to show the design and implementation details in the RDBMS PostgreSQL. For this, a fuzzy query processor and fuzzy access mechanism are designed and implemented. The physical fuzzy relational operators are also defined and implemented. The flow of a fuzzy query through the different modules (parser, planner, optimizer and executor) is shown. The chapter includes some experimental results to demonstrate the performance of the proposal solution. These results show that the extensions do not decrease the performance of the RDBMS. Chapter 8 An Extended Relational Model & SQL for Fuzzy Multidatabases .................................................... 185 Awadhesh Kumar Sharma, M.M.M. Engg College, India A. Goswami, IIT Kharagpur, India D. K. Gupta, IIT Kharagpur, India This chapter investigates the problems in integration of fuzzy relational databases and extends the relational data model to support fuzzy multidatabases of type-2 that contain integrated fuzzy relational databases. The extended model named fuzzy tuple source (FTS) relational data model is provided with a set of FTS relational operations to manipulate the global relations called FTS relations from such fuzzy multidatabases. The chapter proposes and implements a full set of FTS relational algebraic operations capable of manipulating an extensive set of fuzzy relational multidatabases of type-2 that include fuzzy
data values in their instances. To facilitate formulation of global fuzzy query over FTS relations in such fuzzy multidatabases, an appropriate extension to SQL is done so as to get fuzzy tuple source structured query language (FTS-SQL). Section 2 Chapter 9 Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System...................................................................................................................... 221 Tadeusz Pankowski, Poznan University of Technology, Poland This chapter discusses a method for schema mapping and query reformulation in a P2P XML data integration system. The discussed formal approach enables us to specify schemas, schema constraints, schema mappings, and queries in a uniform and precise way. Based on this approach, the chapter defines some basic operations used for query reformulation and data merging, and proposes algorithms for automatic generation of XQuery programs performing these operations in real. Some issues concerning query propagation strategies and merging modes are discussed, when missing data is to be discovered in the P2P integration processes. The approach is implemented in 6P2P system. Its general architecture is presented and the way how queries and answers are sent across the P2P environment is sketched. Chapter 10 Deciding Query Entailment in Fuzzy OWL Lite Ontologies ............................................................. 247 Jingwei Cheng, Northeastern University, China Z. M. Ma, Northeastern University, China Li Yan, Northeastern University, China This chapter focuses on fuzzy (threshold) conjunctive queries over knowledge bases encoding in fuzzy DL SHIF(D), the logic counterpart of fuzzy OWL Lite language. The decidability of fuzzy query entailment in this setting is shown by providing a corresponding tableau-based algorithm. It is also shown that the data complexity for answering fuzzy conjunctive queries in fuzzy SHIF(D) is in coNP, as long as only simple roles occur in the query. Regarding combined complexity, the chapter proves a co3NExpTime upper bound in the size of the knowledge base and the query. Chapter 11 Relational Techniques for Storing and Querying RDF Data: An Overview ....................................... 269 Sherif Sakr, University of New South Wales, Australia Ghazi Al-Naymat, University of New South Wales, Australia This chapter concentrates on using relational query processors to store and query RDF data. An overview of the different approaches is given and these approaches are classified according to the storage and query evaluation strategies.
Section 3 Chapter 12 Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword: An Experimental Query Rewriter in Java ........................................................................................... 287 Eric Draken, University of Calgary, Canada Shang Gao, University of Calgary, Canada Reda Alhajj, University of Calgary, Canada & Global University, Lebanon This chapter intends to provide SQL expression equivalent to explicit relational algebra division (with static divisor). The goal is to implement a SQL query rewriter in Java which takes as input a divide grammar and rewrites it to an efficient query using current SQL keywords. Chapter 13 Querying Graph Databases: An Overview .......................................................................................... 304 Sherif Sakr, University of New South Wales, Australia Ghazi Al-Naymat, University of New South Wales, Australia This chapter provides an overview of different techniques for indexing and querying graph databases. An overview of several proposals of graph query language is also given and a set of guidelines for future research directions is provided. Chapter 14 Querying Multimedia Data by Similarity in Relational DBMS ......................................................... 323 Maria Camila Nardini Barioni, Federal University of ABC, Brazil Daniel dos Santos Kaster, University of Londrina, Brazil Humberto Luiz Razente, Federal University of ABC, Brazil Agma Juci Machado Traina, University of So Paulo at So Carlos, Brazil Caetano Traina Jnior, University of So Paulo at So Carlos, Brazil This chapter presents an already validated strategy that adds similarity queries to SQL, supporting a powerful set of similarity operators. The chapter also describes techniques to store and retrieve multimedia objects in an efficient way and show existing DBMS alternatives to execute similarity queries over multimedia data. Compilation of References .............................................................................................................. 360 About the Contributors ................................................................................................................... 378 Index ................................................................................................................................................... 386
xii
Preface
Databases are designed to support the data storage, processing, and retrieval activities related to data management. The wide usage of databases in various applications has resulted in an enormous wealth of data, which populate various types of databases around the worlds. Ones can find many types of database systems, for example, relational databases, object-oriented databases, object-relational databases, deductive databases, parallel databases, distributed databases, multidatabase systems, Web databases, XML databases, multimedia databases, temporal/spatial databases, spatiotemporal databases, and uncertain databases. As a result, databases have become the repositories of large volumes of data. Database query is closely related to data management. Database query processing is such a procedure that database management systems (DBMSs) obtain the information needed by the users from the databases according to users requirements, and then provides them to the users after this useful information is organized. It is very critical to deal with the enormity and retrieve the worthwhile information for effective problem solving and decision making. It is especially true when a variety of database types, data types, and users requirements, as well as large volumes of data, are available. The techniques of database queries are challenging todays database systems and promoting their evolvement. There is no doubt that database query systems play an important role in data management, and data management requires database query support. The research and development of information queries over a variety of databases are receiving increasing attention. By means of query technology, large volumes of information in databases can be retrieved, and Information Systems are hereby built based on databases to support various problem solving and decision making. So database queries are the fields which must be investigated by academic researchers together with developers and users both from database and industry areas. This book focuses on the following issues of advanced database query systems: the technologies and methodologies of database queries, XML and metadata queries, and applications of database query systems, aiming at providing a single account of technologies and practices in advanced database query systems. The objective of the book is to provide the state of the art information to academics, researchers and industry practitioners who are involved or interested in the study, use, design, and development of advanced and emerging database queries with ultimate aim to empower individuals and organizations in building competencies for exploiting the opportunities of the data and knowledge society. This book presents the latest research and application results in advanced database query systems. The different chapters in the book have been contributed by different authors and provide possible solutions for the different types of technological problems concerning database queries. This book, which consists of fourteen chapters, is organized into three major sections. The first section discusses the technologies and methodologies of database queries, over the first eight chapters. The
xiii
next three chapters covering XML and metadata queries comprise the second section. The third section, containing the final three chapters, focuses on the design and applications of database query systems. First of all, we take a look at the issues of the technologies and methodologies of database queries. Web database queries are often exploratory. The users often find that their queries return too many answers and many of them may be irrelevant. Based on different kinds of user preferences, Xiangfu Meng, Li Yan and Z. M. Ma propose a novel categorization approach which consists of two steps. The first step analyzes query history of all users in the system offline and generates a set of clusters over the tuples, where each cluster represents one type of user preference. When a user issues a query, the second step presents to the user a category tree over the clusters generated in the first step such that the user can easily select the subset of query results matching his needs. The problem of constructing a category tree is a cost optimization problem and the authors develop heuristic algorithms to compute the min-cost categorization. The efficiency and effectiveness of their approach are demonstrated by experimental results. Database systems are increasingly used for interactive and exploratory data retrieval. In such retrievals, user queries often result in too many answers, so users waste significant time and efforts sifting and sorting through these answers to find the relevant ones. Mounir Bechchi, Guillaume Raschia and Noureddine Mouaddib first review and discuss several research efforts that have attempted to provide users with effective and efficient ways to access databases. Then, they focus on a simple but useful strategy for retrieving relevant answers accurately and quickly without being distracted by irrelevant ones. They present a very recent but promising approach to quickly provide users with structured and approximate representations of users query results, a must have for decision support systems. The underlying algorithm operates on pre-computed knowledge-based summaries of the queried data, instead of raw data themselves. Thus, this first-citizen data structure is also presented. Alexandr Savinov describes a novel query language, called the concept-oriented query language (COQL), and demonstrates how it can be used for data modeling and analysis. The query language is based on a novel construct, called concept, and two relations between concepts, inclusion and partial order. Concepts generalize conventional classes and are used for describing domain-specific identities. This includes relation generalized inheritance and is used for describing hierarchical address spaces. Partial order among concepts is used to define two main operations: projection and de-projection. Savinov demonstrates how these constructs are used to solve typical tasks in data modeling and analysis such as logical navigation, multidimensional analysis, and inference. Criteria that induce a Skyline naturally represent users preference conditions useful to discard irrelevant data in large datasets. However, in the presence of high-dimensional Skyline spaces, the size of the Skyline can still be very large. To identify the best k points among the Skyline, the Top-k Skyline approach has been proposed. Marlene Goncalves and Mara-Esther Vidal describe existing solutions and propose to use the TKSI algorithm for the Top-k Skyline problem. TKSI reduces the search space by computing only a subset of the Skyline that is required to produce the top-k objects. In addition, the Skyline Frequency Metric is implemented to discriminate among the Skyline objects those that best meet the multidimensional criteria. They empirically study the quality of TKSI, and their experimental results show the TKSI may be able to speed up the computation of the Top-k Skyline in at least 50% percent with regards to the state-of-the-art solutions. Janusz Kacprzyk, Guy De Tr, and Sawomir Zadrony briefly present the concept of, a rationale for and various approaches to the use of fuzzy logic in flexible querying. They discuss first some historical developments, and then the main issues related to fuzzy querying. Next, they concentrate on fuzzy queries
xiv
with linguistic quantifiers, and discuss in more detail their FQUERY for Access fuzzy querying system. They indicate not only the straightforward power of that fuzzy querying system but its great potential as a tool to implement linguistic data summaries that may provide an ultimately human consistent way of data mining and data summarization. Also, they briefly mention the concept of bipolar queries that may reflect positive and negative preferences of the user, and may be a breakthrough in fuzzy querying. In the context of fuzzy querying and linguistic summarization they mention a considerable potential of their new recent proposals to explicitly use in linguistic data summarization some elements of natural language generation (NLG), and some natural language generation related elements of Hallidays systemic functional linguistics (SFL). They argue that this may be a promising direction for future research. Gloria Bordogna et al. discuss the limitations of current temporal metadata in discovery services of Spatial Data Infrastructures (SDIs) and propose some solutions. They present their proposal of a formal and operational method to represent imperfect temporal metadata values and allow users to express flexible search conditions, i.e. tolerant to under-satisfaction. In doing so, discovery services can apply partial matching mechanisms between the desired metadata, expressed by the user, and the archived metadata: this would allow retrieving geodata in decreasing order of relevance to the user needs, as it usually occurs on the Web when using search engines. The proposal is finally illustrated with an example. Ana Aguilera, Jos Toms Cadenas and Leonid Tineo concentrate on incorporating the fuzzy capabilities to a relational database management system (RDBMS) of open source. The fuzzy capabilities include connectors, modifiers, comparators, quantifiers, and queries. The extensions consider a more flexible DDL and DML languages. The aim is to show the design and implementation details in the RDBMS PostgreSQL. For this, they design and implement a fuzzy query processor and fuzzy access mechanism. Also, they define and implement the physical fuzzy relational operators. They show the flow of a fuzzy query through the different modules (parser, planner, optimizer, and executor). They include some experimental results to demonstrate the performance of the proposal solution. These results show that the extensions do not decrease the performance of the RDBMS. Awadhesh Kumar Sharma, A. Goswami, and D.K. Gupta investigate the problems in integration of fuzzy relational databases and extend the relational data model to support fuzzy multidatabases of type-2 that contain integrated fuzzy relational databases. The extended model is given the name fuzzy tuple source (FTS) relational data model which is provided with a set of FTS relational operations to manipulate the global relations called FTS relations from such fuzzy multidatabases. They propose and implement a full set of FTS relational algebraic operations capable of manipulating an extensive set of fuzzy relational multidatabases of type-2 that include fuzzy data values in their instances. To facilitate formulation of global fuzzy query over FTS relations in such fuzzy multidatabases, an appropriate extension to SQL can be done so as to get fuzzy tuple source structured query language (FTS-SQL). The second section deals with the issues of XML and metadata queries. Tadeusz Pankowski addresses the problem of data integration in a P2P environment, where each peer stores schema of its local data, mappings between the schemas, and some schema constraints. The goal of the integration is to answer queries formulated against a chosen peer. The answer must consist of data stored in the queried peer as well as data of its direct and indirect partners. Pankowski focuses on defining and using mappings, schema constraints, query propagation across the P2P system, and query answering in such scenario. Schemas, mappings, constraints (functional dependencies) and queries are all expressed using a unified approach based on tree-pattern formulas. He discusses how functional dependencies can be exploited to increase information content of answers (by discovering missing values)
xv
and to control merging operations and propagation strategies. He proposes algorithms for translating high-level specifications of mappings and queries into XQuery programs, and shows how the discussed method has been implemented in SixP2P (or 6P2P) system. Significant research efforts in the Semantic Web community have recently been directed toward the representation and reasoning with fuzzy ontologies. Description logics (DLs) are the logical foundations of standard Web ontology languages. Conjunctive queries are deemed as an expressive reasoning service for DLs. Jingwei Cheng, Z. M. Ma, and Li Yan focus on fuzzy (threshold) conjunctive queries over knowledge bases encoding in fuzzy DL SHIF(D), the logic counterpart of fuzzy OWL Lite language. They show decidability of fuzzy query entailment in this setting by providing a corresponding tableau-based algorithm. Also they show data complexity for answering fuzzy conjunctive queries in fuzzy SHIF(D) is in coNP, as long as only simple roles occur in the query. Regarding combined complexity, they prove a co3NExpTime upper bound in the size of the knowledge base and the query. The Resource Description Framework (RDF) is a flexible model for representing information about resources in the Web. With the increasing amount of RDF data which is becoming available, efficient and scalable management of RDF data has become a fundamental challenge to achieve the Semantic Web vision. The RDF model has attracted attentions in the database community, and many researchers have proposed different solutions to store and query RDF data efficiently. Sherif Sakr and Ghazi AlNaymat concentrate on using relational query processors to store and query RDF data. They give an overview of the different approaches and classify these approaches according to the storage and query evaluation strategies. In the third section, we see the design and application aspects of database query systems. Relational Algebra (RA) and structured query language (SQL) are supposed to have a bijective relationship by having the same expressive power. That is, each operation in SQL can be mapped to one RA equivalent and vice versa. RA has an explicit relational division symbol () whereas SQL does not have a corresponding explicit division keyword. Division is implemented using a combination of four core operations, namely cross product, difference, selection, and projection. The work described by Eric Draken, Shang Gao, and Reda Alhajj is intended to provide SQL expression equivalent to explicit relational algebra division (with static divisor). The goal is to implement a SQL query rewriter in Java which takes as input a divide grammar and rewrites it to an efficient query using current SQL keywords. The developed approach could be adapted as front-end or as a wrapper to existing SQL query system. Recently, there has been a lot of interest in the application of graphs in different domains. Graphs have been widely used for data modeling in different application domains such as: chemical compounds, protein networks, social networks, and Semantic Web. Given a query graph, the task of retrieving related graphs as a result of the query from a large graph database is a key issue in any graph-based application. This has raised a crucial need for efficient graph indexing and querying techniques. Sherif Sakr and Ghazi Al-Naymat provide an overview of different techniques for indexing and querying graph databases. They also give an overview of several proposals of graph query language. Finally, they provide a set of guidelines for future research directions. Multimedia objectssuch as images, audio, and videodo not present the total ordering relationship, so the relational operators are not suitable to compare them. Therefore, similarity queries are the most useful, and often the only types of queries adequate to search multimedia objects stored in a database. Unfortunately, the ubiquitous query language SQLthe most widely employed language in Database Management Systems (DBMS)does not provide effective support for similarity queries. Maria Camila
xvi
Nardini Barioni et al. present an already validated strategy that adds similarity queries to SQL, supporting a powerful set of similarity operators. They also describe techniques to store and retrieve multimedia objects in an efficient way and show existing DBMS alternatives to executing similarity queries over multimedia data. Li Yan Northeastern University, China Zongmin Ma Northeastern University, China
xvii
Acknowledgment
The editors wish to thank all of the authors for their insights and excellent contributions to this book and would like to acknowledge the help of all involved in the collation and review process of the book, without whose support, the project could not have been satisfactorily completed. Most of the authors of chapters included in this book also served as referees for chapters written by other authors. Thanks go to all those who provided constructive and comprehensive reviews. A further special note of thanks goes to all the staff at IGI Global, whose contributions throughout the whole process from inception of the initial idea to final publication have been invaluable. Special thanks also go to the publishing team at IGI Global. This book would not have been possible without the ongoing professional support from IGI Global. The idea of editing this volume stems from the initial research work that the editors did in past several years. The research work of the editors was supported by the National Natural Science Foundation of China (60873010 and 61073139), the Fundamental Research Funds for the Central Universities (N090504005, N100604017 and N090604012), and the Program for New Century Excellent Talents in University (NCET- 05-0288). Li Yan Northeastern University, China Zongmin Ma Northeastern University, China June 2010
Section 1
Chapter 1
ABSTRACT
Web database queries are often exploratory. The users often find that their queries return too many answers and many of them may be irrelevant. Based on different kinds of user preferences, this chapter proposes a novel categorization approach which consists of two steps. The first step analyzes query history of all users in the system offline and generates a set of clusters over the tuples, where each cluster represents one type of user preference. When a user issues a query, the second step presents to the user a category tree over the clusters generated in the first step such that the user can easily select the subset of query results matching his needs. The problem of constructing a category tree is a cost optimization problem and heuristic algorithms were developed to compute the min-cost categorization. The efficiency and effectiveness of our approach are demonstrated by experimental results.
INTRODUCTION
As internet becomes ubiquitous, many people are searching their favorite cars, houses, stocks, etc. over the Web databases. However, Web database queries are often exploratory. The users often find that their queries return too many answers, which are commonly referred to as information overload. For
DOI: 10.4018/978-1-60960-475-2.ch001
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
example, when a user submits a query to MSN House&Home Web site to search for a house located in Seattle with a price between $200,000 and $300,000, 1,256 tuples are returned. Information overload makes it hard for the user to separate the interesting items from the uninteresting ones, and thereby lead to a huge wastage of users time and effort. In such a situation, the user would pose a broad query in the beginning to avoid exclusion of potentially interesting results, and then iteratively refine their queries until a few answers matching their preferences are returned. However, this iterative procedure is timeconsuming and many users will give up before they reach the final stage. In order to resolve the problem of information overload, two types of solutions have been proposed. The first type categorizes the query results into a category tree (Chakrabarti, Chaudhuri & Hwang, 2004; Chen & Li, 2007), and second type ranks the results (Agrawal, Chaudhuri, Das & Gionis, 2003; Agrawal, Rantzau &Terzi, 2006; Bruno, Gravano & Marian, 2002; Chaudhuri, Das, Hristidis & Weikum, 2004; Das, Hristidis, Kapoor & Sudarshan, 2006). The success of both approaches depends on the utilization of user preferences. But these approaches always assume that all users have the same user preferences, but in real life different users often have different preferences. Let us look at the following example. Example 1. Consider a real estate searching Web site. Figure 1 and Figure 2 respectively show a fraction of category trees generated by using the methods of Greedy (Chakrabarti, Chaudhuri & Hwang, 2004) and C4.5-Categorization (Chen & Li, 2007) over 214 houses returned by a query with the condition Price between 250000 and 350000 City = Seattle. Each of tree nodes specifies the range or equality conditions on an attribute, and the number in the parentless is the number of tuples satisfying all conditions from the root to the current node. Users can use this tree to select the houses they are interested in. Consider three users U1, U2, and U3. Assume that U1 prefers houses with large square, U2 prefers houses with water views, and U3 prefers both water views and Burien living area. The Greedy method assumed that all users have the same preferences. As a result, attributes Livingarea and Schooldistrict are placed at the first two levels of the tree because more users are concerned with Livingarea and Schooldistrict than other attributes. However, there may be some users (such as U2 and U3) who want to first visit the large square and water view houses. Then they have to visit many nodes if they go along with the tree built in Figure 1. Considering the diversity of user preferences and the cost of both visiting intermediate nodes and leaf nodes, the C4.5-Categorization method took advantage of C4.5 algorithm
to create the navigational tree. But the created category tree (Figure 2) has two drawbacks: (i) the tuples under the intermediate nodes cannot be explored by the users, i.e., users can only access the tuples under the leaf nodes but cannot examine the tuples in the intermediate nodes; (ii) the cost of visiting the tuples of intermediate node is not considered if the user choose to explore the tuples of intermediate node. User preferences are often difficult to obtain because users do not want to spend extra efforts to specify their preferences, thus there are two major challenges to address the diversity issue of user preferences: (i) how to summarize different kinds of user preferences from the behavior of all users already in the system, and (ii) how to categorize or rank the query results according to the specific user preferences. Query history has been widely applied to infer the preferences of all users in the system (Agrawal, Chaudhuri, Das & Gionis, 2003; Chaudhuri, Das, Hristidis & Weikum, 2004; Chakrabarti, Chaudhuri & Hwang, 2004; Das, Hristidis, Kapoor & Sudarshan, 2006). In this chapter, we present techniques to automatically categorize the results of user queries on Web databases in order to reduce information overload. We propose a two-step approach to address both challenges for the categorization case. The first step analyzes query history of all users already in the system offline and then generates a set of clusters over the data. Each cluster corresponds to one type of user preferences and is associated with a probability that users may be interested in the cluster. Assume that an individual users preference can be represented as a subset of these clusters. When a specific user submits a query, the second step first compute the similarity between the query and the representative queries in the query clusters, and then the data clusters the user may be interested in can be inferred by the query. Next, the set of data clusters generated in the first step is intersected with the query answers and then a labeled hierarchical category structure is generated automatically based on the contents of the tuples in the answer set. Consequently, a category tree is automatically constructed over these intersected clusters on the fly. This tree is finally presented to the user. This chapter presents a domain-independent approach to addressing the information overload problem. The contributions are summarized as follows:
We propose a clustering approach to cluster queries and summarize preferences of all users in the system using the query history. This approach uses query pruning, pre-computation and query clustering to deal with large query histories and large data sets. We propose a cost-based algorithm to construct a category tree over these clusters pre-formulated in the offline processing phrase. Unlike the existing categorization and decision tree construction approaches, our approach shows tuples for intermediate nodes and considers the cost for users to visit both intermediate nodes and leaves.
The rest of this chapter is organized as follows. Section 2 reviews some related work. Section 3 formally defines some notions. Section 4 describes the queries and tuples clustering method. Section 5 proposes the algorithm for the category tree construction. Section 6 shows the experimental results. The chapter is concluded in Section 7.
RELATED WORK
Two kinds of automatic categorization approaches have been proposed by Chakrabarti et. al. (Chakrabarti, Chaudhuri & Hwang, 2004) and Chen et. al. (Chen & Li, 2007), respectively. Chakrabarti et.al. proposed a greedy algorithm to construct a category tree. This algorithm uses query history of all users in the system to infer an overall user preference as the probabilities that users are interested in each attribute. Taking advantage of C4.5 decision tree constructing algorithm, Chen (Chen & Li, 2007) proposed a two-step solution which first clusters user query history and then constructs the navigated tree for resolving the users personalized query. We make use of some of these ideas, but enhance the category tree with the feature of showing tuples in the intermediate nodes and focus on how the clusters of query history and the cost of visiting both intermediate nodes and leaves have impact on the categorization. For providing query personalization, several approaches have been proposed to define a user profile for each user and use the profile to decide his preferences (Koutrika &Ioannidis, 2004; Kieling, 2002). As Chen (Chen & Li, 2007) pointed out that, however, in real life, user profiles may not be available because users do not want to or cannot specify their preferences (if they can, they can form the appropriate query and there is no need for either ranking or categorizing). The profile may be derived from the query history of a certain user, but this method does not work if the user is new to the system, which is exactly true when the user needs help. There has been a rich body of work on categorizing text documents (Dhillon, Mallela, & Kumar, 2002; Joachims, 1998; Koller,& Sahami, 1997) and Web search results (Liu, Yu & Meng, 2002; Zeng, He, Chen, Ma & Ma, 2004). But categorizing relational data presents unique challenges and opportunities. First, relational data contains numerical values while text categorization methods treat documents as bags of words. This chapter tries to minimize the overhead for users to navigate the generated tree (it will be defined in Section 3), which is not considered in the existing text categorization methods. Also there has been a rich body of work on information visualization techniques (Card, MacKinlay & Shneiderman, 1999). Two popular techniques are dynamic query slider (Ahlberg & Shneiderman, 1994) and brushing histogram (Tweedie, Spence, Williams & Bhogal, 1994). The former allows users to visualize dynamic query results by using sliders to represent range search conditions, and the latter employs interactive histograms to represent each attribute and helps users exploring correlations between attributes. Note that they do not take query history into account. Furthermore, information visualization
techniques require users to specify what information to visualize (e.g., by setting the slider or selecting histogram buckets). Since our approach generates the information to visualize, i.e., the category tree, our approach is a complementary to visualization techniques. If the leaf of the tree still contains many tuples, for example, a query slider or brushing histogram can be used to further narrow down the scope. Concerning the ranked retrieval from databases, user relevance feedback (Rui, Huang & Merhotra, 1997; Wu, Faloutsos, Sycara & Payne, 2000) is employed to learn the similarity between a result tuple and the query, which is used to rank the query results in relational multimedia databases. The SQL query language is extended to allow the user to specify the ranking function according to their preference for the attributes (Kieling, 2002; Roussos, Stavrakas & Pavlaki, 2005). Also, the importance scores of result tuples (Agrawal, Chaudhuri, Das & Gionis, 2003; Chaudhuri, Das, Hristidis & Weikum, 2004; Geerts, Mannila & Terzim, 2004) are extracted automatically by analyzing the past workloads, which can reveal what users are looking for and what they consider as important. According to the scores, the tuples can be ranked. Ranking is a complementary to categorization. We can use ranking in addition to our techniques (e.g., we rank tuples stored in the intermediate nodes and leaves). However, most existing work does not consider the diversity issue of user preferences. In contrast, we focus on addressing the diversity issue of user preferences for the categorization approach. Also there has been a lot of work on information retrieval (Card, MacKinlay & Shneiderman, 1999; Shen, Tan & Zhai, 2005; Finkelstein &Gabrilovich, 2001; Joachims, 2006; Sugiyama & Hatano, 2004) using query history or other implicit feedbacks. However, these work focuses on searching text documents, while this chapter focuses on searching relational data. In addition, these studies typically rank query results, while this chapter categorizes the results. Of course, ones could use the existing hierarchical clustering techniques (Mitchell, 1997) to create the category tree. But the generated trees are not easy for users to navigate. For example, how do we describe the tuples contained in a node? We can use a representative tuple, but such a tuple may contain many attributes. It is difficult for users to read. On the contrary, the category tree used in this chapter is easy to understand because each node just uses one attribute.
BASICS OF CATEGORIZATION
This section introduces the query history firstly, and then defines the category tree and the category cost. The categorical space and exploration model are finally described.
Query History
Consider a database relation D with n tuples D = {t1,..., tn} with schema R {A1,...,Am}. Let Dom(Ai) represent the active domain of attribute Ai. Let H be a query history {(Q1, U1, F1),..., (Qk, Uk, Fk)} in chronological order, where Qi is a query, Ui is a session ID (a session starts when a user connects to the database and ends when the user disconnects), and Fi is the importance weight of the query, which is evaluated by the frequency of the query in H. Assume that the queries in the same session are asked by the same user, which will be used later to prune queries. The query history can be collected using the query log of commercial database systems.
We assume that all queries only contain point or range conditions, and the query is of the form: Q = im(Ai ai), where ai Dom(Ai), {>, <, =, , , between, in}. Note that if is the operator between, Ai ai has the format of Ai between ai1 and ai2or ai1 Ai ai2, where ai1, ai2 Dom(Ai). D can be partitioned into a set of disjoint preference-based clusters C = {C1,..., Cq}, where each cluster Cj corresponds to one type of user preferences. Each Cj is associated with a probability Pj that users are interested in Cj. This set of clusters over D is inferred from the query history. We assume that the dataset D is fixed. But, in practice D may get modified from time to time. For the purpose of this chapter, we will assume that the clusters are generated periodically (e.g., once a month) as the set of queries evolve and database is updated.
Exploration Model
Given a category tree T over the query results, the user starts the exploration by exploring the root node. Suppose that she has decided to explore the node v, if v is an intermediate node, she non-deterministically (i.e., not known in advance) chooses one of the two options: Option ShowTuples: Browse through the tuples in N(v). Note that the user needs to examine all tuples in N(v) to make sure that she finds every tuple relevant to her. Option ShowCat: Examine the labels of all the n subcategories of v, exploring the ones relevant to her and ignoring the rest. More specifically, she examines the label of each subcategory vi of v staring form the first subcategory and no-deterministically chooses to either explore it or ignore it. If she chooses to ignore vi, she simply proceeds and examines the next label (of vi+1). If she chooses to explore
vi, she does so recursively based on the same exploration model, i.e., by choosing either ShowTuples or ShowCat if it is an intermediate node or by choosing ShowTuples if it is a leaf node. After she finishes the exploration of vi, she goes ahead and examines the label of the next subcategory of v (of vi+1). When the user reaches the end of the subcategory list, she is done. Note that we assume that the user examines the subcategories in the order it appears under v; it can be from top to bottom or from left to right depending on how the tree is rendered by the user interface.
Category Cost
We assume that a user visits T in a top-bottom fashion, and stops at a node (intermediate node or leaf node) that contains the tuples that she is interested in. Let v be a node (intermediate node or leaf node) of T with N(v) tuples and Cj be a cluster in C. Cj v denotes that v contains tuples in Cj. Anc(v) denotes the set of ancestors of v including v itself, but excluding the root. Sib(v) denotes the set of nodes at the same level as the node v including itself. Let K1 and K2 represent the weights of visiting a tuples in the node and visiting an intermediate tree node, respectively. Let Pj be the probability that users will be interested in cluster Cj, and let Pst be the probability that user goes for option ShowTuples for an intermediate node v given that she explores v. The category cost is defined as follows. Definition 2 (category cost). The category cost Cost (T ,C ) =
v Node (T ) C j v f
Pj (K1 | N (v ) | +K 2
vi Anc (v )
(| Sib(vi ) | +
|Sib (vi )|
j =1
Pst (N (v j ))))
j
(1)
The category cost of a leaf node v consists of three terms: the cost of visiting tuples in leaf node v, the cost of visiting intermediate nodes, and the cost of visiting tuples in intermediate nodes if the user chooses to explore it. Users need to examine the labels of all sibling nodes to select a node on the path from the root to v, thus users have to visit v Anc(v ) | Sib(vi ) | intermediate tree nodes. Users may also like to examine the tuples of some sibling nodes on the path from the root to v, thus users have to visit
vi Anc (v )
i
j =1
|Sib (vi )|
Pst (N (v j )) tuples of intermediate tree nodes. When users reach the node v which they
j
would like to explore it, they have to look at N(v) tuples in v. Pst is the probability that the user exploring v using ShowTuples, Pst = N(Av)/N, where N(Av) denotes the number of queries in the query history that contain selection condition on attribute A of node v and N is the total number of queries in the query history. Definition 2 computes the expected cost over all clusters and nodes.
Data Clustering
We generate preference-based clusters as follows. We first define a binary relationship R over tuples such that (ri, rj) R if and only if two tuples ri and rj appear in the results of the exactly same set of queries in H. If (ri, rj) R, according to the query history, ri and rj are not distinguishable because each user that requests ri also requests rj and vice versa. Clearly, R is reflexive, symmetric, and transitive. Thus R is an equivalence relation and it partitions D into equivalence classes {C1,..., Cq}, where tuples equivalent to each other are put into the same class. Those tuples not selected by any query will also form a cluster associated with zero probability (since no users are interested in them). Thus, we can define the data clustering problem as follows. Problem 1. Given database D, query history H, find a set of disjoint clusters C = {C1,, Cq} such that for any tuples ri and rj Cl, 1 l q, (ri, rj) R, and for any tuples ri and rj not in the same cluster, (ri, rj)R. Since the query history H may contain many queries, thus we need to cluster the queries in H, and then to cluster the tuples depending on the clusters of query history. We will propose the algorithm for query history and data clustering in the next section.
Query Clustering
Since there are too many queries in the query history, we should cluster the similar queries into the same cluster and find the representative queries.
Now, we can define the similarity between Q1 and Q2 using their vector representations VQ1 and VQ2 as follows: Sim(Q1,Q2 ) = cos(VQ 1,VQ 2 ) = VQ 1 VQ 2 | VQ 1 || VQ 2 | (2)
In order to quantify how well a query Q1 is represented by another query Q2, we need to define a distance measure between two queries. Based on the similarity mentioned above, the distance between Q1 and Q2 can be defined as d(Q1, Q2) = 1- Sim(Q1, Q2) (3)
Based on the definitions above, the queries clustering problem can be defined. Let H be the set of m queries in query history: H = {Q1,, Qm}. The we need to find a set of k queries Hk = {Q1 ,, Q k } (k < m) such that: cos t(H k ) = d (Q,QC k )
Q H
(4)
is minimized. The distance of a query Qi from a set of queries H is defined as d (Qi , H ) = min d (Qi ,Q j )
Q j H
(5)
We call the queries in set Hk representative queries and associate with each representative query Q i a set of queries QC j = {Qi | Q j = arg min j ' d (Qi ,Q j ' )} .
cxF = min f F cxF for x X. The objective is to find a k-element set F X that minimizes cost(F) (Chrobak, Keynon & Young, 2005). Obviously, the queries clustering problem can be treated as the kmedian problem and it is also NP-hard. Thus, we have to think of approximation algorithms for solving it.
10
Algorithmic Solution
For clustering queries, we propose a novel approach, which can discover the near-globally optimal solution and has the low time complexity of the algorithm as well. The approach is described as follows. Observing the solution of the queries clustering, we can find that every representative query connects with some other queries of H and these connections are like star structures. Here, we call a connection as a Star. Then we can re-define the queries clustering problem as follows: Let U be the set of all Stars, i.e., U = {Qi, QCi | Qi H, QCi H}. The cost of each Star s = Qi, QCi U can be denoted as: cs = Q QC d (Qi ,Q j ). Let rs = cs/|QCi| be the performance-price ratio. Our objective is to find a set of
j i
Star S, such that S U, which minimizes the cost and enables that there are k representative queries in S and any original query Qj H appears at least once at Star s S. For solving this problem, we propose an approach which consists of two parts: a pre-processing part and a processing part. In the processing part, we build a sequential permutation ki = {Qi1, Qi2,..., Qim} over H for each query Qi H, where {Qi1, Qi2,..., Qim} H and the queries in ki are arranged non-decreasing according to their cost corresponding to Qi, that is, d(Qi, Qi1) d(Qi, Qi2) ... d(Qi, Qim). Such permutations can help us only consider the first l queries in ki other than all queries in ki when we build the Star for Qi. Note that, the number l should be choosed appropriately. It can be seen that the complexity of pre-processing part is O(|H|2log|H|), where |H| denotes the number of queries of H. The task of processing part is to cluster queries by using the Greedy-Refine algorithm (Algorithm 3) based on the Stars formed in pre-processing part. The input is a set of all Stars formed in preprocessing part. For each Qi H, the algorithm picks up the Star si with the minimal rs in Ui (the set of all Stars in U corresponding to Qi, Ui U) and put it in the set B. From the set B, the algorithm chooses the Star s with the minimal rs and adds it to the objective set Hk. And then, the algorithm removes Qi and QCi from H. The algorithm stops when the set Hk has k elements. The output is a set of k pairs of the form Q i ,
11
QCi, where Q i is a representative query (i.e., it is the center of clustering QCi), and Q i corresponds to the query cluster QCi. The time complexity in the processing part is O(|H|k), and thus the algorithm is polynomial solvable (Meng & Ma, 2008).
Pj =
Q S Fi . Q H Fi
i j p
12
Algorithm Overview
A category tree is very similar to a decision tree. There are many well-known decision construction algorithms such as ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), and CART (Breiman, Friedman & Stone, 1984). However, the existing decision tree construction algorithm aims at minimizing the impurity of data (Quinlan, 1993) (represented by information gain, etc.). Our goal is to minimize the category cost, which includes both the cost of visiting intermediate tree nodes (and the cost of visiting tuples in the intermediate nodes if the user explores it) and the cost of visiting tuples stored in leaf nodes. For building the category tree, we make use some ideas of solution presented by Chen (Chen & Li, 2007) and propose the improved algorithms for solving it. The problems of our algorithm have to resolve including (i) eliminating a subset of relatively unattractive attributes without considering any of their partitions, and (ii) for every attribute selected above, obtaining a good partition efficiently instead of enumerating all the possible partitions. Finally, we construct the category tree by choosing the attribute and its partition that has the least cost.
13
maximal one as g(Ai, Ti). 10. End If 11. End For 12. Choose the attribute Aj with the maximal g(Aj, Tj), remove Aj from AH if Aj is categorical. 13. If g(Aj, Tj)> then 14. Replace r with the subcategory Tj, for each leaf nk in Tj with tuples in DQk, BuildTree (AH, DQk, C, ). 15. End If
14
is the gain of the attribute. If the gain-ratio of the attribute with the maximal gain-ratio exceeds a predefined threshold , the tree will be expanded by adding the selected subcategory to the current root.
Algorithm Solution
Based on the solutions mentioned above, we can now describe how we construct a category tree. Since the problem of finding a tree with minimal category cost is NP-hard, we propose an approximate algorithm (see Algorithm 6). After building the category tree, the user can go along with the branches of tree to find the interesting answers. As mentioned above, the user can explore category tree using two models, i.e., showing tuples (option ShowTuples) and showing category (option ShowCat). When the user chosen the option ShowTuples on a node (an intermediate node or a leaf node), the system will provide the items satisfying all conditions from the root to the current node. The category tree accessing algorithm is shown in Algorithm 7.
Partition Criteria
Existing decision tree construction algorithms such as C4.5 compute an information gain to measure how good an attribute classifies data. Given a decision tree T with N tuples and n classes, where each class Ci in T has Ni tuples. The entropy can be defined as follows,
E (T ) = -
i =1
Ni N log i N N
(6)
15
In real applications, there may be several distinct values in the domain of an attribute A. For each attribute value v of A, let NTi be the number of tuples with the attribute value v of A in class Ci, and thus the conditional entropy can be defined as
EA(v ) =
i =1
Ti
E (Ti )
(7)
And then, the information gain of attribute A can be computed by g(A) = E(T) EA(v) (8)
For example, consider a fraction results (showed in Table 1) returned by MSN house&home Web database for a query with the condition Price between 250000 and 350000 and City = Seattle. We then use it to describe how to obtain a best partition attribute by using the formulas defined above. Here, we assume the decision attributes are View, Schooldistrict, Livingarea, and SqFt. We first compute the entropy of tree T, 5 5 6 6 4 4 E (T ) = E (C 1, C 2 , C 3 ) = log + log + log = 0.471293. 15 15 15 15 15 15 And then, we compute the entropy of each decision attributes. For attribute View, it contains four distinct values which are Water, Mountain, GreenBelt, and Street, the entropy of each value are
16
5 5 1 1 E View (Water) = log log = 0.195676, 6 6 6 6 0 0 2 2 1 1 E View (Mountain) log log log = 0.276434, 3 3 3 3 3 3 0 0 2 2 1 1 E View (GreenBelt) = log log log = 0.276434, , 3 3 3 3 3 3 0 0 2 2 1 1 E View (Street) log log log = 0.276434. 3 3 3 3 3 3 Next, E(View) = 6/15 * 0.195676 + 3/15 * 0.276434+3/15 * 0.276434 + 3/15 * 0.276434 = 0.2441308. Thus, the gain g(View) = E(T) E(View) = 0.2271622. Analogously, g(Schooldistrict) = 0.2927512, g(Livingarea) = 0.1100572, g(SqFt) = 0.251855, where, we choose the value 987 as the partition value. Finally, the attribute Schooldistrict will be selected as the first level partition attribute of decision tree T by using the C4.5 algorithm. However, the main difference between our algorithm and the existing decision tree construction algorithm is how to compute the gain of a partition. Our approach wants to reduce the category cost of visiting intermediate nodes (includes the tuples in them if user choose to explore them) and the cost of visiting tuples in the leaves. Our following analysis will show that information gain ignores the cost of visiting tupes, and the existing category tree construction algorithm proposed by Chakrabarti et. al. (Chakrabarti, Chaudhuri & Hwang, 2004) ignores the cost of visiting intermediate nodes generated by future partitions while the category tree construction algorithm proposed by Chen et. al. (Chen & Li, 2007) ignores the cost of visiting tuples in the intermediate nodes.
Cost Estimation
Cost of Visiting Leave Let v be the node to be partitioned and N(v) be the number of tuples in v. Let v1 and v2 be children generated by a partition. Let Pi be the probability that users are interested in cluster Ci. The gain equals the reduction of category cost when v is partitioned into v1 and v2. Thus based on the category cost defined in Definition 2, the reduction of the cost of visiting tuples due to partition v into v1 and v2 equals
17
N (t )
Cl t f
Pl N (t j )( Pi )
j =1,2 C i t j
(9)
The decision tree construction algorithms do not consider the cost of visiting leaf tuples. For example, consider a partition that generates two nodes that contain tuples with labels (C1, C2, C1) and (C2), and a partition that generates two nodes that contain tuples with labels (C2, C1, C2) and (C1). According to the discussion in Section 5.4.1, these two partitions have the same information gain. However, if P1 = 0.5 and P2 = 0, then the category cost for the first partition is smaller because the cost is 1.5 for the first partition and is 2 for the second partition. Cost of Visiting Intermediate Nodes For estimate the cost of visiting intermediate nodes, we adopted the method proposed by Chen et. al. (Chen & Li, 2007). According to the definition in (Chen & Li, 2007), the perfect tree is that their leaves only contain tuples of on class and can not be partitioned further. In fact, the perfect tree is a decision tree. Given a perfect tree T with N tuples and k classes, where each class Ci in T has Ni tuples. The entropy E(T) =
i =1 k
Ni N log i approximates the average length of root-to-leaf paths for all tuples in T. N N
big tree Tb are balanced, thus the height for Ti is logNi and the height for Tb is logN. Note that for the i-th leaf Li in T, the length of the path from root to Li equals the height of big tree Tb minus the height of small tree Ti. There are Ni tuples in Li, all with the same path from the root. Thus the average length of root-to-leaf paths for tuples is defined as follows,
Since T is a perfect tree, its leaves contain only one class per node. For each such leaf Li that contains Ni tuples of class Ci, it can be further expanded into a smaller subtree Ti which is rooted at Li, and its leaf contains exactly one record in Ci. Each such small subtree Ti contains Ni leaves. All these subtrees and T compose a big tree Tb that contains 1i k N i = N leaves. We further assume that each Ti and the
1i k
Ni N
(log N log N i ) =
i =1
Ni N log i N N
(10)
This is exactly the entropy E(t). Note that most existing decision tree algorithms choose the partition that maximizes information gain. Information gain is the reduction of entropy due to a partition and is represented in the following formula,
IGain(t, t1, t2 ) = E (t )
N1 N E (t1 ) 2 E (t2 ) N N
(11)
Thus a partition with a high information gain will generate a tree with a low entropy. And, this tree will have short root-to-leaf paths as well. Since the cost of visiting intermediate nodes equals the product of path lengths and fan-out in Definition 2, if we assume the average fan-out is about the same for all trees, then the cost of visiting intermediate nodes is proportional to the length of root-to-leaf paths. Therefore, the cost reduction of visiting intermediate nodes can be used information gain to estimate.
18
Cost of Visiting Tuples in Intermediate Nodes Since user may choose to examine the tuples in an intermediate node, thus we need to consider the cost of visiting these tuples. Let Pst be the probability that user goes for option ShowTuples for an intermediate node v given that she explores v, and N(t) be the number of tuples in the intermediate node t. Next, the cost of visiting tuples in an intermediate node equals PstN(t). Combining Costs The remaining problem is how to combine the three types of costs. Here we take a normalization approach, which uses the following formula to estimate the gain of partitioning t into t1 and t2,
(12)
The denominator is the product of the cost of visiting leaf tuples after partition normalized by the cost before partition multiplying the cost of visiting the tuples in t. A partition always reduces the cost of visiting tuples (the proof is straightforward). Thus the denominator ranges from (0, 1]. The nominator is the information gain normalized by the entropy of t. We compute a ratio between these two terms rather than sum of the nominator and (1-denominator) because in practice the nominator (information gain) is often quite small. Thus the ratio is more sensitive to the nominator when the denominator is similar. Complexity Analysis Let n be the number of tuples in query results, m be the number of attributes, and k be the number of classes. The gain in Formula 5 can be computed in O(k) time. C4.5 also uses several optimizations such as computing the gains for all partition points of an attribute in one pass, sorting all tuples on different attribute values beforehand, and reusing the sort order. The cost of sorting tuples on different attribute values is O(mnlogn), and the cost of computing gains for all possible partitions at one node is O(mnk) because there are at most m partition attributes and n possible partition points, and each gain can be computed in O(k) time. If we assume the generated tree has O(logn) levels, the total time is O(mnklogn).
EXPERIMENTAL EVALUATION
In this section, we describe our experiments, report the experimental results and compare our approach with several existing approaches.
Experimental Setup
We used Microsoft SQL Server 2005 RDBMS on a P4 3.2-GHz PC with 1 GB of RAM for our experiments. Dataset: For our evaluation, we setup a real estate database HouseDB (Price, SqFt, Bedrooms, Bathrooms, Livingarea, Schooldistrict, View, Neighborhood, Boat, Garage, Buildyear) containing 1,700,000 tuples extracted from MSN House&Home Web site. There are 27 attributes, 10 numerical and 17 categorical. The total data size is 20 MB.
19
Query history: In our experiments, we requested 40 subjects to behave as different kinds of house buyers, such as rich people, clerks, workers, women, young couples, etc. and post queries against the database. We collected 2000 queries for the database and these queries are used as the query history. Each subject was asked to submit 15 queries for HouseDB, each query had 2~6 conditions and had 4.2 specified attributes on average. We assume each query has equal weight. We did observe that users started with a general query which returned many answers, and then gradually refined the query until it returned a small number of answers. Algorithm: We implemented all algorithms in C# and connected to the RDBMS through ADO. The clusters are stored by adding a column to the data table to store the class labels of each tuple. The stopping threshold in build tree algorithm is set to 0.002. We have developed an interface that allows users to classify query results using generated trees. Comparison: We compare our create tree algorithm (henceforth referred to as Cost-based algorithm) with the algorithm proposed by Chakrabarti et. al. (Chakrabarti, Chaudhuri & Hwang, 2004) (henceforth referred to as Greedy algorithm). It differs from our algorithm on two aspects: (i) it does not consider different user preferences, and (ii) it does not consider the cost of intermediate nodes generated by future partitions. We also compare the algorithm proposed by Chen et. al. (Chen & Li, 2007) (henceforth referred to as C4.5-Categorization algorithm), it first uses the merging queries step to generate data clusters and corresponding labels, then uses modified C4.5 to create the navigational tree. It differs from our algorithm on two aspects: (i) it needs to execute queries on the dataset to evaluate the queries similarity and then to merging the similar queries, and (ii) it can not expand the intermediate nodes to show tuples and it thus does not consider the cost of visiting tuples of intermediate nodes. Setup of user study: We conduced an empirical study by asking 5 subjects (with no overlap with the 40 users submitting the query history) to use this interface. The subjects were randomly selected colleagues, students, etc. Each subject was given a tutorial about how to use this interface. Next, each subject was given the results of 5 queries listed in Table 2, which do not appear in the query history. For each such query, the subject was asked to go along with the trees generated by the three algorithms mentioned above, and to select 5-10 houses that he would like to buy.
20
ActCost =
(K1N (v ) + K 2
vi Anc (v )
(| Sib(vi ) | +
|Sib (vi )|
j =1
Pst (N (v j ))))
j
(13)
Unlike the category cost in Definition 2, this cost is the real count of intermediate (including siblings) and tuples visited by a subject. We assume the weight for visiting intermediate nodes and visiting tuples are equal, i.e. K1 = K2 = 1. In general the lower the total category cost, the better the categorization method. Figure 3 shows the total actual cost, averaged over all the subjects, for Cost-based, C4.5-Categorization, and Greedy algorithm. Figure 4 reports the average number of houses selected by each subject. Figure 5 reports the average category cost of per selected house for these algorithms. The results show that the category trees generated by Cost-based algorithm have the lowest actual cost and the lowest average cost per selected house (the number of query clusters k was set to 30). Users
21
have also found more houses worth considering to buy using our algorithm than the other two algorithms, suggesting our method makes it easier for users to find interesting houses. The tree generated by Greedy algorithm has the worst results. This expected because the Greedy algorithm ignores different user preferences, and dose not consider future partitions when generating category trees. The C4.5-Categorization algorithm also has higher cost than our method. The reason is that our algorithm uses a partitioning criterion that considers the cost of visiting the tuples in intermediate nodes, while C4.5-Categorization algorithm does not. Moreover, our algorithm can use a few clusters to representative a large scale tuples without lose accuracy (it will be tested in the next experiment). The results show that using our approach, on average a subject only needs to visit no more than 8 tuples or intermediate nodes for queries Q1, Q2, Q3, and Q4 to find the first relevant tuple, and needs to visit about 18 tuples or intermediate nodes for Q5. The total navigational cost for our algorithm is less than 45 for the former four queries, and is less than 80 for Q5. At the end of the study, we asked subjects which categorization algorithm worked the best for them among all the queries they tried. The result of that survey is reported in Table 3 and shows that a majority of subjects considered our algorithm the best.
22
is the number of vector elements in <attribute, value> pairs set of (note that, each query in the query history is translated into vector representations), m is the number of input queries, and l is the number of true underlying clusters. We set 0 and 1 at random on the n elements and we then generate l random queries by sampling at random the space of all possible permutations of n elements. These initial queries form the centers around which we build each one of the clusters. The task of the algorithms is to rediscover the clustering model used for the data generation. Given a cluster center, each query from the same cluster is generated by adding to the center a specified amount of noise of a specific type. We consider two types of noise: swaps and shifts. The swap means that 0 and 1 elements from the initial order are picked and their positions in the order are exchanged. For the shifts we pick a random element and we move it to a new position, either earlier (or later) in the order. All elements that are between the new and the old positions of the element are shifted on position down (or up). The amount of noise is the number of swaps or shifts we make. We experiment with datasets generated for the following parameters: n = 300, m = 600, l = {4, 8, 16, 32}, noise = {2, 4, 8,, 128} for swaps. Figure 6 shows the performance of the algorithms as a function of the amount of noise. The y axis is the ratio: F(A)/F(INP), for A = {Greedy, Greedy-Refine}, where the Greedy- Refine algorithm is proposed in this chapter (Algorithm 3), while the Greedy algorithm is proposed in [2]. We compare them here since they all aim at solving the same problem (clustering problem). The F(A) is the total cost of the solution provided by algorithm A when the distance showed in Equation (3) is used as a distance measure between queries. The F(INP) corresponds to the cost of the clustering structure (Equation 4) used in the data generation process. From Figure 6 we can see that: Greedy- Refine algorithm performs greatly better than Greedy algorithm. The reason is that: the Greedy- Refine is executed on the queries which were arranged according
23
to their cost in pre-processing phrase and makes twice greedy selection in processing phrase, so that it can obtain the near-globally optimization solution.
Performance Report
Figure 7 report the tree construction time of our algorithm for the 5 test queries (since the execution time of Q5 is much longer than the first 4 queries, we do not show its histogram in the figure). Our algorithm took no more than 2.4 second for the first 4 queries queries that returned several hundred results. It took about 4 seconds for the 5th query that returned 16,213 tuples. Thus our algorithm can be used in an interactive environment.
CONCLUSION
This chapter proposed a categorization approach to address diverse user preferences, which can help users navigate many query results. This approach first summarized preferences of all users in the system by clustering the query history, and then divided tuples into clusters using the different kinds of user preferences. When a specific user issues a query, our approach create a category tree over the clusters appearing in the results of the query to help users navigate these results. Our approach differs from the several existing approaches in two aspects: (i) our approach does not require a user profile or a meaningful query when deciding the user preferences for a specific user, and (ii) the category tree construction algorithm proposed in this chapter considers both the cost of visiting intermediate nodes (including the cost of visiting the tuples in intermediate nodes) and the cost of visiting the tuples in leaf nodes. In the future, we will investigate how to accommodate the dynamic nature of user preferences and how to integrate the ranking approach into our approach.
24
REFERENCES
Agrawal, R., Rantzau, R., & Terzi, E. (2006). Context-sensitive ranking. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 383-394). Agrawal, S., Chaudhuri, S., Das, G., & Gionis, A. (2003). Automated ranking of database query results. ACM Transactions on Database Systems, 28(2), 140174. Ahlberg, C., & Shneiderman, B. (1994). Visual information seeking: tight coupling of dynamic query filters with starfield displays (pp. 313317). Proceedings on Human Factors in Computing Systems. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. (1984). Classification and regression trees. Boca Raton, FL: CRC Press. Bruno, N., Gravano, L., & Marian, A. (2002). Evaluating top-k queries over Web-accessible databases. Proceedings of the 18th International Conference on Data Engineering, (pp. 369-380). Card, S., MacKinlay, J., & Shneiderman, B. (1999). Readings in information visualization: using vision to think. Morgan Kaufmann. Chakrabarti, K., Chaudhuri, S., & Hwang, S. (2004). Automatic categorization of query results. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 755766). Chaudhuri, S., Das, G., Hristidis, V., & Weikum, G. (2004). Probabilistic ranking of database query results. Proceedings of the 30th International Conference on Very Large Data Base, (pp. 888899). Chen, Z. Y., & Li, T. (2007). Addressing diverse user preferences in SQL-Query-Result navigation. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 641-652). Chrobak, M., Keynon, C., & Young, N. (2005). The reverse greedy algorithm for the metric k-median problem. Information Processing Letters, 97, 6872. doi:10.1016/j.ipl.2005.09.009 Das, G., Hristidis, V., Kapoor, N., & Sudarshan, S. (2006). Ordering the attributes of query results. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 395-406). Dhillon, I. S., Mallela, S., & Kumar, R. (2002). Enhanced word clustering for hierarchical text classification. Proceedings of the 8th ACM SIGKDD International Conference, (pp. 191200). Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2001). Placing search in context: The concept revisited. Proceedings of the 9th International World Wide Web Conference, (pp. 406414). Geerts, F., Mannila, H., & Terzim, E. (2004). Relational link-based ranking. Proceedings of the 30th International Conference on Very Large Data Base, (pp. 552-563). Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the European Conference on Machine Learning, (pp. 137142). Joachims, T. (2002). Optimizing search engines using clickthrough data. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, (pp. 133142).
25
Kieling, W. (2002). Foundations of preferences in database systems. Proceedings of the 28th International Conference on Very Large Data Bases, (pp. 311-322). Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. Proceedings of the 14th International Conference on Machine Learning, (pp. 170178). Koutrika, G., & Ioannidis, Y. (2004). Personalization of queries in database systems. Proceedings of the 20th International Conference on Database Engineering, (pp. 597-608). Liu, F., Yu, C., & Meng, W. (2002). Personalized Web search by mapping user queries to categories. Proceedings of the ACM International Conference on Information and Knowledge Management, (pp. 558-565). Meng, X. F., & Ma, Z. M. (2008). A context-sensitive approach for Web database query results ranking. Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, (pp. 836-839). Mitchell, T. (1997). Machine learning. McGraw Hill. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81106. doi:10.1007/ BF00116251 Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann Publishers Inc. Roussos, Y., Stavrakas, Y., & Pavlaki, V. (2005). Towards a context-aware relational model. Proceedings of the International Workshop on Context Representation and Reasoning, Paris, (pp. 101-106). Rui, Y., Huang, T. S., & Merhotra, S. (1997). Content-based image retrieval with relevance feedback in MARS. Proceedings of the IEEE International Conference on Image Processing, (pp. 815-818). Shen, X., Tan, B., & Zhai, C. (2005). Context-sensitive information retrieval using implicit feedback. Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 4350). Sugiyama, K., Hatano, K., & Yoshikawa, M. (2004). Adaptive Web search based on user profile constructed without any effort from users. Proceedings of the 13th International World Wide Web Conference, (pp. 975-990). Tweedie, L., Spence, R., Williams, D., & Bhogal, R. S. (1994). The attribute explorer. Proceedings of the International Conference on Human Factors in Computing Systems, (pp. 435436). Wu, L., Faloutsos, C., Sycara, K., & Payne, T. (2000). FALCON: Feedback adaptive loop for content-based retrieval. Proceedings of the 26th International Conference on Very Large Data Bases, (pp. 297-306). Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., & Ma, J. (2004). Learning to cluster Web search results. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 210217).
26
27
28
Chapter 2
ABSTRACT
Database systems are increasingly used for interactive and exploratory data retrieval. In such retrievals, users queries often result in too many answers, so users waste significant time and efforts sifting and sorting through these answers to find the relevant ones. This chapter first reviews and discusses several research efforts that have attempted to provide users with effective and efficient ways to access databases. Then, it focuses on a simple but useful strategy for retrieving relevant answers accurately and quickly without being distracted by irrelevant ones. Generally speaking, the chapter presents a very recent but promising approach to quickly provide users with structured and approximate representations of their query results, a must have for decision support systems. The underlying algorithm operates on pre-computed knowledge-based summaries of the queried data, instead of raw data themselves. Thus, this first-citizen data structure is also presented in this chapter.
1. INTRODUCTION
With the rapid development of the World Wide Web, more and more accessible databases are available online; A July 2000 study (Bergman, 2001) estimated 96000 relational databases were online and the number increased by seven times in 2004 (Chang, He, Li, Patel & Zhang, 2004). The increased visibility of these structured data repositories made them accessible to a large number of lay users, typically lacking a clear view of their content, moreover, not even having a particular item in mind. Rather,
DOI: 10.4018/978-1-60960-475-2.ch002
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
they are attempting to discover potentially useful items. In such a situation, user queries are often very broad, resulting in too many answers. Not all the retrieved items are relevant to the user. Unfortunately, she/he often needs to examine all or most of them to find the interesting ones. This too-many-answers phenomenon is commonly referred to as information overload - a state in which the amount of information that merits attention exceeds an individuals ability to process it (Schultz & Vandenbosch, 1998). Information overload often happens when the user is not certain of what she/he is looking for, i.e., she/he has a vague and poorly defined information need or retrieval goal. Thus, she/he generally poses a broad query in the beginning to avoid exclusion of potentially interesting results and next, she/he starts browsing the answer looking for something interesting. Information overload makes it hard for the user to separate the interesting items from the uninteresting ones, thereby leading to potential decision paralysis and wastage of time and effort. The dangers of information overload are not to be underestimated and are well illustrated by buzzwords such as Infoglut (Allen, 1992), Information Fatigue Syndrome (Lewis, 1996), TechnoStress (Weil & Rosen, 1997), Data Smog (Shenk, 1997), Data Asphyxiation (Winkle, 1998) and Information Pollution (Nielsen, 2003). In the context of relational databases, automated ranking and clustering of query results are used to reduce information overload. Automated ranking-based techniques first seek to clarify or approximate the users retrieval goal. Then, they assign a score to each answer, representing the extent to which it is relevant to the approximated retrieval goal. Finally, the user is provided with a ranked list, in descending order of relevance, of either all query results or only a top-k subset. In contrast, clustering-based techniques assist the user to clarify or refine the retrieval goal instead of trying to learn it. They consist in dividing the query result set into dissimilar groups (or clusters) of similar items, allowing users to select and explore groups that are of interest to them while ignoring the rest. However, both of these techniques present two major problems: the first is related to relevance. With regard to automated ranking-based techniques, the relevance of the results highly depends on their ability to accurately capture the users retrieval goal, which is not an obvious task. Furthermore, such techniques also bring the disadvantage of match homogeneity, i.e., the user is often required to go through a large number of similar results before finding the next different result. With regard to clustering-based techniques, there is no guarantee that the resulting clusters will match the meaningful groups that a user may expect. In fact, most clustering techniques seek to only maximize some statistical properties of the clusters (such as the size and compactness of each cluster and the separation of clusters relative to each other) ; the second is related to scalability. Both ranking and clustering are performed on query results and consequently occur at query time. Thus, the overhead time cost is an open critical issue for such a posteriori tasks.
To go one step beyond an overview of these well-established techniques, we investigate a simple but useful strategy to alleviate the two above problems. Specifically, we present an efficient and effective algorithm coined Explore-Select-Rearrange Algorithm (ESRA) that provides users with hierarchical clustering schemas of their query results. ESRA operates on pre-computed knowledge-based summaries of the data, instead of raw data themselves. The underlying summarization technique used in this work is the SAINTETIQ model (Raschia & Mouaddib, 2002; Saint-Paul, Raschia, & Mouaddib, 2005), which is a domain knowledge-based approach that enables summarization and classification of structured data stored into a database. Each node (or summary) of the hierarchy provided by ESRA describes a subset of
29
the result set in a user-friendly form based on domain knowledge. The user then navigates through this hierarchy structure in a top-down fashion, exploring the summaries of interest while ignoring the rest. The remaining of this chapter is organized as follows. In the first part of the chapter, we first survey techniques that have been proposed in the literature to provide users with effective and efficient ways to access relational databases, and then propose a categorization of these techniques based on the problem that they are supposed to address. In the second part of the chapter, we present the ESRA algorithm and the query answering system that supports ESRA-based summary hierarchies.
30
This problem, which is known as empty-answer problem, happens when the user submits a very restrictive query. A simple way to remedy this problem is to retry a particular query repeatedly with alternative values of certain conditions until obtaining satisfactory answers from a database. This solution, however, can be applied only if the user is aware of the close alternatives, otherwise it is infeasible (especially for users who lack knowledge about the contents of the database they wish to access). Many techniques are proposed to overcome this problem, namely query relaxation (Section 2.1.1) and similarity based search (Section 2.1.2).
31
conditions in the query. As an example (cf. Motro, 1986), consider the following query q over the Employees relation EmpDB: SELECT * FROM EmpDB Q WHERE Age 30 AND Gender = F AND Salary 40K. Figure 1 shows a portion of the lattice of relaxations of q generated by SEAVE, where nodes indicate generalizations (or presuppositions) and arcs indicate generalization relationships. (x, y, z) denotes a query which returns the employees whose age is under x, whose sex is y, and whose yearly salary is at least z. The symbol * indicates any value; once it appears in a query, this query cannot be generalized any further. Assume in this example that q fails and all its relaxed queries are successful queries except those that are marked in Figure 1 (i.e., q1, q3 and q8). Thus, the failed relaxed queries q1 and q3 are XGFs, whereas the successful relaxed queries q2, q4, q6 and q15 are MGSs. The queries q2, q4, q6 and q15 produce alternative answers to the original query q: (q2) all employees under 30 who earn at least 40K; (q4) all female employees under 32 who earn at least 40K; (q6) all female employees under 31 who earn at least 39K; and (q15) all female employees under 30 who earn at least 37K. These answers can be delivered by SEAVE to the user as the best it could do to satisfy the query q. The main drawback of SEAVE is its high computational cost, which comes from computing and testing a large number of generalizations (i.e., various combinations of the values of attributes) to identify MGSs/XGFs. CoBase The CoBaseiii (Chu, Yang, Chiang, Minock, Chow & Larson, 1996) system augments the database with Type Abstraction Hierarchies (TAHs) to control the query generalization process. A TAH represents attribute values at different levels of granularity. The higher levels of the hierarchy provide a more abstract data representation than the lower levels (or attribute values). Figure 2 shows an example of TAH for
32
attribute Salary in which unique salary values are replaced by qualitative ranges of high, medium, or low. To relax a failing query, CoBase uses some types of TAH-based operators such as generalization (moving up the TAH) and specialization (moving down the TAH). For example, based on the type abstraction hierarchy given in Figure 2, the condition Salary = 20K could be generalized (i.e., move-up operator) to Salary = medium. CoBase considerably reduces the number of generalizations to be tested. Note that how close the results are to the users initial expectations depends on the TAHs used. Godfreys System In (Godfrey, 1997), Godfrey proposed to generalize the user failed query by just removing some of its conditions. Thus, instead of searching all MGSs/XGFs, the proposed system looks for all maXimal Succeeding and Minimal Failing Sub-queries (XSSs and MFSs, respectively) of the failed query. The author also proves that this problem is NP-hard. Indeed, the size of the search space grows exponentially with the number of attributes used in the failed query, i.e., if a query involves m attributes, there exist (2m 2) sub-queries that have to be examined, disregarding the query itself and the empty query . For instance, the lattice of the possible sub-queries of q is illustrated in Figure 3, using the same notation as in Figure 1. Hence, recently some heuristics (Muslea, 2004; Muslea & Lee, 2005; Nambiar & Kambhampati, 2004) have been proposed to prune the search space. In (Muslea, 2004) and (Muslea & Lee, 2005), Muslea et al. respectively used decision tree and Bayesian network learning techniques on a randomlychosen subset of the target database to identify potential relaxations (or sub-queries) of the failing query to be tested. Then, they use nearest-neighbor techniques to find the relaxed query that is the most similar to the failing query. Nambiar et al. (Nambiar & Kambhampati, 2004) employed approximate Figure 3. Lattice of possible sub-queries of q
33
functional dependency to get the importance degree of the schema attributes in a database, according to which the order of the relaxed attributes is specified. The data samples used to compute the importance degree are also chosen randomly. Note that both Muslea et als and Nambiar et als approaches reduce the search space considerably, but the output relaxed queries, while succeed, are not necessary XSSs.
34
accepted distance value. ARES then accesses dissimilarity relations to produce a boolean query which will be processed by a conventional database system. For example, the vague condition A v is transformed to the boolean one A{x DA | (v, x, dist) DRA dist }, where is the maximum allowed distance given by the user on DA. In other words, x and v are considered somewhat close as far as dist . The produced query will then select acceptable tuples for which a global distance is calculated, by summing up the elementary distances tied to each vague condition in the query. Finally, the tuples are sorted in ascending order according to their global distance (dissimilarity) values and the system will output as many tuples as possible within the limit that has been specified by the user. The main drawback of ARES is its high storage and maintenance costs of dissimilarity relations: each dissimilarity relation needs m2 entries with respect to m different attribute values in the corresponding conventional relation; and when a new attribute value is added, 2m + 1 additional entries are necessary for the corresponding dissimilarity relation. Moreover, ARES does not allow defining dissimilarity between attribute values for infinite domains because the dissimilarities can only be defined by means of tables. VAGUE VAGUEv (Motro, 1988) is a system that resembles ARES in its overall goals. It is an extension to the relational data model with data metrics and the SQL language with a comparator . Indeed, each attribute domain D is endowed with a metric MD to define distance (dissimilarity) between its values. MD is a mapping from the cartesian product DD to the set of non-negative reals which is: reflexive, i.e., MD(x, x) = 0, for every value x in D; symmetric, i.e., MD(x, y) = MD(y, x), for all values x and y in D; and transitive, i.e., MD(x, y) MD(x, z) + MD(z, y), for all values x, y and z in D.
Furthermore, MD is provided with a radius r. This notion is very similar to the maximum dissimilarity allowed in ARES. Thus, two values v1 and v2 in D are considered to be similar if MD(v1, v2) r. During query processing, each vague condition expressed in the query is translated (in a similar way to ARES) into a boolean one using the appropriate metric and the resulting query is used to select tuples. Then, an ordering process takes place, relying on the calculation of distances (by means of associated metrics) for the elementary vague conditions. The global distance attached to a selected tuple in case of a disjunctive query is the smallest of distances related to each vague condition. For conjunctive queries, the global distance is obtained as the root of the sum of the squares (i.e., the Euclidean distance) of distances tied to each vague condition. Note that in VAGUE, the users cannot provide their own similarity thresholds for each vague condition but when a vague query does not match any data, VAGUE doubles all searching radii simultaneously. Thus the search performance can be considerably deteriorated. IQE Another system that supports similarity-based search over relational databases is the IQEvi system (Nambiar & Kambhampati, 2003). IQE converts the imprecise query (i.e., conditions involving the similarity operator ) into equivalent precise queries that appear in an existing query workload. Such queries are then used to answer the user given imprecise query. More precisely, given the workload, the main idea of IQE is to map the users imprecise query qi to a precise query qp by tightening the operator in the query condition. For example, tightening the operator similar-to () to equal-to (=) in the
35
imprecise query Salary 40K gives us the precise query Salary = 40K. Then IQE computes the similarity of qp to all queries in the workload. To estimate the similarity between two queries, IQE uses the document Jaccard similarity metric over the answer sets of the two queries. A minimal similarity threshold is used to prune the number of queries similar to qp. Finally, the answer to qi is the union of the answers of the precise queries similar to qp, with each tuple in the union inheriting the similarity of its generating precise query. Although IQE is a useful system, a workload containing past user queries is required, which is unavailable for new online databases. Nearest Neighbors In the approach known as nearest neighbors (Roussopoulos, Kelley, & Vincent, 2003), database records and queries are viewed as points (i.e., feature vectors) in a multidimensional space S with a metric MS (e.g., the Euclidean distance). Here a typical query is given by an example (Mosh, 1975) and its result set corresponds to the set of database records which are close to it according to MS. For instance, in image databases, the user may pose a query asking for the images most similar to a given image. This type of query is known as nearest neighbor query and it has been extensively studied in the past (Kevin, Jonathan, Raghu & Uri, 1999). The two most important types of nearest neighbor queries (NNQ) in databases are: -Range Query. The user specifies a query object q S and a query radius . The system retrieves all objects from the database DB S that have a distance from q not exceeding (Figure 4-(a)). q q More formally, the result set RQe is defined as follows: RQe = {t DB | M S (q, t ) e} k-Nearest Neighbor Query. The user specifies a query object q and the cardinality k of the result set. The system retrieves the k objects from the database DB S that have the least distance from q q (Figure 4-(b)). More formally, the result set NN k is defined as follows:
q q t NN k , t' DB - NN k , M S (q, t ) M S (q, t' )
A naive solution for answering a given NNQ query is to scan the entire database and test for each object if it is currently among the results. Obviously, this solution is very expensive and not feasible for a very large set of objects. Several multidimensional index structures, that enable to prune large parts of the search space, were proposed. The most popular are R-Tree and its variants R*-tree, X-Tree, SS-
36
Tree, etc. For a more detailed elaboration on multidimensional access methods and on the corresponding query processing techniques, we refer the interested reader to (Volker & Oliver, 1998) and (Christian, Stefan & Daniel, 2001). While the approaches described in both previous subsections differ in their implementation details, their overall goal is the same - allowing the database system to return answers, related to the failed query, which is more convenient than returning nothing. In the following section, we review some works addressing the many-answers problem, i.e., the situation where the original query results in overabundant answers.
37
to each feature in an overall ranking. As one can observe, a fundamental requirement of top-k based approaches is that they require a scoring function that specifies which result from a large set of potential answers to a query is most relevant to the user. Achieving this requirement is generally not a straightforward endeavor, especially when users do not really know what might be useful or relevant for them. Automatic methods of ranking answers of database (DB) queries have been recently investigated to overcome this problem, most of them being an adaptation of those employed in Information Retrieval (IR). Before discussing these methods in detail, we first review existing Information Retrieval ranking techniques. Then, we discuss related work on top-k query processing techniques in relational database systems.
score(d ) =
P (Re l | d ) P (Re l | d )
P (d | Re l ) P (d | Re l )
38
where Rel denotes the set of relevant documents, Re l = (D Re l ) the set of irrelevant ones and P (Re l | d ) (resp., P (Re l | d )) the probability of relevance (resp., non-relevance) of document d w.r.t. the query q. The higher the ratio of the probability of relevance to non-relevance is w.r.t. a document d, then the more likely document d is to be relevant to a user query q. In the above formula, the second equality is obtained using the Bayes formula, whereas the final simplification follows from the fact that P( Rel) and P( Rel ) are the same for every document d and thus are mere constant values that do not influence the ranking of documents. Note that Rel and Rel are unknown at query time and consequently are estimated as accurately as possible, on the basis of whatever data has been made available to the system for this purpose. The usual techniques in Information Retrieval make some simplifying assumptions, such as estimating Rel through user feedback, approximating Rel as D (since Rel is usually small compared to D) and assuming some form of independence between query terms (e.g., the Binary Independence Model, the Linked Dependence Model, or the Tree Dependence Model. INQUERY (Callan, Croft & Harding, 1992) is an example of this model. Link-Based Ranking Methods The link-based ranking methods are based on how the pages on the Internet link to each other (Brin & Page, 1998; Bharat & Henzinger, 1998; Kleinberg, 1999; Borodin, Roberts, Rosenthal & Tsaparas, 2001; Ng, Zheng & Jordan, 2001). Indeed, the relevance of a page is not only decided by the page content, but is also based on the linkage among pages. An example of a ranking algorithm based on link analysis is the PageRank algorithm (Brin & Page, 1998) introduced by Googlevii. PageRank computesWeb page scores by exploiting the graph inferred from the link structure of the Web. Its underlying motivation is that pages with many backlinks are more important than pages with only a few backlinks. So basically, a pages rank in Googles search results is higher if many, preferably important, pages link to that page. The higher the PageRank is, the more relevant the page is (according to Google). Another example of a ranking algorithm using link analysis is the HITSviii algorithm (Kleinberg, 1999). The HITS algorithm suggests that each page should have a separate authority rating (based on the links going to the page) and a hub rating (based on the links going from the page). The intuition behind the algorithm is that important hubs have links to important authorities and important authorities are linked by important hubs. For more details and more general sources about Information Retrieval ranking methods, please refer to (Manning, Raghavan & Schtze, 2008).
39
sos, Sycara & Payne, 2000) and (MacArthur, Brodley, Ka & Broderick, 2002) use relevance feedback from the user. Chaudhuri et als System In (Chaudhuri & Das, 2003), the authors proposed a ranking function (QFW) that leverages workload information to rank the answers to a database query. It is based on the frequency of occurrence of the values of unspecifiedix attributes. For example, consider a home-buyer searching for houses in HouseDB. A query with a not very selective condition such as City = Paris AND #Bedrooms = 2 may result in too many tuples in the answer, since there are many houses with two bedrooms in Paris. The proposed system uses workload information and examines attributes other than City and #Bedrooms (i.e., attributes that are not specified in the query) to rank the result set. Thus, if the workload contains many more queries for houses in Pariss 15th arrondissement (precinct) than for houses in Pariss 18th arrondissement, the system ranks two bedroom houses in the 15th arrondissement higher than two bedroom houses in the 18th arrondissement. The intuition is that if the 15th arrondissement is a wonderful location, the workload will contain many more queries for houses in the 15th than for houses in the 18th. More formally, consider one relation R. Each tuple in R has N attributes A1,, AN. Further, let q be a users query with some number of attributes specified (e.g., A1, A2,, Ai and i < N) and the rest of them unspecified (e.g., Ai+1,, AN). The relevance score of an answer t is defined as follows: QFW (t) =
k=i+1
F(t.Ak ) Fmax
where F(t.Ak) is the frequency of occurrence of value t.Ak of attribute Ak in the workload W and Fmax the frequency of the most frequently occurring value in W. PIR In the PIRx system (Chaudhuri, Das, Hristidis & Weikum, 2004), the authors adapted and applied principles of probabilistic models from Information Retrieval to structured data. Given a query, the proposed ranking function depends on two factors: (a) a global score that captures the global importance of unspecified attribute values, and (b) a conditional score that captures the strengths of dependencies (or correlations) between specified and unspecified attribute values. For example, for the query City = Marseille AND View = Waterfront, a house with SchoolDistrict = Excellent gets a high rank because good school districts are globally desirable. A house with also BoatDock = Yes gets a high rank because people desiring a waterfront are likely to want a boat dock. These scores are estimated using past workloads as well as data analysis, e.g., past workload may reveal that a large fraction of users seeking houses with a waterfront view have also requested boat docks. More precisely, under the same notations and hypotheses that we have used for the previous approach, the relevance score of a tuple t is computed as follows: PIR(t) = P( Rel | t) P( Rel | t)
k=i+1
k=i+1 l= 1
where the quantities P(t.Ak |W) and P(t.Ak | R) are simply the relative frequencies of each distinct value t.Ak respectively in the workload W and in the relation R, while the quantities P(t.Al | t.Ak ,W)
40
and P(t.Al | t.Ak ,R) are estimated by computing the confidences of pair-wise association rules (Agrawal, Mannila, Srikant, Toivonen & Verkamo, 1996) in W and R, respectively. Note that the score in the above formula is composed of two large factors. The first factor is the global part of the score, while the second one is the conditional part of the score. This approach is implemented in STAR (Kapoor, Das, Hristidis, Sudarshan & Weikum, 2007). Note that in both Chaudhuris system (Chaudhuri & Das, 2003) and the PIR system (Chaudhuri, Das, Hristidis & Weikum, 2004), the atomic quantities F(x), P(x|W), P(x|R), P(y|x,W) and P(y|x,R) are pre-computed and stored in special auxiliary tables for all distinct values x and y in the workload and the database. Then at query time, both approaches first select the tuples that satisfy the query condition, then scan and compute the score for each such tuple using the information in the auxiliary tables, and finally returns the top-k tuples. The main drawback of both (Chaudhuri & Das, 2003) and (Chaudhuri, Das, Hristidis & Weikum, 2004) is their high storage and maintenance costs of auxiliary tables. Moreover, they require a workload containing past user queries as input, which is not always available (e.g., new online databases). QRRE In the QRRExi system (Su, Wang, Huang & Lochovsky, 2006), the authors proposed an automatic ranking method, which can rank the query results from an E-commerce Web database R using only data analysis techniques. Consider a tuple t = t.A1 , ...,t.AN in the result set Tq of a query q that is submitted by a buyer. QRRE assigns a weight wi to each attribute Ai that reflects its importance to the user. wi is evaluated by the difference (e.g., The Kullback-Leibler divergence (Duda, Hart & Stork, 2000)) between the distribution (histogram) of Ais values over the result set Tq and their distribution (histogram) over the whole database R. The bigger the divergence, the more Ai is important for a buyer. For instance, suppose the database HouseDB contains houses for sale in France and consider the query q with the condition View = Waterfront. Intuitively, the Price values of the tuples in the result set Tq distribute in a small and dense range with a relatively high average, while the Price values of tuples in HouseDB distribute in a large range with a relatively low average. The distribution difference shows a close correlation between the unspecified attribute, namely, Price, and the query View = Waterfront. In contrast, attribute Size is less important for the user since its distribution in houses with a waterfront view may be similar to its distribution in the entire database HouseDB. Besides the attribute weight, QRRE also assigns a preference score pi to each attribute value t.Ai. pi is computed based on the following two assumptions: a product with a lower price is always more desired by buyers than a product with a higher price if the other attributes of the two products have the same values. For example, between two houses that differ only in their price, the cheapest one is preferred. Hence, QRRE assigns a small preference score to a high Price value and a large preference score to a low Price value; a non-Price attribute value with higher desirableness for the user corresponds to a higher price. For example, a large house, which most buyers prefer, is usually more expensive than a small one. Thus, in the case of a non-Price attribute Ai, QRRE first converts its value t.Ai to a Price value pv which is the average price of the products for Ai = t.Ai in the database R. Then, QRRE assigns a large preference score to t.Ai if pv is large.
41
Finally, the attribute weight and the value preference score are combined to calculate the ranking score for each tuple t Tq , as follows: QRRE(t) =
w
i= 1
pi
The tuples ranking scores are sorted and the top-K tuples with the largest ranking scores are presented to the user first. QRRE is a useful automated ranking approach for the many-answers problem. It does not depend on domains nor require workloads. However, this approach may imply high response times, especially in the case of low selectivity queries, since different histograms need to be constructed over the result set (i.e., at query time). Feedback-Based Systems Another approach to rank querys results, which is different from those discussed above, is to prompt the user for feedback on retrieval results and then use this feedback on subsequent retrievals to effectively infer which tuples in the database are of interest to the user. Relevance Feedback techniques were studied extensively in the context of image retrieval (Wu, Faloutsos, Sycara & Payne, 2000; MacArthur, Brodley, Ka & Broderick, 2002) and were usually paired with the query-by-example approach (Mosh, 1975). The basic procedure of these approaches is as follows: 1. 2. 3. 4. 5. the user issues a query; the system returns an initial set of results; the user marks some returned tuples as relevant or non-relevant; the system constructs a new query that is supposed to be close to the relevant results and far from those which are non-relevant; and the system displays the results that are most similar to the new query.
This procedure can be conducted iteratively until the user is satisfied with the query results. Relevance feedback-based approaches provide an effective method for reducing the number of query results. However, they are not necessarily popular with users. Indeed, users are often reluctant to provide explicit feedback, or simply do not wish to prolong the search interaction. Furthermore, it is often harder to understand why a particular tuple is retrieved after the relevance feedback algorithm has been applied. Once the scoring function is defined, the DB ranking techniques discussed in this subsection adapt and use available top-k query processing algorithms (Ilyas, Beskales & Soliman, 2008) in order to quickly provide the user with the k most relevant results of a given query. In the following subsection, we briefly review top-k query processing methods in relational database systems.
42
the top-k problem, Tq could alternatively be seen as a set of N sorted lists Li of Tq (the number of tuples in Tq) pairs (t,si ),t Tq . Hence, for each attribute Ai, there is a sorted list Li in which all Tq results are ranked in descendant order. Entries in the lists could be accessed randomly from the tuple identifier or sequentially from the sorted score. The main issue for top-k query processing is then to obtain the k tuples with the highest overall scores computed according to a given aggregation function agg(s1 , ...,sN ) of the attribute scores si. The aggregation function agg used to combine ranking criteria has to be monotone; that is, agg must satisfy the following property: agg(s1 , ...,sN ) agg(s'1 , ...,s'N ) if si s'i for every i The naive algorithm consists in looking at every entry t,si in each of the sorted lists Li, computing the overall grade of every object t, and returning the top k answers. Obviously, this approach is unnecessarily expensive as it does not take advantage of the fact that only the k best answers are part of the query answer and the remaining answers do not need to be processed. Several query answering algorithms have been proposed in the literature to efficiently process top-k queries. The most popular is the Threshold Algorithm (TA) independently proposed by several groups (Fagin, Lotem & Naor, 2001; Nepal & Ramakrishna, 1999; Gntzer, Balke & Kieling, 2000). The TA algorithm works as follows: 1. do sorted access in parallel to each of the N sorted lists. As a tuple t is seen under sorted access in some list, do random access to the other lists to find the score of t in every list and compute the overall score of t. Maintain in a set TOP the k seen tuples whose overall scores are the highest among all tuples seen so far; for each list Li, let si be the last score seen under sorted access in Li. Define the threshold to be = agg(s1 , ...,sN ) . If TOP involves k tuples whose overall scores are higher than or equal to , then stop doing sorted access to the lists. Otherwise, go to step 1; return TOP.
2.
3.
Table 2 shows an example with three Lists L1, L2 and L3. Assume that the top-k query requests the top-2 tuples and the aggregation function agg is the summation function SUM. TA first scans the first tuples in all lists which are t5, t4, and t3. Hence the threshold value at this time is = 21+34+30 = 85. Then TA calculates the aggregated score for each tuple seen so far by random accesses to the three lists. We get the aggregated score for t5 SUM(t5) = 21+9+7 = 37, for t4 SUM(t4) = 11+34+14 = 59 and for t3 SUM(t3) = 11+26+30 = 67. TA maintains the top-2 tuples seen so far which are t3 and t4. As neither of them has an aggregated score greater than the current threshold value = 85, TA continues to scan the tuples at the second positions of all lists. At this time, the threshold value is recomputed as = 17 + 29 + 14 = 60. The new tuples seen are t1 and t2. Their aggregated scores are retrieved and calculated as SUM(t1) = 0 + 29 + 0 = 29 and SUM(t2) = 17 + 0 + 1 = 18. TA still keeps tuples t3 and t4 since their aggregated scores are higher than those of both t1 and t2. Since only t3 has an aggregated score greater than the current threshold value = 60, TA algorithm continues to scan the tuples in the third positions. Now the threshold value is = 11 + 29 + 9 = 49 and the new tuple seen is t0. TA computes the aggregated score for t0 which is 38. t3 and t4 still maintain the two highest aggregated scores which are now greater
43
than the current threshold value = 49. Thereby, TA terminates at this point and returns t3 and t4 as the top-2 tuples. Note that, in this example, TA avoids accessing the tuples t6, t7, t8 and t9. For more details about top-k processing techniques in relational databases, we refer the interested reader to (Ilyas, Beskales & Soliman, 2008).
44
keywords) match the upcoming query. A hierarchical clustering technique is typically used in these approaches, and different strategies for matching the query against the document hierarchy have been proposed, most notably a top-down or a bottom-up search and their variants (Jardine & Van Rijsbergen, 1971; Van Rijsbergen & Croft, 1975; Croft, 1980; Voorhees, 1985). Similarly, search engines (e.g., Yahoo) and product catalog search (e.g., eBay) use a category structure created in advance and then group search results into separate categories. In all these approaches, if a query does not match any cluster representation of one of the pre-defined clusters or categories, then it fails to match any documents even if the document collection contains relevant results. It is worth noticing that this problem is not intrinsic to clustering, but is due to the fact that keyword representation of clusters is often insufficient to apprehend the meaning of documents in a cluster. An alternative way of using the cluster hypothesis is in the presentation of retrieval results, that is by presenting, in a clustered form, only documents that have been retrieved in response to the query. This idea was first introduced in the Scatter/Gather system (Hearst & Pedersen, 1996) which is based on a variant of the classical k-means algorithm (Hartigan & Wong, 1979). Since then, several classes of algorithms have been proposed such as STC (Zamir & Etzioni, 1999), SHOC (Zeng, He, Chen, Ma & Ma, 2004), EigenCluster (Cheng, Kannan, Vempala & Wang, 2006), SnakeT (Ferragina & Gulli, 2005). Note that such algorithms introduce a noticeable time overhead to the query processing, due to the large number of results returned by the search engine. The reader is referred to (Manning, Raghavan & Schtze, 2008) for more details on IR clustering techniques. All the above approaches testify that there is a significant potential benefit in providing additional structure in large answer sets.
45
The order in which the attributes appear in the tree, and the values used to split the domain of any attribute are inferred by analyzing the aggregate knowledge of previous user behaviors - using the workload. Indeed, the attributes that appear most frequently in the workload are presented to the user earlier (i.e., at the highest levels of the tree). The intuition behind this approach is that the presence of a selection condition on an attribute in a workload reflects the users interest in that attribute. Furthermore, for each attribute Ai, one of the following two methods is used to partition the set of tuples tset(C) contained in a category C depending on whether Ai is categorical or numeric: If Ai is a categorical attribute with discrete values {v1,, vk}, the proposed algorithm simply partitions tset(C) into k categories, one category Cj corresponding to a value vj. Then, it presents them in the decreasing order of occ(Ai = vj), i.e., the number of queries in the workload whose selection condition on Ai overlaps with Ai = vj; Otherwise, assume the domain of attribute Ai is the interval [vmin, vmax]. If a significant number of query ranges (corresponding to the selection condition on Ai) in the workload begins or ends at v[vmin, vmax], then v is considered as a good point to split [vmin, vmax]. The intuition here is that most users would be interested in just one bucket, i.e., either in the bucket Ai v or in the bucket Ai > v but not both.
This approach provides the user with navigational facilities to browse query results. However, it requires a workload containing past user queries as input, which is not always available. Furthermore, the hierarchical category structure is built at query time, and hence the user has to wait a long time before the results can be displayed. OSQR In Bamba, Roy & Mohania (2005), the authors proposed OSQRxiii, an approach for clustering database query results based on the agglomerative single-link approach (Jain, Murty & Flynn, 1999). Given an SQL query as input, OSQR explores its result set, and identifies a set of terms (called the querys context) that are the most relevant to the query; each term in this set is also associated with a score quantifying
46
its relevance. Next, OSQR exploits the term scores and the association of the rows in the query result with the respective terms to define a similarity measure between the terms. This similarity measure is then used to group multiple terms together; this grouping, in turn, induces a clustering of the query result rows. More precisely, consider a query q on a table R, and let Tq denote the result of the query q. OSQR works as follows: 1. scan Tq and assign a score sx to each attribute value x (or term) in Tq. The terms scores (similar to tf * idf scores used in information retrieval) are defined in such a way that higher scores indicate attributes values that are popular in the query result Tq and are rare in R-Tq; compute the context of q as the set of terms qcontext with scores exceeding a certain threshold (a system parameter); associate to each term x in qcontext the cluster Cx, i.e., the set of tuples of Tq in which the attribute value x appears. The tuples in Tq that are not associated with any term xqcontext are termed outliers.
2. 3.
These rows are not processed any further; iteratively merge the two most similar clusters until a stopping condition is met. The similarity sim(Cx,Cy) between each pair of clusters Cx and Cy (x, y qcontext) is defined as follows:
where |.| denotes the cardinality of a set. OSQRs output is a dendrogram that can be browsed from its root node to its leaves, where each leaf represents a single term x in qcontext and its associated tuple set Cx. The above approach has many desirable features: it generates overlapping clusters, associates a descriptive context with each generated cluster, and does not require the query workload. However, note that some query results (tuples that are not associated with any term in qcontext) are ignored and therefore not included in the output result. Moreover, this approach may imply high response times, especially in the case of low selectivity queries, since both scoring and clustering of terms are done on the fly. Li et als System In a recent work (Li, Wang, Lim, Wang & Chang, 2007), the authors generalized the SQL group-by operator to enable grouping (based on the proximity of attribute values) of database query results. Consider a relation R with attributes A1,,AN and a users query q over R with a group-by clause on a subset X of Rs numeric attributes. The proposed algorithm first divides the domain of each attribute AiX into pi disjoint intervals (or bins) to form a grid of pi buckets (or cells). Next, this approach identifies the set of buckets C = {b1,, bm} that holds the results of q, and associates to each bucket bi a virtual point vi, located at the center of that bucket. Finally, a k-means algorithm is performed on these virtual points (i.e., {v1,, vm}) to obtain exactly k clusters of qs results. The k parameter is given by the enduser. For example, consider a users query that returns 10 tuples t1,, t10 and the user needs to partition these tuples into 2 clusters, using two attributes A1 and A2. Figure 6 shows an example of a grid over t1,, t10 by partitioning attributes A1 and A2. The bins on A1 and A2 are {[0, 3), [3, 6), [6, 9)} and {[0, 10), [10, 20), [20, 30)}, respectively. The two clusters C1 (i.e. A1 [0, 3] A2 [10, 30]) and C2 (i.e., A1 [6,
47
9] A2 [0, 10]) are returned to that user. C1 contains 6 tuples t1, t3, t4, t6, t9 and t10, whereas C2 contains 4 tuples t2, t5, t7 and t8. This approach is efficient. Indeed, it relies on a bucket-level clustering, which is much more efficient than the tuple-level one, since the number of buckets is much smaller than the number of tuples. However, the proposed algorithm requires the user to specify the number of clusters k, which is difficult to know in advance, but has a crucial impact on the clustering result. Further, this approach generates flat clustering of query results and some clusters may contain a very large number of results, although, this is exactly the kind of outcome this technique should avoid.
48
tuples of R satisfying QP1P2; if the result is empty, pick the tuples satisfying Q P1P2; if the result is empty, pick the tuples satisfying Q P1P2. In other words, Q is a hard constraint, whereas P1 and P2 are preferences or soft constraints. Since then extensive investigation has been conducted, and two main types of approaches have been distinguished in the literature to deal with the users preferences, namely, quantitative and qualitative (Chomicki, 2003).
49
composing a user profile and proposed a generic model that can be instantiated and adapted to each specific application.
dt ', w ', wt ') = (d = d ' dt = fish wt = white dt ' = fish wt ' = red ) (d = d ' dt = meat wt = red dt ' = meat wt ' = white)
Another example representation of a qualitative preference, over a relation CarDB with attributes Make, Year, Price and Miles, is the following preference for cheap cars manufactured by Benz, and prior to 2005 but not before 2003, using PreferenceSQL from (Kieling, 2002): SELECT * FROM CarDB WHERE Make = Benz PREFERRING (LOWEST(Price) AND Year BETWEEN 2003, 2005) Note that the qualitative approach is more general than the quantitative one, since one can define preference relations in terms of scoring functions, whereas not every preference relation can be captured by scoring functions. For example, consider the relation BookDB(ISBN, Vendor, Price) and its instance shown in Table 3. The preference if the same ISBN, prefer lower price to higher price gives the preferences b2 to b1 and b1 to b3. There is no preference between the first three books (i.e., b1, b2 and b3) and the fourth one (i.e., b4). Thus, the score of the fourth tuple should be equal to all of the scores of the first three tuples. But this implies that the scores of the first three tuples are the same, which is not possible since the second tuple is preferred to the first one which in turn is preferred to the third one.
50
where mF (u) , for each uU, denotes the degree of membership of u in the fuzzy set F. An element u U is said to be in the fuzzy set F if and only if mF (u)m0 and to be a full member if and only if mF (u) = 1 . We call support and kernel of the fuzzy set F respectively the sets: support(F ) = u U | mF (u) 0 and kernel(F )= {u U | mF (u) = 1}. Furthermore, if m fuzzy sets F1, F2, and Fm are defined over U such that i = 1,m ,Fi j,Fi U and u U, mF (u) , the set {F1,, Fm} is called a fuzzy partition (Ruspini, 1969) of U. For example, consider the attribute Salary with domain DSalary = [0, 110] K. A typical fuzzy partition of the universe of discourse DSalary (i.e., the employees salaries) is shown in Figure 8, where the fuzzy sets (values) none, miserable, modest, reasonable, comfortable, enormous and outrageous are defined. Here, the crisp value 60K has a grade of membership of 0.5 for both the reasonable and the comfortable fuzzy sets, i.e., reasonable(60K) = comfortable(60K) = 0.5.
i
51
Tahanis Approach Tahani (Tahani, 1977) was the first to propose a formal approach and architecture to deal with simple fuzzy queries for crisp relational databases. More specifically, the author proposed to use in the query condition fuzzy values instead of crisp ones. An example of a fuzzy query would be get employees who are young and have a reasonable salary. This query contains two fuzzy predicates Age = young and Salary = reasonable, where young and reasonable are words in natural language that express or identify a fuzzy set (Figure 9). Tahanis approach takes a relation R and a fuzzy query q over R as inputs and produces a fuzzy relation Rq, that is an ordinary relation in which each tuple t is associated with a matching degree q within [0, 1] interval. The value q indicates the extent to which tuple t satisfies the fuzzy predicates involved in the query q. The matching degree, q, for each particular tuple t is calculated as follows. For a tuple t and a fuzzy query q with a simple fuzzy predicate A = l, where A is an attribute and l is a fuzzy set defined on the attribute domain of A, A=l is defined as follows: A=l(t) = l(t.A) where t.A is the value of tuple t on attribute A and l is the membership function of the fuzzy set l. For instance, consider the relation EmpDB in Table 4. The fuzzy relation corresponding to the fuzzy predicate Age = young (resp., Salary = reasonable) is shown in Table 5-(a) (resp., Table 5-(b)). Note that when q(t) = 0, the tuple t does not belong to the fuzzy relation Rq any longer (for instance, tuple #1 in Table 5-(a)). The matching function for a complex fuzzy query with multiple fuzzy predicates is obtained by applying the semantics of the fuzzy logical connectives, that are: g p p (t ) = min(g p (t ), g p (t )) g p p t = max g p t , g p t g p t = 1 g p t
1 1 1 2 1 2 1 2 1 2
52
where p1, p2 are fuzzy predicates. Table 6 shows the fuzzy relation corresponding to the query Age = young AND Salary= reasonable. Note that the min and max operators may be replaced by any t-norm and t-conorm operators (Klement, Mesiar & Pap, 2000) to model the conjunction and disjunction connectives, respectively. FQUERY In (Zadrozny & Kacprzyk, 1996), the authors proposed FQUERY, an extension of the Microsoft Access SQL language with capability to manipulate fuzzy terms. More specifically, they proposed to take into account the following types of fuzzy terms: fuzzy values, fuzzy comparison operators, and fuzzy quantifiers. Given a query involving fuzzy terms, the matching degree of relevant answers is calculated according to the semantics of fuzzy terms (Zadeh, 1999). In addition to the syntax and semantics of the extended SQL, the authors have also proposed a scheme for the elicitation and manipulation of fuzzy terms to be used in queries. FQUERY has been one of the first implementations demonstrating the usefulness of fuzzy querying features for a traditional database. SQLf In contrast to both Tahanis approach and FQUERY which concentrated on the fuzzification of conditions appearing in the WHERE clause of the SQLs SELECT statement, the query language SQLf (Bosc & Pivert, 1992; Bosc & Pivert, 1995; Bosc & Pivert, 1997) allows the introduction of fuzzy terms into SQL wherever they make sense. Indeed, all the operations of the relational algebra (implicitly or explicitly used in SQLs SELECT instruction) are redefined in such a way that the equivalences that occur in the crisp SQL are preserved. Thus, the projection, selection, join, union, intersection, Cartesian product and
53
set difference operations are considered. Special attention is also paid to the division operation which may be interpreted in a different way due to many possible versions of the implication available in fuzzy logic (Bosc, Pivert & Rocacher, 2007). Other operations typical for SQL are also redefined, including the GROUP BY clause, the HAVING clause and the operators IN and NOT IN used along with sub-queries. A query in SQLf language has the following syntax: SELECT [n|t|n,t] set of attributes FROM set of relations WHERE set of fuzzy predicates where the parameters n and t of the select block limit the number of the answers by using a quantitative condition (the best n answers) or a qualitative condition (ones which satisfy the fuzzy predicates according to a degree higher than t). For more details and more general sources about fuzzy querying, please refer to (Galindo, 2008).
54
AND Author.name = Papakonstantinou Yannis ) Obviously, this model of search is too complicated for ordinary users. Several methods aim at decreasing this complexity by providing keyword search functionality over relational databases. With such functionality, a user can avoid writing an SQL query; and she/he can just submit a simple keyword query Hristidis Vagelis and Papakonstantinou Yannis to the DBLP database. Examples include BANKS (Bhalotia, Hulgeri, Nakhe, Chakrabarti & Sudarshan, 2002), DBXplore (Agrawal, Chaudhuri & Das, 2002) and DISCOVER (Hristidis & Papakonstantinou, 2002). The former system (BANKS) models the database as a graph and retrieves results by means of traversal, whereas the latter ones (DBXplore and DISCOVER) exploit the database schema to compute the results. BANKS BANKSxv (Bhalotia, Hulgeri, Nakhe, Chakrabarti & Sudarshan, 2002) views the database as a directed weighted graph, where each node represents a tuple, and edges connect tuples that can be joined (e.g., according to primary-foreign key relationships). Node weight is inspired by prestige ranking such as PageRank; node that has large degreexvi get a higher prestige. Edge weight reflects the importance of the relationship between two tuples or nodesxvii; lower edge weights correspond to greater proximity or stronger relationship between the involved tuples. At query time, BANKS employs a backward search strategy to search for results containing all keywords. A result is a tree of tuples (called tuple tree), that is, sets of tuples which are associated on their primary-foreign key relationships and contain all the keywords of the query. Figure 11 shows two tuple trees for query q = Hristidis Vagelis and Papakonstantinou Yannis on the example database of Figure 10. More precisely, BANKS constructs paths starting from
55
Figure 11. Tuple trees for query q = Hristidis Vagelis and Papakonstantinou Yannis
each node (tuple) containing a query keyword (e.g., a2 and a4 in Figure 11) and executes a Dijkstras single source shortest path algorithm for each one of them. The idea is to find a common vertex (e.g., p2 and p4 in Figure 11) from which a forward path exists to at least one tuple corresponding to a different keyword in the query. Such paths will define a rooted directed tree with the common vertex as the root containing the keyword nodes as leaves, which will be one possible answer (tuple tree) for a given query. The answer trees are then ranked and displayed to the user. The ranking strategy of BANKS is to combine nodes and edges weights in a tuple tree to compute a score for ranking. DBXplorer DBXplorer (Agrawal, Chaudhuri & Das, 2002) models the relational schema as a graph, in which nodes map to database relations and edges represent relationships, such as primary-foreign key dependencies. Given a query consisting of a set of keywords, DBXplorer first searches the symbol table to find the relations of the database that contain the query keywords. The symbol table serves as an inverted list and it is built by preprocessing the whole database contents before the search. Then, DBXplorer uses the schema graph to find join trees that interconnect these relations. A join tree is a subtree of schema graph that satisfies two conditions: one is that the relation corresponding to a leaf node contains at least one query keyword; another is that every query keyword is contained by a relation corresponding to a leaf node. Thus, if all relations in a join tree are joined, the results might contain rows having all keywords. For each join tree a relevant SQL query is then created and executed. Finally, results are ranked and displayed to the user. The score function that DBXplorer uses to rank results is very simple. The score of a result is the number of joins involved. The rationale behind this simple relevance-ranking scheme is that the more joins are needed to create a row with the query keywords, the less clear it becomes whether the result might be meaningful or helpful. DISCOVER DISCOVER (Hristidis & Papakonstantinou, 2002) also exploits the relational schema graph. It uses the concept of a candidate network to refer to the schema of a possible answer, which is a tree interconnecting the set of tuples that contain all the keywords, as in DBXplorer. The candidate network generation algorithm is also similar. However, DISCOVER can be regarded as an improvement of DBXplorer. In fact, it stores some temporary data to avoid re-executing joins that are common among candidate networks. DISCOVER, like DBXplorer, ranks results based on the number of joins of the corresponding candidate network.
56
Note that all afore-mentioned approaches (BANKS, DBXplore and DISCOVER) are useful to users who do not know SQL or are unfamiliar with the database schema. However, they present a semantic challenge because the metadata of attributes and relations that are in an SQL statement are lacking in keyword search. Furthermore, because a keyword may appear in any attributes and in any relations, the result set may be large and include many answers users do not need.
2.4 Discussion
Database search processes basically involve two steps, namely query formulation and query evaluation (Figure 12). In the query formulation step, the user formulates her/his information need (or retrieval goal) in terms of an SQL query. The query evaluation step runs this query against the database and returns data that match it exactly. Thus, formulating a query that accurately captures the users retrieval goal is crucial for obtaining satisfactory results (i.e., results that are both useful and of manageable size for human analysis) from a database. However, it is challenging to achieve for several reasons: users may have ill-defined retrieval goals, i.e., they do not really know what might be useful for them, merely an expectation that interesting data may be encountered if they use the database system (e.g., I cant say what I want, but I will recognize it when I see it) ; even if users have well-defined retrieval goals, they may not know how to turn them into regular SQL queries (e.g., I know what I want, but I dont know how to get it). This may be either because they are not familiar with the SQL language or with the database schema. It may also be due to an expressiveness limit of SQL. In fact, an information need is a mental image of a user regarding the information she/he wants to retrieve and it is difficult to capture it using an unnatural exact language such as SQL.
Note that even if users have well-defined retrieval goals and know how to formulate them in terms of SQL queries (e.g., I know what I want and I know how to get it), they may not obtain what they need (e.g., I am not satisfied). This occurs if the database they wish to access contains no data that satisfy their retrieval goals. One can clearly notice that the Many-Answers and the Empty-Answer problems are immediate consequences of the above problems. In fact, if the user has an ill-defined retrieval goal, her/his queries are often very broad, resulting in too many answers; otherwise, they are very specific and often return no answers.
57
In Figure 13, we propose a classification of the techniques presented and discussed in this chapter. This classification is based on the situation or problem, from those mentioned above, that they are supposed to address. The first category (Group A in Figure 13) contains query relaxation (Section 2.1.1) and similaritybased search (Section 2.1.2) techniques. These techniques address situations in which a user approaches the database with a query that exactly captures her/his retrieval goal but she/he is delivered an empty result set. These techniques allow the database system to retrieve results that closely (though not completely) match the users retrieval goal. The second (Group B in Figure 13) contains preference-based (Section 2.3.1), fuzzy-based (Section 2.3.2) and keyword-based search (Section 2.3.3) techniques. These techniques provide human-oriented interfaces which allow users to formulate their retrieval goals in a more natural or intuitive manner. Note, however, that these techniques, while useful, do not help users to clarify or refine their retrieval goals; therefore they are not designed for the problem of ill-defined retrieval goal. Finally, the third category (Group C in Figure 13) contains automated ranking and clustering-based techniques. These techniques address situations in which a database user has an ill-defined retrieval goal. Automated ranking-based techniques (Section 2.2.1) first seek to clarify or approximate the retrieval goal. For this purpose, they use either past behavior of the user (derived from available workloads) or relevance feedback from the user. Then, they compute a score, by means of a similarity measure, of each answer that represents the extent to which it is relevant to the approximated users retrieval goal. Finally, the user is provided with a ranked list, in descending order of relevance, of either all query results or only a top-k subset. The effectiveness of these approaches highly depends on their ability to accurately capture the users retrieval goal, which is a tedious and time consuming task. Note that such approaches also bring the disadvantage of match homogeneity, i.e., the user is often required to go through Figure 13. A classification of advanced database query processing techniques
58
a large number of similar results before finding the next different result. In contrast, clustering-based techniques(Section 2.2.2) assist the user to clarify or refine the retrieval goal instead of trying to learn it. They consist in dividing the query result set into homogeneous groups, allowing the user to select and explore groups that are of interest to her/him. However, such techniques seek to only maximize some statistical property of the resulting clusters (such as the size and compactness of each cluster and the separation of clusters relative to each other), and therefore there is no guarantee that the resulting clusters will match the meaningful groups that a user may expect. Furthermore, these approaches are performed on query results and consequently occur at query time. Thus, the overhead time cost is an open critical issue for such a posteriori tasks. In the second part of this chapter, we focus on the Many-Answers problem that is critical for very large database and decision support systems. Thus, we investigate a simple but useful strategy to handle this problem.
2.
59
Now suppose that the user poses her/his query to a real estate agent. The estate agent often provides that user with better results than the ones obtained using traditional database systems, as she/he has a large amount of knowledge in the field of house buying. Let us briefly discuss two important features of the estate agent that contribute to her/his ability to fit the users information requirements: 1. the first is her/his ability to organize, abstract, store and index her/his knowledge for future use. In fact, besides organizing her/his knowledge into groups (Miller, 1962; Mandler, 1967), the estate agent stores (in her/his memory) these groups as a knowledge representation and such groups often become an automatic response of her/him. This interesting statement comes from the central tenet of semantic network theory (Quillian, 1968; Collins & Quillian, 1969), which argues that information is stored in human memory as a network of linked concepts (e.g., the concept house is related by the word is to the concept home). Moreover, psychological experiments (Miller, 1956; Simon, 1974; Halford, Baker, McCredden & Bain, 2005) show that humans can deal with a large amount of information, exceeding their memory limitations, when such information are supplemented with additional features such as a relationship to a larger group (or concept). Hence, cognitive theories assume that humans arrange their knowledge in a hierarchical structure that describes groups at varying levels of specificity. Furthermore, Ashcraft (Ashcraft, 1994) found that humans assign meaningful words from natural language to groups and retrieve information by those words (i.e., group representatives) rather than blindly traversing all information; the second feature is her/his ability to assist the user to refine and clarify her/his information need as well as to make a decision. In fact, the estate agent establishes a dialog with the user during which she/he asks pertinent questions. Then, for each users response (i.e., a new information need), the estate agent uses her/his knowledge to provide the user with concise and comprehensive information. Such information is retrieved by matching user query words with group representatives (Ashcraft, 1994) stored in her/his memory.
2.
From these cognitive and technical observations, we propose a simple but useful strategy, that emulates the interaction a user might have with a real estate agent to some extent, to alleviate the two previously mentioned problems (i.e., relevance and scalability). It can be summarized as follows (see Figure 14):
60
In a pre-processing step, we compute knowledge-based summaries of the queried data. The underlying summarization technique used in this paper is the SAINTETIQ model (Raschia & Mouaddib, 2002; Saint-Paul, Raschia, & Mouaddib, 2005), which is a domain knowledge-based approach that enables summarization and classification of structured data stored into a database. SAINTETIQ first transforms raw data into high-level representations (summaries) that fit the users perception of the domain, by means of linguistic labels (e.g., cheap, reasonable, expensive, very expensive) defined over the data attribute domains and provided by a domain expert or even an end-user. Then it applies a hierarchical clustering algorithm on these summaries to provide multi-resolution summaries (i.e., summary hierarchy) that represent the database content at different abstraction levels. The summary hierarchy can be seen as an analogy for knowledge representation estate agent. At query time, we use the summary hierarchy of the data, instead of the data itself, to quickly provide the user with concise, useful and structured answers as a starting point for an online analysis. This goal is achieved thanks to the Explore-Select algorithm (ESA) that extracts query-relevant entries from the summary hierarchy. Each answer item describes a subset of the result set in a human-readable form using linguistic labels. Moreover, answers of a given query are nodes of the summary hierarchy and every subtree rooted by an answer offers a guided tour of a data subset to the user. The user then navigates this tree, in a top-down fashion, exploring the summaries of interest while ignoring the rest. Note that the database is accessed only when the user requests to download (Upload) the original data that a potentially relevant summary describes. Hence, this framework is intended to help the user iteratively refine her/his information need in the same way as done by the estate agent.
However, since such the summary hierarchy is independent of the query, the set of starting point answers could be large and consequently dissimilarity between items is susceptible to skew. It occurs when the summary hierarchy is not perfectly adapted to the user query. To tackle this problem, we first propose a straightforward approach (ESA-SEQ) using the clustering algorithm of SAINTETIQ to optimize the high-level answers. The optimization requires post-processing and therefore, it incurs overhead time cost. Thus, we finally develop an efficient and effective algorithm (ESRA, i.e., ES-Rearrange Algorithm) that rearranges answers based on the hierarchical structure of the pre-computed summary hierarchy, such that no post-processing task (but the query evaluation itself) have to be performed at query time. The rest of this section is organized as follows. First, we present the SAINTETIQ model and its properties and we illustrate the process with a toy example. Then, in Section 3.2 we detail the use of SAINTETIQ outputs in a query processing and we describe the formulation of queries and the retrieval of clusters. Thereafter, we discuss in Section 3.3 how such results help facing the many-answers problem. The algorithm that addresses the problem of dissimilarity (discrimination) between the starting point answers by rearranging them is presented in Section 3.4. Section 3.5 discusses an extension of the above process that allows every user to use her/his own vocabulary when querying the database. An experimental study using real data is presented in Section 3.6.
61
62
For the sake of simplicity, we have only reported the linguistic labels (intent) and the row Ids (extent) that point to tuples described by those linguistic labels.
63
where L is the number of cells of the output hierarchy and d its average width. In the above formula, the coefficient kSEQ corresponds to the set of operations performed to find the best learning operator (create, merge or split) to apply at each level of the hierarchy, whereas L log L is an estimation of the average depth of this hierarchy. The SAINTETIQ model, besides being a grid-based clustering method, has many other advantages that are relevant for achieving the targeted objective of this chapter. First, SAINTETIQ uses prior domain knowledge (the Knowledge Base) to guide the clustering process, and to provide clusters that fit the users perception of the domain. This distinctive feature differentiates it from other grid-based clustering techniques which attempt to only maximize some statistical property of the clusters, and therefore there is no guarantee that the resulting clusters will match the meaningful groups that a user may expect. Second, the flexibility in the vocabulary definition of KB leads to clustering schemas that have two useful properties: (1) the clusters have soft boundaries, in the sense that each record belongs to each cluster to some degree, and thus undesirable threshold effects that are usually produced by crisp (non-fuzzy) boundaries are avoided; (2) the clusters are presented in a user-friendly language (i.e., linguistic labels) and hence the user can determine at a glance whether a clusters content is of interest. Finally, SAINTETIQ applies a conceptual clustering algorithm for partitioning the incoming data in an incremental and dynamic way. Thus, changes in the database are reflected through such an incremental maintenance of the complete hierarchy (Saint-Paul, Raschia, & Mouaddib, 2005). Of course, for new application, the end-user or the expert has to be consulted to create linguistic labels as well as the fuzzy membership functions. However, it is worth noticing that, once such knowledge base is defined, the system does not require any more setting. Furthermore, the issue of estimating fuzzy membership functions has been intensively studied in the fuzzy set literature (Galindo, 2008), and various methods based on data distribution and statistics exist to assist the user designing trapezoidal fuzzy membership functions.
64
suburb (su.)} are sets of linguistic labels defined respectively on attributes Price, Size and Location. Figure 17 shows the summary hierarchy HR provided by SAINTETIQ performed on R.
65
c:
no decision can be made (i.e., AX,z.A-QA). There is one attribute A for which z exhibits one or many linguistic labels besides those strictly required (i.e., those in QA). The presence of required features in each attribute of z suggests, but does not guarantee, that results may be found in the subtree rooted by z. Exploration of the subtree is necessary to retrieve possible results: for each branch, it will end up in situations categorized by case a or case b. Thus, at worst at leaf level, an exploration leads to accepting or rejecting summaries; the indecision is always solved. The situations stated above reflect a global view of the matching of a summary z with a query Q.
The ESA algorithm (see Algorithm 1) is based on a depth-first search and relies on a property of the hierarchy: the generalization step in the SAINTETIQ model guarantees that any label that exists in a node of the tree also exists in each parent node. Inversely, a label is absent from a summarys intent if and only if it is absent from all subnodes of this summary. This property of the hierarchy permits branch cutting as soon as it is known that no result will be found. Depending on the query, only a part of the hierarchy is explored. Thus, results of Q1 (Example 3.1), when querying the hierarchy shown in Figure 17, look like Table 9. In this case, the ESA algorithm first confronts z (root) with Q1. Since no decision can be made, Q1 is respectively confronted to z0 and z1, the children of z. The subtree rooted by z0 is then ignored because there is no correspondence between Q1 and z0. Finally z1 is returned because it exactly matches Q1. Thus, the process tests only 30% of the whole hierarchy. Note that, the set of records Rz summarized by z1 can be returned if the user requests it (SHOWTUPLES option). This is done by simply transforming the intent of z1 into a query q z and sending it as
1
a usual query to the database system. The WHERE clause of q z is generated by transforming the lin1
guistic labels (fuzzy sets) contained in the intent of z1 into crisp ones. In other words, each linguistic label l on an attribute A is replaced by its support, i.e., the set of all values in the domain of A (DA) that belong to l with non-zero membership. Then, the obtained crisp criteria on each summarys attribute are connected with OR operator and summarys attributes are connected with AND operator to generate the WHERE clause. Thus, performing a SHOWTUPLES operation takes advantage of the optimization mechanisms that exist in the database system. Furthermore, tuples covered by z1 can also be sorted, thanks to their satisfaction degrees to the users query, using an overall satisfaction degree. We assign
66
Algorithm 1.
Function Explore-Select(z,Q) Lres if Corr(z,Q) = indecisive then for all child node zchild of z do Lres Lres+ Explore-Select(zchild,Q) end for else if Corr(z,Q) = exact then Add(z, Lres) end if end if return Lres
to each tuple the degree to which it satisfies the fuzzy criteria of the query. Usually, the minimum and maximum functions stand for the conjunctive and disjunctive connectives. There are many propositions in the literature for defining aggregation connectives (Klement, Mesiar & Pap, 2000). The ESA algorithm is particularly efficient. In the worst case (exploration of the hierarchy is complete), its time complexity is given by: TESA = e L 1 O L d 1
where L is the number of leaves (cells) of the queried summary hierarchy, d its average width and coefficient corresponds to the time required for matching one summary in the hierarchy against the query. L -1 gives an estimation of the number of nodes in the summary hierarchy. In the above formula, d -1 In the following subsection, we discuss how ESAs answers help facing the many-answers problem.
67
Table 9. Q1 results
Id_Sum z1 Price cheapreasonableexpensive Size mediumlarge Location suburb
2:
Price and consequently a new broad query with less selective conditions may be submitted or the task may be abandoned (we denote this with the IGNORE option). summary z exactly fits the users need. It means that for each AY, all linguistic labels in z.A are relevant to the user. Assume that the user is interested in cheap, reasonable as well as expensive houses. Thus, all tuples contained in Rz are relevant to her/him. In such cases, she/he uses SHOW1
TUPLES option to access tuples stored in Rz . 3: summary z partially fits the users need. In this case, there is at least one attribute AY for which z.A exhibits too many linguistic labels w.r.t. the users requirement. For instance, the set Rz partially
1 1
matches the needs of a user who is looking for cheap as well as reasonable houses because Rz
contains also tuples that are mapped to expensive on attribute Price. In this case, a new query with more selective conditions (e.g., Price IN {cheap OR reasonable}) may be submitted or a new clustering schema of the set Rz , i.e., which allows to examine more precisely the dataset, is required. Since z is a subtree of the summary hierarchy, we present to the user the children of z (SHOWCAT option). Each child of z represents only a portion of tuples in Rz and gives a more precise representation of the tuples it contains. For example, {z10, z11} is a partitioning of Rz into
1
two subsets Rz
10
and Rz ; z10 exactly fits user needs. Since the entire tree is pre-computed, no
11
clustering at all would have to be performed at feedback time. More generally, a set of summaries or clusters S = {z1 zm} is presented to the user as a clustering schema of the query result tset(Q). The three options IGNORE (case 1), SHOWTUPLES (case 2) and SHOWCAT (case 3) give the user the ability to browse through the S structure (generally a set of rooted subtrees), exploring different datasets in the query results and looking for potentially interesting pieces of information. Indeed, the user may navigate through S using the basic exploration model given below: i. ii. iii. iv. start the exploration by examining the intensional description of ziS (initially i = 0); if case 1, ignore zi and examine the next cluster in S, i.e., zi+1; if case 2, navigate through tuples of Rz to extract every relevant tuple and thereafter, go ahead and examine zi+1; if case 3, navigate through children of zi, i.e., repeat from step (i) with S, the set of children of zi. More precisely, examine the intensional description of each child of zi starting from the first one and recursively decide to ignore it or examine it (SHOWTUPLES to extract relevant tuples or
1
68
SHOWCAT option for further expansion). At the end of the exploration of the children of zi, go ahead and examine zi+1. For instance, suppose a user is looking for medium, large as well as expensive houses in the suburb but issues the broad query Q1 (Example 3.1): find medium or large houses in the suburb. The set of summaries S presented to that user is {z1}, where z1 is a subtree (Figure 18) in the pre-computed summary hierarchy shown in Figure 17. In this situation, the user can explore the subtree rooted by z1 as follows to reach relevant tuples: analyze the intent of z1 and explore it using SHOWCAT option, analyze the intent of z10 and ignore it, analyze the intent of z11 and use SHOWTUPLES option to navigate through the tuples in Rz (i.e., t25-t30) to identify each relevant tuple. Note that when the set S = {z} is a singleton, i.e., z is a node of the pre-computed clustering tree, its exploration is straightforward. Indeed, given a summary of the tree rooted by z that the user wishes to examine more closely (SHOWCAT option), its children are well separated since SAINTETIQ is designed to discover summaries (clusters) that locally optimize the objective function U. Furthermore, the number of clusters presented to the user, at each time, is small; the highest value is equal to the maximum width of the pre-computed tree. However, since the summary hierarchy is independent of the query, the set of starting point answers S could be large and consequently dissimilarity between summaries is susceptible to skew. It occurs when the summary hierarchy is not perfectly adapted to the user query. In this situation, it is hard for the user to separate the interesting summaries from the uninteresting ones, thereby leading to potential decision paralysis and wastage of time and effort. In the next subsection, we propose an original rearranging query results algorithm to tackle this problem.
11
69
of S. Then, we present to the user the top-level partition of the result tree. We will refer to this approach as ESA+SEQ (i.e., search step followed by summarization step) for the remainder of this chapter. Size reduction and discrimination between items in S are clearly achieved at the expense of an overhead computational cost. Indeed, the search step time complexity TESA is in O(L) where L is the number of leaves of the queried summary hierarchy (see Section 3.2.3). Furthermore, the summarization step time complexity TSEQ is in O(L log L) with L the number of cells populated by answer records (see Section 3.1.2). Therefore, the global time complexity TESA+SEQ of the ESA+SEQ approach is in O(L log L): LL since the query is expected to be broad. Thus, ESA+SEQ doesnt fit the querying process requirement (see experimental results in Section 3.6), that is to quickly provide the user with concise and structured answers. To tackle this problem, we propose an algorithm coined Explore-Select-Rearrange Algorithm (ESRA) that rearranges answers, based on the hierarchical structure of the queried summary hierarchy, before returning them to the user. The main idea of this approach is rather simple. It starts from the summary partition S (a clustering schema of the query results) and produces a sequence of clustering schemas with a decreasing number of clusters at each step. Each clustering schema produced at each step results from the previous one by merging the closest clusters into a single one. Similar clusters are identified thanks to the hierarchical structure of the pre-computed summary hierarchy. Intuitively, summaries which are closely related have a common ancestor lower in the hierarchy, whereas the common ancestor of unrelated summaries is near the root. This process stops when it reaches a single hyperrectangle (the root z*). Then, we present to the user the top-level partition (i.e., children of z*) in the obtained tree instead of S. For instance, when this process is performed on the set of summaries S = {z00, z01, z1000, z101, z11} shown in Figure 19, the sequence of clustering schemas in Table 10 is produced. The hierarchy H obtained from the set of query results S is shown in Figure 19 (Right). Thus, the partition {z, z} is presented to the user instead of S. This partition has a small size and defines well separated clusters. Indeed, all agglomerative methods, including the above rearranging process, have a monotonicity property (Hastie, Tibshirani & Friedman, 2001): the dissimilarity between the merged clusters is monotonically increasing with the level. In the above example, it means that the dissimilarity value of the partition {z, z} is greater than the dissimilarity value of the partition {z00, z01, z1000, z101, z11}.
70
Table 10.
1 2 3 4 z00 z00+z01 z00+z01 z00+z01+z1000+z101+z11 z01 z1000+z101 z1000+z101+z11 z1000 z11 z101 z11
Algorithm 2 describes ESRA. It is a modified version of ESA (Algorithm 1) with the following new assumptions: it returns a summary (z*) rather than a list of summaries; function AddChild appends a node to callers children; function NumberofChildren returns the number of callers children; function uniqueChild returns the unique child of the caller; function BuildIntent builds callers intent (hyperrectangles) from intents of its children; Zres and Z are local variables of type summary.
The ESRA cost is only a small constant factor larger than that of ESA. In fact, the rearranging process is done at the same time the queried summary hierarchy is being scanned. It means that no postprocessing task (but the query evaluation itself) have to be performed at query time. More precisely, the time complexity of the ESRA Algorithm is in the same order of magnitude (i.e., O(L)) than the ESA Algorithm: TESRA = + L 1 O L d 1
where the coefficient is the time cost for the additional operations (addChild, uniqueChild and BuildIntent). L is the number of leaves (cells) of the queried summary hierarchy and d its average width.
71
Algorithm 2.
Function Explore-Select-Rearrange(z, Q) Zres Null Z Null if Corr(z,Q) = indecisive then for all child node zchild of z do Z Explore-Select-Rearrange(zchild,Q) if Z Null then Zres.addChild (Z) end if end forif Zres.NumberofChildren() > 1 then Zres.BuildIntent() Else Zres Zres.uniqueChild() end if else if Corr(z,Q) = exact then Zres z end if end ifreturn Zres
groups that share the same (or similar) vocabulary as theirs. In addition, they have to be familiar with their groups linguistic labels before using them accurately. As a result, this option is not much more convenient than using ad-hoc linguistic labels predetermined by a domain expert. Moreover, it only transposes the problem of user-specific summaries maintenance to group-specific ones. The second alternative, investigated in the following, consists in building only one SAINTETIQ summary hierarchy using an ad-hoc vocabulary, and querying it with user-specific linguistic labels. Since the vocabulary of the users query Q is different from the one in the summaries, we first use a fuzzy setbased mapping operation to translate predicates of Q from the user-specific vocabulary to the summary language. It consists in defining an accepting similarity threshold to decide whether the mapping of two fuzzy labels is valid. In other words, the users label lu is rewritten with a summary label ls if and only if the similarity of lu to ls ((lu, ls)) is greater than or equal to . There are many propositions in the literature for defining (lu, ls) (e.g., degree of satisfiability (Bouchon-Meunier, Rifqi & Bothorel, 1996)). Then, the ESRA Algorithm is performed using the rewritten version of Q. Finally, results are sorted and flirted thanks to their similarity degrees to the initial users query Q.
72
today in the context of Relational Databases. Nonetheless, in this subsection, we discuss the efficiency and the effectiveness of ESA, ESA+SEQ and ESRA algorithms based on a real database.
3.6.2 Results
All experiments reported in this section were conducted on a workload composed of 150 queries with a random number of selection predicates from all attributes (i.e., each query has between 1 and 3 required features on 1, 2, 3 or 4 attributes). Quantitative Analysis The CIC dataset is summarized by the SAINTETIQ system as described in Section 3.1. The dataset, consisting of 33735 records, yields a summary tree with 13263 nodes, 6701 leaves or cells, maximum depth of 16, average depth of 10.177, maximum width of 14 and an average width of 2.921. The data 6701 ) occupation rate. distribution in the summary tree reveals a 0.6% ( 1036800 From the analysis of theoretical complexities, we claim that ESA and ESRA are much faster than the post-clustering approach ESA+SEQ. That is the main result of Figure 20 that shows the performance evolution according to the number of cells populated by query answer records. Furthermore, we plot the number of summary nodes visited (#VisitedNodes) per query (right scale) and finally, the normalized ESRA time cost (tN. ESRA) to evaluate the performance of ESRA regardless of how the query fits the pre-clustering summary hierarchy. tN. ESRA is computed as follows: tN .ESRA = tESRATreeNodes VisitedNodes
As one can observe, Figure 20 verifies experimentally that ESA+SEQ is quasi-linear (O(Llog L)) in the number of cells L whereas ESA, ESRA and N.ESRA are linear (O(L)). Besides, the time cost incurred by rearranging query results (i.e., tESRA-tESA) is insignificant compared to the search cost (i.e., tESA). For instance, for L = 1006, tESA and tESRA are 0.235sec and 0.287sec, respectively. Thus, the ESRA algorithm is able to drastically reduce the time cost of clustering query results. Qualitative Analysis Due to the difficulty of conducting a large-scalexix real-life user study, we discuss the effectiveness of the ESRA algorithm based on structural properties of results provided to the end-user. It is worth notic-
73
ing that the end-user benefit is proportional to the number of items (clusters or tuples) the user needs to examine at any time as well as to the dissimilarity between these items. We define the Compression Rate (CR) as the ratio of the number of clusters (summaries) returned as a starting point for an online exploration over the total number of cells covered by such summaries. Note that CR = 1 means no compression at all, whereas smaller values represent higher compression. As expected, Figure 21 shows that CR values of ESRA and ESA+SEQ are quite similar and much smaller than that of ESA. Thus, size reduction is clearly achieved by the ESRA algorithm. We could also see that the dissimilarity (Figure 22) of the first partitioning that ESRA presents to the end-user is greater than that of ESA and is in the same order of magnitude than that provided by ESA+SEQ. It means that ESRA significantly improves discrimination between items when compared against ESA and is as effective as the post-clustering approach ESA+SEQ. Furthermore, the dissimilarity of the ESA result is quite similar to that of the most specialized partitioning, i.e., the set of cells. Thus the rearranging process is highly required to provide the end-user with well-founded clusters. Now assume that the user decides to explore a summary z returned by ESRA. We want to examine the number of hops NoH (i.e., the number of SHOWTUPLES/SHOWCAT operations) the user might employ to reach a relevant tuple. NoH ranges from 1 up to dz + 1, where dz is the height of the tree rooted by z (i.e., subtree Hz). The best case (NoH = 1) occurs when z exactly fits the users need, whereas the worst case (NoH = dz + 1) occurs when the relevant information is reached by following a path of maximal length in Hz. Note that these two scenarios are on opposite extremes of the spectrum of possible situations: the generalized case (0 NoH dz + 1) is that the users need is successfully served by a node at height h such that 0 h dz. In the following experiment (Figure 23), one considers a user query q with selectivity of , i.e., the number of tuples returned by q divided by the total number of tuples in the database (33735). We look for all the possible summaries (subtrees) in the pre-computed summary hierarchy that can be returned
74
as the result of q and, for each one, we compute its maximum depth, its average depth and the average length of all paths emanating from that node (summary). Then, we pick out the highest (maximum) value observed for each of these measures. The three values obtained, for each value of ( {0.2%, 0.4% 2%}), evaluate respectively: 1. the worst number of hops required to reach the deepest leaf node (Worst NoH2C) containing relevant data;
75
2. 3.
the average number of hops needed to reach any leaf node (Average NoH2C) containing relevant data (i.e., not necessarily the deepest one); the average number of hops required to reach any node (Average NoH2N) containing relevant data (i.e., not necessarily a leaf node).
Figure 23shows that the Worst NoH2C and the Average NoH2C are relatively high, but bounded respectively by the maximum (16) and the average (10.177) depth of the pre-computed tree. It is worth noticing that, in real life situation, the user finds out that her/his need has been successfully served by an inner node of the tree rooted by z. Thus, the Average NoH2N is more adapted to evaluating the effectiveness of our approach (ESRA). As one can observe, the Average NoH2N is quite small given the number of tuples in the q result set. For instance, NoH2N = 3.68 is the number of hops the user takes to reach relevant information within a set of 674 tuples ( = 2).Those experimental results validate the claim of this work, that is to say the ESRA algorithm is very efficient (Figure 20) and provides useful clusters of query results (Figure 21 and Figure 22) and consequently, makes the exploration process more effective (Figure 23).
4. CONCLUSION
Interactive and exploratory data retrieval are more and more suitable to database systems. Indeed, regular blind queries often retrieve too many answers. Users then need to spend time sifting and sorting through this information to find relevant data. In this chapter, we proposed an efficient and effective algorithm coined Explore-Select-Rearrange Algorithm (ESRA) that uses database SAINTETIQ summaries to quickly provide users with concise, useful and structured representations of their query results. Given a user query, ESRA (i) explores the summary hierarchy (computed offline using SAINTETIQ) of the whole data stored in the database;
76
(ii) selects the most relevant summaries to that query; (iii) rearranges them in a hierarchical structure based on the structure of the pre-computed summary hierarchy and (iv) returns the resulting hierarchy to the user. Each node (or summary) of the resulting hierarchy describes a subset of the result set in a user-friendly form using linguistic labels. The user then navigates through this hierarchy structure in a top-down fashion, exploring the summaries of interest while ignoring the rest. Experimental results showed that the ESRA algorithm is efficient and provides well-formed (tight and clearly separated) and well-organized clusters of query results. Thus, it is very helpful to users who have vague and poorly defined retrieval goals or are interested in browsing through a set of items to explore what choices are available.
REFERENCES
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. & Verkamo, A.I. (1996). Fast discovery of association rules. Advances in Knowledge Discovery and Data Mining, 307-328.Allen, D. (1992). Managing infoglut. BYTE magazine, 17(6), 16. Agrawal, R. & Wimmers, E.L. (2000). A framework for expressing and combining preferences. Agrawal, S., Chaudhuri, S., & Das, G. (2002). Dbxplorer: enabling keyword search over relational databases. Proceedings of the ACM SIGMOD international conference on Management of data, (pp. 627-627). Ashcraft, M. (1994). Human memory and cognition. Addison-Wesley Pub Co. Bamba, B., Roy, P., & Mohania, M. (2005). OSQR: Overlapping clustering of query results. Proceedings of the 14th ACM international conference on Information and knowledge management, (pp. 239-240). Bergman, M. K. (2001). The deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1). doi:10.3998/3336451.0007.104 Berkhin, P. (2006). A survey of clustering data mining techniques. In Kogan, J., Nicholas, C., & Teboulle, M. (Eds.), Grouping multidimensional data: Recent advances in clustering (pp. 2571). doi:10.1007/3540-28349-8_2 Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., & Sudarshan, S. (2002). Keyword searching and browsing in databases using banks. Proceedings of the International Conference on Data Engineering, (pp. 431-440). Bharat, K., & Henzinger, M. R. (1998). Improved algorithms for topic distillation in a hyperlinked environment. Proceedings of the International Conference on Research and development in information retrieval, (pp. 104111). Borodin, A., Roberts, G. O., Rosenthal, J. S., & Tsaparas, P. (2001). Finding authorities and hubs from link structures on the World Wide Web. Proceedings of the International Conference on World Wide Web, (pp. 415-429). Bosc, P., & Pivert, O. (1992). Fuzzy querying in conventional databases. In Fuzzy logic for the management of uncertainty. (pp. 645-671).
77
Bosc, P., & Pivert, O. (1995). Sqlf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 117. doi:10.1109/91.366566 Bosc, P., & Pivert, O. (1997). Fuzzy queries against regular and fuzzy databases. In Flexible query answering systems. (pp. 187-208). Bosc, P., Pivert, O. & Rocacher, D. (2007). About quotient and division of crisp and fuzzy relations. Bouchon-Meunier, B., Rifqi, M., & Bothorel, S. (1996). Towards general measures of comparison of objects. Fuzzy Sets and Systems, 84(2), 143153. doi:10.1016/0165-0114(96)00067-X Brin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117. Callan, J. P., Croft, W. B., & Harding, S. M. (1992). The inquery retrieval system. Proceedings of the Third International Conference on Database and Expert Systems Applications, (pp. 7883). Chakrabarti, K., Chaudhuri, S., & Hwang, S. W. (2004). Automatic categorization of query results. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 755-766). Chang, K. C. C., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the Web: Observations and implications. Proceedings of the ACM SIGMOD International Conference on Management of Data, 33(3), 61-70.Chaudhuri, S. & Das, G. (2003). Automated ranking of database query results. In CIDR, (pp. 888-899). Chaudhuri, S., Das, G., Hristidis, V., & Weikum, G. (2004). Probabilistic ranking of database query results. Proceedings of the Thirtieth international conference on Very Large Data Bases, (pp. 888-899). Cheng, D., Kannan, R., Vempala, S., & Wang, G. (2006). A divide-and-merge methodology for clustering. ACM Transactions on Database Systems, 31(4), 14991525. doi:10.1145/1189769.1189779 Chomicki, J. (2002). Querying with intrinsic preferences. Proceedings of the 8th International Conference on Extending Database Technology, (pp. 34-51). Chomicki, J. (2003). Preference formulas in relational queries. ACM Transactions on Database Systems, 28(4), 427466. doi:10.1145/958942.958946 Christian, B., Stefan, B., & Daniel, A. K. (2001). Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 322373. doi:10.1145/502807.502809 Chu, W. W., Yang, H., Chiang, K., Minock, M., Chow, G., & Larson, L. (1996). Cobase: A scalable and extensible cooperative information system. Journal of Intelligent Information Systems, 6(2-3), 223259. doi:10.1007/BF00122129 Collins, A., & Quillian, M. (1969). Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior, 8(2), 240-247.Croft, W.B. (1980). A model of cluster searching based on classification. Information Systems, 5, 189195.
78
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Wiley-Interscience. Fagin, R., Lotem, A., & Naor, M. (2001). Optimal aggregation algorithms for middleware. Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, (pp. 102-113). Ferragina, P., & Gulli, A. (2005). A personalized search engine based on Web-snippet hierarchical clustering. Special interest tracks and posters of the 14th International Conference on World Wide Web, (pp. 801-810). Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), 139172. doi:10.1007/BF00114265 Gaasterland, T., Godfrey, P., & Minker, J. (1994). An overview of cooperative answering. In Nonstandard queries and nonstandard answers: studies in logic and computation (pp. 140). Oxford: Oxford University Press. Galindo, J. (2008). Handbook of research on fuzzy information processing in databases. Hershey, PA: Information Science Reference. Godfrey, P. (1997). Minimization in cooperative response to failing database queries. IJCIS, 6(2), 95149. Gntzer, U., Balke, W. T., & Kieling, W. (2000). Optimizing multi-feature queries for image databases. Proceedings of the 26th International Conference on Very Large Data Bases, (pp. 419-428). Halford, G. S., Baker, R., McCredden, J. E., & Bain, J. D. (2005). How many variables can humans process? Psychological Science, 15, 7076. doi:10.1111/j.0956-7976.2005.00782.x Hartigan, J. A., & Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics, 28, 100108. doi:10.2307/2346830 Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning. Springer. Hearst, M. A., & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/gather on retrieval results. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, (pp. 76-84). Hristidis, V., & Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases. Proceedings of the International Conference on Very Large Data Bases, (pp. 670-681). Ichikawa, T., & Hirakawa, M. (1986). Ares: A relational database with the capability of performing flexible interpretation of queries. IEEE Transactions on Software Engineering, 12(5), 624634. Ilyas, I. F., Beskales, G., & Soliman, M. A. (2008). A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys, 40(4), 158. doi:10.1145/1391729.1391730 Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264323. doi:10.1145/331499.331504 Jardine, N. & Van Rijsbergen, C.J. (1971). The use of hierarchical clustering in information retrieval.
79
Kaplan, S. J. (1983). Cooperative responses from a portable natural language database query system. In Brady, M., & Berwick, R. C. (Eds.), Computational models of discourse (pp. 167208). Cambridge, MA: MIT Press. Kapoor, N., Das, G., Hristidis, V., Sudarshan, S., & Weikum, G. (2007). Star: A system for tuple and attribute ranking of query answers. Proceedings of the International Conference on Data Engineering, (pp. 1483-1484). Kevin, S. B., Jonathan, G., Raghu, R., & Uri, S. (1999). Nearest neighbor queries When is nearest neighbor meaningful? Proceedings of the International Conference on Database Theory, (pp. 217-235). Kieling, W. (2002). Foundations of preferences in database systems. Proceedings of the 28th International Conference on Very Large Data Bases, (pp. 311-322). Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604632. doi:10.1145/324133.324140 Klement, E. P., Mesiar, R., & Pap, E. (2000). Triangular norms. Kluwer Academic Publishers. Kostadinov, D., Bouzeghoub, M., & Lopes, S. (2007). Query rewriting based on users profile knowledge. Bases de Donnes Avances. BDA. Koutrika, G., & Ioannidis, Y. (2005). A unified user profile framework for query disambiguation and personalization. Proceedings of Workshop on New Technologies for Personalized Information Access, (pp. 44-53). Lacroix, M., & Lavency, P. (1987). Preferences: Putting more knowledge into queries. Proceedings of the 13th International Conference on Very Large Data Bases, (pp. 217-225). Lewis, D. (1996). Dying for information: An investigation into the effects of information overload in the usa and worldwide. Reuters Limited. Li, C., Wang, M., Lim, L., Wang, H. & Chang, K.C.C. (2007). Supporting ranking and clustering as generalized order-by and group-by. Proceedings of the ACM SIGMOD International Conference on Management of data, (pp. 127-138). MacArthur, S. D., Brodley, C. E., Ka, A. C., & Broderick, L. S. (2002). Interactive content-based image retrieval using relevance feedback. Computer Vision and Image Understanding, 88(2), 5575. doi:10.1006/cviu.2002.0977 Maimon, O. & Rokach, L. (2005). The data mining and knowledge discovery handbook. Mandler, G. (1967). Organization in memory. In Spense, K. W., & Spense, J. T. (Eds.), The psychology of learning and motivation (pp. 327372). Academic Press. Manning, C. D., Raghavan, P., & Schtze, H. (2008). Introduction to information retrieval. USA: Cambridge University Press. Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63, 8197. doi:10.1037/h0043158
80
Miller, G. A. (1962). Information input overload. In M.C. Yovits, G.T. Jacobi, and G.D. Goldstein, (Eds.), Conference on Self-Organizing Systems. Spartan Books. Mosh, M. Z. (1975). Query-by-example: The invocation and definition of tables and forms. Proceedings of the ACM SIGMOD international conference on Very Large Data Bases, (pp. 1-24). Motro, A. (1986). Seave: A mechanism for verifying user presuppositions in query systems. ACM Transactions on Information Systems, 4(4), 312330. doi:10.1145/9760.9762 Motro, A. (1988). Vague: A user interface to relational databases that permits vague queries. ACM Transactions on Information Systems, 6(3), 187214. doi:10.1145/45945.48027 Muslea, I. (2004). Machine learning for online query relaxation. Proceedings of the International Conference on Knowledge discovery and data mining, (pp. 246-255). Muslea, I., & Lee, T. (2005). Online query relaxation via Bayesian causal structures discovery. Proceedings of the National Conference on Artificial Intelligence, (pp. 831-836). Nambiar, U. (2005). Answering imprecise queries over autonomous databases. Unpublished doctoral dissertation, University of Arizona, USA. Nambiar, U., & Kambhampati, S. (2003). Answering imprecise database queries: a novel approach. Proceedings of the International Workshop on Web Information and Data Management, (pp. 126-133). Nambiar, U., & Kambhampati, S. (2004). Mining approximate functional dependencies and concept similarities to answer imprecise queries. Proceedings of the International Workshop on the Web and Databases, (pp. 73-78). Nepal, S., & Ramakrishna, M. V. (1999). Query processing issues in image (multimedia) databases. Proceedings of the 15th International Conference on Data Engineering, (pp. 23-26). Ng, A. Y., Zheng, A. X., & Jordan, M. I. (2001). Stable algorithms for link analysis. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 258-266). Nielsen, J. (2003). Curmudgeon: IM, not IP (information pollution). ACM Queue; Tomorrows Computing Today, 1(8), 7675. doi:10.1145/966712.966731 Quillian, M. R. (1968). Semantic memory. In Minsky, M. (Ed.), Semantic information processing (pp. 227270). The MIT Press. Raschia, G., & Mouaddib, N. (2002). SAINTETIQ: A fuzzy set-based approach to database summarization. Fuzzy Sets and Systems, 129(2), 137162. doi:10.1016/S0165-0114(01)00197-X Robertson, S. E. (1997). The probability ranking principle in IR. Readings in Information Retrieval, 281286.Roussopoulos, N., Kelley, S. & Vincent, F. (2003). Nearest neighbor queries. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 71-79). Ruspini, E. (1969). A new approach to clustering. Information and Control, 15, 2232. doi:10.1016/ S0019-9958(69)90591-9
81
Saint-Paul, R., Raschia, G., & Mouaddib, N. (2005). General purpose database summarization. Proceedings of the International Conference on Very Large Data Bases, (pp. 733-744). Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. USA: McGraw-Hill, Inc. Schikuta, E., & Erhart, M. (1998). Bang-clustering: A novel grid-clustering algorithm for huge data sets. Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition, (pp. 867-874). Schultz, U., & Vandenbosch, B. (1998). Information overload in a groupware environment: Now you see it, now you dont. Journal of Organizational Computing and Electronic Commerce, 8(2), 127148. doi:10.1207/s15327744joce0802_3 Sheikholeslami, G., Chatterjee, S., & Zhang, A. (2000). Wavecluster: A wavelet-based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3-4), 289304. doi:10.1007/s007780050009 Shenk, D. (1997). Data smog: Surviving the information glut. HarperCollins Publishers. Simon, H. A. (1974). How big is a chunk? Science, 183, 482488. doi:10.1126/science.183.4124.482 Su, W., Wang, J., Huang, Q., & Lochovsky, F. (2006). Query result ranking over e-commerce Web databases. Proceedings of the International Conference on Information and Knowledge Management, (pp. 575-584). Tahani, V. (1977). A conceptual framework for fuzzy query processing: A step toward very intelligent database systems. Information Processing & Management, 13, 289303. doi:10.1016/0306-4573(77)90018-8 Van Rijsbergen, C. J., & Croft, W. B. (1975). Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Information Processing & Management, 11(5-7), 171182. doi:10.1016/0306-4573(75)90006-0 Volker, G., & Oliver, G. (1998). Multidimensional access methods. ACM Computing Surveys, 30(2), 170231. doi:10.1145/280277.280279 Voorhees, E. M. (1985). The cluster hypothesis revisited. Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 188-196). Wang, W., Yang, J., & Muntz, R. R. (1997). STING: A statistical information grid approach to spatial data mining. Proceedings of the International Conference on Very Large Data Bases, (pp. 186-195). Weil, M. M., & Rosen, L. D. (1997). TechnoStress: Coping with technology work home play. John Wiley and Sons. Winkle, W.V. (1998). Information overload: Fighting data asphyxiation is difficult but possible. Computer Bits magazine, 8(2). Wu, L., Faloutsos, C., Sycara, K. P., & Payne, T. R. (2000). Falcon: Feedback adaptive loop for contentbased retrieval. Proceedings of the International Conference on Very Large Data Bases, (pp. 297-306).
82
Zadeh, L. A. (1956). Fuzzy sets. Information and Control, 8, 338353. doi:10.1016/S0019-9958(65)90241X Zadeh, L. A. (1975). Concept of a linguistic variable and its application to approximate reasoning. Information Systems, 1, 119249. Zadeh, L. A. (1999). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 100, 934. doi:10.1016/S0165-0114(99)80004-9 Zadrozny, S., & Kacprzyk, J. (1996). Fquery for access: Towards human consistent querying user interface. Proceedings of the ACM Symposium on Applied Computing, (pp. 532-536). Zamir, O., & Etzioni, O. (1999). Grouper: A dynamic clustering interface to Web search results. Proceedings of the Eighth International Conference on World Wide Web, (pp. 1361-1374). Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., & Ma, J. (2004). Learning to cluster Web search results. Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 210-217).
ENDNOTES
i ii iii iv v vi vii viii ix
x xi xii
A COOPerative Query System. Supposition Extraction And VErification. A Cooperative DataBase System. Associative Information REtrieval System. A User Interface to Relational Databases that Permits VAGUE Queries. Imprecise Query Engine. http://www.google.com Hyperlink-Induced Topic Search. In the case of information retrieval, ranking functions are often based on the frequency of occurrence of query values in documents (term frequency, or tf). However, in the database context, tf is irrelevant as tuples either contain or do not contain a query value. Hence ranking functions need to consider values of unspecified attributes. Probabilistic Information Retrieval. Query Result Ranking for E-commerce. In information retrieval systems, the information need is what the user desires to know from the stored data, to satisfy some intended objective (e.g., data analysis, decision making). However, the query is what the user submits to the system in an attempt to fulfill that information need. Overlapping cluStering of Query Results. Fuzzy sets can be defined in either discrete or continuous universes. Browsing ANd Keyword Searching. A degree of a node is the number of edges incident to this node. The intuition is that a node that has many links with others has relative small possibility of having a close relationship to any of them, and thus edges incident on it have large weights.
83
xviii xix
http://trec.nist.gov A potential problem with real-user evaluation techniques is that users opinions are very subjective. Hence, even if we obtain positive feedback from a small set of test users, we cannot be more convinced and affirmative about the effectiveness of our approach.
84
85
Chapter 3
ABSTRACT
This chapter describes a novel query language, called the concept-oriented query language (COQL), and demonstrates how it can be used for data modeling and analysis. The query language is based on a novel construct, called concept, and two relations between concepts, inclusion and partial order. Concepts generalize conventional classes and are used for describing domain-specific identities. Inclusion relation generalizes inheritance and is used for describing hierarchical address spaces. Partial order among concepts is used to define two main operations: projection and de-projection. This chapter demonstrates how these constructs are used to solve typical tasks in data modeling and analysis such as logical navigation, multidimensional analysis, and inference.
INTRODUCTION
A model is a mathematical description of a world aspect and a data model provides means for data organization in the form of some structural principles. These structural principles are used to break all elements into smaller groups making access to and manipulation of data more efficient for end-users and applications. The concept-oriented model (COM) is a novel general-purpose approach to data modelDOI: 10.4018/978-1-60960-475-2.ch003
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
ing (Savinov, 2009a) which is intended to solve a wide spectrum of problems by reducing them to the following three structural principles distinguishing it from other data models: Duality principle answers the question how elements exist by assuming that any element is a couple of one identity and one entity (called also reference and object, respectively) Inclusion principle answers the question where elements exist by postulating that any element is included in some domain (called also scope or context) Order principle answers the question what an element is, that is, how it is defined and what is its meaning by assuming that all elements are partially ordered so that any element has a number of greater and lesser elements
Formally, the concept-oriented model is described using a formalism of nested partially ordered sets. The syntactic embodiment of this model is the concept-oriented query language (COQL). This language reflects the principles of COM by introducing a novel data modeling construct, called concept (hence the name of the approach), and two relations among concepts, inclusion and partial order. Concepts are intended to generalize conventional classes and inclusion generalizes inheritance. Concepts and inclusion are used also in a novel approach to programming, called concept-oriented programming (COP) (Savinov, 2008, 2009b). Partial order relation among concepts is intended to represent data semantics and is used for complex analytical tasks and reasoning about data. The concept-oriented model and query language are aimed at solving several general problems which are difficult to solve using traditional approaches. In particular, the following factors motivated this work: Domain-specific identities. In most existing data models elements are represented either by platform-specific references like oids or by weak identities based on entity properties like primary keys. These approaches do not provide a mechanism for defining strong domain-specific identities with arbitrary structure. Concepts solve this problem by making it possible to describe both identities and entities using only one common construct. This produces nice symmetry between two branches: identity modeling and entity modeling. Hierarchical address spaces. Elements cannot exist outside of any space, domain or context but existing data models do not support this abstraction as a core notion of the model. A typical solution consists in modeling spaces and containment like any other domain-specific relationship. The principled solution proposed in COM is that all elements are supposed to exist within a hierarchy where a parent is a space, context, scope or domain for its child elements. Thus inclusion relation between concepts turns an element into a set of its child elements. Since identities of internal elements are defined relative to the space they are in, we simultaneously get a hierarchical address space for the elements. Each element within this hierarchy is identified by a domain-specific hierarchical address like a conventional postal address. Multidimensionality. Dimension is one of the fundamental constructs which is used to represent information in various areas of human knowledge. There exist numerous approaches to multidimensional modeling which are intended for analytical processing. The problem is that there exist different models for analytical and transactional processing which rely on different assumptions and techniques. The goal of COM in this context is to rethink dimensions as a first-class construct of the data model which plays a primary role for describing both transactional and analytical aspects. Data should be represented as originally existing in a multidimensional space and dimension should be used in most operations with data.
86
Semantics. The lack of semantic description in traditional approaches to data modeling is a strong limiting factor for the effective use of data and this significantly decreases its value including possibility of information exchange, integration, consistency, interoperability and many other functions. Semantics in databases should enable it to respond to queries and other transactions in a more intelligent manner (Codd, 1979). Currently, semantics is supposed to exist at conceptual level while logical data models have rather limited possibilities for representing semantic relationships. In this context, the goal of COM is to make semantics integral part of the logical data model so that the database can be directly used for reasoning about data. To reach this goal, COM makes a principled assumption that database is a partially ordered set (as opposed to a set without any structure). Partial order is supposed to be represented by references which are used as an elementary semantic construct in COM.
CONCEPT-ORIENTED MODEL
The smallest unit of data in COM is a primitive value like integer or text string. They are normally provided by DBMS and therefore their structure is not part of the model. More complex elements are produced by means of tuples which are treated in their common mathematical sense as a combination of primitive values or other tuples. Each member of a tuple has its unique position which is referred to as a dimension. If e=(x=a,y=b,z=c,) is a tuple then x, y and z are dimensions while a, b and c are members of this tuple. According to the duality principle, an element in COM is defined as a couple of two tuples: identity tuple and entity tuple. Identity tuples are values which are passed by-copy while entity tuples are passed by-reference. Identity tuples are used as locations or addresses for entity tuples. If tuple a is a member within tuple b then only identity part of a is included by-value in b. It is assumed that it is always possible to access entity tuple if its identity tuple is available. In terms of conventional computer memory, identity tuple has the structure of memory addresses and entity tuple has the structure of memory cells. Inclusion principle in COM is implemented via extension operator denoted by semicolon. If a and b are two elements then e=a:b is a new element where a is a base and b is an extension (so b is said to extend a). It is analogous to object-oriented extension with the difference that this operation is applied to couples of identity-entity tuples rather than to individual tuples. If entity tuple is empty then this operation can be used to extend values and model value domains. If identity tuple is empty then it is used to extend objects or records. With the help of extension any element can be represented as a sequence of couples starting from some common root element and ending with the last extension. Extension operator induces strict inclusion relation among elements by assuming that extended element is included in its base element, that is, if e=a:b then ae. All elements from set R in this case are represented as a nested set (R,) where parents are referred to as super-elements and children (extended elements) are referred to as sub-elements. Inclusion relation among elements is analogous to the nested structure of elements in XML. According to the order principle, elements in COM are (strictly) partially ordered by assuming that a tuple is less than any of its members, that is, if e=(,a,) is a tuple then e<a. All elements of set R in this case are represented as a partially ordered set (R,<). If a<b then a is said to be a lesser element and b is referred to as a greater element. In particular, this assumption means that a tuple cannot have itself as a member directly or indirectly, that is, cycles are prohibited. According to the duality principles,
87
tuple members are represented by storing identities which correspond to conventional object references. Hence this principle means that object references represent greater elements. Inclusion and partial order relations are connected using type constraint: ebexbx In terms of the extension operator it can be equivalently written as follows: b:e(bx):(ex) This condition means that if an element sets some value for its dimension then its extensions are permitted to extend this value using this same dimension but they cannot set arbitrary value for it. A concept-oriented database is formally defined as a nested partially ordered set (R,,<) where elements are identity-entity couples and strict inclusion and strict partial order relations satisfy type constraint. This structure can be produced from a partially ordered set if we assume that an element can itself be a partially ordered set. Alternatively, it can be produced from a nested set if we assume that its elements are partially ordered. An example of a nested partially set is shown in Figure 1. Nested structure spreads horizontally while partially ordered structure spreads vertically. For example, element a consists of elements b and c which are its extensions. Simultaneously, element a has two greater elements f and d.
MODELING IDENTITIES
Object identity is an essential part of any programming and data model. Although the role of identities has never been underestimated (Khoshafian et al, 1986; Wieringa et al, 1995; Abiteboul et al, 1998; Kent, 1991; Eliassen et al, 1991), there exists a very strong bias towards modeling entities. There is a Figure 1. Nested partially ordered set
88
very old and very strong belief that it is entity that should be in the focus of a programming and data model while identities simply serve entities. As a consequence, there is a strong asymmetry between entities and identities in traditional approaches to data modeling: entities have domain-specific structure while identities have platform-specific structure. Accordingly, classes are used to model domain-specific structure and behavior of entities while identities are provided by the underlying environment with some built-in platform-specific structure. The concept-oriented model treats entities and identities symmetrically so that both of them may have arbitrary domain-specific structure and behavior. More specifically, it is assumed that any element is a couple consisting of one identity and one entity. To model such identity-entity couples, a new construct is introduced, called concept. Concept is defined as a couple of two classes: one identity class and one entity class. Concept fields are referred to as dimensions in COM. This produces a nice yin-yang style of balance and symmetry between two orthogonal branches identity modeling and entity modeling. If traditional approaches use classes to model entities then COQL uses concepts to model identity-entity couples. The shift of paradigm is that things (data elements) are couples and identities are made explicit part of the model. Informally, elements in COM can be thought of as complex numbers in mathematics which also have two constituents manipulated as one whole. Identity is an observable part which is manipulated directly in its original form. It is passed by-value (by-copy) and does not have its own separate representative. Entity can be viewed as a thing-in-itself or reality which is radically unknowable and not observable in its original form. The only way to do something with an entity consists in using its identity as an intermediate. This means that there is no other way to access an entity rather than using its reference. Entities are persistent objects while identities are transient values which are used to represent and access objects. In most cases identities are immutable and cannot be changed over the whole lifetime of the represented entity while entities are supposed to be mutable so that their properties reflect the current state of the problem domain. The entity itself is then considered a kind of a point in space while its reference is thought of as a coordinate. It is important to note that references in COM are abstract addresses from a virtual address space. Therefore the represented object can actually reside anywhere in the world (not even necessarily in a computer system). References in COM (just as values) are domainspecific because they are designed taking into account properties and requirements of the application being created. In contrast, references in most existing models are platform-specific because they are provided by the compiler taking into account the available run-time environment. If identity class of a concept is empty then this concept is equivalent to conventional class which is used for describing objects. In this case entities are supposed to be represented by some platformspecific reference. If entity class is empty then this concept describes a type of values (value domain). Values are very important notion in data modeling because they are considered terminal elements for any model. Essentially, values are what any model or computation is about. Values are characterized by the following properties: they are immutable, passed by-copy, do not have their own reference. Values are used in modeling because they have some meaning which can be passed to other elements, stored, retrieved or computed. An example of a primitive value is an integer number which could represent city population or a double number which could represent the quantity of some resource. A complex system requires complex value types which are made of simpler types and it is precisely where concepts (with empty entity class) are intended to be used. For example, the following concept describes amount in some currency (like USD or EUR) as a value:
89
If both constituents of a concept are non-empty then identity class describes references for representing its objects. For example, let us assume that we need to model bank accounts. In this problem domain any bank account is identified by an account number and characterized by its current balance and owner. The existence of domain-specific identity and entity is easily described by means of the following concept:
CONCEPT Account IDENTITY CHAR(10) accNo ENTITY Amount balance Person owner
Each instance of this concept will be a pair of account reference identifying this element and account object consisting of two fields. Variables of this concept will store account numbers which identify the corresponding account objects. Account owners can be modeled using the following concept:
CONCEPT Person IDENTITY CHAR(11) ssn ENTITY CHAR(11) name Date dob
The difference from the relational model (Codd, 1970) is that concept dimensions contain the whole identity of the referenced element treated as one value. The difference from object data models (Dittrich, 1986; Bancilhon, 1996) is that identities may have arbitrary domain-specific structure. And the general specific feature of this approach is that both identities and entities are modeled together within one data modeling construct.
MODELING HIERARCHIES
Any element must have some identity which manifests the fact of its existence. But if something exists then there has to be some space to which it belongs, that is, elements are not able to exist outside of any space. In COM, existence within space means that the element is identified relative to this space. Space is a normal element of the model and all elements exist within a hierarchy where a child is said to be included in its parent interpreted as a space, scope, context or domain. Parents in the inclusion
90
hierarchy are said to be super-elements while their children are called sub-elements. Parent element is also referred to as a domain, scope or context for its children. To model a hierarchy of elements, COQL introduces a special relation among concepts, called inclusion. For example, assume that concept Savings describing savings accounts is included in concept Account:
CONCEPT Savings IN Account IDENTITY CHAR(2) savAccNo ENTITY Amount balance
This declaration means that instances of the Savings concept will be identified by two digits within an instance of the Account concept. Any reference to a savings account consists of two segments: main account number and relative savings account number. An account in this case is a set of savings accounts which are all distinguished within this main context by means of their relative identifier. The most important difference from inheritance is that one parent may have many children. In contrast, if Savings were a class inheriting from class Account then any new instance of Savings would get its own account. Thus the advantage of concepts is that they allow us to describe a hierarchy where one account may contain many sub-accounts which in turn may have their own child objects. This hierarchy is analogous to conventional postal addresses where one country has many cities and one city has many streets so that any destination has a unique hierarchical address. Inclusion is also analogous to element nesting in XML where any element may have many child elements. An important property of inclusion is that it is equivalent to inheritance under certain simplifying conditions and this is why we say that inclusion generalizes inheritance. Namely, inclusion relation is reduced to conventional inheritance if identity class of the child concept is empty. In this case only one child can be created and this child shares identity with its parent. For example, if concept Special describing accounts with special privileges has no identity and is included in concept Savings then effectively it extends the parent concept and behaves like a normal class:
CONCEPT Special IN Savings IDENTITY // Empty ENTITY INT privileges
This compatibility allows us to make smooth transition from classical inheritance to more powerful inclusion which is a novel, more general, treatment of what inheritance is. In concept-oriented terms, to inherit something means to be included into it and to have it as a domain, scope or context. Inheritance is one of the corner stones of object data models but it has one problem: classes exist as a hierarchy while their instances exist in flat space. Indeed, parent classes are shared parts of the child classes while parent objects are allocated for each new child and all objects have the same type of identity. Concepts and inclusion eliminate this asymmetry so that both concepts and their instances exist within a hierarchy. Having object hierarchy is very important in data modeling because it is a way how one large set of objects can be broken into smaller groups. Probably, it is the solid support of set-oriented operations
91
that is the main reason why the relational model (Codd, 1970) has been dominating among other data models. And insufficient support of the set-oriented view on data is why object-orientated paradigm is much less popular in data modeling than in programming. Concepts and inclusion allow us to turn object data models into a set-based approach where any element is inherently a set and the notion of set is supported at the very basic level. Also, the use of inclusion relation makes COM similar to the hierarchical model (Tsichritzis et al, 1976). Inclusion in the case of empty entity class can be used to extend values by adding additional fields and describing a hierarchy of value types. For example, if we need to mark an amount with the date of transaction then it can be described as follows:
CONCEPT DatedAmount IN Amount IDENTITY Date date ENTITY // Empty
Of course, the same can be done using conventional classes but then we would need to have two kinds of classes for values and for objects while concepts and inclusion describe both values and objects types. In relational terms, this allows us to model two hierarchies of domains and relations using only one construct (concept) and one relation (inclusion). In this sense, it is a step in the direction of unifying object-oriented and relational models by uniting two orthogonal branches: domain modeling and relational modeling.
Here concept Account has a dimension which represents its owner the type of which is Person. Accordingly, concept Person is a greater concept and concept Account is a lesser concept. The number of greater concepts is equal to the number of dimensions (concept fields). And the number of lesser concepts is equal to the number of uses of this concept as a dimension type. In diagrams, greater concepts are positioned over lesser concepts. Concept instances are also partially ordered using the following principle: a referenced element is greater than the referencing element. Thus a reference always points to a greater element in the database. For example, a bank account element (instance of concept Account) is less than the referenced account
92
owner (instance of concept Person). Thus references are not simply a navigational tool but rather allow us to represent partial order which is a formal basis for representing data semantics. Two operations of projection and de-projection are defined in COM by using partially ordered structure of elements. Projection is applied to a set of elements and returns all their greater elements along the specified dimension. Since greater elements are represented by references, this definition can be formulated as follows: projection of a set is a set of all elements referenced by some dimension. For example (Figure 2, left), in order to find all owners of the bank accounts we can project these accounts up to the collection of owners:
(Accounts) -> owner -> (Persons)
The set of accounts and owners can be restricted so that the selected accounts are projected to the selected persons:
(Accounts | balance > 1000) -> owner -> (Persons | age < 30)
This query returns all young owners of the accounts with large balance. De-projection is applied to a set of elements and returns all their lesser elements along the specified dimension. In terms of references, de-projection is defined as follows: de-projection of a set is a set of all elements referencing them by some dimension. For example (Fig. 2, right), in order to find all accounts of the persons we can de-project these persons down to the collection of accounts:
(Persons) <- owner <- (Accounts)
The owners and accounts can be restricted and then this query returns all accounts with large balance belonging to young owners:
(Persons | age < 30) <- owner <- (Accounts | balance > 1000)
93
In the general case, operations of projection and de-projection can be combined so that it has a zigzag form and consists of upward (projection) and downward (de-projection) segments in the partially ordered structure of collections (Savinov, 2005a, 2005b).
Inference
Logical navigation using projection and de-projection can be used for constraint propagation along explicitly specified dimension paths. In this section we describe an inference procedure which can automatically propagate source constraints through the model to the target without complete specification of the propagation path (Savinov, 2006b). This inference procedure is able to choose a natural propagation path itself. This means that if something is said about one set of elements then the system can infer something specific about other elements in the model. The main problem is how concretely the source constraints are propagated through the model. In COM, the inference procedure consists of two steps: 1. 2. De-projecting source constraints down to the chosen fact collection Projecting the obtained fact elements up to the target collection
Let us consider a concept-oriented database schema shown in Figure 3 where it is assumed that each book is published by one publisher and there is a many-to-many relationship between writers and books. If we want to find all writers of one publisher then this can be done by specifying concrete access path for propagating publishers to writers:
(Publishers | name = XYZ) <- published <- (Books) <- book <- (BooksWriters) -> writer -> (Writers)
Here we select a publisher, de-project it down to the Books collection, then again de-project the result further down to the BooksWriters collections with fact elements, and finally project the facts up to the Writers collection.
94
If we apply the inference procedure then the same query can be written in a more compact form:
(Publishers | name = XYZ) <-* (BooksWriters) *-> (Writers)
This query consists of two parts which correspond to the two-step inference procedure. On the first step we de-project the source collection down to the BooksWriters fact collection using <-* operator. On the second step we project the facts up to the target Writers collection using *-> operator. Note that we use stars in projection and de-projection operators to denote arbitrary dimension path. We can write this query in even more concise form if de-projection and projection steps are united in one inference operator denoted <-*->:
(Publishers | name = XYZ) <-*-> (Writers)
The system will de-project the source collection down to the most specific collection and then project it up to the target collection. Note how simple and natural this query is. We essentially specify what we have and what we need. The system then is able to propagate these constraints to the target and return the result.
Multidimensional Analysis
Projection and de-projection allow us to navigate through the partially ordered structure of the database following only one dimension path. For multidimensional analysis it is necessary to have an operation which could take several collections and produce a multidimensional cube from them (Savinov, 2006a). In COQL, this operation is prefixed with the CUBE keyword followed by a sequence of source collections in round brackets. For example (Figure 4), given two source collections with countries and product categories we can produce a 2-dimensional cube where one element is a combination of one country and one produce category:
ResultCube = CUBE (Countries, Categories)
By default, all input dimensions will be included in the result collection, that is, a cube in our example will have two dimensions: a country and a product category. An arbitrary structure of the result collection is specified my means of the RETURN keyword. For example, we might want to return only country and category names:
ResultCube = CUBE(Countries co, Categories ca) RETURN co.name, ca.name
95
The main application of the CUBE operator is multidimensional analysis where data is represented in aggregated form over a multidimensional space with user-defined structure. The analyzed parameter, measure, is normally computed by aggregating data for each cell of the cube. Measure can be computed within query body denoted by the BODY keyword. For example, if we need to show sales over countries and categories then it can be done as follows:
ResultCube = CUBE(Countries co, Categories ca) BODY { cell = co <-* Sales AND ca <-* Sales measure = SUM(cell.price) } RETURN co.name, ca.name, measure
In this query it is assumed that the Sales collection is a common lesser collection for Countries and Categories. The result is a combination of country and category names along with the total sales for them. To compute the total sales figure we first find all facts belonging to one cell (to a combination of the current country and category). It is done by de-projecting the country and category down to the Sales fact collection and then finding their intersection (denoted by AND). Then we sum up all sales within one cell of the cube using only one numeric dimension for aggregation and store this result in the measure variable. Finally, the value computed for this country and this category is then returned in the result.
RELATED WORK
Although the concept-oriented model is based on only three principles, it is able to simulate many widespread mechanisms and data modeling patterns provided in other model. For example, the existence of hierarchies in COM makes it very similar to the classical hierarchical data model (HDM) (Tsichritzis et al, 1976). In both models data exist within a hierarchy where any element has a unique position. Therefore, this can be viewed as a reincarnation of the hierarchical model on a new cycle of the development spiral. The main distinguishing feature of COM is that it proposes to use domain-specific identities by providing means for modeling hierarchical address space for each problem domain. Another novelty is
96
that inclusion relation simultaneously generalizes inheritance and can be used for type modeling like it is done in object data models. Both the relational model and COM are tuple-based and set-based. The main difference of COM is that it introduces two types of tuples: identity tuples and entity tuples. Another difference is that instead of using conventional sets, COM considered partially ordered sets. In other words, COM assumes that in data modeling it is important to consider the structure of the set and partial order is assumed to be an intrinsic and primary property of data while other properties are derived from it. One of the main achievements of the relational model was independence from physical representation which was reached by removing physical (platform-specific) identities from the model. In this sense COM reverses the situation by recognizing the importance of identity modeling. COM makes a clear statement that identities are at least as important as entities and introduces special means for modeling them using concepts. If we assume that surrogates are used for identifying rows in the relational model and then these surrogates may have arbitrary structure then we get the mechanism of identity modeling similar to that used in COM. The idea that partial order can be laid at the foundation of data management was also developed by Raymond (1996) where a partial order database is simply a partial order. However, this approach assumes that partial order underlies type hierarchies while COM proposes to use a separate inclusion relation for modeling types. It also focuses on manipulating different partial orders and relies more on formal logic while COM focuses on manipulating elements within one nested poset with strong focus on dimensional modeling, constraint propagation and inference. The notion of direct acyclic graph (DAG) has frequently been used in data modeling as a constraint imposed on a graph structure. When used informally, DAGs and partial orders can be easily confused although they are two different mathematical constructs created for different purposes. DAGs are more appropriate for graph-based models to impose additional constraints on its relationships (edges of the graph). COM is not a graph-based model and its main accent is made on dimensional modeling and analytical functions where order theoretic formalism is much more appropriate. For example, in graphs (including DAGs) we still rely on navigational approach for data access while in COM we rely on projection and de-projection operations along dimensions which change the level of details. Hierarchies in the concept-oriented model are as important as they are in object data models (Dittrich, 1986, Bancilhon, 1996). However, object hierarchies are interpreted as inheritance and one of its main purposes consists in re-use of parent data and behavior. COM inclusion relation generalizes inheritance which means that inclusion can be used as inheritance. However, inclusion hierarchy is simultaneously a means for identity modeling so that elements get their unique addresses within one hierarchical container. Essentially, establishing the fact that inheritance is actually a particular case of inclusion is one of the major contributions of the concept-oriented approach. From this point of view, hierarchy in the hierarchical model and inheritance in object models are particular cases of inclusion hierarchy in COM. The treatment of inclusion in COM is very similar to how inheritance is implemented in prototype-based programming (Lieberman, 1986; Stein, 1987; Chambers et al, 1991) because in both approaches parent elements are shared parts of children. The use of partial order in COM makes it similar to multidimensional models (Pedersen et al, 2001) widely used in OLAP and data warehousing. In most multidimensional models (Li et al, 1996; Agrawal et al, 1997; Gyssens et al, 1997), each dimension type is defined as a partially ordered set of category types (levels) and two special categories T (top) and L (bottom). The main difference is that COM proposes to partially order the whole model without assigning special roles to dimensions, cube, fact table and measures as it is done in multidimensional models. Thus instead of defining dimensions as several
97
partially ordered sets of levels, COM unites all levels into one model. The assignment of such roles as dimension, cube, fact table and level is done later during each concrete analysis which is defined in terms of set-based projection and de-projection operations. This extension of partial order on the whole model (rather than considering it within the scope of each individual dimension) allows us to treat it as a model of data rather than a model of analysis (OLAP model). One of the main characteristics of any semantic data model is its ability to represent complex relationships among elements and then using them for automating complex tasks like reasoning about data. There has been a large body of research on semantic data models (Hull et al, 1987; Peckham et al, 1988) but most of them propose a conceptual model which needs to be mapped to some logical model. COM proposes a new approach to representing and manipulating data semantics where different abstractions of conventional semantic models such as aggregation, generalization and classification are expressed in terms of a partially ordered set. From the point of view of aggregation, greater elements are constituents of this aggregate. From the point of view of generalization, greater elements are more general elements. One important feature of COM is that references change their role from navigational tool to an elementary semantic construct. Another unique property of COM is that it uses two orthogonal structures: inclusion and partial order. A similar approach is used in (Smith et al, 1977) where data belongs to two structures simultaneously: aggregation and generalization. The functional data model (FDM) is based upon sets and functions (Sibley et al, 1977; Shipman, 1981). COM is similar to this model because dimensions can be expressed as functions which return a super-element. However, COM restricts them by only single-valued functions while set-valued functions are represented by inverted dimensions which are expressed via de-projection operator. In addition, COM imposes a strong constraint on the structure of functions by its order principle which means that a sequence of functions cannot return a previous element.
CONCLUSION
In this paper we described the concept-oriented model and query language which propose to treat elements as identity-entity couples structured using two relations: inclusion and partial order. The main distinguishing features of this novel approach are as follows: Concepts instead of classes. COQL introduces a novel construct, called concept, which generalizes classes. If classes have only one constituent then concepts are made up of two constituents: identity class and entity class. Data modeling is then broken into two orthogonal branches: identity modeling and entity modeling. This creates a nice ying-yang style of symmetry between two sides of one model. Informally speaking, it can be compared to manipulating complex numbers in mathematics which also have two constituents: real and imaginary parts. In practice, this generalization allows us to model domain-specific identities instead of having only platform-specific ones. Inclusion instead of inheritance. Classical inheritance is not very effective in data modeling because class instances exist in flat space although classes exist in hierarchy. Inclusion relation introduced in COM permits objects to exist in a hierarchy where they are identified by hierarchical addresses. Data modeling is then reduced to describing such hierarchical address space where data elements are supposed to exist. Importantly, inclusion retains all properties of classical in-
98
heritance. This use of inclusion turns objects into sets (of their sub-objects) and makes the whole approach intrinsically set-based rather than instance-based. Partial order instead of graph. COM proposes to partially order all data elements by assuming that references represent greater elements and dimension types of concepts represent greater concepts. Data modeling is then reduced to ordering elements so that other properties and mechanisms are derived from this relation. Note that partial order also allows us to treat elements as sets of their lesser elements.
These principles are rather general and support many mechanisms and patterns of thought currently being used in data modeling. In particular, we demonstrated how this approach can be used for logical navigation using operations of projection and de-projection, inference where constraint propagation path is chosen automatically, and multidimensional analysis where cube and measures are easily constructed using the partially ordered structure of the model. Taking into account its simplicity and generality, COM and COQL seem rather perspective direction for further research and development activities in the area of data modeling.
REFERENCES
Abiteboul, S., & Kanellakis, P. C. (1998). Object identity as a query language primitive. [JACM]. Journal of the ACM, 45(5), 798842. doi:10.1145/290179.290182 Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Modeling multidimensional databases. In 13th International Conference on Data Engineering (ICDE97), (pp. 232243). Bancilhon, F. (1996). Object databases. [CSUR]. ACM Computing Surveys, 28(1), 137140. doi:10.1145/234313.234373 Chambers, C., Ungar, D., Chang, B., & Hlzle, U. (1991). Parents are shared parts of objects: Inheritance and encapsulation in self. Lisp and Symbolic Computation, 4(3), 207222. doi:10.1007/BF01806106 Codd, E. (1970). A relational model for large shared data banks. Communications of the ACM, 13(6), 377387. doi:10.1145/362384.362685 Codd, E. F. (1979). Extending the database relational model to capture more meaning. [TODS]. ACM Transactions on Database Systems, 4(4), 397434. doi:10.1145/320107.320109 Dittrich, K. R. (1986). Object-oriented database systems: The notions and the issues. In Proceedings of the International Workshop on Object-Oriented Database Systems, (pp. 24). Eliassen, F., & Karlsen, R. (1991). Interoperability and object identity. SIGMOD Record, 20(4), 2529. doi:10.1145/141356.141362 Gyssens, M., & Lakshmanan, L. V. S. (1997). A foundation for multi-dimensional databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB97), (pp. 106115). Hull, R., & King, R. (1987). Semantic database modeling: Survey, applications, and research issues. [CSUR]. ACM Computing Surveys, 19(3), 201260. doi:10.1145/45072.45073
99
Kent, W. (1991). A rigorous model of object references, identity and existence. Journal of Object-Oriented Programming, 4(3), 2838. Khoshafian, S. N., & Copeland, G. P. (1986). Object identity. Proceedings of OOPSLA86, ACM SIGPLAN Notices, 21(11), 406416. Li, C., & Wang, X. S. (1996). A data model for supporting on-line analytical processing. In Proceedings of the Conference on Information and Knowledge Management, Baltimore, MD, (pp. 8188). Lieberman, H. (1986). Using prototypical objects to implement shared behavior in object-oriented systems. In Proceedings of OOPSLA86, ACM SIGPLAN Notices, 21(11), 214223. Peckham, J., & Maryanski, F. (1988). Semantic data models. [CSUR]. ACM Computing Surveys, 20(3), 153189. doi:10.1145/62061.62062 Pedersen, T. B., & Jensen, C. S. (2001). Multidimensional database technology. IEEE Computers, 34(12), 4046. Raymond, D. (1996). Partial order databases. Unpublished doctoral thesis, University of Waterloo, Canada Savinov, A. (2005a). Hierarchical multidimensional modeling in the concept-oriented data model. 3rd International Conference on Concept Lattices and Their Applications (CLA05) Olomouc, Czech Republic, (pp. 123134) Savinov, A. (2005b). Logical navigation in the concept-oriented data model. Journal of Conceptual Modeling, 36. Savinov, A. (2006a). Grouping and aggregation in the concept-oriented data model. In Proceedings of the 21st Annual ACM Symposium on Applied Computing (SAC06), Dijon, France, (pp. 482486). Savinov, A. (2006b). Query by constraint propagation in the concept-oriented data model. Computer Science Journal of Moldova, 14(2), 219238. Savinov, A. (2008). Concepts and concept-oriented programming. Journal of Object Technology, 7(3), 91106. doi:10.5381/jot.2008.7.3.a2 Savinov, A. (2009a). Concept-oriented model. In Ferraggine, V. E., Doorn, J. H., & Rivero, L. C. (Eds.), Handbook of research on innovations in database technologies and applications: Current and future trends (pp. 171180). Hershey, PA: IGI Global. Savinov, A. (2009b). Concept-oriented programming. In Khosrow-Pour, M. (Ed.), Encyclopedia of Information Science and Technology (2nd ed., pp. 672680). Hershey, PA: IGI Global. Shipman, D. W. (1981). The functional data model and the data language DAPLEX. [TODS]. ACM Transactions on Database Systems, 6(1), 140173. doi:10.1145/319540.319561 Sibley, E. H., & Kerschberg, L. (1977). Data architecture and data model considerations. In Proceedings of the AFIPS Joint Computer Conferences, (pp. 85-96). Smith, J. M., & Smith, D. C. P. (1977). Database abstractions: Aggregation and generalization. [TODS]. ACM Transactions on Database Systems, 2(2), 105133. doi:10.1145/320544.320546
100
Stein, L. A. (1987). Delegation is inheritance. In Proceedings of OOPSLA87, ACM SIGPLAN Notices, 22(12), 138146. Tsichritzis, D. C., & Lochovsky, F. H. (1976). Hierarchical data-base management: A survey. [CSUR]. ACM Computing Surveys, 8(1), 105123. doi:10.1145/356662.356667 Wieringa, R., & de Jonge, W. (1995). Object identifiers, keys, and surrogates-object identifiers revisited. Theory and Practice of Object Systems, 1(2), 101114.
101
102
Chapter 4
ABSTRACT
Criteria that induce a Skyline naturally represent users preference conditions useful to discard irrelevant data in large datasets. However, in the presence of high-dimensional Skyline spaces, the size of the Skyline can still be very large. To identify the best k points among the Skyline, the Top-k Skyline approach has been proposed. This chapter describes existing solutions and proposes to use the TKSI algorithm for the Top-k Skyline problem. TKSI reduces the search space by computing only a subset of the Skyline that is required to produce the top-k objects. In addition, the Skyline Frequency Metric is implemented to discriminate among the Skyline objects those that best meet the multidimensional criteria. This chapters authors have empirically studied the quality of TKSI, and their experimental results show the TKSI may be able to speed up the computation of the Top-k Skyline in at least 50% percent with regard to the state-of-the-art solutions.
INTRODUCTION
Emerging technologies such as Semantic Web, Grid, Semantic Search, Linked Data and Cloud and Peerto-Peer computing have become available very large datasets. For example, by the time this paper has been written at least 21.59 billion pages are indexed by the Web (De Kunder, 2010) and the Cloud of Linked Data has at least 13,112,409,691 triples (W3C, 2010). The enormous growth in the size of data has a direct impact on the performance of tasks that are required to process on very large datasets and
DOI: 10.4018/978-1-60960-475-2.ch004
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
whose complexity depends on the size of the database. Particularly, the task of evaluating queries based on user preferences may be considerably affected by this situation. Skyline (Brzsnyi et al., 2001) approaches have been successfully used to naturally express user preference conditions useful to characterize relevant data in large datasets. Even though, Skyline may be a good choice for huge data sets its cardinality may become very large as the number of criteria or dimensions increases. The estimated cardinality of the Skyline is O(lnd-1n) when the dimensions are independent where n is the size of the input data and d the number of dimensions (Bentley et al., 1978). Consider Table 1 that shows estimates of the skyline cardinality when the number of dimensions ranges from 2 to 10 in a database comprised of 1,000,000 tuples. We may observe in Table 1 that Skyline cardinality rapidly increases making unfeasible for users to process the whole skyline set. In consequence, users may have to discard useless data manually and consider just a small subset or a subset of the Skyline that best meet the multidimensional criteria. To identify these points, the Top-k Skyline has been proposed (Goncalves and Vidal, 2009; Chan et al., 2006b; Lin et al., 2007). Top-k Skyline uses a score function to induce a total order of the Skyline points, and recognizes the top-k objects based on these criteria. Several algorithms have been defined to compute the Top-k Skyline, but they may be very costly (Goncalves and Vidal, 2009; Chan et al., 2006b; Lin et al., 2007; Vlachou and Vazirgiannis, 2007). First, they require the computation of the whole Skyline; second, they execute probes of the multidimensional function over the whole Skyline points. Thus, if k is much smaller than the cardinality of the Skyline, these solutions may be very inefficient because a large number of non-necessary probes may be evaluated, i.e., at least Skyline size minus k performed probes will be non-necessaries. Top-k Skyline has become necessary in many real-world situations (Vlachou and Vazirgiannis, 2007), and a wide range of ranking metrics to measure the interestingness of each Skyline tuple has been proposed. Examples of these ranking metrics are skyline frequency (Chan et al., 2006b), k-dominant skyline (Chan et al., 2006a), and k representative skyline (Lin et al., 2007). Skyline frequency ranks Skyline in terms of the number of times in which a Skyline tuple belongs to a non-empty subset or subspace of the multi-dimensional criteria. k-dominant skyline metric identifies Skyline points in k d dimensions of multi-dimensional criteria. Finally, k representative skyline metric produces the k Skyline points that have the maximal number of dominated object.
103
Skyline frequency is one of the most significant metrics that measures interestingness of each Skyline point in the answer. Intuitively, a high Skyline frequency value indicates that the point is dominated only in few subsets of multidimensional criteria and therefore, it is considered a very interesting point because it may dominate in many of the subsets. In (Chan et al., 2006b), the authors proposed an efficient approximated algorithm to estimate the skyline frequency values. (Yuan et al., 2005) define two algorithms to efficiently calculate the Skycube or the union of the Skylines of the non-empty subsets of multidimensional criteria. Both algorithms make uses of the Skyline point properties to speed the Skycube computation. But, they compute the Skycube completely. To overcome these limitations, we propose an algorithm that takes advantages of the properties of the skyline frequency metric, and identifies the subset of the Skyline points that are needed to compute the top-k ones in the answer. In this chapter, we will address the problem of computing Top-k Skyline queries efficiently (Goncalves and Vidal, 2009; Chan et al., 2006b) in a way that number of probes of the multidimensional function or score function is minimized. This chapter comprises five sections in addition to section 1 that motivates the problem. Section 2 presents the background required to understand the Top-k Skyline problem. Section 3 will present our Top-k Skyline approach. We will describe an algorithm that is able to compute only the subset of the Skyline that will be required to produce the top-k objects. In Section 4, the quality and performance of the proposed technique will be empirically evaluated against the state-of-the-art solutions. Finally, the conclusions of our work will be pointed out in Section 5.
BACKGROUND
In this section, we present a motivating example and preliminary definitions, and we summarize existing approaches to compute the Skyline and Top-k points. Then, we will outline the advantages and limitations of each approach, and we will consider existing solutions defined to calculate Top-k Skyline. Finally, we will present some metrics proposed to rank the Skyline which allows to score the importance of the Skyline without necessity of defining a score function.
Motivating Example
To motivate the problem of computing Top-k Skyline queries efficiently, consider the DBLP Computer Science Bibliography database (Ley, 2010) and a research institute which offers a travel fellowship to the best three researchers based on their number of publications on the main four database conferences: EDBT, ICDE, VLDB and SIGMOD. The DBLP database provides information on researchers performance, which include number of papers in each conference. The summarized information is organized in the Researcher relational table, where the candidates are described by an identifier, author name, total number journals in SIGMODR, VLDBJ, TODS, TKDE, and DKE, and number of papers in EDBT, ICDE, VLDB SIGMOD. According to the research institute policy, all criteria are equally important and relevant; hence, either a weight or a score function cannot be assigned. A candidate can be chosen for granting an award, if and only if, there is no other candidate with more papers in EDBT, ICDE, VLDB and SIGMOD. To nominate a candidate, the research institute must identify the set of all the candidates that are non-dominated by any
104
other candidate in terms of these criteria. Thus, tuples in table Researcher must be selected in terms of the values of EDBT, ICDE, VLDB and SIGMOD. Following these criteria, the nominees are computed, and presented in Table 2; also the Skyline frequency metric (SFM) for each researcher is reported. In total, DBLP database contains information at least 1.4 million publications (Ley, 2010). Since the research institute only can grant three awards, it has to select the top-3 researchers among the five nominees. Thus, criteria to discriminate the top-3 researchers among nominees are needed. The number of journals may be used as a score function; therefore, three candidates are the new nominees: 19660, 8846 and 5932 (or 23870). On the other hand, in the literature, several metrics have been proposed to distinguish the top-k elements in a set of incomparable researchers. For example, consider the skyline frequency metric (SFM) that measures the number of times a researcher belongs to a skyline set when different sub-sets of the conditions in the multi-dimensional criteria are considered. To compute SFM the algorithms presented in (Yuan et al., 2005) may be applied. Both algorithms build non-empty subsets of multidimensional criteria as shown in Table 3. However, the Skyline may be huge, and it will be completely built by these algorithms (Goncalves and Vidal, 2009). Therefore, to calculate the skyline frequency values, a large number of non-necessary points in all subsets of multidimensional criteria may be computed. Based on the values of the SFM, three of the researchers 8846, 23870 and 19660 are the winners of the research institute request. Intuitively, to select the awarded researchers, queries based on user preferences have been posted against the table Researcher. Skyline (Brzsnyi et al., 2001) and Top-k (Carey and Kossmann, 1997) are two user preference languages that could be used to identify some of he granted researchers. However, none of them will provide the complete set, and post-processing will be needed to identify the top-3 researchers (Goncalves and Vidal, 2009). To overcome limitations of existing approaches, we propose a query evaluation algorithm that minimizes the number of non-necessary probes, i.e., this algorithm is able to identify the top-k objects in the Skyline, for which there are not k better Skyline objects in terms of the SFM.
Preliminaries
Given a set DO = {o1, , on} of database objects, where each object oi is characterized by p attributes (A1, , Ap); r different score functions s1, , sq, , sr defined over some of the p attributes, where each si: O [0, 1], 1 i r; a score function f defined on some scores si, which induces a total order of the objects in DO; and a multicriteria function m defined over a subspace S of the score functions s1, , sq, which induces a partial order of the objects in DO. For simplicity, we suppose that scores related to the multicriteria function need to be maximized, and the score functions s1, , sq, , sr respect a
105
natural ordering over p attributes. We define the Skyline SKYS on a space S according to a multicriteria function m as follows: SKYS = {oi| oi DO: ( oj| oj DO: s1(oj)s1(oi) sq(oj)sq(oi) (x | 1 x q: sx(oj)>sx(oi)))} The conditions to be satisfied by the answers of a Top-k Skyline query with respect to the functions m and f, are described as follows: <f,m,k> = { oi| oi SKYS (k-|SKYs| oj| oj SKYS: f(oj) f(oi))} Where, t means that exists at most t elements in the set. Additionally, the Skyline Frequency may be used as a score function to rank the Skyline: <sf,m,k> = { oi| oi SKYS (k-|SKYs| oj| oj SKYS: sf(oj) sf(oi))} Where the Skyline Frequency of a object o SKYS, denoted by sf(o), is the number of subspaces S of S in which o is a Skyline object, this is: sf(o) = ( S | S S o SKYS: 1) The Skycube or lattice is the set of the all Skylines for any subspace S of S defined as: SkyCube = { SKYS | S S}
106
Finally, the probes of the functions m and sf required to identify the top-k objects in the Skyline correspond to necessary probes, i.e., a probe p of the functions m or f is necessary if and only if p is performed on an object o <sf,m,k>. In this work, we define an algorithm that minimizes the number of non-necessary probes, while computing the Top-k Skyline objects with respect to the functions m and sf.
Related Work
Skyline (Brzsnyi et al., 2001) and Top-k (Carey and Kossmann, 1997) approaches have been defined in the context of databases to distinguish the best points that satisfy a given ranking condition. A Skylinebased technique identifies a partially ordered set of points whose order is induced by criteria comprised of conditions on equally important parameters. Top-k approaches select the top-k elements based on a score function or discriminatory criteria that induce a totally ordered of the input set. (Bentley et al., 1978) proposed the first Skyline algorithm, referred to as the maximum vector problem and it is based on the divide & conquer principle. Progress has been made as of recent on how to compute efficiently such queries in a relational system and over large datasets. Block-Nested-Loops (BNL) (Brzsnyi et al., 2001), Sort-Filter-Skyline (SFS) (Godfrey et al., 2005) and LESS (Linear Elimination Sort for Skyline) (Godfrey et al., 2005) are three algorithms that identify the Skyline by scanning the whole dataset. On the other hand, progressive (or online) algorithms for computing Skyline have been introduced: the Tan et al.s algorithm, NN (Nearest Neighbor) and BBS (Branch-and-Bound Skyline) (Kossmann et al., 2002; Papadias et al., 2003; Tan et al., 2001). A progressive algorithm returns the first results without having to read the entire input and produces more results during execution time. Although these strategies could be used to implement our approach, they may be inefficient because they may perform a number of non-necessary probes or require index structures which are not accessible in Web data sources. In order to process Skyline queries against Web data sources, efficient algorithms have been designed considering sequential and random accesses. Each data source contains object identifiers and their scores. A sequential access retrieves an object from a sorted data source while a random access returns the score from a given object identifier. The Basic Distributed Skyline (BDS) defined by (Balke et al., 2004) is one of the algorithms to solve this kind of Skyline queries. BDS is a twofold solution which builds a Skyline superset in a first phase and then, it discards the dominated points in a second phase. A second algorithm known as Basic Multi-Objective Retrieval (BMOR) is presented by (Balke and Gntzer, 2004); in contrast to BDS, BMOR compares all the seen objects until a seen object that dominates the virtual object is found. The virtual object is constantly updated and is comprised of all the highest values seen so far. Both algorithms avoid to scan the whole dataset, and minimize the number of probes. A new hybrid approach that combines the benefits of Skyline and Top-k has been proposed and it is known as Top-k Skyline (Goncalves and Vidal, 2009). Top-k Skyline identifies the top-k objects using discriminatory criteria that induces a total order of the objects that compose the skyline of points that satisfy a given multi-dimensional criteria. Top-k Skyline has become necessary in many real-world situations (Vlachou and Vazirgiannis, 2007), and a variety of ranking metrics have been proposed to discriminate among the points in the Skyline, e.g., Skyline Frequency (Chan et al., 2006b), k-dominant sky-line (Chan et al., 2006a), and k representative skyline (Lin et al., 2007). The Skyline Frequency Metric is one of the most significant metrics that ranks skyline points in terms of how many times a skyline point belongs to the skyline induced by the subsets of the multidimensional criteria; it measures how much a skyline point satisfies the different parameters in the multidimensional criteria. Intuitively, a high Skyline Frequency
107
value indicates that a point may be dominated on smaller subsets of the multidimensional criteria, and it can be considered a very good point because it may dominate in many of the other subsets; in contrast, a skyline point with a low Skyline Frequency value shows that other skyline points dominate it in subsets of the multidimensional criteria. Approaches in (Pei et al., 2006; Yuan et al., 2005) propose two algorithms to compute Skyline Frequency values by building the Skycube or the union of the skylines of the non-empty subsets of the multidimensional criteria. The Bottom-Up Skycube algorithm (BUS) (Yuan et al., 2005) identifies the Skycube of d dimensions in a bottom-up fashion. BUS sorts dataset on each dimension of the multidimensional criteria in a list and it calculates the skyline points from one to d dimensions. BUS makes use of the skyline point properties to speed the Skycube computation. On the other hand, the Top-Down Skycube algorithm (TDS) (Pei et al., 2006) computes the Skycube in a top-down manner based on a Divide and Conquer (DC) Skyline algorithm (Brzsnyi et al., 2001). TDS computes a minimal set of paths in the lattice structure of the Skycube and then, it identifies skylines in these paths. Thus, multiple related skylines are built simultaneously. BUS and TDS can be used to compute the Top-k Skyline. However, some overhead may have to be paid, because both algorithms compute the Skycube completely. (Goncalves and Vidal, 2009) propose an index-based technique called TKSI, to compute the Top-k Sky-line points by just probing the minimal subset of incomparable points and using a given score function to distinguish the best points.
TOP-K SKYLINE
The Top-k Skyline approach identifies the objects that best meet the multi-dimensional criteria based on the Skyline Frequency Metric (SFM). The problem of efficient implementation of Top-k Skyline (EITKS) is defined as the problem of building the set <sf,m,k> minimizing the number of non-necessary probes; a probe p on the multidimensional criteria function m and the Skyline Frequency metric sf is necessary if and only if p is perfomed on an object o <sf,m,k>. Skyline frequency of an object o returns the number of times in which o belongs to a non-empty subset the multidimensional criteria. There are 2q - 1 non-empty subspaces for multidimensional criteria m with q dimensions. In Figure 1, we illustrate that the structure of lattice of for our example contains 24 - 1 subspaces. Moreover, there exists a containment relationship property between the skylines of subspaces when data are non-duplicated: Given two subspaces U, V where U V, then SKYU SKYV. Since o SKYU, none object may be better than o in all criteria of V (Chan et al., 2006b). For example, the object 8846 SKY{VLDB} also belongs to SKY{VLDB,SIGMOD} because of Skyline definition formula, i.e., any object o in SKY{VLDB,SIGMOD} may dominate 8846 in {SIGMOD} but not in {VLDB, SIGMOD}. BUS is based on the containment relationship property in order to save probes among subspaces. Instead of constructing Skylines of each subspace individually, BUS builds the lattice in a bottom-up fashion sharing results of the skylines in order to minimize probes of the multidimensional criteria. To illustrate the behavior of the BUS with an example, consider the following query: the top-1 candidates with maximum number of papers in EDBT, ICDE, VLDB, and SIGMOD. To answer this query, BUS calculates the Skyline for each subspace of 1-dimension EDBT, ICDE, VLDB, SIGMOD; then shares results for the skyline of subspaces of 2-dimensions {EDBT,ICDE},{EDBT,VLDB}, {EDBT,SIGMOD}, {ICDE,VLDB}, {ICDE,SIGMOD}, {VLDB, SIGMOD} using the containment relationship property; so and so until 24 - 1 all Skylines for subspaces of multidimensional criteria are built. The Skylines of each subspace are in Table 3.
108
BUS may be adapted to build the Top-k Skyline. First, BUS computes the lattice including the whole Skyline; second, it calculates the SFM values for each Skyline object; and finally, it sorts the Skyline by the SFM and returns the top-k objects. However, time complexity for Skyline queries is high and it depends on the size of the input data set and the number of probes performed. In general, the problem of identifying the Skyline is O(n2) (Godfrey et al., 2005); this is because all the input objects need to be compared against themselves to probe the multidimensional criteria. Our goal is to minimize nonnecessary probes building the Skyline partially until the top-k objects are produced. BUS requires the computation of the entire Skyline and executes probes of the multidimensional function over the whole Skyline objects. Thus, we propose the use of Top-k Skyline Index (TKSI) to efficiently solve this problem based on the Skyline Frequency Metric (SFM). Consider Table 4 that shows a set of sorted indices I1, , I5 on each attribute of multidimensional criteria and SFM values, and the first object 8846 characterized by the highest SFM value. I1, , I5 contain the objects sorted descendantly. We may observe that not exists an object above 8846 in all indices I1, , I4, and therefore 8846 is not dominated by any object and it is a Skyline object. Since 8846 is a Skyline and has the highest SFM value, 8846 is the Top-1 Skyline. Thus, it is not necessary to completely build the Skyline whose size is five in order to produce the answer for our Top-1 Skyline query. Next, we introduce the following property:
Table 4. Indices
I1 Id 8846 19660 5932 23870 20259 EDBT 10 9 6 3 2 Id 19660 5932 23870 8846 20259 I2 ICDE 49 37 27 23 16 Id 8846 20259 23870 5932 19660 I3 VLDB 35 30 26 24 21 Id 23870 5932 20259 8846 19660 I4 SIGMOD 39 32 30 28 18 Id 8846 23870 19660 5932 20259 I5 SFM 12 10 8 7 4
109
Property 1. Given a set of sorted indices I1, , Iq on each attribute of multidimensional criteria m; an index Iq+1 defined on values of SFM sf; the Skyline SKYS; an integer k such as 0 < k |SKYS|; and object o indexed by Iq+1. Then, o is an Top-k Skyline object if not exists an object above o in all indices I1, , Iq and not exists Top k - 1 Skyline objects with higher SFM value than it. TKSI focuses on performing sorted accesses on the SFM index Iq+1 firstly and then, verifying if each accessed object is a Top-k Skyline using the indices on multidimensional criteria Ii, , Iq. Basically, TKSI receives a set of indices on each attribute of multidimensional criteria m and the Skyline Frequency metric sf, and an integer k; and it builds the Top-k Skyline using the indices (Table 4). Since the indices on multidimensional criteria are sorted, TKSI has not to scan the entire index and builds the whole skyline while k is smaller than the Skyline size. Following with our example, TKSI accesses the objects from I5 sequentially until the top-1 object is produced. For each accessed object o from I5, TKSI verifies that o is a Top-k Skyline object. Because objects are sorted, it is very likely that any object with the higher values in each index of function m dominates the next objects in the indices. For this reason, TKSI must select one of the indices I1, I2, I3, or I4 in order to minimize the necessary probes over the multicriteria function m. The objects could be accessed in a round robin fashion. However, in order to speed up the computation, TKSI determines what is the index whose distance with respect to o is the lowest, i.e., the index that will avoid the access of more non-necessary objects. To do this, TKSI computes the distance D1 as the difference between the last seen value from I1 and the value for EDBT of o (min1 - s1(o)), D2 as the difference between the last seen value from I2 and the value for ICDE of o (min2 - s2(o)), D3 as the difference between the last seen value from I3 and the value for VLDB of o (min3 - s3(o)), and D4 as the difference between the last seen value from I4 and the value for SIGMOD of o (min4 - s4(o)). Next, TKSI selects the minimum value between D1, D2, D3, and D4. Initially, TKSI accesses the first object 8846 from I5, and their values for EDBT, ICDE, VLDB and SIGMOD randomly. Because of the objects from I1, I2, I3, and I4 have not been seen yet; TKSI assumes the last seen value is the maximum value possible for the attribute. Therefore, the best distance between D1 = 10 - 10, D2 = 49 - 23, D3 = 35 - 35, and D4 = 39 - 28 is calculated. In this case, I1 and I3 have the minimum distance. Note that 8846 is placed in the indices I1 and I3 in a lower position than the same object in I2 and I4. The objects of I1 are accessed until the object 19660 with a value lower in EDBT is found. All these objects are compared against 8846 to verify if some of them dominate it. Since, none of the objects dominates 8846, the object 8846 is a Top-k Skyline object. If some object indexed by I1 dominates 8846, a new object from I5 is accessed. However, the algorithm decides to stop here because the objects behind 19660 have worse values in Publication than 8846, and they may not dominate 8846. The detailed TKSI execution is showed in Table 5.
110
The TKSI algorithm is presented in Figure 2. In the first step, TKSI initializes the set of Top-k Skyline objects and the variable cont registers the number of top objects produced by the algorithm. In step 2, TKSI identifies the Top-k objects in terms of SFM. In the step 2a-b), the next best object ot from the SFM metric is completely accessed. This object is a Top-k Skyline candidate because it has the following best skyline frequency value. However, TKSI must verify if ot is incomparable. TKSI may select sorted indices in a round-robin way in order to check if an object is incomparable. Nevertheless, based on the properties of SFM, we have implemented a heuristic that guides TKSI in the space of possibly good objects, and avoids excessive accesses of non-necessary objects. For simplicity, we suppose that attributes of the multidimensional criteria are maximized. Since 2q - 1 subspaces are calculated to compute the skyline frequency metric, TKSI computes a monotonic function am s(a) / (2q - 1) with respect to the last object seen in each source, in order to select the best one. Intuitively, while this function is close to 1.0, it indicates that the object belongs to a large number of skylines. We are interested in this kind of objects because they may discard quickly dominated objects. Because objects are sorted, it is very likely that any object with the higher values in each index dominates the next objects in the other indices. Thus, sources with the maximum value of the monotonic function will be selected, and scanned for minimizing the number of accesses. Finally, if ot is dominated by some seen intermediate object in the selected index, then in step 2d) the object ot is discarded. In case of ot is non-dominated with respect to the seen objects; then, in step 2e) the object ot is a Top-k Skyline Frequency object and it is inserted into . Thus, the algorithm continues until k objects are found. Finally, the Theorem 1 shows lower bound for TKSI algorithm.
111
Theorem 1. Given a set of sorted indices I1, , Iq on each attribute of multidimensional criteria m; an index Iq+1 defined on values of SFM sf; the Skyline SKYS; and an integer k such as 0 < k |SKYS|. Then, a lower bound of the number of probes performed by TKSI is 2k. The best case for the TKSI algorithm is each object o accessed by the index Iq+1 is compared against a only object o of some index I1, , Iq and each object o is in the answer. Thus, 2k probes are necessary because k objects are compared and TKSI verifies by each object in Iq+1 if each object o dominates o, and o dominates o.
EXPERIMENTAL STUDY
Dataset and Query Benchmark: We shredded the downloaded DBLP file (Ley, 2010) into the relational database; DBLP features are shows in Table 6. We randomly generated 25 queries by restricting the numbers of papers by each author in the DBLP dataset; the queries are characterized by the following properties: (a) only one table in the FROM clause; (b) the attributes in the multicriteria function were selected following a uniform distribution; (c) directives for each attribute of the multicriteria function were selected considering only maximizing; (d) the number of attributes of the multicriteria function is five, six and seven; and (e) k is 3. Evaluation Metrics: we report on the Number of Probes (NP), the ratio of the skyline size and the Normalized Skyline Frequency value (NSF). NP is the number of the probes of the multidimensional criteria and Skyline Frequency Metric evaluations performed by the algorithm. NSF is a quality metric that represents a percent of non-empty subspaces of the multidimensional criteria; it indicates how good a Skyline object is. NSF is computed as follows: SFM / (2q -1). Implementations: TKSI and BUS algorithms were developed in Java (64-bit JDK version 1.5.0 12) on top of Oracle 9i. A set of sorted queries are executed for each criterion of the multicriteria and the score functions, and the resultsets are stored on indices. The resultsets are sorted descendantly according to the MAX criteria of the multicriteria function. Each resultset is accessed on-demand.
Furthermore, a set of hash maps are built, one for each index. These hash maps are comprised of objects accessed by each index. Also, a variable for each index is updated with the last value seen in that index. Initially, these variables are set with the best values. Lately, they are updated according to the last object accessed by each index. Thus, TKSI accesses the first object o from the index over the score function. It selects which is the index Ii that has the lowest gap with respect to o. The resultset of the selected index Ii is scanned until
112
some object from Ii dominates o or none of the objects better than o in the attribute i dominates to o. If o is incomparable, then o is a Top-k Skyline object and it is added in the set of answers. This process continues until computing the top k objects. DBLP data were stored in relational tables on Oracle 9i, and sorted based on each dimension. The experiments were evaluated on a SunFire V440 machine equipped with 2 processors Sparcv9 of 1.281 MHZ, 16 GB of memory and 4 disks Ultra320 SCSI of 73 GB running on SunOS 5.10 (Solaris 10).
113
CONCLUSION
Skyline size can be very large in the presence of high-dimensional Skyline spaces, making unfeasible for users to process this set of points. Top-k Skyline has been proposed in order to identify the top k points among the Skyline. Top-k Skyline uses discriminatory criteria to induce a total order of the points that comprise the Skyline, and recognizes the best k objects based on these criteria. Different algorithms have been defined to compute the top-k objects among the Skyline; while existing solutions are able to produce the Top-k Skyline, they may be very costly. First, state-of-the-art Top-k Skyline solutions require the computation of the whole Skyline; second, they execute probes of
114
the multicriteria function over the whole Skyline points. Thus, if k is much smaller than the cardinality of the Skyline, these solutions may be very inefficient because a large number of non-necessary probes may be evaluated. In this chapter, we showed the problem of identifying the Top-k that best meet multidimensional criteria and adapt the TKSI, an efficient solution for the Top-k Skyline that overcomes existing solutions drawbacks. The TKSI is an index-based algorithm that is able to compute only the subset of the Skyline that will be required to produce the top-k objects; thus, the TKSI is able to minimize the number of non-necessary probes. TKSI was empirically compared to extensions of the state-of-the-art algorithms: BUS. BUS relies on the computation of the whole set of Skyline to identify the Top-k Skyline while TKSI builds the Skyline until it has computed the k objects. Initial experimental results show that TKSI computes the Top-k Skyline performing less number of probes, and have shown that our approach is able to identify good quality objects and outperform state-of-the-art solutions.
REFERENCES
W3C Semantic Web discussion list. (2010). Kit releases 14 billion triples to the linked open data cloud. Retrieved from http://permalink.gmane.org/gmane.org.w3c.semantic-web/12889 Balke, W., & Gntzer, U. (2004). Multi-objective query processing for database systems. In Proceedings of the International Conference on Very Large Databases (VLDB), (pp. 936-947). Balke, W., Gntzer, U., & Zheng, J. (2004). Efficient distributed skylining for Web Information Systems. In Proceedings of International Conference on Extending Database Technology (EDBT), (pp. 256-273). Bentley, J., Kung, H.T., Schkolnick, M. & Thompson, C.D. (1978). On the average number of maxima in a set of vectors and applications. Journal of ACM (JACM). Brzsnyi, S., Kossmann, D., & Stocker, K. (2001). The skyline operator. In Proceedings of the 17th International Conference on Data Engineering, (pp. 421- 430). Washington, DC, USA. IEEE Computer Society. Carey, M. J., & Kossmann, D. (1997). On saying \Enough already! in SQL. SIGMOD Record, 26(2), 219230. doi:10.1145/253262.253302 Chan, C.-Y., Jagadish, H. V., Tan, K.-L., Tung, A. K. H., & Zhang, Z. (2006a). Finding k-dominant skylines in high dimensional space. In SIGMOD 06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, (pp. 503-514). New York: ACM. Chan, C. Y., Jagadish, H. V., Tan, K.-L., Tung, A. K. H., & Zhang, Z. (2006b). On high dimensional Skylines. In Proceedings of International Conference on Extending Database Technology (EDBT), (pp. 478-495). De Kunder, M. (2010). The size of the World Wide Web. Retrieved from http://www.worldwidewebsize.com Godfrey, P., Shipley, R., & Gryz, J. (2005). Maximal vector computation in large data sets. In VLDB 05: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), (pp. 229-240).
115
Goncalves, M., & Vidal, M.-E. (2009). Reaching the top of the Skyline: An efficient indexed algorithm for Top-k Skyline queries. In Proceedings of International Conference on Database and Expert Systems Applications (DEXA), (pp. 471-485). Kossmann, D., Ramsak, F., & Rost, S. (2002). Shooting stars in the sky: An online algorithm for skyline queries. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), (pp. 275-286). Ley, M. (2010). The dblp computer science bibliography. Retrieved from http://www.informatik.unitrier.de/~ley/db Lin, X., Yuan, Y., Zhang, Q., & Zhang, Y. (2007). Selecting Stars: The k Most Represen-tative Skyline Operator. In Proceedings of International Conference on Database Theory (ICDE), pp. 86-95. Papadias, D., Tao, Y., Fu, G., & Seeger, B. (2003). An optimal and progressive algorithm for skyline queries. In SIGMOD 03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, (pp. 467-478). New York: ACM Press. Pei, J., Yuan, Y., Lin, X., Jin, W., Ester, M., & Wang, Q. L. W. (2006). Towards multidimensional subspace skyline analysis. ACM Transactions on Database Systems, 31(4), 13351381. doi:10.1145/1189769.1189774 Tan, K., Eng, P., & Ooi, B. (2001). Efficicient progressive skyline computation. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), (pp. 301-310). Vlachou, A., & Vazirgiannis, M. (2007). Link-based ranking of skyline result sets. In Proceedings of the 3rd Multidisciplinary Workshop on Advances in Preference Handling (M-Pref). Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J. X., & Zhang, Q. (2005). Efficient computation of the skyline cube. In VLDB 05: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), (pp. 241-252). VLDB Endowment.
116
Top-k: Set of the best k objects based on a score function. Top-k uses a score function to induce a total order of the input dataset. Top-k Skyline: The top-k objects among the Skyline. Top-k Skyline uses a score function to induce a total order of the Skyline points, and recognizes the top-k objects based on these criteria. Skycube: The Skylines of the subsets of multidimensional criteria.
117
118
Remarks on a Fuzzy Approach to Flexible Database Querying, Its Extension and Relation to Data Mining and Summarization
Janusz Kacprzyk Polish Academy of Sciences, Poland Guy De Tr Ghent University, Belgium Sawomir Zadrony Polish Academy of Sciences, Poland
Chapter 5
ABSTRACT
For an effective and efficient information search of databases, various issues should be solved. A very important one, though still usually neglected by traditional database management systems, is related to a proper representation of user preferences and intentions, and then their representation in querying languages. In many scenarios, they are not clear-cut, and often have their original form deeply rooted in natural language implying a need of flexible querying. Although the research on introducing elements of natural language into the database querying languages dates back to the late 1970s, the practical commercial solutions are still not widely available. This chapter is meant to revive the line of research in flexible querying languages based on the use of fuzzy logic. This chapter recalls details of a basic technique of flexible fuzzy querying, discusses some newest developments in this area and, moreover, shows how other relevant tasks may be implemented in the framework of such queries interface. In particular, it considers fuzzy queries with linguistic quantifiers and shows their intrinsic relation with linguistic data summarization. Moreover, the chapter mentions so called bipolar queries and advocates them as a next relevant breakthrough in flexible querying based on fuzzy logic and possibility theory.
DOI: 10.4018/978-1-60960-475-2.ch005
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
INTRODUCTION
Databases are a crucial element of all kinds of information systems that are in turn the backbone of virtually all kinds of nontrivial human activities. The growing power, and falling prices of computer hardware and software, including those that have a direct impact on database technology, have implied an avalanche growth of data volume stored all over the world. That huge volume makes an effective and efficient use of information resources in databases difficult. On the other hand, the use of databases is not longer an area where database professionals are only active and, in fact, nowadays most of the users are novice. This implies a need for a proper human-computer (database) interaction which would adapt to the specifics of the human being, mainly in our context to the fact that for the human user the only fully natural means of articulation and communication is natural language with its inherent imprecision. The aspects mentioned above, the importance of which has been growing over the lasts decades or years, have triggered many research efforts, notably related to what is generally termed flexible querying, and some human consistent approaches to data mining and knowledge discovery, including the use of natural langue, for instance in linguistic data summarization. Basically, the construction of a database query consists in spelling out conditions that should be met by the data sought. Very often, the meaning of these conditions is deeply rooted in natural language, i.e., their original formulation is available in the form of natural language utterances. It is then, often with difficulty, translated into mathematical formulas requested by the traditional query languages. For example, looking for a suitable house in a real estate agency database one may prefer a cheap one. In order to pose a query, the concept of cheap has to be expressed by an interval of prices. The bounds of such an interval will usually be rather difficult to assess. Thus, a tool to somehow define the notion of being cheap may essentially ease the construction of a query. The same definition may be then used, in other queries referring to this concept, also in the context of other words, as, e.g., very. The words of this kind, interpreted as so-called modifiers, modify the meaning of the original concept in a way that may be assumed context-independent and expressed by a strict mathematical formula. It seems obvious that a condition referring to such terms as cheap, large etc. should be considered, in general, to be satisfied to a degree rather than as satisfied or not satisfied as it is assumed in the classical approach to database querying. Thus, the notion of the matching degree is one of the characteristic features of flexible fuzzy queries. Moreover, usually, a query comprises more than just one condition. In such a case, the user may require various combinations of conditions to be met. Classically, directly only the satisfaction of all conditions may be required or the satisfaction of any one condition may be required. However, these are in fact only some extreme cases of conceivable aggregation requirements. For instance, a user may be completely satisfied with the data satisfying most of the his or her conditions. The study of modeling of such natural language terms as cheap, very or most for the purposes of database querying is the most important part of the agenda of the flexible fuzzy querying research. In this paper we will present a focused overview of the main research results on the development of flexible querying techniques that are based on fuzzy set theory (Zadeh, 1965). The scope of the chapter is further limited to an overview of those techniques that aim to enhance database querying by introducing various forms of user specified fuzzy preferences (Bosc, Kraft & Petry, 2005). We will not consider other techniques that are relevant in this area, exemplified by self-correcting, navigational, cooperative, etc. querying systems.
119
For our purposes, we will view a fuzzy query as a combination of a number of imprecisely specified (fuzzy) conditions on attribute values to be met. The fuzzy preferences in queries are introduced inside query conditions and between query conditions. For the former, fuzzy preferences introduced inside query conditions via flexible search criteria which make possible to indicate a graded desirability of particular values. For the latter, fuzzy preferences between query conditions are given via grades of importance of particular query conditions. The research on fuzzy querying has already a long history, starting with the seminal works of Zadeh during his stay at the IBM Almaden Research Center in the late 1970s, and the first attempt to use fuzzy logic in database querying by Zadehs doctoral student Tahani (1977). The area has soon enjoyed a great popularity, with many articles in the early period, related both to database querying per se and a relevant area of textual information retrieval [cf. (Bookstein, 1980; Bosc & Pivert,1992a; Kacprzyk & Zikowski, 1986; Kacprzyk, Zadrony & Zikowski, 1989), etc.], and books, cf. (Zemankova & Kandel, 1984). Later, the field has become an area of huge research efforts. For an early account of main issues and perspectives, we can refer the reader to Zemankova & Kacprzyk (1993), while for recent, comprehensive state of the art type presentation Rosado, Ribeiro, Zadrony & Kacprzyk (2006), Zadrony, De Tr & De Caluwe (2008), etc. Some novel and practically relevant developments in broadly perceived data mining and data warehousing have greatly increased interest in fuzzy querying. A notable examples are here works on the combination of fuzzy querying and data mining interfaces for an effective and efficient linguistic summarization of data [cf. (Kacprzyk & Zadrony, 2000a; Kacprzyk & Zadrony, 2000b)] or fuzzy logic and the OLAP (Online Analytical Processing) technology (Laurent, 2003). The purpose of this paper is to present those developments of, and related to fuzzy querying in a focused way to show their essence and applicability. We will start with a general introduction to fuzzy querying in (numeric) relational databases, adding some remarks on the use of object oriented paradigm. Then, we will mention some attempts to add an additional information of user specified preference via so called bipolar queries which, in their particular form, make possible to include mandatory and optional requirements. Then, we will show the usefulness of fuzzy queries as a vehicle for an effective and efficient generation of linguistic summaries.
BACKGROUND
A relational database is meant as a collection of relations, characterized by sets of attributes and populated with tuples, which are represented by tables comprising rows and columns. In what follows we will freely use interchangeably both notions of relations and tables what should not lead to any misunderstandings. Each relation R is defined via the relation schema: R A1 : Dom (A1 ), A2 : Dom (A2 ), , An : Dom (An )
(1)
where Ais are the names of attributes (columns) and Dom(Ai)s are their associated domains. To retrieve data, a user forms a query specifying some conditions (criteria). The retrieval process may be meant, in our context of fuzzy querying, as the calculation of a matching degree for each tuple of relevant relation(s), usually from [0,1], as opposed to {0,1} as in the traditional querying.
120
Basically, one can follow the two general formal approaches to the querying: the relational algebra and the relational calculus. However, for our purposes the exact form of queries is not that important as we focus on the condition part of queries. A fuzzy set F in the universe U is characterized by a membership function mF : U 0, 1 (2)
where for each element xU, F(x) denotes the membership grade or extent to which x belongs to F. Fuzzy sets make it possible to represent vague concepts, like tall man by reflecting the graduality of such a concept.
where x is t[A], i.e. the value of tuple t for attribute A and ml is the membership function of the fuzzy set representing the linguistic term l. The g for complex conditions, as, e.g., age = YOUNG AND (salary = HIGH OR empyear = RECENT) is obtained using the fuzzy logical connectives, i.e., g (p q, t ) = min g (p, t ), g (q, t )
) )
121
where p, q are conditions. The minimum and maximum may be replaced by, e.g., t-norm and t-conorm (Klement, Mesiar & Pap, 2000) to model the conjunction and disjunction connectives, respectively. Among earlier contributions, using the relational calculus instead of the relational algebra, is Takahashi (1995) where he proposes the FQL (Fuzzy Query Language), meant as a fuzzy extension of the domain relational calculus (DRC). A more complete approach has been proposed Buckles, Petry & Sachar (1989) in a more general context of fuzzy databases, which is however also applicable for the crisp relational databases considered here. Zadrony & Kacprzyk (2002) proposed to interpret elements of DRC in terms of a variant of fuzzy logic. This approach makes it also possible to account for preferences between query conditions in an uniform way.
In order to be meaningful, weights should satisfy some natural conditions [cf. (Dubois & Prade, 1997; Dubois, Fargier & Prade, 1997)]. An interesting disctinction is between static and dynamic weights. Basically, for the static weights which are used in most approaches, Dubois and Prade (1997) propose the following framework. Assume that a query condition p is a conjunction (or disjunction) of weighted elementary query conditions pi, and denote by g (pi , t ) the matching degree for a tuple t of pi without any importance weight assigned. Then, the matching degree, g pi* , t , of an elementary condition pi with an importance weight wi[0,1] assigned is: g pi* , t = wi g (pi , t )
) (
(7)
where is fuzzy implication connective. The overall matching degree of the whole query composed of the conjunction of conditions pi is calculated using the standard min-operator. Depending on the type of the fuzzy implication operator used we get various interpretations of importance weights. For example, using the Dienes implication we obtain from (7): g pi* , t = max g (pi , t ), 1 wi for the Gdel implication: 1 g pi* , t = g (pi , t )
(8)
if
g (pi , t ) wi otherwise
(9)
122
if
g (pi , t ) wi otherwise
(10)
In the case of dynamic weights, Dubois & Prade (1997) deal with a variable importance wi[0,1] depending on the matching degree of the associated elementary condition. Basically, while using dynamic weights and dynamic weight assignments, neither the weights nor the associations between weights and criteria are known in advance. Both the weights and their assignments then depend on the attribute values of the record(s) on which the query criteria act as, for example, if the condition high salary is not important, unless the salary value is extremely high. Other Flexible Aggregation Schemes The use of other flexible aggregation schemes is also a subject of intensive research in flexible, fuzzy logic based querying. In (Kacprzyk & Zikowski, 1986) and (Kacprzyk, Zadrony & Zikowski, 1989) the aggregation of partial queries (conditions) driven by a linguistic quantifier has been firstly described by considering conditions: p = Q out of {p1 , , pk } (11)
where Q is a linguistic (fuzzy) quantifier and pi are elementary conditions to be aggregated. For example, in the context of a US based company, one may classify an order as troublesome if it meets most of the following conditions: comes from outside of USA, its total value is low, its shipping costs are high, employee responsible for it is known to be not completely reliable, the amount of order goods on stock is not much greater than the amount ordered, etc. The overall matching degree may be computed using any of the approaches used to model the linguistic quantifier driven aggregation. In (Kacprzyk & Zikowski, 1986; Kacprzyk, Zadrony & Zikowski, 1989) first the linguistic quantifiers in the sense of Zadeh (Zadeh, 1983) and later the OWA operators (Yager, 1994) are used (cf. Kacprzyk & Zadrony, 1997; Zadrony & Kacprzyk, 2009b). In Zadehs approach (1983), a linguistically quantified proposition, exemplified by Most conditions are satisfied, is written as: Qys are F (13) (12)
where Q is a linguistic quantifier (e.g., most), Y={y} is a set of objects (e.g., conditions), and F is a property (e.g., satisfied). Importance B may be added yielding: QBys are F (14)
123
e.g., Most (Q) of the important (B) conditions (ys) are satisfied (F). (15)
The problem is to find truth(Qys are F) or truth(QBys are F) , respectively, knowing truth(y is F), yY. To this end property F and importance B are represented by fuzzy sets in Y, and a (proportional, nondecreasing) linguistic quantifier Q is assumed to be a fuzzy set in [0,1] as, e.g. 1 for x 0.8 mQ (x ) = 2x 0.6 for 0.3 < x < 0.8 0 for x 0.3 Then, due to Zadeh (1983)
1 truth(Qy ' s are F ) = mQ [ n i =1 mF (yi )] n
(16)
(17)
n
(18)
There is a lot of works on this topic studying various possible interpretations of linguistic quantifiers for the flexible querying purposes; cf., e.g., (Bosc, Pivert & Lietrad, 2001; Bosc, Lietrad & Pivert, 2003; Galindo, Urrutia & Piattini, 2006; Vila, Cubero, Medina & Pons, 1997). The linguistic quantifier guided aggregation is also relevant for our further considerations concerning data mining related extensions of flexible fuzzy querying discussed in what follows.
124
2.
terms corresponding to the non-standard aggregation operators, including: a. fuzzy (linguistic) quantifiers (e.g., most), b. importance coefficients (e.g., important to a degree 0.8 or very important etc.)
The query languages of the classic DBMSs do not provide any means for representing such linguistic terms, and some extensions of a classic query language, the SQL are needed. These extensions concern both the syntax of the language and a proper interpretation of particular new linguistic constructs accompanied with some scheme for their representation, elicitation and manipulation. Here we discuss these issues on the example of FQUERY for Access.
The Syntax
We will focus our attention on the well-known SELECT...FROM...WHERE command of the SQL, and will deal only with its WHERE clause. Starting with its simplified version, e.g., excluding subqueries, we propose the following additions to the usual syntax of this clause providing for the direct use of linguistic terms:
<WHERE-clause>::= WHERE <condition> <condition>::= <linguistic quantifier> <sequence of subconditions> ; <sequence of subconditions>::= <subcondition> | <subcondition> OR <sequence of subconditions> <subcondition>::= <importance coefficient> <linguistic quantifier> <sequence of atomic conditions> <sequence of atomic conditions>::= <atomic condition> | <atomic condition> AND <sequence of atomic conditions> <atomic condition>::= <attribute> = <modifier> <fuzzy value> | <attribute> <fuzzy relation> <attribute> | <attribute> <fuzzy relation> <number> | <single-valued-attribute> IN <fuzzy-set constant> | <multi-valued-attribute> <compatibility operator> <fuzzy-set constant> | <attribute>::= <numeric field> <linguistic quantifier>::= <OWA-tag> <quantifier name> <OWA-tag>::= OWA | <modifier>::= VERY | MORE OR LESS | RATHER | NOT | <compatibility operator>::= possible matching | necessary matching | Jackard compatibility
Now, let us discuss particular categories of linguistic terms listed above. In what follows, we mainly use examples referring to a hypothetical database of a real estate agency. Particular houses are characterized by: price, land area, location (region), number of bedrooms and bathrooms and other life quality indicators as, e.g., an overall assessment of the environment, transportation infrastructure or shopping facilities.
125
Atomic condition. The basic building block of a query condition is the atomic condition. Basically, it contains a name of an attribute and a constraint imposed on the value of this attribute. Such a constraint may be a traditional, crisp one as, e.g., in Price <= 200,000 It may also employ one of linguistic terms as, e.g.: 1. 2. 3. 4. 5. Price =low (numeric fuzzy value) Land area =very large (numeric fuzzy value + modifier) Priceis not much greater than250,000 (fuzzy relation) Location belongs tofavorite regions (fuzzy set constant) Life quality indicators are compatible with high quality of life pattern (multi-valued-attribute + fuzzy set constant)
Numeric fuzzy values are to be used in connection with numeric fields as, e.g., with the field Price. Meaning of such a linguistic term is intuitively obvious, although rather subjective. Thus, it should be possible for each user to define his or her meaning of the linguistic term low. On the other hand, it would be advantageous to make it possible to use an already once defined term, like low, in various fields. Numeric fuzzy values may be accompanied by modifiers as, e.g., very, that directly correspond to the similar structures of the natural language. Fuzzy relations make it possible to soften rigidness of crisp relations. In the third example given above, the atomic condition employs the much greater than fuzzy relation accompanied with the negation operator treated as a modifier. Thus, such a condition will accept the price of, e.g., 255,000, which seems to be much more practical than treating 250,000 as a sharp limit. The examples discussed so far employed linguistic terms to be used along with the numeric data. The fourth example introduces a fuzzy set constant which is similar to numeric fuzzy values but meant to be used with scalar data. In this example, the favorite regions constant represents the users preferences as to the location of the house sought. The concept of favorite regions will quite often turn out to be fuzzy, i.e., some regions will be perceived by the user as the best location, some will be completely rejected, and the rest will be acceptable to a degree. Obviously, such a concept is highly subjective. Finally, the fifth example presents the concept of a multi-valued attribute but we will not discuss this case here, for simplicity, referring the reader to (Zadrony & Kacprzyk, 1996). The sequence of atomic conditions is just a conjunction of atomic conditions. Due to the fact that particular atomic conditions may be satisfied to a degree, we need to employ some generalization of the classical AND logical connective, notably using a t-norm, in particular the min operator. In order to achieve a flexibility of the aggregation postulated earlier, linguistic quantifiers are implemented in FQUERY for Access. Each sequence of atomic conditions may be additionally assigned an importance coefficient. That way, the user may vary the degree to which given sequence contributes to the overall satisfaction degree of the whole query. Finally, the sequence of atomic conditions, possibly accompanied by a linguistic quantifier and an importance coefficient, is called a subcondition. The sequence of subconditions is the disjunction of subconditions. This structuring of various elements of the condition adheres to the scheme assumed in Microsoft Access. As in case of the AND connec-
126
tive, the OR connective is replaced by the corresponding max operator of fuzzy logic. Again, it may be further replaced by a linguistic quantifier and importance coefficients may be assigned to subconditions.
127
(19)
where md stands for the overall matching degree and mdis denote partial matching degrees computed for n component conditions.
128
Lavency (1987). Their approach was quickly followed by Bosc & Pivert (1992c,1993), and then extended and discussed in detail by Dubois & Prade (2002,2008) and in our papers (Matth & De Tr, 2009; Zadrony, 2005; Zadrony & Kacprzyk, 2006, 2007, 2009a; De Tr et al., 2009). In the most general setting relevant for our considerations, bipolarity is understood as follows. The user expressing his or her preferences concerning the data sought is assumed to consider both positive and negative aspects. Both may be considered more or less independently and may be aggregated (or not) by the user in many different ways. Thus, our aim should be to provide the user with means making expression of such bipolar preferences as convenient as possible. Bipolarity may be modeled by using the two basic models (Grabisch, Greco & Pirlot, 2008): bipolar univariate and unipolar bivariate.The former assumes one scale with three main levels of, respectively, negative, neutral and positive evaluation, gradually changing from one end of the scale to another, giving rise to some intermediate levels. The latter model of bipolarity assumes two independent scales which separately account for positive and negative evaluation. In the first case the negative and positive assessments are somehow combined by the user and only an aggregated overall assessment is expressed as one number, usually from the [-1, 1] interval. Intuitively, the negative numbers express an overall negative assessment, 0 expresses the neutral assessment and the positive numbers express an overall positive assessment. In the case of the unipolar bivariate scale, the positive and negative assessments are expressed separately on two unipolar scales, usually by two numbers from the [0,1]. Now, we will briefly discuss various aspects of the concept of bipolarity in the context of flexible fuzzy queries because we think it is important to distinguish various interpretations of this term used in the literature. First, in the classic fuzzy approach a unipolar univariate scale is tacitly assumed as we have a degree to which given attribute value is compatible with the meaning of a given linguistic term l and, thus, the degree to which this value satisfies a query condition. There is no way to distinguish between negative (rejected, bad) and positive (accepted, good) values. The bipolarity may manifest itself at the level of each attribute domain or at the level of the comprehensive evaluation of the whole tuple. In the former case, the user may see particular elements of the domain as negative, positive or neutral, to a degree. This classification should, of course, influence the matching degree of a tuple having particular element of the domain as the value of the attribute under consideration. In the latter case the user is expected to express some conditions, involving possibly many attributes, which when satisfied by a tuple make it negative or positive. In case of the unipolar bivariate model we can distinguish a special interpretation which is further discussed in this paper. Namely, the negative and positive assessments are treated as corresponding to the conditions which are required and preferred to be satisfied, respectively. Thus, the former condition has to be satisfied necessarily and the latter only if possible. The negative assessment in this interpretation is identified with the degree to which the required condition is not satisfied. For example, if a person sought has to be young (the required condition), then its negative assessment corresponds to the degree to which he or she is not young, i.e., to which it satisfies the negation of the required condition. The preferred condition, on the other hand, characterizes those tuples (persons) which are really desired, with an understanding that the violation of such a condition by a tuple does not necessarily cause its rejection. The above special interpretation of bipolarity is in fact predominant in the literature. Lacroix and Lavency (1987) first introduced a query comprising two categories of conditions: one which is mandatory (C) and another which expresses just mere preferences (desires) (P). The bipolarity of these conditions becomes evident when one adopts the following interpretation. The former condition C expresses
129
a negative preference: the tuples which do not satisfy it are definitely not matching the whole query while the latter condition P, on the other hand, expresses a positive preference. These conditions will be referred to as a positive and negative condition, for short, and the whole query will be denoted as (C,P). We will identify the negative and positive condition of a bipolar query with the predicates that represent them and denote them as C and P, respectively. Let us denote the set of all tuples under consideration with T. For a tuple t T, C(t) and P(t) will denote that the tuple t satisfies the respective condition. The bipolar query in this approach may be expressed in natural language as follows: Find tuples t satisfying C and possibly satisfying P exemplified by: Find a house cheaper than USD 250,000 and possibly located not more than two blocks from a railway station, and such a query may be formally written as C and possibly P (20)
The key problem, which we consider here, is a proper modeling of the aggregation of both types of conditions which is expressed here with the use of the and possibly operator. Thus, we are mainly concerned with how to combine both negative and positive evaluations (assessments) in order to come up with a standard evaluation on a unipolar univariate scale which provides for an obvious ordering of the tuples in an answer to the query. An alternative way is to not aggregate and order the tuples with respect to their matching of required and preferred conditions taken separately this way is adopted, e.g., by Dubois and Prade (2002). However, it seems that the interpretation of the and possibly operator is quite intuitive and possesses some interesting properties [cf. also (Bordogna & Pasi, 1995)]. According to the original (crisp) approach by Lacroix & Lavency (1987) such an operator has an important property: the aggregation result depends not only on the explicit arguments, i.e., C(t) and P(t), but also on the content of the database. If there are no tuples meeting both conditions then the result of the aggregation is determined by the negative condition C alone. Otherwise the aggregation becomes a regular conjunction of both conditions. This dependence is best expressed by the following logical formula (Lacroix & Lavency, 1987): C(t) and possibly P(t) C(t) s (C(s) P(s)) P(t) (21)
If conditions C and P are crisp, then this characteristic property is preserved if the first select using C and then order using P interpretation of (20) is adopted, i.e., when first tuples satisfying C are selected and then ordered according to P. However if both conditions C and P are fuzzy then it is no longer clear what it should mean that a tuple satisfies the condition C as the satisfaction of this condition is now a matter of the degree. In our approach we start with the formula (24) and, using standard fuzzy counterparts of the classical logical connectives, interpret it in terms of fuzzy logic obtaining the membership function of the fuzzy answer set to a bipolar query (C, P), with respect to a set of tuples T, ans(C,P,T), as: ans(C,P,T)(t) = min (C(t), max (1-maxsT min(C(s), P(s)), P(t))) (22)
130
The matching degree of a tuple against a bipolar query (C, P) is thus meant here as the truth value of (21), computed in the framework of fuzzy (multivalued) logic using the right-hand side of (22). Thus, the evaluation of a bipolar query in this approach produces a fuzzy set of tuples, where the membership function value for a tuple t corresponds to the matching degree of this tuple against the query. The answer to a bipolar query is then a list of the tuples, non-increasingly ordered according to their membership degree. In (22), the min, max and 1-x operators are used to model the connectives of conjunction, disjunction and negation, respectively. Moreover, the implication connective is modeled by the Kleene-Dienes implication operator and the existential quantifier is modeled via the maximum operator. As there are many other alternatives the issue arises how to appropriately model the logical connectives in (22). For more information on this issue, as well as on other issues related to bipolar queries, we refer the reader to our works (Zadrony & Kacprzyk, 2007, 2009a; Matth & De Tr, 2009; De Tr et al., 2009). Concluding this section, it has to be stressed that the research on bipolarity in the framework of database querying is still at its infancy. Despite some very advanced theoretical treatments [cf. (Dubois & Prade, 2008)] still a vast area of possible interpretations is not covered yet and further research is definitely needed.
131
with T(most of recently hired employees are young)=0.7. The truth T may be meant in a more general sense, e.g. as validity or, even more generally, as some quality or goodness of a linguistic summary. The quantity in agreement, Q, is an indication of the extent to which the data satisfy the summary, and two types of a linguistic quantity in agreement can be used: absolute as, e.g., about 5, more or less 100, several, and relative as, e.g., a few, more or less a half, most, almost alletc.
Notice that the above linguistic expressions are again the fuzzy linguistic quantifiers use of which we advocate in the framework of fuzzy flexible querying. Thus, they may be modeled and processed using Zadehs (1983) approach and this way we obtain the truth T of a linguistic summary. The basic validity criterion, i.e. the truth of a linguistically quantified statement given by (13) and (14), is certainly the most natural and important but it does not grasp all aspects of a linguistic summary, and hence some other criteria have been proposed, notably by Kacprzyk & Yager (2001), and Kacprzyk, Yager & Zadrony (2000). These include the degrees of imprecision, covering, appropriateness, the length of a summary, etc. (cf. Kacprzyk & Zadrony, 2010). The problem is to find a best summary, i.e. with the highest value of some weighted average of the satisfactions of the criteria assumed. One can clearly notice that a fully automatic determination of a best linguistic summary may be infeasible in practice due to a high number of possible summaries obtained via a combination of all possible linguistic quantifiers, summarizers and qualifiers. In (Kacprzyk & Zadrony, 2001a) an interactive approach was proposed with a user assistance in the selection of summarizers, qualifiers and linguistic quantifiers. Basically, given a set of data D, we can hypothetize any appropriate summarizer S, qualifier R and any quantity in agreement Q, and the assumed measure of truth will indicate the quality of the summary. In our interactive approach it is assumed that such hypothetic summaries are proposed by the user via a flexible fuzzy querying interface, such as provided by FQUERY for Access. It may be easily noticed that components of linguistic summaries are also components of fuzzy queries, as implemented in FQUERY for Access. In particular, atomic conditions with fuzzy values are perfect simple summarizers and qualifiers which may be further combined to obtain more sophisticated summaries. Linguistic quantifiers are used in fuzzy queries to aggregate partial matching degrees but exactly the same computations are required in order to obtain the truth of a linguistic summary. Therefore, the derivation of a linguistic summary may proceed in an interactive (user assisted) way as follows: the user formulates a set of linguistic summaries of interest (relevance) using the fuzzy querying add interface, the system retrieves records from the database and calculates the validity of each summary adopted, and a best (most appropriate) linguistic summary is chosen.
132
Table 2. Classification of linguistic summaries (Sstructure denotes that attributes and their connection in a summary are known, while Svalue denotes a non-instantiated part of a protoform (a summarizer sought)).
Type 1 2 3 4 5 Given S SB Q Sstructure Q Sstructure B Nothing Sought Q Q Svalue Svalue SBQ Remarks Simple summaries through ad-hoc queries Conditional summaries through ad-hoc queries Simple value oriented summaries Conditional value oriented summaries General fuzzy rules
Kacprzyk & Zadrony (2005, 2009) proposed to use the concept of a protoform in the sense of Zadeh (2006) as a template underlying both the internal representation of linguistic summaries and their formation in a dialogue with the user. A protoform is defined as an abstract prototype of a linguistically quantified proposition, and its most abstract form is given by (14). Less abstract protoforms are obtained by instantiating particular elments of (14), i.e., for example by replacing F with a condition/property price is cheap. A more subtle instantiation is also possible where, e.g., only an attribute price is specified and its (fuzzy) value is left over. Thus, the user is constructing a more or less abstract protoform and the role of the system is to complete it with all missing elements (e.g., referring to our previous example of the protoform, all possible fuzzy values representing the price) and check the truth value (or other quality indicator) of each thus obtained lingustic summary. Of course, this is fairly easy for a fully instantiated protoform, such as (23) but much more difficult, if possible at all, for fully abstract protoform (14). In Table 1 we show a classification of linguistic summaries into 5 basic types corresponding to protoforms of an increasingly abstract form. Type 1 and 2 summaries may be easily produced by a simple extension of a fuzzy querying interface as provided by FQUERY for Access. Basically, the user has to construct a query a candidate summary, and it has to be determined what is the fraction of rows matching this query and what linguistic quantifier best denotes this fraction. Type 3 summaries require much more effort. Their primary goal is to determine typical (exceptional) values of an attribute. So, query S consists of only one simple condition built of the attribute whose typical (exceptional) value is sought, the = relational operator and a placeholder for the value sought. The latter corresponds to the non-instantiated part of an underlying protform. For example, using the following summary in the context of personnel database: Q = most and S = age=? (here ? denotes a placeholder mentioned above) we look for a typical value of age. A Type 4 summary may produce typical (exceptional) values for some, possibly fuzzy, subset of rows. From the computational point of view Type 5 summaries, corresponding to the most abstract protoform (14), represent the fuzzy rules describing dependencies between specific values of particular attributes. The summaries of Type 1 and 3 have been actually implemented in the framework of FQUERY for Access. As for possible future directions, we can mention the new proposals to explicitly base linguistic data summarization in the sense considered here, i.e. founded on the concept of Zadehs computing with words, on some developments in computational linguistics. First, Kacprzyk & Zadrony (2010a) have proposed to consider linguistic summarization in the context of natural language generation (NLG). Second, Kacprzyk & Zadrozny (2010b) suggested the use of some natural language generation (NLG)
133
related elements of Hallidays systemic functional linguistics (SFL). We think that these new directions of research will play a considerable role to find better tools and techniques for linguistic data summarization that will better take into account an intrinsic imprecision of natural language.
REFERENCES
Bookstein, A. (1980). Fuzzy requests: An approach to weighted Boolean searches. Journal of the American Society for Information Science American Society for Information Science, 31, 240247. doi:10.1002/ asi.4630310403 Bordogna, G., & Pasi, G. (1995). Linguistic aggregation operators of selection criteria in fuzzy information retrieval. International Journal of Intelligent Systems, 10(2), 233248. doi:10.1002/int.4550100205 Bosc, P. (1999). Fuzzy databases. In Bezdek, J. (Ed.), Fuzzy sets in approximate reasoning and Information Systems (pp. 403468). Boston: Kluwer Academic Publishers. Bosc, P., Kraft, D., & Petry, F. E. (2005). Fuzzy sets in database and Information Systems: Status and opportunities. Fuzzy Sets and Systems, 153(3), 418426. doi:10.1016/j.fss.2005.05.039 Bosc, P., Lietard, L., & Pivert, O. (2003). Sugeno fuzzy integral as a basis for the interpretation of flexible queries involving monotonic aggregates. Information Processing & Management, 39(2), 287306. doi:10.1016/S0306-4573(02)00053-5
134
Bosc, P., & Pivert, O. (1992a). Some approaches for relational databases flexible querying. International Journal on Intelligent Information Systems, 1, 323354. doi:10.1007/BF00962923 Bosc, P., & Pivert, O. (1992b). Fuzzy querying in conventional databases. In Zadeh, L. A., & Kacprzyk, J. (Eds.), Fuzzy logic for the management of uncertainty (pp. 645671). New York: Wiley. Bosc, P., & Pivert, O. (1992c). Discriminated answers and databases: Fuzzy sets as a unifying expression means. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), (pp. 745-752). San Diego, USA. Bosc, P., & Pivert, O. (1993). An approach for a hierarchical aggregation of fuzzy predicates. In Proceedings of the 2nd IEEE International Conference on Fuzzy Systems (FUZZ-IEEE93), (pp. 12311236). San Francisco, USA. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 117. doi:10.1109/91.366566 Bosc, P., Pivert, O., & Lietard, L. (2001). Aggregate operators in database flexible querying. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2001), (pp. 1231-1234), Melbourne, Australia. Buckles, B. P., Petry, F. E., & Sachar, H. S. (1989). A domain calculus for fuzzy relational databases. Fuzzy Sets and Systems, 29, 327340. doi:10.1016/0165-0114(89)90044-4 De Tr, G., De Caluwe, R., Tourn, K., & Matth, T. (2003). Theoretical considerations ensuing from experiments with flexible querying. In T. Bilgi, B. De Baets & O. Kaynak (Eds.), Proceedings of the IFSA 2003 World Congress, (pp. 388391). (LNCS 2715). Springer. De Tr, G., Verstraete, J., Hallez, A., Matth, T., & De Caluwe, R. (2006). The handling of select-projectjoin operations in a relational framework supported by possibilistic logic. In Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledgebased Systems (IPMU), (pp. 21812188). Paris, France. De Tr, G., Zadrony, S., Matthe, T., Kacprzyk, J., & Bronselaer, A. (2009). Dealing with positive and negative query criteria in fuzzy database querying. (LNCS 5822), (pp. 593-604). Dubois, D., & Prade, H. (1997). Using fuzzy sets in flexible querying: Why and how? In Andreasen, T., Christiansen, H., & Larsen, H. L. (Eds.), Flexible query answering systems. Dordrecht: Kluwer Academic Publishers. Dubois, D., & Prade, H. (2002). Bipolarity in flexible querying. (LNAI 2522), (pp. 174-182). Dubois, D., & Prade, P. (2008). Handling bipolar queries in fuzzy information processing. In Galindo, J. (Ed.), Handbook of research on fuzzy information processing in databases (pp. 97114). New York: Information Science Reference. Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998). A server for Fuzzy SQL queries. In T. Andreasen, H. Christiansen & H.L. Larsen (Eds.), Proceedings of the Third International Conference on Flexible Query Answering Systems, (pp. 164-174). (LNAI 1495). London: Springer-Verlag.
135
Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Grabisch, M., Greco, S., & Pirlot, M. (2008). Bipolar and bivariate models in multicriteria decision analysis: Descriptive and constructive approaches. International Journal of Intelligent Systems, 23, 930969. doi:10.1002/int.20301 Kacprzyk, J., & Yager, R. R. (2001). Linguistic summaries of data using fuzzy logic. International Journal of General Systems, 30, 133154. doi:10.1080/03081070108960702 Kacprzyk, J., Yager, R. R., & Zadrony, S. (2000). A fuzzy logic based approach to linguistic summaries of databases. International Journal of Applied Mathematics and Computer Science, 10, 813834. Kacprzyk, J., & Zadrony, S. (1995). FQUERY for Access: Fuzzy querying for windows-based DBMS. In Bosc, P., & Kacprzyk, J. (Eds.), Fuzziness in database management systems (pp. 415433). Heidelberg, Germany: Physica-Verlag. Kacprzyk, J., & Zadrony, S. (1997). Implementation of OWA operators in fuzzy querying for Microsoft Access. In Yager, R. R., & Kacprzyk, J. (Eds.), The ordered weighted averaging operators: Theory and applications (pp. 293306). Boston: Kluwer Academic Publishers. Kacprzyk, J., & Zadrony, S. (2000a). On a fuzzy querying and data mining interface. Kybernetika, 36, 657670. Kacprzyk, J., & Zadrony, S. (2000b). On combining intelligent querying and data mining using fuzzy logic concepts. In Bordogna, G., & Pasi, G. (Eds.), Recent research issues on fuzzy databases (pp. 6781). Heidelberg: Physica-Verlag. Kacprzyk, J., & Zadrony, S. (2001a). Data mining via linguistic summaries of databases: An interactive approach. In Ding, L. (Ed.), A new paradigm of knowledge engineering by soft computing (pp. 325345). Singapore: World Scientific. doi:10.1142/9789812794604_0015 Kacprzyk, J., & Zadrony, S. (2001b). Computing with words in intelligent database querying: Standalone and Internet-based applications. Information Sciences, 134, 71109. doi:10.1016/S0020-0255(01)00093-7 Kacprzyk, J., & Zadrony, S. (2005). Linguistic database summaries and their protoforms: Towards natural language based knowledge discovery tools. Information Sciences, 173, 281304. doi:10.1016/j. ins.2005.03.002 Kacprzyk, J., & Zadrony, S. (2009). Protoforms of linguistic database summaries as a human consistent tool for using natural language in data mining. International Journal of Software Science and Computational Intelligence, 1(1), 100111. Kacprzyk, J., & Zadrozny, S. (2010). Computing with words and systemic functional linguistics: Linguistic data summaries and natural language generation. In Huynh, V.-N., Nakamori, Y., Lawry, J., & Inuiguchi, M. (Eds.), Integrated uncertainty management and applications (pp. 2336). Heidelberg: Springer-Verlag. doi:10.1007/978-3-642-11960-6_3 Kacprzyk, J., & Zadrony, S. (in press). Computing with words is an implementable paradigm: Fuzzy queries, linguistic data summaries and natural language generation. IEEE Transactions on Fuzzy Systems.
136
Kacprzyk, J., Zadrony, S., & Zikowski, A. (1989). FQUERY III+: A human-consistent database querying system based on fuzzy logic with linguistic quantifiers. Information Systems, 14, 443453. doi:10.1016/0306-4379(89)90012-4 Kacprzyk, J., & Zikowski, A. (1986). Database queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man, and Cybernetics, 16, 474479. doi:10.1109/TSMC.1986.4308982 Klement, E. P., Mesiar, R., & Pap, E. (Eds.). (2000). Triangular norms. Dordrecht, Boston, London: Kluwer Academic Publishers. Lacroix, M., & Lavency, P. (1987). Preferences: Putting more knowledge into queries. In Proceedings of the 13 International Conference on Very Large Databases, (pp. 217-225). Brighton, UK. Laurent, A. (2003). Querying fuzzy multidimensional databases: Unary operators and their properties. International Journal of Uncertainty. Fuzziness and Knowledge-Based Systems, 11, 3146. doi:10.1142/ S0218488503002259 Matth, T., & De Tr, G. (2009). Bipolar query satisfaction using satisfaction and dissatisfaction degrees: Bipolar satisfaction degrees. In S.Y. Shin & S. Ossowski (Eds.), Proceedings of the SAC Conference, (pp. 1699-1703). ACM. Rosado, A., Ribeiro, R., Zadrony, S., & Kacprzyk, J. (2006). Flexible query languages for relational databases: An overview. In Bordogna, G., & Psaila, G. (Eds.), Flexible databases supporting imprecision and uncertainty (pp. 353). Berlin, Heidelberg: Springer Verlag. doi:10.1007/3-540-33289-8_1 Tahani, V. (1977). A conceptual framework for fuzzy query processing: A step toward very intelligent database systems. Information Processing & Management, 13, 289303. doi:10.1016/0306-4573(77)90018-8 Takahashi, Y. (1995). A fuzzy query language for relational databases. In Bosc, P., & Kacprzyk, J. (Eds.), Fuzziness in database management systems (pp. 365384). Heidelberg, Germany: Physica-Verlag. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 727. doi:10.1007/BF01014018 Vila, M. A., Cubero, J.-C., Medina, J.-M., & Pons, O. (1997). Using OWA operator in flexible query processing. In Yager, R. R., & Kacprzyk, J. (Eds.), The ordered weighted averaging operators: Theory and applications (pp. 258274). Boston: Kluwer Academic Publishers. Yager, R. R. (1982). A new approach to the summarization of data. Information Sciences, 28, 6986. doi:10.1016/0020-0255(82)90033-0 Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multi-criteria decision making. IEEE Transactions on Systems, Man, and Cybernetics, 18, 183190. doi:10.1109/21.87068 Yager, R. R., & Kacprzyk, J. (1997). The ordered weighted averaging operators: Theory and applications. Boston: Kluwer. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338353. doi:10.1016/S00199958(65)90241-X
137
Zadeh, L. A. (1983). A computational approach to fuzzy quantifiers in natural languages. Computers & Mathematics with Applications (Oxford, England), 9, 149184. doi:10.1016/0898-1221(83)90013-5 Zadeh, L. A. (2006). From search engines to question answering systems-the problems of world knowledge relevance deduction and precisiation. In Sanchez, E. (Ed.), Fuzzy logic and the Semantic Web (pp. 163210). Amsterdam: Elsevier. Zadrony, S. (2005). Bipolar queries revisited. In V. Torra, Y. Narukawa & S. Miyamoto (Eds.), Modelling decisions for artificial intelligence (MDAI 2005), (pp. 387-398). (LNAI 3558). Berlin, Heidelberg: Springer-Verlag. Zadrony, S., De Tr, G., De Caluwe, R., & Kacprzyk, J. (2008). An overview of fuzzy approaches to flexible database querying. In Galindo, J. (Ed.), Handbook of research on fuzzy information processing in databases (pp. 3454). Hershey, PA/ New York: Idea Group, Inc. Zadrony, S., & Kacprzyk, J. (1996) Multi-valued fields and values in fuzzy querying via FQUERY for Access. In Proceedings of FUZZ-IEEE.96 - Fifth International Conference on Fuzzy Systems New Orleans, USA, (pp. 1351-1357). Zadrony, S., & Kacprzyk, J. (2002). Fuzzy querying of relational databases: A fuzzy logic view. In Proceedings of the EUROFUSE Workshop on Information Systems, (pp. 153-158). Varenna, Italy. Zadrony, S., & Kacprzyk, J. (2006). Bipolar queries and queries with preferences. In Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA06), Krakow, Poland, (pp. 415-419). IEEE Computer Society. Zadrony, S., & Kacprzyk, J. (2007). Bipolar queries using various interpretations of logical connectives (pp. 181190). Zadrony, S., & Kacprzyk, J. (2009a). Bipolar queries: An approach and its various interpretations. In J.P. Carvalho, D. Dubois, U. Kaymak, & J.M. da Costa Sousa (Eds.), Proceedings of the IFSA/EUSFLAT Conference, (pp. 1288-1293). Zadrony, S., & Kacprzyk, J. (2009b). Issues in the practical use of the OWA operators in fuzzy querying. Journal of Intelligent Information Systems, 33(3), 307325. doi:10.1007/s10844-008-0068-1 Zemankova, M., & Kacprzyk, J. (1993). The roles of fuzzy logic and management of uncertainty in building intelligent Information Systems. Journal of Intelligent Information Systems, 2, 311317. doi:10.1007/BF00961658 Zemankova-Leech, M., & Kandel, A. (1984). Fuzzy relational databases-a key to expert systems. Cologne, Germany: Verlag TV Rheinland.
138
Database: A collection of persistent data. In a database, data are modeled in accordance with a database model. This model defines the structure of the data, the constraints for integrity and security, and the behavior of the data. Fuzzy Query: A database query which involves imprecisely specified search conditions. These conditions are often expressed using terms of a natural language which are modeled using fuzzy logic. Linguistic Quantifier: a natural language expression such as most, around 5 which expresses an imprecise proportion or quantity. Linguistic Summary of Data: a linguistic (by a sentence or a small number of sentences in a natural language) summarization of a set of data. Protoform: an abstract prototype of a linguistically quantified proposition, i.e. of an expression such as Most employees are young. Relational Database: A relational database is a database that is modeled in accordance with the relational database model. In the relational database model, the data are structured in relations that are represented by tables. The behavior of the data is defined in terms of the relational algebra, which originally consists of eight operators (union, intersection, division, cross product, join, selection, projection and division), or in terms of the relational calculus, which is of a declarative nature.
139
140
Chapter 6
ABSTRACT
Spatial Data Infrastructures (SDI) allow users connected to the Internet to share and access remote and distributed heterogeneous geodata that are managed by their providers at their own Web sites. In SDIs, available geodata can be found via standard discovery geo-services that makes available query facilities of a metadata catalog. By expressing precise selection conditions on the values of the metadata collected in the catalog, the user can discover interesting and relevant geodata and then access them by means of the services of the SDI. An important dimension of geodata that often concerns such users requests is the temporal information that can have multiple semantics. Current practice to perform geodata discovery in SDIs is inadequate for several reasons. First of all, with respect to the temporal characterization, available recommendations for metadata specification, for example, the INSPIRE Directive of the European community do not consider the multiple semantics of the temporal metadata. To this aim,
DOI: 10.4018/978-1-60960-475-2.ch006
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
this chapter proposes to enrich the current temporal metadata with the possibility to indicate temporal metadata related to both the observations, i.e., the geodata, the observed event, i.e., the objects in the geodata, and the temporal resolution of observations, i.e., their timestamps. The chapter introduces also a proposal to manage temporal series of geodata observed at different dates. Moreover, in order to represent the uncertain and incomplete knowledge of the time information on the available geodata, the chapter proposes a representation for imperfect temporal metadata within the fuzzy set framework. Another issue that is faced in this chapter is the inadequacy of current discovery service query facilities: in order to obtain a list of geodata results, corresponding values of metadata must exactly match the query conditions. To allow more flexibility, the chapter proposes to adopt the framework of fuzzy databases to allow expressing soft selection conditions, i.e., tolerant to under-satisfaction, so as to retrieve geodata in decreasing order of relevance to the user needs. The chapter illustrates this proposal by an example.
INTRODUCTION
Infrastructures are complex systems in which a network of interconnected but autonomous components is used for the exchange and mobility of goods, persons, information. Their successful exploitation requires technologies, policies, investments in money and personnel, common standards and harmonized rules. Typical examples of infrastructures which are critical for society are transportation and water supply. In Information Technology, the term infrastructure could be related to communication channels through which information can be located, exchanged, accessed, and possibly elaborated. The importance of Spatial Data Infrastructures (SDIs) has been recognized since the United Nations Conference on Environment and Development in Rio de Janeiro in 1992. Geographic information is vital to making sound decisions at the local, regional, and global levels. Crime management, business development, flood mitigation, environmental restoration, community land use assessments and disaster recovery are just a few examples of areas in which decision-makers can benefit from geographic information, together with the associated Spatial Data Infrastructure (SDI) that support information discovery, access, and use of this information in the decision-making process. In time, the role of discovery services of data with a geographic reference (geodata) has become a main issue of governments and institutions, and central to many activities in our society. In order to take political and socio-economics decisions, administrators must analyze data with geographic reference; for example, the governments define funding strategies on the basis of CO2 pollution distribution. Even in everyday life, people need considering data regarding the area in which they live, move, work and act; for example, consider a family wishing to reach mountains for a skiing holiday, and looking for meteorological data. In order to be useful, the data they are looking for should fit the area and period of time of their interest; they should trust in the quality of the data; if possible, they should obtain what they need with simple searching operations, and in a way that allows evaluating the fitness of the data with respect to their needs and purposes. In 2007, the INSPIRE Directive of the European Parliament and of the Council entered into force (INSPIRE Directive, 2007) to trigger the creation of a European Spatial Data Infrastructure (ESDI) that delivers to the users integrated spatial information services. These services should allow users to discover and possibly access spatial or geographical information from a wide range of sources, from the
141
local to the global level, in an inter-operable way for a variety of uses. Discovery is performed through services that should follow INSPIRE standards and can be implemented through some products (either proprietary or not) that declare their compliance. Nevertheless, current technologies adopted in SDIs, and consequently the actual practice for searching geographic information, do not comply with the way users express their needs and search for information and hamper the ability and practices of geodata providers. One main problem is due to the characteristics of the information on the available geodata, i.e., metadata. Metadata is an essential requirement for locating and evaluating available geodata, and metadata standards can increase and facilitate geodata sharing through time and space. For this reason considerable efforts have been spent to define standard and minimum core metadata for geodata to be used in SDIs. INSPIRE is nowadays a directive of the European community that comprehends metadata specifications (European Commission, 2009). Nevertheless, such specifications are still incomplete and inadequate for they do not allow specifying all the necessary information on geodata as far as the temporal dimension, and force providers to generate precise metadata values, which are missing in many real cases (Dekkers, 2008; Bordogna et al., 2009). This chapter, analyses the utility of temporal metadata on geodata and proposes a framework to represent its imprecise and uncertain values as well as to express its possible multiple semantics.. Another problem with the current information search practice in SDIs is that the user is forced to express precise conditions on the metadata values that must be perfectly satisfied in order to obtain a result. Further, the results are unordered, with no indication of relevance to the user who must access the remote geodata to become aware of its actual adequacy to his/her needs. Even when a web service is invoked, this may case useless network overloading. A framework that this chapter proposes for optimizing this search practice is to allow users to express flexible queries, with tolerant (soft) conditions on the metadata values, so as to retrieve results ranked in decreasing order of their satisfaction to the query. To this aim the fuzzy database framework can provide an effective way to model the discovery service of SDIs, since it allows both representing and flexibly querying imperfect metadata (Bosc and Pivert, 2000). In the next paragraph, this chapter will discuss the limitations of current temporal metadata in discovery services of SDIs and propose some solutions. Then, the proposal of a formal and operational method to represent imperfect temporal metadata values and allowing users to express soft search conditions, i.e., tolerant to under-satisfaction, is presented. In doing so, discovery services can apply partial matching mechanisms between the desired metadata, expressed by the user, and the archived metadata: this would allow retrieving geodata in decreasing order of relevance to the user needs, as it usually occurs on the Web when using search engines. In the last paragraph, the proposal is illustrated with an example, while the concluding paragraph describes the context of this research work.
142
The object of our attention is the data discovery service that is in charge of both providing the discovery services and the management of the metadata. Such metadata are defined to support distinct purposes, namely: Geodata Discovery- What geodata holds the characteristics I am interested in? This enables users to know what geodata the organizations have and make available. Exploration activity - Do the identified geodata contain sufficient information for my purposes? This is documentation on the geodata that must be provided to ensure that others can use the geodata correctly. Exploitation activity What is the process of obtaining and using the required geodata? This helps end users and provider organisations to effectively store, reuse, maintain and archive their data holdings. Two kinds of metadata sets are managed by the discovery service: the metadata data set that describes the summary information and characteristics of the available geodata (these include human-generated textual description of geodata and machine-generated data), and the service metadata, that are descriptions of the service characteristics made available by the SDI, for example services to discover geodata.
The discovery service works by comparing the users requests for information, coming from the Application and Geoportal Layer, with the metadata that describe available resources, following a well know and widely developed retrieval model. Figure 2 illustrates the three main components of a retrieval model, i.e. data representation, query representation, and matching mechanism. In order to improve effectiveness of the discovery, representations must be comparable and matching adequate to them (Cater and Kraft, 1989; Salton, 1989; Bordogna, Carrara, Pasi, 1991). The matching mechanism is deputed at identifying all metadata whose content satisfies (at least partially) users selection conditions. In order to allow automatic matching,
143
selection conditions and resource descriptions must be expressed in the same formal representation at some stage of the matching process. In particular, in the discovery of geodata sets, the query selection conditions expressed by a user should clearly specify what is of interest through content keywords (e.g., landslides), where the interesting features should be located (e.g., in a bounding box surrounding the Alps), when these features should have been observed (e.g., the date/s of the observation), and why they are searched (e.g., to detect recurrence of landslides). On the provider side, metadata should describe georesources following the same four aspects.
144
Temporal reference
Temporal extent
of time values to be adopted either in expressing dates or time periods and to suggest how to preserve precision in exchange or conversion processes. However, these recommendations are still limited, with respect to both the requirements of time stamping by metadata providers, and the representation of the temporal search conditions necessary to geodata users.
145
Current metadata specification allows to create a unique metadata item for the whole set of maps, assigning the value series to the metadata element ResourceType (see section 2.2.3, European Commission, 2009). This unique metadata item is used to describe the whole set, since all satellite products share all their attributes (e.g., geographic bounding-box, lineage, provider, etc.) but time. In particular, the temporal extent element of the metadata describing the series should address the whole temporal range targeted by the maps: this goal can be achieved specifying a time period either starting with the first image date and ending with the date of the last image, or declaring a sequence of time instants, each of them corresponding to an image date (see Table 2). A combination of the two specifications could be adopted too. From the metadata provider point of view, managing the map collection like a time series avoids a long and error-prone metadata creation activity for each map; from the user point of view, the discovery service presents to her/him a unique resource, instead of a huge amount of similar results. However, in terms of discovery, it is not possible to filter a single image in the set, on the basis of its own timestamp. However, the problem of temporal filtering can be solved by filling the resource locator metadata element Resource locator (see section 2.2.4, European Commission, 2009), which could be used to link the metadata of the series to a Web service providing the maps. In this way, a proper client application could allow users to search dates or periods that are automatically combined with the service URL, and submit a correctly formatted request to the service that provides related maps.
146
As regards events and observations, Dekkers, (2008) has reported that looking at the use of temporal information for discovery, users may be interested in [] a particular date or time period during which aneventtook place which is described in theresource. The supporting example regards the statistics of rainfall in a particular time period. However, though in the example there is a semantic agreement between the event and its observations - that constitutes the content of the resource -, in many other cases this is not applicable. For example, if a provider has to describe the metadata of remote sensing products acquired and processed in order to monitor a landslide that occurred in 1986, the current specification of temporal extent just allows indicating the time of the observations (acquisitions of the satellite image) and not the time of the event, i.e., of the landslide occurrence. Nevertheless, this can be very important information for the user of a discovery service, because she/he can be interested in comparing the status of the landslide in different periods of time, as they appear in distinct products. She/he must be sure that the compared images refer to the same event, and not to distinct ones.
147
While the description of the event associated to the different resources can be included in the resource abstract, there is no way to specify the temporal characterization of the event. In summary, this chapter proposes to introduce the possibility to include in metadata one or more events/processes/phenomena of reference for the geodata: a satellite image can be an observation of fires in a region; a set of meteo records are measures of a rainfall; some thematic maps can be subsequent representations of urban growth, etc. A temporal extent element should be defined also for the reference event(s), of course. Notwithstanding further extensions are possible, this chapter proposes to include the following temporal metadata: Instant in which the observation/event occurred: Date, time (e.g. the landslide occurred 11-07-2008 at 8:00:00) Period of validity, or duration of an observation/event: period (interval of dates, times) (e.g. the duration of a fire was from 11-07-2008 to 13-07-2008) Sequence of instances of occurrences/events: multiple instants of time or dates, periodic or aperiodic (e.g. the dates of the distinct fires were 11-07-2008, and 12-07-2008) Sequence of periods of validity or duration of occurrences/events: multiple periods or durations, periodic or aperiodic (e.g. the times of the distinct fires were from 11-07-2008 to 13-07-2008 and from 20-08-2008 to 23-08-2008)
Moreover, the following paragraph introduces a framework to allow the definition of imperfect temporal values in the temporal extent metadata elements regarding both geodata and related event(s).
148
The use of TimeML is motivated by the fact that it is a textual meta-language - thus easy to read and to index by common Information Retrieval techniques - and can be employed in a discovery service context in order to represent the metadata contents for a successive search and discovery. It is enough flexible to allow the annotation (description) of the kind of event and observations, and their temporal information, possibly imprecise and approximate. In the following, let us first describe the representation within fuzzy set theory and possibility theory of time expressions, then, introduce TimeML and, specifically, the adopted tags, and finally, the proposed partial matching mechanism.
149
of the domain: in the previous example, the time point 4 has the maximum membership degree 1, whose meaning is that 4 is fully possible as value of the defined approximate time point, while 3 and 5 have membership degrees 0.8 and 0.5 respectively, indicating that they are also possible values of the approximate time point, but to a lower extent than 4. A duration in time, i.e. a time span, is a pair [t, G] and can be denoted by either a set or a range of time points. A fuzzy time span example is [t ={0.8/3, 1./4, 0.7/5}, year] that means a duration of about 4 years. A temporal distance from the origin, i.e. a time distance, is defined as a pair [d, G] in which d is a positive or negative value, indicating the distance in time granules on G from the origin. In this case [d=2, day] means two days after the origin. As t, also d can be a fuzzy set indicating a fuzzy time distance. A time interval is a triple [t, d, G]; in a crisp case [t=1991, d=3, year] means a range of 3 years from 1991. A composite span is a union of spans [ti, Gi], not necessarily adjacent and on the same basic domain G. An aperiodic time element is a union of time intervals [ti, di, Gi]. The crisp example [t=1-112008, d=28, day] [t=30-11-2008, d=31, day] means 28 days from 1-11-2008 and 31 days from 30-11-2008. Finally, a periodic time element is a union of time intervals separated by time distances: [ti, di, Gi], [dk, Gk]. For example [t=1-8-2000, d=31, day], [d=1, year] means every August from year 2000. An example of approximate periodic time element is [t=1-2000, d={0.2/1, 0.8/2, 1./3, 0,8/4}, week], [d=1, year] that means around the third week of every January from year 2000.
Since in the context of metadata compilation we may have time series that are related to finite repetitions of observations or events, a finite periodic time element is defined as composed of a periodic time element and a time point: [ti, di, Gi], [dk, Gk][ t, G] in which the time point t specifies the end of the repetition. An example of finite periodic time element is every autumn from 2000 to 2005 that is formally expressed as: [21-09-2000, 91, day], [1, year][21-12-2005, day].
150
4.
Reasoning about the persistence of events (how long does an event or the outcome of an event last).
The tags of TimeML1 adopted to extend the INSPIRE metadata, and modelled within the fuzzy framework previously described, are listed and illustrated in Table 3. For example the expression every beginning of autumn from 2000 to 2005 can be formulated in TimeML as in Table 4. Finally, in TimeML it is possible to mark confidence values to be assigned to any tag and to any attribute of any tag. The confidence value associated with the value attribute of TIMEX3 expresses the uncertainty that the metadata provider has in assigning the temporal indication to an event or observation. For example, if we are unsure that the observation was performed on the first or second of January 2000 we could add the confidence annotation to TimeX3 as in Table 5.
151
Table 5. TimeML expression for more likely 1 January 2000 than 2 January 2000
<TIMEX3 tid=t1 type=DATE value=2000-01-01> On January 1st, 2000 </TIMEX3> <CONFIDENCE tagType=TIMEX3 tagID=t1 confidenceValue=1./> <TIMEX3 tid=t2 type=DATE value=2000-01-02> On January 2nd, 2000 </TIMEX3> <CONFIDENCE tagType=TIMEX3 tagID=t2 confidenceValue=0.80/>
The metadata provider should define the time indications of events and observations by means of a metadata editor. A running example of the metadata of two thematic maps representing subsequent observations of the same event, i.e., the melting of the Lys Glacier (a glacier of the Italian Alps), during Summer 2007, is discussed. In this example, the temporal specification of the first occurrence of the observation is known precisely, while the second date of the observation is affected by some uncertainty. Following the proposed extension, in the metadata of the first map we have fields such as:
Metadata 1 Event=Lys Glacier melting Occurrence=observation of Lys Glacier melting Time Position of Occurrence=1.7.2007
The Metadata items are translated into TimeML sentences as in Table 6. To allow partial matching with respect to soft selection conditions specified by a user, a parser translates the external TimeML definitions of temporal metadata element into its internal fuzzy set representation. In this phase we can obtain fuzzy sets defined on distinct domains (G) having different time granularity. On the other side, the user specifies her/his temporal selection conditions Q within a discovery service client interface. The expression of temporal selection conditions could be performed by the support of
152
a graphic user interface that allows depicting the semantics of the (possibly) soft temporal conditions on the timeline such as in Figure 3. The soft temporal conditions are defined as soft constraints, i.e., with a desired membership function Q, defined on a time line with a given granularity, chosen among the available ones (Zadeh, 1978). An example of user selection condition for the Lys Glacier melting example reported in table 6 could be: Search for observations of glacier melting events, occurring close to late Summer 2007. The soft constraint of this example can correspond to a fuzzy time interval, defined as follows: [t=1-82007, d={ 0, 15, 43, 17}, day], where d specifies the fuzzy time span in days from the date 1-8-2007. This definition corresponds to a membership function with a trapezoidal shape such as the obe depicted in Figure 4 [1-8-2007, 15-8-2007, 23-9-2007, 10-10-2007] (Bosc and Pivert, 2000). The matching is performed by first filtering the observations that are related to the Lys Glacier. This is done by matching the strings in the metadata EVENT and MAKEINSTANCE fields. Then, if an instance is found, in order to obtain two homogeneous temporal representations, we have to convert either the temporal metadata values in TIME3X or the soft query condition to which it is matched into a temporal indication, defined on a common basic domain with the same granularity. This can be done as explained in the following paragraph.
153
Figure 3. Examples of two soft temporal constraints defined on two timelines with distinct granularity (months and years, respectively). The constraint every autumn from 2000 to 2005 is defined as a fuzzy periodic time element, while after the second world war as a fuzzy time interval
Figure 4. Trapezoidal membership function corresponding to a soft constraint representing a fuzzy time interval
converting the current coarser granule of the i-th node in terms of aggregation of finer granules of the j-th node (Figure 5 is a simplification to illustrate how the concept works). The function Fi,j:GG, where GG, with i-th node defined on domain G and j-th defined on G, defines the mapping of a granule gG into a fuzzy set Fi,j(g)G of granules defined on G. Notice that a precise temporal expression can be converted into a fuzzy temporal expression: for example a month is defined in figure 5 by the following fuzzy set of days: month={0.3/30 days; 0.08/28 days, 0.58/31 days }, meaning that it can consist of 30 days with confidence 0.3, 28 days with confidence 0.08, or 31 days with confidence 0.58. A temporal indication t, defined on a coarser domain G (e.g., year) can be converted on another finer domain G (e.g., day) at a distance greater than 1 on the graph, by repeatedly applying the mapping
154
Figure 5. Simplified example of the graph representing the relationships between time granules: on each edge the fuzzy set defines the conversion function F of granules
functions Fi,j associated to the edges of the path from node G to node G as proposed in (De Caluwe et al., 2000). Nevertheless, since there can be more than a single path P1,..Pk, connecting two nodes in the hierarchy (e.g. in figure 5 we have two paths connecting year to day: P1=year, season, day and P2= year, month, day), by applying a transformation using the distinct paths, depending on the followed path, one can obtain multiple definitions tP1,.. tPk of the same temporal indication t. In order to reconcile these definitions we select the shortest path connecting the two nodes. If more paths with same length exist we generate the maximum bound of their definitions: tP1P2 Pk (g)= max(tP1(g).. tPk(g)) The reason for this choice is that the maximum bound comprehends all possible definitions of the temporal indication on the domain G. When matching a temporal metadata item with a temporal query condition, we transform the one defined on the coarser granularity domain into the finer domain so that we obtain two homogenous temporal indications expressed with same granularity. At this point the users temporal query specifications and the temporal metadata are converted into their internal fuzzy set representations t.and Q, respectively, and can be matched as described in the following paragraph.
155
wanted to adopt a general framework that could be used also for other kinds of metadata elements, such as those for spatial descriptions. In the representation-based approach, both the metadata values t, possibly uncertain, and the soft query condition Q are interpreted as soft constraints and one can match them to obtain a degree of satisfaction, the Retrieval Status Value, indicated by RSV(t,Q)[0,1], by computing several measures between the two fuzzy sets Q and t, such as a measure of similarity or a fuzzy inclusion measure (see Figure 6). When one is interested in selecting observations of an event taken in a date close to another date, or for matching events that took place close to a desired date a similarity matching function can be used, defined for example as follows:
in which t(i) and Q(i) are the membership degrees of a date i of the temporal domain to a fuzzy temporal metadata value t, and in the query constraint Q, respectively. By this definition if the two fuzzy sets have at least a time point in common, i.e., have some overlapping, the metadata item is retrieved. Another case is when one wants to select observations or events that occurred within a period: this corresponds to select a matching function defined as a fuzzy inclusion:
(i))
Further, other matching functions could be defined corresponding with other temporal relations such as close before close after recent defined based on the generalization of the Allens temporal relations (Allen, 1983).
Figure 6. Schema of the partial matching mechanism between a soft query condition and an imperfect temporal metadata item
156
The retrieved metadata, based on RSVs evaluated by the chosen matching function can be ranked in decreasing order with respect to the user query, thus avoiding empty answers and suggesting an access order to the geodata. In the example proposed, by using the close matching function, both metadata are retrieved: Metadata 2 has a membership degree 1 being situated in late Summer, while Metadata 1 is also retrieved since it partially satisfies the query condition close, thus meaning that it is associated to an observation of glacier melting that is in proximity of the required temporal range late Summer.
CONCLUSION
The proposal in this chapter originated within the context of a recent experience in European projects where we had to cope with the actual job of both filling metadata of our georesources and training other colleagues to create their own ones in order to create a discovery service for a European SDI. In carrying out this activity we encountered several difficulties derived mainly by the constraints on metadata type and formats imposed by the current INSPIRE implementation rules. In fact, it happens very often that information on the geodata is not well known or completely trusted, or it is even lacking. In some cases, the available geodata is produced manually by distinct experts in charge of performing land surveys. Only successively this information is transformed into electronic format, so that there can be a substantial temporal gap between the time of geodata creation and their metadating, time in which pieces of information can get lost. Such imperfect information on metadata call for new data models capable to represent and manage imprecision and uncertainty of the metadata values. This aspect is particularly evident as far as the temporal metadata used to describe geodata, and this is the aspect that has been analyzed in this chapter. We observed in particular that temporal resolution, time series and imperfect and/or linguistic temporal values are missing in the current INSPIRE Implementing Rules, so we proposed distinct metadata elements to represent them and proposed a fuzzy database framework to allow the representation of imperfect metadata and their flexible querying in catalog services.
ACKNOWLEDGMENT
Two out of five authors of this paper carried out this research under temporary contracts. We wish to acknowledge European Commission funding for this possibility, without which our work couldnt be performed.
REFERENCES
Allen, J. F. (1983). Maintaining knowledge about temporal intervals. Communications of the ACM, 832843. doi:10.1145/182.358434 Bordogna, G., Carrara, P., Pagani, M., Pepe, M., & Rampini, A. (2009). Extending INSPIRE metadata to imperfect temporal descriptions, In the Proceedings of the Global Spatial Data Infrastructures Conference (GSDI11), June 15-19 2009, Rotterdam (NL), CD Proceedings, Ref 235.
157
Bordogna, G., Carrara, P., & Pasi, G. (1991). Query term weights as constraints in fuzzy information retrieval. Information Processing & Management, 27(1), 1526. doi:10.1016/0306-4573(91)90028-K Bordogna, G., & Pasi, G. (2007). A flexible approach to evaluating soft conditions with unequal preferences in fuzzy databases. International Journal of Intelligent Systems, 22(7), 665689. doi:10.1002/int.20223 Bosc, P., & Pivert, O. (2000). On the specification of representation-based conditions in a context of incomplete databases. In the Proceedings of Database and Expert Systems Applications, 10th International Conference (DEXA 99), August 30 - September 3, 1999, Florence (It), (pp. 594-603). Cater, S. C., & Kraft, D. H. (1989). A generalization and clarification of the Waller-Kraft wish-list. Information Processing & Management, 25, 1525. doi:10.1016/0306-4573(89)90088-5 Cross Drafting Teams, I. N. S. P. I. R. E. (2007). INSPIRE technical architecture overview, INSPIRE cross drafting teams report. Retrieved on February 17, 2009, from http://inspire.jrc.ec.europa.eu/reports.cfm De Caluwe, R., & De Tr, G. G., Van der Cruyssen, B., Devos, F. & Maesfranckx, P. (2000). Time management in fuzzy and uncertain object-oriented databases. In O. Pons, A. Vila & J. Kacprzyk (Eds.), Knowledge management in fuzzy databases. (pp. 67-88). Heidelberg: Physica-Verlag. De Caluwe, R., Devis, F., Maesfranckx, P., De Tr, G., & Van der Cruyssen, B. (1999). Semantics and modelling of flexible time indication. In Zadeh, L. A., & Kacprzyk, J. (Eds.), Computing with words in Information/Intelligent Systems (pp. 229256). Physica Verlag. Dekkers, M. (2008). Temporal metadata for discovery-a review of current practice. M. Craglia (Ed.), (EUR 23209 EN, JRC Scientific and Technical Report). Deng, L., Cai, Y., Wang, C., & Jiang, Y. (2009). Fuzzy temporal logic on fuzzy temporal constraint metworks. In the Proceedings of the Sixth International Conference on Fuzzy Systems and Knowledge Discovery, 6, (pp. 272-276). Directive, I. N. S. P. I. R. E. 2007/2/EC of the European Parliament and of the Council of 14. (2007). INSPIRE. Retrieved on February 17, 2009, from www.ecgis.org/inspire/directive/l_10820070425en00010014.pdf European Commission. (2007). Draft implementing rules for metadata (v. 3). INSPIRE Metadata Report. Retrieved on February 17, 2009, from http://inspire.jrc.ec.europa.eu/reports.cfm European Commission. (2009). INSPIRE metadata implementing rules: Technical guidelines based on EN ISO 19115 and EN ISO 19119. INSPIRE Metadata Report. Retrieved on February 18, 2009, from http://inspire.jrc.ec.europa.eu/reports.cfm ISO8601. (2004). Data elements and interchange formatsinformation interchange- Representation of dates and times. (Ref: ISO 8601). Maiocchi, R., Pernici, B., & Barbic, F. (1992). Automatic deduction of temporal indications. ACM Transactions on Database Systems, 17(4), 647668. doi:10.1145/146931.146934 Salton, G. (1989). Automatic text processing: The transformation, analysis and retrieval of information by computer. Addison Wesley.
158
Tsotras, V. J., & Kumar, A. (1996). Temporal database bibliography update. SIGMOD Record, 25(1), 4151. Vila, L. (1994). A survey on temporal reasoning in artificial intelligence. AI Communications, 7(1), 428. Vila, L., & Godo, L. (1995). Query answering in fuzzy temporal constraint networks. In the Proceedings of FUZZ-IEEE/IFES95, Yokohama, Japan. IEEE Press. Wang, X. S., Bettini, C., Brodsky, A., & Jajodia, S. (1997). Logical design for temporal databases with multiple granularities. ACM Transactions on Database Systems, 22(2), 115170. doi:10.1145/249978.249979 Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1, 328. doi:10.1016/0165-0114(78)90029-5
ENDNOTE
1
159
160
Chapter 7
ABSTRACT
This chapter is focused in incorporating the fuzzy capabilities to a relational database management system (RDBMS) of open source. The fuzzy capabilities include connectors, modifiers, comparators, quantifiers, and queries. The extensions consider a more flexible DDL and DML languages. The aim is to show the design and implementation details in the RDBMS PostgreSQL. For this, a fuzzy query processor and fuzzy access mechanism has been designed and implemented. The physical fuzzy relational operators have been also defined and implemented. The flow of a fuzzy query through the different modules (parser, planner, optimizer, and executor) has been shown. Some experimental results have been included to demonstrate the performance of the proposal solution. These results show that the extensions have not decreased the performance of the RDBMS.
INTRODUCTION
The language used by human beings is imprecise, but for the computer the opposite stands true. Scientists are continually investigating how to create an intermediate point to enhance the communication between these different worlds. Moreover, the capacity to store large quantities of data is growing in an exponenDOI: 10.4018/978-1-60960-475-2.ch007
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
tially form, we now dispose of technology that is normally interconnected to support many terabytes or petabytes of data, consequently we need systems that are able to efficiently manipulate this data. The problem resides in the fact that human reasoning is very different from the logic involved in software. In order to narrow this gap, there exists a branch in the scientific community which studies different techniques to model the human mind. This branch is known as soft computing (Zadeh, 1994). The aim of this work is to describe, in a general form, how to render the constraints of the queries on a database more flexible, therefore making a Database Management System (RDBMS) flexible without affecting in a considerable way its performance to manipulate large quantities of data. Natural language is imprecise, vague and inexact; for example, when we say Juan is a young person or Milton is very tall, we are expressing imprecise terms like young and very tall; how does the computer interpret these terms? Normally, we have to employ precise terms: a person is young if she/he is 30 years old or less; but, what about the person who is 31 years old? One day I am young (I was 30 years old), and the day after my birthday, Am I no longer young? Furthermore, Juan is 50 years old and he considers himself to be young. That is true for him but false for others. If Milton is 1.80 meters, do you think that he is very tall? A lot of persons will say no, but if Milton is being selected to ride horses as a jockey? Then he is very tall; truthfulness or falseness is relative, and depends on many factors. People have different points of view (preferences) and the role of context is also important. In a computer system, objects are clearly divided into white and black, but in the real world there is an infinite grey area between the two and most things normally fall in the grey area. Morin (1999) states that Reality is not easily legible. Ideas and theories are not a reflection of reality they are translations, and sometimes mistranslations. Our reality is nothing more than our idea of reality (p. 44). Morin, a French philosopher and sociologist, recognizes ambiguity and uncertainty as the hallmark of science and of human experience. Morins approach, complex thought, is in harmony with a culture of uncertainty. Morin (1999) asserts But in life, unlike crossword puzzles, there are boxes without definitions, boxes with false definitions, and no neat framework to define the limits (p. 45). How can we manipulate these boxes into the computers? An example is the evaluation of young ladies for top models. There are many requirements to be fulfilled and nobody satisfies all of them. However, Lolita was disqualified because she is 1.79 meters tall and there exists a clear (crisp) condition stipulating a minimum height of 1.80 meters. But what about the other prerequisites met? What happen if Lolita was the best candidate based on the rest of them? This system may be very unfair, due to the rigidity of precise conditions. We have proposed the enhancement of an RDBMS to support flexible queries and help users to define preferences depending on the context by using fuzzy conditions in their SQL queries. The contribution of this chapter is to describe the design and implementation of flexible queries (using fuzzy logic) in a Relational Database Management System (RDBMS) without affecting its performance to manipulate large quantities of data in a considerable form. The approach is Tightly Coupling, with modification in the source code on the query processor engine in order to process fuzzy queries.
BACKGROUND
Traditional RDBMS suffer from rigidity: data are considered to be perfectly known and query languages do not allow natural expression of user preferences. In order to provide flexible queries over databases,
161
several efforts have been made, such as: RankSQL (Li, Chen-chuan, Ihab, Ilyas & Song, 2005) retrieving the top-k ranked answers; SKYLINE (Brzsnyi, Kossmann, & Stocker, 2001) selection of best rows or all non-dominated based on a crisp multi criteria comparison; SQLf (Bosc & Pivert, 1995) allowing fuzzy conditions anywhere SQL expects Boolean ones; Soft-SQL (Bordogna & Psaila, 2008) allowing customizable fuzzy terms definitions for querying; FSQL (Galindo, 2005) using Fuzzy Sets for imperfect data representation; MayBMS (Koch, 2009), MystiQ (Boulos, Dalvi, Mandhani, Mathur, Re & Suciu, 2009) and Trio (Widom, 2009) are proposals to implement probabilistic uncertain databases; there are also proposals for fuzzy queries combination on multiple data sources (Fagin, 2002). Fuzzy Set based is the more general approach to solve the problem of database rigidity (Bosc & Pivert, 2007; Goncalves & Tineo, 2008). Nevertheless, Fuzzy Set handling adds extra processing costs to database systems that must be controlled (Bosc & Pivert, 2000; Tineo, 2006). We need efficient evaluation mechanisms in order to make the use of fuzzy queries possible in real world applications (Lopez & Tineo, 2006). Moreover, it would be appreciated to enhance existing RDBMS with native fuzzy query capability in order to improve performance and scalability; this is the focus and contribution of present chapter.
Fuzzy Sets
Zadeh (1965) introduced Fuzzy Sets in order to model fuzzy classes in Control Systems, since then Fuzzy Set has been infiltrating into many branches of pure and applied mathematics that are set theory based. A Fuzzy Set is defined as a subset F of a domain X characterized by a membership function F ranked on the real interval [0,1]. Some correspondence operators between a Fuzzy Set F and regular ones are defined: support(F) is the set of elements with F(x)>0; core(F) is the set of elements with F(x)=1; border(F) is the set of elements with F(x){0,1}; -cut(F) is the set of elements with F(x). In the numeric domain, trapezoidal shape functions are often used, they are described by F =(x1, x2, x3, x4), where the range [x2, x3] is the core, the interval ]x1,x4[ is the support, the interval ]x1, x2[ is the increasing part or the border where the membership function is given by the line segment from (x1, 0) to (x2,1) and the interval ]x3, x4[ is the decreasing side or the border characterized by the segment from (x3,1) to (x4,0). A trapezoidal shape Fuzzy Set is said to be Monotonous if it is has only an increasing or a decreasing side but not both (x1=x2 or x3=x4). We say that is unimodal when it has both increasing and decreasing sides (x1x2 and x3x4.) Fuzzy Sets give meaning to linguistic terms (predicates, modifiers, comparators, connectors and quantifiers), giving rise to a fuzzy logic where sentence S truth-value, (S) is in [0,1], being 0 completely false, and 1 completely true. Conjunction and disjunction are extended by means of operators t-norm and t-conorm (s-norm) respectively, satisfying the properties: boundary in [0,1], monotonicity, commutativity, associativity and neutral element (1 and 0 respectively). Most common used t-norm and t-conorm couple are minimum and maximum operators.
162
SQLf was conceived for fuzzy querying relational databases. Its basic query structure is: SELECT <attributes> FROM <relations> WHERE <fuzzy condition> WITH CALIBRATION k||k,. The result of this query is a fuzzy relation with the attributes of the SELECT clause projected from the Cartesian product of the relations in the FROM clause that satisfy the fuzzy condition in the WHERE clause. The optional WITH CALIBRATION clause, proposed by Tineo (2006) to maintain the orthogonality of the original SQL, indicate the best rows choice in two senses: Quantitative, retrieving the top k answers, according to satisfaction degree. Qualitative, obtaining rows with membership greater or equal to a threshold (alpha cut). Goncalves and Tineo (2008) have previously worked towards a real implementation of a flexible querying system based in SQLf and its extensions. Result of such work is the flexible querying system named SQLfi. This system implements fuzzy querying capabilities on top of an existing RDBMS by means of a processing strategy known as the Derivation Principle that we briefly describe here after. At present time SQLfi is compatible with most popular RDBMS (Oracle, PostgreSQL, MySQL, IBM/ DB2, Firebird and SQL/Server). Nevertheless, SQLfi uses a Loose Coupling strategy (Timarn, 2001) that has scalability problems. In this chapter we provide another way of implementing SQLf that surpasses such problems due to a Tightly Coupling implementation strategy at core of a RDBMS.
Derivation Principle
Fuzzy querying supposes extra cost of processing when compared to crisp ones. For SQLf processing, in order to minimize this added cost, the Derivation Principle has been conceived (Bosc & Pivert, 2000; Tineo, 2006). It takes advantage of support and -cut concepts. Given an SQLf query it is possible to derive a crisp SQL query DQ() retrieving relevant rows for , ie support(result())result(DQ()). Then is processed on result(DQ()), membership degrees are computed and rows are filtered according to desired query calibration. When we find DQ() such that support(result())=result(DQ()), the derivation is said to be strong and the processing does not make unsuccessful computation. Otherwise, the derivation is said to be weak. Fuzzy queries with boolean connectors (AND, OR, NOT) allow strong derivation. With the Derivation Principle, SQLf processing is made on top of the RDBMS. This kind implementation strategy is known as Loose Coupling architecture (Timarn, 2001). Previous works (Bosc & Pivert 2000; Lopez & Tineo, 2006) have proved Derivation Principle based processing strategy to be the best in performance respects existing ones. Nevertheless, it has some overhead because rows in result(DQ()) are rescanned for the fuzzy query processing to compute the membership degree. This is one reason for the scalability problem of Loose Coupling. We would like to provide a fuzzy query processing mechanism with the advantage of Derivation Principle but without problems of Loose Coupling. For so doing we must extend functionality of RDBMS inner modules. In this chapter we propose needed extensions. Moreover, we present a real implementation into PostgreSQL source code. Experimental evidences of this strategy feasibility and benefice are also in this chapter.
163
SCOPE DELIMITATION
SQLf is the most complete fuzzy extension to SQL due to the diversity of fuzzy queries that allows the extension of all SQL constructions with Fuzzy Sets. SQLf Data Definition Language (SQLf-DDL) allows inside its syntactic structure the following fuzzy terms: Atomic Predicates interpreted by Fuzzy Sets (we call fuzzy predicates). Modifiers build predicates by Fuzzy Set transformations. Comparators as fuzzy binary relations. Connectors as operation over membership values. Quantifiers represented as Fuzzy Sets over natural numbers.
Fuzzy terms allows building fuzzy conditions that can be used in SQLf anyplace where standard SQL allows a boolean logic condition. Thus, SQLf is a rather complex language. Processing model that we propose in this chapter may be applied to all SQLf querying features. Nevertheless, due to complexity of this language, we delimit the scope of our actual implementation to the following characteristics.
164
Box 1.
Atomic Predicates with the syntax: CREATE FUZZY PREDICATE <name> ON <dom> AS <fset> where is a string of characters, <name> is the domain (possible values over the linguistic variable), <dom> is the specification of a Fuzzy Set with one of the follow<fset> ing forms: A trapezoidal function with four parameters (<support1>, <core1>, <core2>, <support2>) A Fuzzy Set for extension, i.e. {<value1>/<1>,...,<valuen>/< n>} An arithmetic expression with the variable <x> that indicates the predicates argument. Fuzzy modifiers with the syntax: CREATE MODIFIER <name> AS POWER <n> or CREATE MODIFIER <name> AS <exp> POWER <n> or CREATE MODIFIER <name> AS TRANSLATION <d> where is a string of characters, <name> is a power to the membership degree, <n> is an arithmetic expression with the variables <x> and <y> <exp> indicating the first and second term respectively is a value for translation of the original predicate <d> Fuzzy comparison with the syntax: CREATE COMPARATOR <name> ON <dom> AS <exp> where is a string of characters <name> is the domain (possible values over the linguistic variable) <dom> is an expression to calculate the value of the comparator <exp> through two elements. It may be a Fuzzy Set like: A trapezoidal function with four parameters (<exp1>, <exp2>, <exp3>, <exp4>) that are arithmetic expressions with the variables <x> and <y> indicating the first and second term respectively, {<(value11,value12)>/<1>,..., <(valuenn, valuenm)>/<n>} and valueij (i between 1 and n, j between 1 and m) are pairs of values in the domain with respective membership degrees. Fuzzy connectors with the syntax: CREATE CONNECTOR <name> AS <exp> where is a string of characters <name>
165
Box 1. Continued
<exp> is an expression that led to compute the value of the compound predicate. The variables <x> and <y> indicate the first and second term respectively Fuzzy quantifiers with the syntax: CREATE [ABSOLUTE/RELATIVE] QUANTIFIER <name> AS <fset> where is a string of characters <name> is the specification of a Fuzzy Set with one of the follow<fset> ing forms: A trapezoidal function with four parameters (<support1>, <core1>, <core2>, <support2>) An arithmetic expression with the variable <x> that indicates the predicates argument. Checks fuzzy conditions with the syntax: CHECK(<fuzzy condition>) Views based in fuzzy subquery: CREATE VIEW <name> AS <fuzzy subquery>
Statement Processing
The general algorithm for term definition is: 1. 2. The parser accepts the creation of a fuzzy term (i.e. create fuzzy predicate, create fuzzy modifier, create fuzzy quantifier ) If the fuzzy term does not exists then record it with parameters on the fuzzy catalog, else report to the user that the fuzzy term is already on the fuzzy catalog.
166
Box 2.
Fuzzy queries with the syntax: SELECT <attributes> FROM <tables> WHERE <fuzzy condition> Queries with fuzzy conditions in partitioning: SELECT <attributes> FROM <tables> WHERE <condition> GROUP BY <attributes> HAVING <quantified fuzzy condition> Updates with fuzzy condition: UPDATE <table> SET <attribute>= <value> WHERE <fuzzy condition> Queries with fuzzy subquery at FROM clause: SELECT <attributes> FROM <fuzzy subquery> AS <alias> WHERE <fuzzy condition> Fuzzy Set operations: (<Q1> INTERSECT/UNION/EXCEPT <Q2>) where, At least one of the queries (Q1 or Q2) is a fuzzy subquery. A <fuzzy condition> is a fuzzy logic expression of the form: <exp> = <pred> or (<fuzzy condition>) or NOT <fuzzy condition> or <fuzzy condition> <conn> <fuzzy condition> where is a traditional SQL value expression <exp> is a fuzzy predicate term that may be: <pred> an user defined fuzzy predicate identifier <name> a combination <mod><pred>, being <mod> the name of a user defined modifier or a built-in modifier ANT/NOT is a fuzzy logic connector that may be: <conn> a built in fuzzy logic operator AND/OR a <name> identifying an user defined fuzzy connector A <quantified fuzzy condition> is a fuzzy logic expression of the form: <quant> ARE <fuzzy condition> where is a <name> identifying an user defined fuzzy quantifier <quant>
The main thing is that the fuzzy catalog is composed of system tables and we can maintain or view them trough sentences of the standard SQL, with the constraints of the RDBMS and system privileges. The general algorithm to process a SQLf-DML with a fuzzy term would be:
167
1. 2.
3.
4.
5. 6.
7.
The parser module, after having verified that a term is not standard, search on the fuzzy catalog. If the term is on the fuzzy catalog then a. Create a new fuzzy node (memory structure) with the parameters of the fuzzy term, i.e. A_ fuzzy_predicate node. b. Insert into the parser tree the fuzzy node. c. Set on a boolean fuzzy variable in the parser state, i.e. set on has_fuzzy_predicate else report error to the user and finish. The fuzzy parser tree is converted in a Fuzzy Query Tree, the query condition has the fuzzy term and the fuzzy node is hold on the query tree as a list of fuzzy terms. The boolean fuzzy variable is hold on the query state. The analyzer module applies the Derivation Principle to transform the fuzzy query condition to a classical query condition, then this Fuzzy Query Tree has a classical SQL query but with the information about the fuzzy term. The optimizer module applies the usual algorithm of query optimization to obtain a fuzzy execution plan. It has annotated the fuzzy leaf (base table with a linguistic label over a linguistic variable). The executor module applies the extended fuzzy access mechanism (i.e. fuzzy sequential scan) to the fuzzy leaf, then the calculated membership degree is propagated bottom-up using the extended physical fuzzy algebra relational operator (i.e. fuzzy hash join). Each row is showed with a membership degree (fuzzy row)
This algorithm is easily extended for more fuzzy terms. In special we implement a recursive algorithm to derive a fuzzy condition into a classical condition, furthermore put all the fuzzy nodes in a chained list; and finally, we extended the parser, query and plan tree with other parameters according with each particular fuzzy condition, quantifier, comparator, modifier or fuzzy partition.
Access Methods
The access methods (like sequential scan and index scan) used in classical relations were extended in order to process fuzzy conditions over classical relations and obtain fuzzy relations (rows with membership degree); furthermore, the physical relational operators (like nested loop join, hash join and merge join) were extended to propagated the membership degree until the fuzzy row is showed. Main extension for these mechanisms was to compute membership degree (through Fuzzy Access Methods) recording it in the resulting temporal tables and propagating it (through Fuzzy Physical Relational Operators) until the result set is showed. The innovation consists in the fact that we calculate the membership degree while the execution plan is running, then we avoid calculating it later after having obtained the result set, this job is undertaken by the preceding approach at top of the RDBMS. Access methods are entrusted to calculate the membership degree in base tables and choose the resulting rows according to the support and core of the membership function, thus we extended the classical access mechanism (based in sequential and index scan). In the following sections, we specify the fuzzy access mechanisms when we apply the selection algebra operator (because these are tightly related to choose the rows of the tables) and we explain the implementation for execution like a fuzzy access mechanism.
168
Physical Operators
When applying fuzzy queries, the Fuzzy Query Tree is converted into a fuzzy relational tree, where the physical operators of Selection, Projection and Join must use the fuzzy relational algebra theory. We obtain a fuzzy relation as the result of applying a fuzzy physical operator to classical relations that is why we have to record the membership degree for each resulting row.
Fuzzy Selection
For the classical selection, there are various algorithms denominated file scans because we have to scan all the records and obtain only rows that accomplish the condition of the query. If the scan algorithm involves an index, we denominate it index scan. The most frequently used algorithms for implementing an ordinary condition are: lineal search (nave), binary search (if file is sorted), using a primary index, using a cluster index or using a secondary index (B+ tree) over an equal condition. In this case we extended the file scan, index scan and bitmap heap scan to calculate the membership degree of each row when the base table is annotated with a fuzzy term (there is a linguistic label over a linguistic variable); furthermore these physical operators record the membership degree of each fuzzy row in order to propagate it bottom-up through the execution plan. The membership is computed applying the theoretical frame of Fuzzy Sets, when we have a membership function (i.e. trapezoidal) we take the value of the linguistic variable and apply the corresponding membership function. When we have Boolean connectors in a fuzzy condition we use minimum (AND), maximum (OR) or complement (NOT) as 1 - x.
Fuzzy Projection
This operator only has to delete the attributes that are not projected and record the membership degree of each fuzzy row. When duplicates need to be deleted (Distinct clause) we have to apply a partition access mechanism for the project attributes, and furthermore, calculate the maximum (t-conorm mostly used) membership degree when we have equal rows. The membership degree is propagated bottom up through the execution plan.
Fuzzy Join
We extended this operator when at least one involved relation is fuzzy, that is, it arises from the application of a fuzzy access mechanism or from another fuzzy relational operator. If any of the relations is classical we assumed that the membership degree is one (1), in accordance with the theory of Fuzzy Sets (each row pertains completely to the Fuzzy Set). With these conditions, we only have to compute the minimum (t-norm mostly used) for each pair of rows and record it to propagate bottom-up through the execution plan. So we extended the nested loop join, hash join and merge join.
Optimizer
The fuzzy querying engine optimizer module takes the Fuzzy Query Tree arising from the parser module and uses the same algorithm to optimize a classical query (i.e. query optimizer from System R). This
169
is because, from fuzzy condition, we derive a classical condition and annotate the fuzzy terms (fuzzy algebra tree). The output of the optimizer is a fuzzy execution plan, that is, an algebra relational tree with the physical operators and the fuzzy conditions for each base table (linguistic labels over the linguistic variables), we annotate the fuzzy algebra operators, the fuzzy terms and a new projected attribute to record the membership degree when the executor calculate it.
Executor
The executor must apply the fuzzy access mechanism and Fuzzy Physical Relational Operators recording the membership degree in each resulting row using a bottom-up approach (from leaf to the top of tree), as we have to sort in descending order over the membership degree. Previous works compute the membership degree starting from the top down causing overhead due to the post-processing applied at the top for processing SQLf.
Processing Example
Lets illustrate fuzzy query processing by means of a simple example. We assume the following simple database schema (primary keys are underlined): STUDENT(SID, SName, Age, RecordS) index on SName COURSE(CNO, CName, Credits, AverageC) index on CName ENROLL(SID,CNO,Period,Section, Grade) Further, consider the following SQLf query: SELECT SName, Period FROM STUDENT S, ENROLL E WHERE Grade = middle AND Age = old AND S.SID = E.SID ; In this query middle and old are fuzzy predicates defined by trapezoidal membership functions according users preferences statements like: CREATE FUZZY PREDICATE middle ON 0..100 AS (60,70,80,90); CREATE FUZZY PREDICATE old ON 0..120 AS (65,100, INFINITE, INFINITE); The syntax of this type of statements was defined in previous works and used in applications with SQLfi (Goncalves & Tineo, 2008). The first and last parameter represents the support, also the second and third the core of the membership function. The fuzzy predicate old is a monotonous increasing function, for this reason the two last parameters are infinite (unbounded). Also middle is a linguistic label that refers to the linguistic variable grade and old is another that refers to age. We propose fuzzy algebra operators that might calculate the membership degree in any query trees node or leaf, i.e. Figure 1 show a fuzzy algebra tree for the SQLf query given before. This fuzzy relational algebra query tree cant be processed by classical query optimizer, we propose to apply the Derivation Principle and extend the classical query annotated by fuzzy conditions like is shown in Figure 2.
170
This derived Fuzzy Query Tree with boolean conditions can be processed by classical query optimizer, then we propose to extend access mechanisms (sequential scan, index scan) and physical algebra operators (projection, selection, join) such that the query evaluator might apply fuzzy sequential scan, fuzzy index scan, fuzzy projection, fuzzy selection or fuzzy join. The fuzzy operators can compute the degree of membership and send to the next fuzzy algebra operator the membership degree, additionally if the query optimizer push the projection below, this operator must to project the linguistic variable (age and grade in this case) as is shown in Figure 3. We consider for fuzzy mechanisms access and algebra operators that they may have a classical relation as input but a fuzzy relation (classical relation with a membership degree) as output. In case of join with two fuzzy relations as input the output fuzzy relation will have the minimum between the membership degrees.
171
DATA STRUCTURES
This section presents main data structures that we must implement in order to give support of fuzzy query processing at core of a RDBMS. This data structures are given here according to actual PortgreSQL structures and code. We have extended PostgreSQL querying engine for processing fuzzy queries in SQLf. We name PosgresSQLf our extension.
172
Box 3.
#define RelationFuzzyPredId 2859 CATALOG(pg_fuzzypred,2859) BKI_BOOTSTRAP predname; // NameData predbegd; // int2 predendd; // int2 predminfp; // int2 predcore1; // int2 predcore2; // int2 predmaxfp; // int2 predtypefp; // int2 preddisd; // NameData predcompfplist[1]; // text predexprfp; // NameData } FormData_pg_fuzzypred;
BKI_WITHOUT_OIDS{ predicate name domain range begin domain range end supports left bound cores left bound (included) cores right bound (included) supports right bound shape 1(trapz.) 2(increases) 3(decreases) discrete domain fuzzy predicates compare list name expression
Box 4.
#define RelationFuzzyModId 2879 CATALOG(pg_fuzzymod,2879) BKI_BOOTSTRAP BKI_WITHOUT_OIDS { modname; // modifier name NameData modtype; // modifier type int2 modpower; // modifier power int2 modnorms; // t-norms and t-conorms name NameData modfirstarg; // t-norms and t-conorms left arg NameData modsecarg; // t-norms and t-conorms rigth arg NameData } FormData_pg_fuzzymod;
173
Box 5.
#define RelationFuzzyCompId 2857 CATALOG(pg_fuzzycomp,2857) BKI_BOOTSTRAP compname; // NameData compbegd; // int2 compendd; // int2 compmin; // NameData compcore1; // NameData compcore2; // NameData compmax; // NameData comptype; // int2 compdisd; // NameData complist[1]; // text } FormData_pg_fuzzycomp;
BKI_WITHOUT_OIDS { comparator name domain range begin domain range end supports left bound cores left bound (included) cores right bound (included) supports right bound comparator type discrete domain compare list name
Box 6.
#define RelationFuzzyConnId 2880 CATALOG(pg_fuzzyconn,2880) BKI_BOOTSTRAP BKI_WITHOUT_OIDS{ connname; // connector name NameData connexpr; // defining expresion NameData } FormData_pg_fuzzyconn;
Box 7.
#define RelationFuzzyQuanId 2878 CATALOG(pg_fuzzyquan,2878) BKI_BOOTSTRAP BKI_WITHOUT_OIDS{ NameData uanname; // quantifier name quanminfp; // supports left bound NameData quancore1; // cores left bound (included) NameData quancore2; // cores right bound (included) NameData quanmaxfp; // supports right bound NameData quantypefp; // shape 1(trapz.) 2(increases) 3(decreases) int2 quantypefq; // nature 1 (absolute) 2 (proportional) int2 } FormData_pg_fuzzyquan;
fuzzy condition, generated structure is a Fuzzy Parse Tree. It is characterized by the presence of fuzzy terms nodes as defined in previous section. Lets illustrate this abstract structure by means of the the following query: SELECT FirstName, LastName FROM STUDENT WHERE age = young
174
Box 8.
typedef struct A_FuzzyPred { type; NodeTag *pred; char minfp; int modminfp; int int core1; modcore1; int int core2; modcore2; int int maxfp; modmaxfp; int int typefp; modtypefp; int unsigned int vno; vattno; int rorigtab; Oid Oid rorigcol; List *compfplist; *modcompfplist; List *exprfp; char *disd; char hasfm; bool *fuzzymod; char Mtype; int Mpower; int normType; int vtype; int } A_FuzzyPred;
// predicate name // supports left bound // cores left bound (included) // cores right bound (included) // supports right bound // shape 1(trapz.) 2(increases) 3(decreases) // table number // relative attribute number
// compare fuzzy predicate list // // // // // // fuzzy expression discrete domain has fuzzy modificator fuzzy modificator modificator type modificator power
As we can see in Figure 4, the parser module generates a parser tree composed of: a root node SelectStm, a TargelList (projected attributes) and for each attribute contained in the clause Select a ResTarget node that has a pointer to a node Attr, it contains the table name and a pointer to a Value node where is indicated the name of the attribute. The SelectStmt node also has a fromClause list with a RangeVar node for each input at the FROM clause, and a pointer to a node RelExpr with the table name, moreover the list whereClause has a A_Expr node with the name of the operation, this is a subtree with two leaves: lexpr (left term, in this case attribute Age) and rexpr (right term, in this case, A_FuzzyPred node young).
Box 9.
typedef struct A_FuzzyQuan { type; NodeTag pred; char *minfp; char *core1; char *core2; char *maxfp; char typefp; int typefq; int *args; List unsigned int vno; vattno; int } A_FuzzyQuan;
// // // // // // // // // //
quantifier name supports left bound cores left bound (included) cores right bound (included) supports right bound shape 1(trapz.) 2(increases) 3(decreases) nature 1 (absolute) 2 (proportional) the arguments (list of exprs) table number relative attribute number
Box 10.
typedef struct A_FuzzyComp { type; NodeTag *pred; char *minfp; char modminfp; int *core1; char modcore1; int *core2; char modcore2; int *maxfp; char modmaxfp; int typefp; int modtypefp; int unsigned int vno; vattno; int rorigtab; Oid rorigcol; Oid *compfclist; List *disd; char hasfm; bool *fuzzymod; char Mtype; int Mpower; int normType; int *secarg; A_Const int vtype; } A_FuzzyComp;
176
// comparator name // supports left bound // cores left bound (included) // cores right bound (included) // supports right bound // shape 1(trapz.) 2(increases) 3(decreases) // table number // relative attribute number
// // // // // //
compare fuzzy predicate list. discrete domain. has fuzzy modifier modifier name modifier type modifier power
The Planner Module takes the Fuzzy Query Tree transforming the fuzzy condition into a Boolean condition after applying the Derivation Principle. We have indicators like a boolean variable named hasFuzzPred. It indicates if the query tree is fuzzy and we store a list name fuzzypred with the fuzzy terms as shown in Figure 6.
177
Executor modules physical access mechanisms: sequential scan, index scan and bitmap heap scan were extended as follows: if the node has a fuzzy predicate then it computes the membership degree of each row taking into account the nodes fuzzy predicate attached to the list FuzzyPred, and it is projected in the Gr_memb attribute attached to the Target Entry List (tlist), thus is generated fuzzy rows. Also nested loop, hash and merge physical join operators were extended, so once they receive a fuzzy row then propagate the membership degree bottom-up for the next operator through the attribute Gr_memb. Additional tasks are carried out if we find a fuzzy subquery at FROM clause, we identify this in the rtable structure and process the fuzzy subquery. Furthermore, the Executor module computes the membership degree according to the fuzzy term (predicate, comparator, quantifier, connector, modifier or partition) Finally if we find Fuzzy Set operations, we determined the type of these operations (INTERSECT, UNION or EXCEPT) in order to apply the Fuzzy Set operation (classical operation extended).
178
EXPERIMENTAL RESULTS
We have designed and tested a set of queries on PostgreSQLf in order to verify functionality and performance results. In general, the results obtained are good. The functionality tests and validation of results are done with a 100% level of effectiveness. The performance tests are very good because the time is not incremented much and in some instances the mean time is very similar than in the classical case. We have designed and tested a set of queries on PostgreSQLf in order to verify functionality and performance results over the TPC BenchmarkH (TPC-H) available at http://www.tpc.org/tpch/. For such tests several fuzzy terms were created. Some of them are shown in Box 11. We have designed 28 fuzzy queries with different querying structures and fuzzy condition. One of such queries is in Box 12. We compare the performance of processing fuzzy queries in PostgreSQLf regarding regular querying processing in PostgreSQL. For each designed fuzzy query we run a boolean version obtained as the corresponding derived query according the Derivation Principle. This comparisos allows showing the added extra cost for fuzzy query processing with mechanism that has been proposed and implemented in this work. We generate two volumes of data (low 1 gigabyte and high 5 gigabytes) and we obtain execution time of fuzzy queries and their corresponding classical queries, of course, in the first case we addition-
179
ally obtain the membership degree. We run 56 queries in total: 28 for each data volume and within each level of data volume 14 classical y 14 fuzzy queries. With these data we make a statistical study using a multifactorial analysis (ANOVA), taking account the factors: type of query (boolean or fuzzy), volume of data (low or high). Statistically we obtain that observed processing times are completely explained by the volume of data (Pr(>F) 2.361e-05)and there is no significant difference between times for fuzzy and boolean queries (Pr(>F) 0.8554). We can see in Figure 8 that observed times are similar for fuzzy or boolean queries. This behavior is due to fuzzy query processing mechanism proposed in this chapter.
180
Box 11.
CREATE FUZZY PREDICATE LOWAVAIL ON 0 .. 10000 AS ( INFINITE, INFINITE, 1000, 1500 ); CREATE RELATIVE QUANTIFIER most_of AS ( 0.5, 0.75, 1.0, 1.0 ); CREATE COMPARATOR ~ ON 1 .. 100 AS ( y-5,y,y,y+5 ); CREATE COMPARATOR << ON 1 .. 100 AS ( INFINITE, INFINITE,y/3,y ); CREATE COMPARATOR >> ON 1 .. 100 AS ( y,y*3, INFINITE, INFINITE ); CREATE COMPARATOR cerca on nation AS ( (ALGERIA,ETHIOPIA)/0.1, (ALGERIA,KENYA)/0.2, (ALGERIA,MOROCCO)/0.3, (ALGERIA,MOZAMBIQUE)/0.4, (ARGENTINA,BRAZIL)/0.9, (ARGENTINA,CANADA)/0.1, (ARGENTINA,PERU)/0.5, (FRANCE,GERMANY)/0.7, (FRANCE,ROMANIA)/0.5, (FRANCE,RUSSIA)/0.1 );
Box 12.
SELECT p_name, s_name FROM part, partsupp, supplier WHERE (p_partkey=ps_partkey AND ps_suppkey=s_suppkey) AND ps_availqty << LOWAVAIL INTERSECT SELECT p_name, s_name FROM part, partsupp, supplier WHERE (p_partkey=ps_partkey AND ps_suppkey=s_suppkey) AND ps_supplycost ~ affordable;
onstrate the feasibility of the proposed design. We used the principal characteristics of SQLf according to SQL2 and SQL3 standards. Accordingly to benchmark test using TPC BenchmarkH, the proposed design was validated through functionality tests. With statistical analysis over 56 queries the implementation proved scalability and performance compared to classical queries. Main result is the fact that fuzzy querying with our proposed strategy has not a significant impact in query processing time. This behavior is due to fuzzy query pro-
181
cessing mechanism proposed in this chapter. There are two keys of advantage in this mechanism. First key is the derivation of boolean conditions form fuzzy ones at optimizer level. It avoids superfluous computation of unnecessary satisfaction degrees for fuzzy conditions. Second key is the computation of satisfaction degrees into access methods and physical operator. It avoids the rescanning of result set for the computation of satisfaction degrees. It has been possible because we have proposed fuzzy query processing at core of the RDBMS engine. With this approach we avoid overhead due to the postprocessing applied at the top for processing SQLf in previous works. Proposed extension for fuzzy query processing with reported benefits would lead to wider acceptance and use of fuzzy querying systems for real applications. The development of PostgreSQLf is still open. There are features of SQLf that has not been implemented yet at core of this RDBMS, it may be matter of future works. Up to present time we have assumed that cost model for the optimizer is not significantly affected and we have shown in practice that, nevertheless, it would be very interesting to do a formal study of cost model for this kind of queries. Also it is possible to think about more powerful fuzzy algebra specific physical operators to be implemented and considered bye the planner optimizer. Additionally we are designing various benchmark tests to compare our approach with other implementations of flexible queries.
ACKNOWLEDGMENT
We acknowledge financial help of Venezuelas FONACIT Project G-2005000278 and Frances IRISA/ ENSSAT Project Pilgrim. I will lift up mine eyes unto the hills, from whence cometh my help. My help cometh from the LORD, which made heaven and earth (Psalms121:1-2).
182
REFERENCES
Bordogna, G., & Psaila, G. (2008). Customizable flexible querying classic relational databases . In Galindo, J. (Ed.), Handbook of research on fuzzy information processing in databases (pp. 191217). Hershey, PA: Information Science Reference. Brzsnyi, S., Kossmann, D., & Stocker, K. (2001). The Skyline operator. In Proceedings of 17th International Conference on Data Engineering, (pp. 421-430). Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3(1). doi:10.1109/91.366566 Bosc, P., & Pivert, O. (2000). SQLf query functionality on top of a regular relational RDBMS. Knowledge Management in Fuzzy Databases, 171-190. Heidelberg: Physica-Verlag. Boulos, J., Dalvi, N., & Mandhani, B. Mathur, S., Re, C. & Suciu, D. (2005). MystiQ: A system for finding more answers by using probabilities. System Demo in 2005 ACM SIGMOD International Conference on Management of Data. Retrieved October 18, 2009, from http://www.cs.washington.edu/ homes/suciu/demo.pdf Connolly, T., & Begg, C. (2005). Database systems-a practical approach to design, implementation, and management. United Kingdom: Pearson Education Limited. Fagin, R. (2002). Combining fuzzy information: An overview . SIGMOD Record, 31(2), 109118. doi:10.1145/565117.565143 Galindo, J. (2005). New characteristics in FSQL, a fuzzy SQL for fuzzy databases. WSEAS Transactions on Information Science and Applications, 2(2), 161169. Goncalves, M., & Tineo, L. (2001a). SQLf: Flexible querying language extension by means of the norm SQL2. The 10th IEEE International Conference on Fuzzy Systems, (pp. 473-476). Goncalves, M., & Tineo, L. (2001b). SQLf3: An extension of SQLf with SQL3 features. The 10th IEEE International Conference on Fuzzy Systems, (pp. 477-480). Goncalves, M., & Tineo, L. (2008). SQLfi y sus aplicaciones. [Medelln, Colombia]. Avances en Sistemas e Informtica, 5(2), 3340. Koch, C. (2009). MayBMS: A database management system for uncertain and probabilistic data . In Aggarwal, C. (Ed.), Managing and mining uncertain data (pp. 149184). Springer. doi:10.1007/9780-387-09690-2_6 Li, C., Chen-chuan, K., Ihab, C., Ilyas, F., & Song, S. (2005). RankSQL: Query algebra and optimization for relational top-k queries. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, (pp. 131-142). ACM Press. Lpez, Y., & Tineo, L. (2006). About the performance of SQLf evaluation mechanisms. CLEI Electronic Journal, 9(2), 8. Retrieved October 10, 2009, from http://www.clei.cl/cleiej/papers/v9i2p8.pdf
183
Morin, E. (1999). Seven complex lessons in education for the future. United Nations Educational, Scientific and Cultural Organization. Retrieved October 18, 2009, from http://www.unesco.org/education/ tlsf/TLSF/theme_a/mod03/img/sevenlessons.pdf Timarn, R. (2001). Arquitecturas de integracin del proceso de descubrimiento de conocimiento con sistemas de gestin de bases de datos: Un estado del arte. [Universidad del Valle, Colombia.]. Ingeniera y Competitividad, 3(2), 4451. Tineo, L. (2006) A contribution to database flexible querying: Fuzzy quantified queries evaluation. Unpublished doctoral dissertation, Universidad Simn Bolvar, Caracas, Venezuela. Widom, J. (2009). Trio: A system for integrated management of data, uncertainty, and lineage . In Aggarwal, C. (Ed.), Managing and mining uncertain data (pp. 113148). Springer. doi:10.1007/978-0387-09690-2_5 Zadeh, L. (1994). Soft computing and fuzzy logic. IEEE Software, 11(6), 4856. doi:10.1109/52.329401 Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338353. doi:10.1016/S0019-9958(65)90241-X
184
185
Chapter 8
ABSTRACT
This chapter investigates the problems in integration of fuzzy relational databases and extends the relational data model to support fuzzy multidatabases of type-2 that contain integrated fuzzy relational databases. The extended model is given the name fuzzy tuple source (FTS) relational data model which is provided with a set of FTS relational operations to manipulate the global relations, called FTS relations, from such fuzzy multidatabases. The chapter proposes and implements a full set of FTS relational algebraic operations capable of manipulating an extensive set of fuzzy relational multidatabases of type-2 that include fuzzy data values in their instances. To facilitate formulation of global fuzzy query over FTS relations in such fuzzy multidatabases, an appropriate extension to SQL can be done so as to get fuzzy tuple source structured query language (FTS-SQL). Many real world problems involve imprecise and ambiguous information rather than crisp information. Recent trends in the database paradigm are to incorporate fuzzy sets to tackle imprecise and ambiguous information of real world problems. Fuzzy query processing in multidatabases have been extensively studied, however, the same has rarely been addressed for fuzzy multidatabases. This chapter attempts to extend the SQL to formulate a global fuzzy query on a fuzzy multidatabase under FTS relational model discussed earlier. The chapter provides architecture for distributed fuzzy query processing with a strategy for fuzzy query decomposition and optimization. Proofs of consistent global fuzzy operations and some of algebraic properties of FTS Relational Model are also supplemented.
DOI: 10.4018/978-1-60960-475-2.ch008
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
INTRODUCTION
Databases hold data that represent properties of real-world objects. Ideally, a set of real-world objects can be described by the constructs of a single data model and stored in one and only one database. Nevertheless, in reality, one can usually find two or more databases storing information about the same real-world objects. There are several reasons that result in the overlapping representations. These include: Different roles played by the same real-world objects in different applications. For example, a company can be the customer as well as the supplier for a firm. Hence, the companys information can be found in both the customers database and suppliers database. For performance reasons, a piece of information may be fully or partially duplicated and stored in databases at different geographical locations. For example, the customers information may be stored in both the branches and headquarter. Different ownership of information can also lead to information stored in different databases. For example, the information of a raw material item may be stored in different production databases because each production line wants to own a copy of the information and to exercise control over the information.
When two or more databases represent overlapping sets of real world objects, there is a strong need to integrate these databases in order to support applications of cross- functional information systems. It is therefore important to examine strategies for database integration. An important aspect of database integration is the definition of a global schema that captures the description of the combined (or integrated) database. Here, we define schema integration to be the process of merging schemas of databases, and instance integration to be the process of integrating the database instances. Schema integration is a problem well studied by database researchers (Batini, Lenzerini, and Navade, 1986; Hayne and Ram, 1990; Kaul, Drosten, and Neuhold, 1990; Larson, Navade and Elmasari, 1989; Spaccapietra, Parent and Dupont, 1992). The solution approaches identify the correspondences between schema constructs (e.g. entity types, attributes, etc.) from different databases and resolve their differences. The end result is a global schema which describes the integrated database. In contrast, instance integration focuses on merging the actual values found in instances from different databases. There are two major problems in instance integration: a. b. entity identification; and attribute value conflict resolution
The entity identification problem involves matching data instances that represent the same real-world objects. The attribute value conflict resolution problem involves merging the values of matching data instances. These two problems have been studied in (Chatterjee and Segev, 1991; Lim, Srivastava, Prabhakar and Richardson, 1993; Wang and Madnick, 1989) and (DeMichiel 1989; Lim, Srivastava, Prabhakar and Richardson, 1993; Lim, Srivastava and Shekhar, 1994; Tasi and Chen, 1993) respectively. It is not possible to have attribute value conflicts resolved without entity identification because attribute value conflict resolution can only be done for matching data instances. In defining the integrated database, one has to choose a global data model so that the global schema can be described by the constructs provided by the data model. The queries that can be formulated against the integrated database also depend on
186
the global data model. The selection of global data model depends on a number of factors including the semantic richness of the local databases (Saltor, Castellanos and Garcia-Solaco, 1991; Seth and Larson, 1990) and the global application requirements. Nevertheless, the impact of instance integration on the global data model has not been well studied so far. In this chapter, we study this impact in the context of fuzzy relational data model. In this research, we assume that the schema integration process has been carried out to the extent that a global schema has been obtained from a collection of existing (local) fuzzy relational databases. Hence, global users or applications will formulate their queries based on the global schema. Moreover, export schemas that are compatible with respect to the global schema have been defined upon the local fuzzy databases. We classify instance integration into three distinct levels according to the extent to which instance integration is carried out: Level-0: Neither entity identification nor attribute value conflict resolution is performed. Since no instance integration is involved, the integrated database is defined merely by collecting the instances from different local databases into relations specified by the global schema. Level-1: Entity identification is performed but not attributes value conflict resolution. Hence, local database instances which correspond to the same real-world objects are matched and combined in the global relations. However, the attributes of these matching database instances are not merged. Level-2: (complete integration). Both entity identification and attribute value conflicts are resolved. In this case, the local database instances are completely integrated.
Earlier research on database integration indicates that complete integration of instances is the only ideal solution for database integration. Nevertheless, we argue that there are some reasons advocating different levels of instance integration. 1. 2. 3. Firstly, it may not be possible to acquire sufficient knowledge to perform complete instance integration. Secondly, data quality of local databases may be low and it is not worthwhile to perform complete instance integration. Thirdly, performing instance integration may be costly, especially for the case of virtual database integration in which instance integration is being performed for every global query. For many organizations, the benefits of complete instance integration may not outweigh costs associated with the integration. Lastly, in some cases, the global users or applications may not require complete instance integration.
4.
Apart from level-2 instance integration which represents the complete integration, the levels 0 and 1 impose some constraints upon the global data model: Due to incomplete instance integration, the integrated database is expected to accommodate some remaining instance level heterogeneities. It is the responsibility of global applications to resolve remaining instance-level conflicts when the need arises. On the other hand, there exists the possibility that the levels 0 and 1 integrated databases may be needed to be fully integrated with human involvement combined with additional domain knowl-
187
edge. In order to achieve the complete integration requirement, a global data model must preserve source information for partially integrated databases. An extended global data model associated with source information requires new set of data manipulation operations. On one hand, these operations allow us to query the integrated database. On the other hand, one can make use of these operations to achieve complete database integration.
When complete instance integration has not been performed on multiple databases, it is necessary to augment source information to the global data model in order to identify where the instances in the integrated database come from. The source information allows us to: i. ii. iii. provide the context information to better interpret the non-fully integrated instances; support meaningful and flexible query formulation on the partially integrated databases; and perform entity identification and attribute value conflict resolution within queries or applications if the need arises.
A number of different data models have been proposed for multidatabase systems (MDBSs). They can be broadly classified into three main categories according to the degrees of integration: Type-1: These MDBSs choose not to handle any semantic heterogeneity, e.g. MSQL (Litwin, Abdellatif, Zeroual and Nicolas, 1989; Lakshman, Saderi and Subramanian, 1996; Wang, Madnick, Wang and Madnick 1990). In other words, they do not provide global integrated schemas over the preexisting databases. Type-2: These MDBSs may support global integrated schemas but not integrated instances. In these MDBSs, the pre-existing database instances representing the same real world objects are not entirely integrated together (Agrawal, Keller, Wiederhold and Saraswat, 1995; Liu, Pu and Lee, 1996) Type-3: These are MDBSs that integrated both the pre-existing database schemas and instances (Clements, Ganesh, Hwang, Lim, Mediratta, Srivastava, Stenoein, Myriad, and Yang, 1994).
In (Agrawal, Keller, Wiederhold and Saraswat, 1995), a multidatabase is defined to be a set of flexible relations in which local instances that represent the same real-world entities are stored together as group of tuples. Hence, some implicit grouping of tuples in a flexible relation is required. Flexible relations also capture the source, consistency and selection information of their tuples. A corresponding set of flexible relational operations has been developed to manipulate the flexible relations. Nevertheless, flexible relational model is not a natural extension of the relational model. Furthermore, the join between flexible relations has not been defined. A universal relational approach to model and query multidatabases is proposed in (Zhao, Segev and Chatterjee, 1995). In this approach, a multidatabase is a universal relation instead of a set of relations. Queries on the universal relation are translated into multiple local queries against the local relations. The final query results are formed by unioning the local query results. Source information is attached to tuples in the final query results to indicate where the tuples come from. However, the source attribute is included in neither the universal relation nor its query specification. Joins and other operations that involve multiple component databases are not allowed in this model.
188
189
While considering real world objects another very important consideration that needs to be taken into account is the inherent fuzziness in the data instances. Often the data we have to manage are far from being precise and certain. Indeed, the attribute value of an item may be completely unknown or partially known (a probability distribution is known on the possible values of attribute, for example). Besides an attribute may be irrelevant for some of the considered items; moreover, we may not know whether the values does not exist or is simply unknown. In such circumstances fuzzy relations are incorporated in the database. Integration of fuzziness in database provides means of representing, storing, and manipulating imprecise and uncertain information. Since our knowledge of the real world is often imperfect, ones ability to create databases of integrity poses a great challenge. To maintain the integrity of database in situations where knowledge of the real world is imperfect, one may either restrict the model of database to the portion about which only perfect information is available leading to the loss of valuable information, keeping relevant data unexplored, unanswered queries, unsatisfied user requests and resulting in degraded quality of information delivery. To overcome the aforesaid hazards, formalism has been suggested that allow the representation, storage, retrieval and manipulation of uncertain information. In this research work the term FUZZY is used as a generalized term implying imprecision, uncertainty, partial knowledge, vagueness and ambiguity. Fuzzy relations have been treated by Kaufman (1975) and Zadeh (1965). A considerable work on solving the equality problem among fuzzy data values are in the literature. Buckles and Petry (1983), and Prede and Testamale (1984) introduced the concept of similarity measure to test the two domains for equality of fuzzy data values. Rundensteiner (1989) introduced a new equality measure termed as resemblance relation. The concept behind the resemblance relation and proximity relation are somewhat similar and has been exploited by Raju and Majumdar (1986). A fuzzy probabilistic relational data model is proposed by Zhang, Laun and Meng (1997) to integrate local fuzzy relational databases into a fuzzy multidatabase system by identifying and resolving new types of conflicts in local fuzzy database schemas. Another approach in (Ma, Zhang and Ma, 2000) addressed the fuzzy multidatabase systems for identifying and resolving the involved conflicts in their schema integration.
ui U i , i = 1, 2, , n . An n-ary fuzzy relation R in U* is a relation that is characterized by a n-variate membership function ranging over U*, that is, mR : U * [0, 1] .
190
CS ME CE EC CS MA
v17 highT y24 highT y24 highT v 47 lowT y24 highT y 31 modT
At level-0 instance integration, export fuzzy database instances are not integrated at all although the mapping from export fuzzy schemas to global fuzzy schema has been identified. It is necessary to attach the source information to the export instances when they appear in the global fuzzy relations. For example, consider the two export fuzzy relations Emp and Faculty as given in Table 1. It has already been shown in (Sharma, Goswami and Gupta, 2004) that the set of fuzzy inclusion dependencies between the two export fuzzy relations Emp and Faculty is not empty. This establishes the fact that the two export fuzzy relations Emp and Faculty are related and hence can be integrated. While integrating (at level-0) two databases, only the related relations are merged as illustrated in Table 2. As shown in Table 2, we have assigned the (export fuzzy database identifier), DB1, for instances that come from the Emp relation in FRDB1, and assigned DB2 for instances that come from the Faculty in FRDB2. Since we only have one export database for each local database, the export database identifier can be treated as the local database identifier. In this way, we have extended the relational fuzzy data model with an additional source attribute, and we call the relational data model with such extension the Fuzzy Tuple Source (FTS) Relational data Model. Note that even when the fuzzy schemas of our two export fuzzy relational database examples are compatible with the global schema, there may still be
f
191
global database attributes that cannot be found in all export databases. In that case, we assume Null values for the missing attributes in the export instances. At first glance, one may want to treat the additional source attribute like just another normal attribute. While this may be correct at the data storage level, we advocate that the source attribute deserves special treatment at both the data modeling and the query processing perspectives. The source attribute, unlike other normal attributes, must be present in every FTS relation and has a special meaning which not only relates the instances to the local fuzzy databases they come from, but also identify the context of data instances. Furthermore, it should be manipulated differently from the other normal attributes in the query processing. For the FTS relational model, the values of source attributes are used purely for implementation purpose. They do not provide any semantics regarding local fuzzy relational databases. In order to maintain and provide source (context) semantics in a fuzzy relational multidatabase, we can establish a source table with at least two attributes. The first attribute stores the local fuzzy relational database identifiers, whereas the other attributes store information about the local fuzzy relational databases. Information could be the application domains, the names of geographical locations, the types of database management systems, persons in charge (e.g., DBA), and even the assessment of the data qualitys level of each local fuzzy relational database. This source table is employed to retain the context semantics which can be inferred by users to interpret global fuzzy queries results of fuzzy relational multidatabase with level-0 instance integration, or used by other data analysis tools. In addition, this table contains useful information for level-1 instance integration. For example, the context semantics of our example can be stored in the source table as given in Table 3. Being an extension to the traditional fuzzy relational data model, FTS relational data model can represent relations which do not require source information by assigning values to the source attributes. The standard fuzzy relational operations can still operate on the FTS relations by ignoring the source Figure 1. Membership functions and mappings for databases in Table 1
192
attributes. Note that the resultant fuzzy relations may no longer retain the values of the source attributes. With the special meaning attached to the source attribute, we design manipulation operations that involve the source attributes, which are called fuzzy tuple source (FTS) relational algebraic operators.
y 3 old u2 mid y2 mid y 33 old v22 old v 32 old v 42 mid v52 old y22 mid
y1 poor u2 avrg y 3 good y11 poor v23 poor v 33 good v 43 good v53 avrg y 32 good
y 3 highE u2 modE y1 lowE y 33 highE v25 modE y 33 highE y 33 highE v55 modE y13 modE
y 3 highP u2 modP y2 modP v16 highP y 32 highP v 36 modP y 32 highP y 32 highP y22 modP
y2 modT u2 modT y 3 highT v17 highT y24 highT y24 highT v 47 lowT y24 highT y 31 modT
193
A single number (e.g. Age=22), Set of scalars (e.g. Aptitude={average, good}), Set of numbers (e.g. {20, 21, 25}), A possibilistic distribution of scalar domain values (e.g. Age={0.4/average, 0.7/good}), A possibilistic distribution of scalar domain values (e.g. Age={0.4/23, 1.0/24, 0.8/25}), A real number from [0,1] (e.g. Heavy=0.9), A designated null value (e.g. Age=unknown).
Arithmetic Operations
Arithmetic operations on different fuzzy data types are already discussed in Sharma, Goswami & Gupta, (2008).
Lemma: Let EQ be a resemblance relation on a set U. For all with 0<10, -level sets EQ are tolerance relation on U. The concept of an -resemblance was introduced by Rundensteiner et al (1989). Definition: Given a set U with a resemblance relation EQ as previously defined. Then, U , EQ is called a resemblance space. An -level set EQ induced by EQ is termed as an -resemblance set. Define the relationship of two values x,yU that resemble each other with a degree larger than or equal to (i.e. EQ (x , y ) ) as -resemblant. The following notation is proposed for the notion of two values x,y being -resemblant: xEQay . A set P U is called an -preclass on U , EQ , if x , y P , x and y are -resemblant (i.e. xEQay holds). To define fuzzy relations GREATER THAN (GT) and LESS THAN (LT), let us consider a proximity relation P defined as given below: Definition: A proximity relation P over a universe of discourse U is reflexive, symmetric and transitive fuzzy relation with mP (u1, u2 ) [1, 0] , where u1, u2 U (Kandel, 1986). Definition: Let P1 is a proximity relation defined over U. Fuzzy relational operator GT is defined to be a fuzzy subset of UU, where GT satisfies the following properties u1, u2 U :
194
0 if u u 1 2 mGT (u1, u2 ) = mP (u1, u2 ) otherwise. 1 Definition: Let P2 is a proximity relation defined over a universe of discourse U. The fuzzy relational operator LT is defined to be a fuzzy subset of UU, where LT satisfies the following properties u1, u2 U : 0 if u u 1 2 mLT (u1, u2 ) = mP (u1, u2 ) otherwise. 1 Membership functions of fuzzy relations `NOT EQUAL (NEQ), `GREATER THAN OR EQUAL (GOE) and `LESS THAN OR EQUAL (LOE) can be defined based on that of EQ, GT and LT as follows: mNEQ (u1, u2 ) = [1 mEQ (u1, u2 )] mGOE (u1, u2 ) = max[mGT (u1, u2 ), mEQ (u1, u2 )] mLOE (u1, u2 ) = min[mLT (u1, u2 ), mEQ (u1, u2 )]
fuzzy relation Lij is a component of the global fuzzy relation Gj which is derived from L1 j , , Lnj by
195
Table 4. Set of Export fuzzy relation from databases DB1 & DB2
[DB1]: Emp1 Name Jaya Apu Name Jaya Maya Age .5/old .5/mid [DB2]: Emp2 Age .5/mid .5/mid r .50 .50 Dname Eco Chem. Staff 10 15 Hall MT JCB r .50 .50 Dname Chem. Eco HoD Jaya Maya [DB2]: Dept2 HoD Maya Jaya Fund .6/mod .63/mod r .60 .63 [DB1]: Dept1 Fund .63/low .63/mod r .63 .63
In some way the merge operation is similar to an outer-union except that an additional source attribute is added before they are outer-unioned. An integrated global fuzzy relation thus obtained by the use of merge operation is given the name FTS relations. A set of such FTS relations is called Fuzzy Multidatabase.
respectively, and sa , sa denote their source export fuzzy database identifiers (DBids). Let T be the FTS relation obtained as a result of applying an FTS relational operation over R and/or S. FTS relational algebraic operators are mounted with a flag fs(viz. , , erators are mounted with a flag f (viz. , ,
fs fs fs fs fs fs fs fs fs
aR and aS denote fuzzy attributes values, mR (aR ), mS (as ) denote their membership grades to R and S
fs fs fs
Now, we introduce the formal definition of FTS relational operators as follows: Definition: FTSselect():T = p,, (R) = aR , R (aR ), sa
fs fs
{(
) | p (a , s
R
aR
) R (aR )
where, pa (aR , sa ) is a fuzzy predicate that uses fuzzy comparison operators. Value of is used while
R
deciding -resemblance of two fuzzy values in the predicate. The predicate may involve both fuzzy and Table 5. FTS Relations: Level-0 Integration of DB1 & DB2
Emp=merge(Emp1, Emp2) Name Jaya Apu Jaya Maya Age .5/old .5/mid .5/old .5/mid Hall MT JCB Null Null r .50 .50 .50 .50 source DB1 DB1 DB2 DB2 Dname Chem. Eco Eco Chem. Staff Null Null 10 15 Dept=merge(Dept1, Dept2) HoD Jaya Maya Maya Jaya Fund .63/low .63/mod .6/mod .63/mod r .63 .63 .60 .63 source DB1 DB1 DB2 DB2
196
Table 6. Result of FTS select operation on the FTS relation Dept from Table-5.
fs s
HoD = Maya,.5Dept
HoD Maya Maya Fund .63/mod .6/mod r .63 .60 source DB1 DB2
Staff Null 10
source attributes s {DBset , *} , and DBset is set of DBids for local fuzzy databases. The predicate is defined only on those tuples that qualify the threshold of tuple membership given by b 0, 1 . It gives another level of precision while applying FTS relational operations (see Table 6).
Definition: FTSproject() = (t .A, ,(t .A) | t R (t.A) t.s * T1 = fs sameDB T1 T1 A,, (R) anyDB T2 = fs = t.A, T , smerge(s (A,t .A)R) | t R T (t.A) EQ 2 2 A,, (R) where A is subset of fuzzy attributes in R, and s if s , s S , s = s , where S is 1 2 1 2 smerge(S ) = set of source DBid s, * otherwise.
fs
{(
Here equality of tuples has a special meaning. Two tuples from FTS relations are said to be equal iff each of its attribute values (both crisp and fuzzy) are -resemblant (i.e for the case of projected relation T1, if t1,t2R and EQ (t1 .A, t2 .A) , then t1 .A, mT (t1 .A), t1 .s T1 if mT (t1 .A) mT (t2 .A) , otherwise
Table 7. Result of FTS project operation on FTS relation Dept from Table-5.
fssameDB p Dname ,Fund ,.8,.6
Dept
Fund .63/low .63/mod .6/mod .63/mod r .63 .63 .60 .63 source DB1 DB1 DB2 DB2
Dept
Fund .63/low .63/mod .63/mod r .63 .63 .63 source DB1 * DB2
197
Table 8. Result of FTS join operation to join the FTS relations Emp & Dept from Table-5.
fssameDB p A,.5
(Emp
Dept )
source DB1 DB2 DB2 Dname Chem Chem Chem Chem
fsanyDB p A,.5
(Emp
Dept )
source DB1 * * DB2 * DB2
Staff Null 15 10
Eco Eco
A flag (sameDB or anyDB) is attached to p to indicate if projected attributes from different export fuzzy databases sharing the same fuzzy attributes should be merged or not. smerge() produces for resultant tuples that have multiple sources. Original source values are not maintained because (1) the source attribute values should be atomic, (2) By maintaining set of values for source information, it is not possible to tell the exact source of each individual attribute for a given FTS relation. (see Table 7). Definition: FTS join (
fs sameDB , fs
fs
({
})
where p is a conjunction of fuzzy predicates which may include the source related predicates. It can be observed that the operator
fs anyDB ,
joins two fuzzy tuples from different sources, whereas the operator source values.
fs
Definition:FTS union() EQ (t.A, tR .A) T (t.A) = max T (t.A), R (tR .A) fs sameDB 1 1 T1 = R , S = t.A, T (t.A), t.s (t.s = tR ) EQ (t.A, tS .A) T (t.A) = max 1 1 T (t.A), S (tS .A) (t.s = tS ) 1 anyDB fs f f f f f f (t.A, (t.A), smerge(s , ( (A,t .A) (R S )))) t.A (( A,, R) , ( A,, S )) T2 = R , S = EQ T1
198
Definition: FTS intersection() EQ (t.A, tR .A) T (t.A) = max T (t.A), R (tR .A) fs sameDB 1 1 t.A, (t.A), t.s (t.s = t ) (t.A, t .A) (t.A) = max T1 = R , S = R EQ S T1 T1 T (t.A), S (tS .A) (t.s = tS ) 1 anyDB f f f fs f f f T2 = R , S = (t.A, T (t.A), smerge(s , ( EQ (A,t .A) (R S )))) t.A (( A,, R) , ( A,, S )) 1
fs
Definition: FTS minus() EQ (t.A, tR .A) T (t.A) = max T (t.A), R (tR .A) fs sameDB 1 1 T1 = R , S = t.A, T (t.A), t.s (t.s = tR ) EQ (t.A, tS .A) < T (t.A) = max 1 1 T (t.A), S (tS .A) < (t.s = tS ) 1 anyDB fs f f f f f (t.A, (t.A), smerge(s , ( (A,t .A) R))) t.A (( A,, R) , ( A,, S )) T2 = R , S = EQ T1
fs
Remark: In definition of union, intersection, & minus, A is set of fuzzy attributes that are common to R and S. Default value of and is 1.
199
Table 9. Result of FTS union, intersection & minus operations on FTS relations R and S
FTS relation R Name Gupta Raja Datta Subject DBMS OOPS OOPS Grade .5/C B .65/A r .50 1.0 .65 source DB1 DB2 DB1 Name Datta Sonu gupta Subject OOPS Graph DBMS FTS relation S Grade .63/A .73/A .5/C r .63 .73 .50 source DB2 DB2 DB1
R .sameDB S 8,.5
Name Gupta Raja Datta Datta Subject DBMS OOPS OOPS OOPS Grade .5/C B .65/A .63/A r .55 1.0 .65 .63 source DB1 DB2 DB1 DB2 Name Gupta Raja Datta Subject DBMS OOPS OOPS
fs
R .anyDB S 8,.5
Grade .5/C B .65/A r .50 1.0 .65 source DB1 DB2 *
fs
R .sameDB S 8,.5
Gupta DBMS .5/C .50 DB1 Gupta Datta DBMS OOPS
fs
R .anyDB S 8,.5
.5/C .65/A .50 .65 DB1 *
fs
R - .sameDB S 8,.5
Raja Datta OOPS OOPS B .65/A 1.0 .65 DB2 DB1 Raja OOPS
fs
R - .any5DB S 8,.
B 1.0 DB2
fs
Many real world problems involve imprecise and ambiguous information rather than crisp information. Recent trends in the database paradigm are to incorporate fuzzy sets to tackle imprecise and ambiguous information of real world problems. Fuzzy query processing in multidatabases have been extensively studied, however, the same has rarely been addressed for fuzzy multidatabases. In this chapter we have made an attempt to extend the SQL to formulate a global fuzzy query on a fuzzy multidatabase under FTS relational model discussed earlier. We have also provided architecture for distributed fuzzy query processing with a strategy for fuzzy query decomposition and optimization.
Assumptions
Our system configuration basically consists of one global site and a number of local sites. One of the local sites can be the global site, in which case we save a data communication cost for the site and the global site. Each local site maintains its own data management system that support FSQL (Galindo, Medina, Cubero and Garca, 2001; Galindo, Medina, Pons and Cubero, 1998) and is independent of other sites. A user communicates only with the global site. A global fuzzy query using FTS-SQL is entered
200
at the global site, and results are received from the global site. The global site maintains information about each local fuzzy database structure, such as fuzzy schema definitions and which kind of fuzzy relations are stored in which local fuzzy database. This allows the global site to efficiently schedule a global fuzzy query processing plan. Each local fuzzy database can accommodate any number of fuzzy relations and optimally process each local fuzzy query given by the global site.
((
fs
R)
S)
)
fs
This translation is allowed because both sameDB and anyDB are commutative and associative, , , however, source predicate must be evaluated before any FTS join is carried out, since attributes of operand relations are merged during the join. A FTS-SQL fuzzy query involving union, intersection or subtraction of two or more FTS-relation, can be written as given below: SELECT < target attributes >[anyDB]/[sameDB] WITH SOURCE CONTEXT {optional}
fs
201
FROM < FTS relation > HAVING < value of , > WHERE < selection=join conditions >[anyDB]/[sameDB] Union/Intersection/Minus [anyDB]/[sameDB] SELECT <target attributes >[anyDB]/[sameDB] WITH SOURCE CONTEXT {optional} FROM < FTS relation > WHERE < selection/join conditions >[anyDB]/[sameDB] HAVING < value of , > In summary, the main features offered by FTS-SQL include: FTS-SQL satisfies the closure property. Given FTS-relations, queries produce FTS-relations as result. FTS-SQL allows source options [anyDB]/[sameDB] to be specified on the SELECT and WHERE clauses, as well as on the union, intersection & minus FTS operations. Source option on SELECT clause determines if tuples from different local databases can be combined during projection. Source option on WHERE clause determines if tuples from different local databases can be combined during join. Defining source predicates to operand FTS-relations, can make formulation of queries to a specific local database. A source predicate is represented by <relation name>.source in <set of local DBid> Queries can be on both crisp and fuzzy attributes. By specifying different values of , in the HAVING clause, the precision in fuzzy query formulation can be adjusted. [0,1] is used for -resemblance of fuzzy values in fuzzy predicates, where as [0,1] imposes a constraint while selecting a fuzzy tuple. Default value for both of them is 1 which corresponds to crisp values. It can be shown that some of the FTS-SQL queries involving the anyDB option can not be performed both directly and indirectly by the normal SQL. Even when some of the FTS-SQL queries can be computed by FSQL expressions, we believe that FTS-SQL will greatly reduce the effort of fuzzy query formulation on fuzzy relational multidatabase of type-2. Using the clause WITH SOURCE CONTEXT, tuples in the fuzzy query results can be joined with its source related information available in source relation table.
Figure 2.
202
In the following, we show a number of simple global fuzzy queries formulated using FTS-SQL over the FTS relations given in Table 5, and explain their semantics. In every example we have assumed =0.8 and =0.6 to indicate the precision of the fuzzy query. With the source option [sameDB] assigned to SELECT clause, Q1 requires the projection of Dept to include the source attribute (see Figure 2). Hence the fuzzy tuples with the identical projected attribute values not source values remain to be separate in the fuzzy query result, e.g. information about Eco department. The two fuzzy values.6/mod and.63/mod are -resemblant but the tuples related with Eco department are not merged rather remain to be separate in the fuzzy query result. If the source option [anyDB] is not important during the projection, the source option can be assigned to the SELECT clause as shown in the next fuzzy query example. As shown in result of the fuzzy query Q2, fuzzy tuples that are -resemblant are merged using fuzzy union and the source value of the merged tuple is indicated by * if the parent tuples come from different source. If it is required to view the result source context with the result tuples, the WITH clause is used as shown in the next example Q3. It can be observed that the addition of WITH clause causes the source relation given in Table 3 to be joined with the FTS relation(s) in the WHERE clause using the source attribute. When the tuple has * as the source value, its source context will carry Null values for the context attribute.
DISTRIBUTED FUZZY QUERY PROCESSING ARCHITECTURE Fuzzy Query Mediator and Fuzzy Query Agents
The proposed distributed fuzzy query processor has a fuzzy query mediator and for each local database there is one fuzzy query agent. The responsibilities of a fuzzy query mediator are: 1. to take the global queries as input given by multi-database applications, and decompose it into multiple sub-queries to be evaluated by the fuzzy query agents of the respective local databases. For this decomposition process it has to refer to Global Fuzzy Schema to Export Fuzzy Schema Mapping information. This unique information is supposed to be stored in the FMDBS. to forward the decomposed queries to respective local fuzzy query agents.
2.
Figure 3.
203
Figure 4.
3. 4.
to assemble the sub-fuzzy query results returned by fuzzy query agents and further process the assembled results in order to compute the final fuzzy query result. to transform back the format of final fuzzy query result into a format that is acceptable to multidatabase applications. Here again it refers to Global fuzzy Schema to Export Fuzzy Schema Mapping information. The responsibilities of fuzzy query agents are:
1.
2.
to transform sub-queries into local queries that can be directly processed by the local database systems. This transformation process refers to Export Schema and Export to Local Schema Mapping information. This information is supposed to be stored in respective local databases. to transform back (using Export Schema and Export to Local Schema Mapping information) the local fuzzy query results into a format that is acceptable to the fuzzy query mediator and forward the formatted results to the fuzzy query mediator.
Sub-queries are independent hence they may be processed in parallel at respective local databases. This reduces the fuzzy query response time. Fuzzy query agents hide heterogeneous fuzzy query interfaces of local database systems from the fuzzy query mediators. Distributed fuzzy query processing steps designed for global FTS-SQL queries can be described briefly as follows: Global FTS-SQL queries are parsed to ensure that they are syntactically correct. Based on the parsed trees constructed, the queries are validated against the global schema to ensure that all relations and attributes in the queries exist and are properly used. Given a global FTS-SQL fuzzy query, the fuzzy query mediator decomposes it into sub-queries to be evaluated by the fuzzy query agents. Here the local database involved in global FTS-SQL fuzzy query will be determined. Some fuzzy query optimization heuristics are introduced to reduce the processing overhead. Similar strategies have been adopted for optimizing queries for other multidatabase systems (Evrendilek, Dogac, Nural and Ozcan, 1997; Finance, Smahi and Fessy, 1995). Decomposed sub-queries disseminated to the appropriate fuzzy query agents for execution. Fuzzy query agents further translate the sub-queries to the local database queries and return the sub-fuzzy query results to the fuzzy query mediator. The fuzzy query mediator assembles the sub-fuzzy query results and computes the final fuzzy query result if there exist some sub-fuzzy query operations that could not be performed by the fuzzy query agents.
204
c.
WHERE clause with the source option sameDB allows the join of tuples from the same local database only and with the source option anyDB it allows the join of tuples from any local database. Based on this join definition, we derive the decomposition strategies for following two categories of FTS-SQL queries: FTS-SQL queries with WHERE < >[sameDB] As per this strategy, we decompose a global fuzzy query into a sub-fuzzy query template and a global fuzzy query residue. Sub-fuzzy query template is the sub-fuzzy query generated based on the global schema and it has to be further translated into sub-queries on the export schemas of local fuzzy databases relevant to the global fuzzy query. The global fuzzy query residue represents the remaining global fuzzy query operations that have to be handled by the fuzzy query mediator. Since sameDB is the source option of the WHERE clause, all selection and join predicates on the global FTS relation(s) can be performed by the fuzzy query agents together with their local database systems. Deriving the Sub-Fuzzy query Template and the Global Fuzzy query Residue from a Global Fuzzy query may be given as follows: 1. 2. 3. 4. The SELECT clause of Sub-Fuzzy query Template is assigned the list of attributes that appear in the SELECT clause of Global Fuzzy query, including those which appear in the aggregate functions. The FROM clause of Sub-Fuzzy query Template is assigned the global FTS-relations that appear in the FROM clause of Global fuzzy query. Move the selection and join predicates in the WHERE clause of Global Fuzzy query to the WHERE clause of Sub-Fuzzy query Template. The Global Fuzzy query Residue inherits the SELECT clause of the original Global Fuzzy query. Its FROM clause is defined by the union of sub-fuzzy query results. In other words, only operations to be performed by the Global Fuzzy query Residue are Projections. The WITH clause is retained in the Global Fuzzy query Residue and performed in the last phase of the fuzzy query processing. The values of and in the HAVING clause of the global fuzzy query are assigned to the having clause of all the sub-queries.
5.
205
Figure 5.
ExampleConsider the global fuzzy database as given in Table 5 whose component local fuzzy databases are given in Table 4. In the fuzzy query Qa (see Figure 5), we show that the join predicate (T1 .Name EQa T2 .HoD ) and selection predicate (T2 .Fund EQa .66 / mod ) have been propagated to the sub-fuzzy query template for decomposition of a global fuzzy query that has been formulated using FTS-SQL. Having performed a union of sub fuzzy query results returned by the fuzzy query agents, a final projection operation on the union result will be required as specified in the global fuzzy query residue. FTS-SQL queries with WHERE < >[anyDB] This strategy generates one Global Fuzzy query Residue and multiple Sub-Fuzzy query Templates, one for each global relation involved in the Global Fuzzy query. In other words, a Global Fuzzy query with n relations in its FROM clause, will be decomposed into n Sub-Fuzzy query Templates. This is necessary because join predicates in Global Fuzzy query cant be propagated to the Sub Queries. Given below are the sequential steps to derive Sub-Fuzzy query Templates and Global Fuzzy query Residue from a Global Fuzzy query. 1. For each global FTS-relation R involved in the FROM clause we generate its corresponding subqueries as follows:
206
Figure 6.
a.
2.
3.
The SELECT clause of Sub-Fuzzy query Template is assigned the list of Rs attributes that appears in the SELECT clause or join predicates of the Global Fuzzy query, including those which appear in the aggregate functions. b. Selection and join predicates using Rs attributes in the Global Fuzzy query are propagated to Sub-Fuzzy query Template. c. The FROM clause of Sub-Fuzzy query Template is assigned R. For the inter-global relation join predicates in the WHERE clause of Global Fuzzy query, the projections are retained in WHERE clause of Global Fuzzy query Residue. The clause WITH SOURCE CONTEXT is also retained, however, processed at last. The values of and in the HAVING clause of the global fuzzy query are assigned to the having clause of all the sub-queries.
Example: Consider here again a similar fuzzy query as in above Example but with anyDB option attached to WHERE clause. Thus, a new global fuzzy query Qb has been formulated which is decomposed as shown in Figure 6. It can be shown that the join predicate T1 .Name EQa T2 .HoD cant be evaluated before the two Global FTS-relations Dept and Emp are derived. Nevertheless, the selection predicate
207
T2 .Fund EQa .66 / mod can still be propagated to the sub-queries for local relations corresponding to Dept. Having performed unions of sub-fuzzy query results to construct the global relations Dept and Emp, a final join and projection of the global relations will be required as specified in the Global Fuzzy query Residue.
b.
208
is m-ary), opf is an operation that returns an empty relation for any given input export fuzzy relation opG (G1, ,Gm ) = merge opL (L11, , Ln 1 ), , opL (L1m , , Lnm ) n. where Gi=merge(L1i,,Lni)(i.e. global fuzzy relation Gi is derived by combining export fuzzy relations L1i,,Lni from DB1,,DBn respectively). We assume that for DBk that has no export fuzzy relation for deriving Giwill have Lki = f , andopMap(opG ) = opL , , opL
1 m 1 m
function from global fuzzy relational operations to export fuzzy relational operations is shown below:
fs p (s DB fs fs p , ) = op1, , opn where ), set op fs fs p , p , fs fs A,, A,, fs fs
opMap(
,...,
)=
,...,
)=
)= )= )=
fs sameDB ,
p , fs fs , , fs fs , , fs fs , , ,
,...,
p ,
,...,
,..., ,...,
All these mapping functions shall be used to prove the consistency of the FTS relational operations with the sameDB option as given below:
In the above proof, the source predicate constrains the result of global fuzzy selection to contain only tuples with DBi as source values. This implies that the tuple attributes come from an export fuzzy relation in DBi. Hence a global fuzzy selection operation produces result identical to that produced by
209
first performing a selection on the export fuzzy relation from DBi followed by making the fuzzy tuples global using the merge operation.
Lemma 1.2 : L1 j ,,Lnj , G = merge(L1 j ,,Lnj ), p, G j = merge( p, L1 j , , p, Lnj ).
fs fs fs
Proof:
g p, G j i {1, , n } , g .A p, Lij where A = Attr (G j ) = Attr (Lij ) g merge( p, L1 j , , p, Lnj ).
fs fs fs fs
fs
fs
Proof: g p,G j
fs
i {1,..., n }, g.A p, Lij where A = Attr (G j ) = Attr (Lij ) g merge( p, L1jj ,..., p, Lnj ). A global projection operation with sameDB option is equivalent to projecting the required attributes from the export relations first followed by integrating the projected export relations. Hence the global FTS project operation is consistent as given below. Lemma 1.4 L1 j ,..., Lnj ,G j = merge(L1 j ,..., Lnj ),
fs sameDB A,, fs
fs
fs
Proof: Let G = ( sameDB G j ) and L = ( A,, Lij ). A,, g G and g.s = DBi for some i {1,..., n } g sameDB G jand g.s = DBi for some i {1,..., n } A,, g.A A,, Lij for some i {1,..., n } g = (g.A, L (g.A), DBi ) L (g.A) for some i {1,..., n } g L g ( A,,Lij )
f f f fs
g merge( A,,L1 j ,..., A,, Lnj ). Using similar proof techniques, we have proved that global FTS join, union, intersection, and minus operations are also consistent with the respective fuzzy relational operations on the export fuzzy relations as given below:
210
Lemma 1.5 L1 j ,..., Lnj , L1k ,..., Lnk , and G j = merge(L1 j ,..., Lnj ), Gk = merge(L1k ,..., Lnk ),G j Proof: g (G j (Gj f
Gk ) fs sameDB p , fs sameDB p ,
Gk = merge(L1 j
fs
p ,
fs p ,
Lnk ).
Gk ) g.s = DBi for some i {1,..., n } ((g.Aj , Gj (g.Aj ), DBi ) G j ) ((g.Ak , Gk (g.Ak ), DBi ) Gk )
p ,
(g.Aj Li, j L (g.Aj ) ) ((g.Ak Li,k L (g.Ak ) ) p (g.Aj , g.Ak ) for some i ((g.Aj , Gj (g.Aj )),(g.Ak , L (g.Ak ))) (Lij
ik
f p ,
g = ((g.Aj , g.Ak ), min(L (g.Aj ), L (g.Ak )), DBi ) for some i merge(L1 j
f
ij
L ,..., Lnj p, 1k
ik
L ). p, nk
Lemma 1.6 L1 j ,..., Lnj , L1k ,..., Lnk , and G j = merge(L1 j ,..., Lnj ), Gk = merge(L1k ,..., Lnk ),G j sameDBGk = merge(L1 j p, L1k ,..., Lnj p, Lnk ). p , Proof: Let G = (G j sameDBGk ) and Let L = (Lij , Lik ). p , g G g.s = DBi for some i {1,..., n } g (G j sameDBGk ) g.s = DBi for some i {1,..., n } p , (EQ (g.A, gGj .A) G (g.A) = max(G (g.A), Gj (gGj .A)) ) (EQ (g.A, gGk .A) G (g.A) = max(G (g.A), Gk (gGk .A)) ) g.s = DBi for some i {1,..., n } (EQ (g.A, g L .A) L (g.A) = max(L (g.A), L (g Lij .A)) ) (EQ (g.A, Lik .A) L (g.A) = max(L (g.A), L (g L .A)) ) for some i {1,..., n } g.A (Lij , Lik ) for some i {1,..., n } merge(Lij , Lik ,..., Lnj , Lnk ). Lemma 1.7 L1 j ,..., Lnj , L1k ,..., Lnk , and G j = merge(L1 j ,..., Lnj ), Gk = merge(L1k ,..., Lnk ),G j sameDBGk = merge(L1 j , L1k , ..., Lnj , Lnk ). , Proof: Let G = (G j sameDBGk ) and Let L = (Lij , Lik ). , g G g.s = DBi for some i {1,..., n } g (G j sameDBGk ) g.s = DBi for some i {1,..., n } , (EQ (g.A, gGj .A) G (g.A) = max(G (g.A), Gj (gGj .A)) ) (EQ (g.A, gGk .A) G (g.A) = max(G (g.A), Gk (gGk .A)) ) g.s = DBi for some i {1,..., n } L (g.A) = max(L (g.A), L (g L .A)) ) for some i {1, ..., n } g.A (Lij , Lik ) for some i {1,..., n } merge(Lij , Lik ,..., Lnj , Lnk ).
f f f
ik ik ij ij ij ij
fs
fs
fs
fs
fs
fs
ik
ik
fs
fs
fs
(EQ (g.A, g L .A) L (g.A) = max(L (g.A), L (g Lij .A)) ) (EQ (g.A, Lik .A)
211
Lemma 1.8 L1 j ,..., Lnj , L1k ,..., Lnk , and G j = merge(L1 j ,..., Lnj ), Gk = merge(L1k ,..., Lnk ),G j sameDBGk = merge(L1 j , L1k , ..., Lnj , Lnk ). , Proof: Let G = (G j sameDBGk ) , and Let L = (Lij , Lik ). g G g.s = DBi for some i {1,..., n } g (G j sameDBGk ) g.s = DBi for some i {1,..., n } , (EQ (g.A, gGj .A) G (g.A) = max(G (g.A), Gj (gGj .A)) ) (EQ (g.A, gGk .A) < (EQ (g.A, g L .A) L (g.A) = max(L (g.A), L (g Lij .A)) ) (EQ (g.A, Lik .A) <
ij ij
fs
fs
fs
max(G (g.A), Gk (gGk .A)) < ) g.s = DBi for some i {1,..., n }
max(L (g.A), L (g L .A)) < ) for some i {1,..., n } g.A (Lij , Lik ) for some i {1,..., n } merge(Lij , Lik ,..., Lnj , Lnk ).
f f f
ik ik
Having proved the consistency of the above FTS relational operations, the following corollary becomes apparent. , fs sameDB is consistent with respect to the export fuzzy relational operations. Corollary: The set of FTS relational operations { ,
fs fs sameDB
fs sameDB
fs sameDB
, and
fs sameDB
fs sameDB
T ) . Hence,
(t.A, Z (t.A), t.s ) Z (t.A, Z (t.A), t.s ) (X sameDBT ) , (EQ (t.A, tX .A) Z (t.A) = max(Z (t.A), X (tX .A)) (t.s = tX .s )) )
212
fs
(EQ (t.A, tT .A) Z (t.A) = max(Z (t.A), T (tT .A)) (t.s = tT .s )) (EQ (t.A, tR .A) Z (t.A) = max(Z (t.A), R (tR .A)) (t.s = tR .s )) (EQ (t.A, tS .A) Z (t.A) = max(Z (t.A), S (tS .A)) (t .s = tS .s )) (EQ (t .A, tT .A) Z (t.A) = max(Z (t.A), T (tT .A)) (t.s = tT .s )) (EQ (t .A, tR .A) Z (t .A) = max(Z (t.A), R (tR .A)) (t.s = tR .s )) (EQ (t.A, tY .A) Z (t.A) = max(Z (t.A), Y (tY .A)) (t.s = tY .s )) (t.A, Z (t.A), t.s ) (R sameDBY ) , (t.A, Z (t.A), t.s ) (R sameDB (S sameDBT )) , , Theorem: ,
fs anyDB fs fs fs
is associative. i.e. Z = (R ,
fs anyDB
fs anyDB
S ) ,
fs anyDB
T = R , (S ,
fs anyDB
fs anyDB
T)
Proof: Tuples in the result of , can have either non-`* or `* as its source value. Case 1: Resultant tuple with its source value non-`* (say `s). It can be shown by using the procedure similar to that of previous proof that: (t.A, Z (t.A), t.s ) (R ,
fs anyDB
S ) ,
fs anyDB
fs anyDB
fs anyDB
T)
Case 2: Resultant tuple with `* value of its source attribute. Proof: Let S, R and T be FTS relations of same arity and domain of ith attribute of S,R,T is the same. In case of anyDB option, result must have non-`* source attribute value. Let FTS relation Let FTS relation X = (R , (t.A, Z (t.A), ) Z (t.A, Z (t.A), ) (X sameDBT ) ,
f f f f f f f f fs fs anyDB
fs anyDB
T ) . Hence,
t.A ( A,, R , A,,S , A,,T ) Z (t.A) (t.s = tR .s tS .s tT .s )) t.A ( A,, R , A,,Y ) Z (t.A) (t.s = tR .s tY .s ))
fs fs fs f f f
and ,
fs anyDB
Theorem: (R , S ) , T R , (S , T ) Proof: This can be proved using a counter example given in Figure 7. Similarly properties related with other FTS relational operations can also be proved.
fs sameDB
fs anyDB
213
CONCLUSION
In real life multidatabases it is not always desirable to perform complete instance integration, as there are global users/ applications that require a global schema just to query the local databases. For them it is important to identify the source of instances in order to make decisions. Hence in this work, we have described an extended relational model in the context of fuzzy multidatabases with instance integration (level-0) performed on export fuzzy relations. While integrating instances, semantic conflicts are resolved suitably and information about the source database identity is attached with each of the resulting fuzzy tuples. A set of such fuzzy tuples having source information attached with them are called Fuzzy Tuple Source (FTS)-relation. A set of such FTS relations form a fuzzy multidatabase of type-2 under FTS relational model as per our proposal. We have proposed and implemented a full set of FTS relational algebraic operations capable of manipulating an extensive set of fuzzy relational multidatabases of type-2 that include fuzzy data values in their instances. In this chapter we have also proposed a fuzzy query language FTS-SQL to formulate a global fuzzy query on a fuzzy relational multidatabase of type-2 under FTS relational model. FTS relational operations operate on FTS relations to produce a resultant FTS relation. We have also provided architecture for distributed fuzzy query processing with a strategy for fuzzy query decomposition and optimization. Sub-queries obtained as a result of fuzzy query decomposition are allowed to get processed in parallel at respective local fuzzy databases. This reduces the fuzzy query processing time effectively. We have proved the correctness of FTS relational data model by showing that the FTS relational operations are consistent with the fuzzy relational operations on export fuzzy relations of component local fuzzy relational databases. Finally, we have described some algebraic properties of the FTS relational model that may help the fuzzy query processor to transform the relational expression in order to obtain an algebraically equivalent relational expression that require the least evaluation cost in terms of disk space and communication overhead.
REFERENCES
Agrawal, S., Keller, A. M., Wiederhold, G., & Saraswat, K. (1995). Flexible relation: An approach for integrating data from multiple, possibly inconsistent databases, In: Proc, Intl. Conf. on Data Engineering, pp 495-504. Batini, C., Lenzerini, M., Navade, S.B. (1986). A coperative analysis of methodlogies for database schema integration. ACM Computing Surveys, 18(4), 323364.
214
Buckles, B. P. and Petry, E.F. (1983). Information-Theoretical Characterization of Fuzzy Relational Databases, IEEE Transactions on Systems, Man, and Cybernetics, Vol. SMC-13, No.1, pp 74-77. Chiang, R.H.L., Barron, T.M., Storey, V.C.(1994). Reverse engineering of relational databases: Extraction of an EER model from a relational database. Data andKnowledge Engineering, 12(2), pp 107-142. Clements, D. Ganesh. M., Hwang, S.Y., Lim, E.-P., Mediratta, K., Srivastava, J., Stenoein, J., Myriad, H. Yang, (1994). Design and Implementation of a Federated Database Prototype, In: Proc. ACM SIDMOD Conf., pp 518. Chen P. (1976). The Entity-Relationship Model-Toward a Unied View of Data, ACM Trans on Database Systems 1, 1, pp 9-36. Codd, E. F. (1970). A relational Model of Data for Large Shared Data Banks, ACM Comm., 13,6, pp 377-387. Codd, E.F. (1971). Further Normalization of the Database Relational Model, In Database Systems, Courant Computer Science Symposia, 6, R. Rustin, Ed. Printice Hall, Englewood Claffs, N.J., pp 65-98. Codd, E. F. (1972). Relational Completeness of Database Sublanguages, In R Rustin, Ed, Database Systems, Printice-Hall, New Jersey. Chatterjee, A., Segev, A.(1991). Data manipulation in heterogeneous databases, SIGMOD Record, 20(4), pp 64-68. DeMichiel, L. G. (1989). Resolving database incompatibility: an approach to performing relational operations over mismatched domains. IEEE Transactions on Knowledge and Data Engineering, 1(4), 485493. doi:10.1109/69.43423 Evrendilek, C., Dogac, A., Nural, S., & Ozcan, F. (1997). Fuzzy query optimization in multidatabase systems, Jouranal of Distrubuted and Parallel Databases, 5(1) pp 77-114. Fahrner C., Vossen G.,(1995). A servey of database design transformations based on entity relationship model. Data & Knowledge Engineering, 15(3), 213250. Finance, B., Smahi, V., & Fessy, J. (1995). Fuzzy query processing in IRO-DB, In: Proc. 4th Intl. Conf. On Deductive and Object Oriented Databases (DOOD95), Singapore, pp 299-318. Galindo J., Medina J.M., Pons O., Cubero J.C. (1998). A Server for Fuzzy SQL Queries, Flexible Fuzzy query Answering Systems, eds. T. Andreasen, H. Christiansen and H.L. Larsen, Lecture Notes in Articial Intelligence (LNAI), 1495, Ed. Springer, pp. 164-174. Galindo, J., Medina, J. M., Cubero, J. C., & Garca, M. T. (2001). Relaxing the Universal Quantifier of the Division in Fuzzy Relational Databases, International Journal of Intelligent Systems, Vol. 16-6, pp 713-742. Hayne, S., Ram, S. (1990). Multi-user view integration system (MUVIS): An expert system for view integration, In: Proc. Intl. Conf. on Data Engineering, pp 402-409.
215
Grant, J., Litwin, W., Roussopoulos, N., & Sellis, T. (1993). Fuzzy query languages for relational multidatabases, The VLDB Journal, Volume 2, Issue 2, pp 153-172. Juan C. Lavariega, Susan D. Urban, (2002). An Object Algebra Approach to Multidatabase Fuzzy query Decomposition in Donaj. Distributed and Parallel Databases, 12(Issue 1), 2771. Kandel, A. (1986). Fuzzy Mathematical Techniques with Applications, Addison Wesley Publishing Co., California. Kaufman, A. (1975). Inroduction to the Theory of Fuzzy Subsets, Vol-I, Academic Press. New York: Sanfrancisco. Kaul, M., Drosten, K., & Neuhold, E. J. (1990). Integrating heterogeneous information bases by objectoriented views, In: Proc. Intl. Conf. on Data Engineering, pp 2-10. Litwin W. and Abdellatif A. (1986). Multidatabase Interoperabilty, IEEE. The Computer Journal, 12(19), 1018. Lakshman, L. V. S., Saderi, F., & Subramanian, L. N. \Schema SQL-Alanguage for interoperability in relational multidatabase systems. In: Proc. Intl. Conf. Very Large Databases, (1996), pp 239-250. Meier A. et al.(1994). Hierarchical to relational database migration, IEEE Software, pp 21-27. Larson, J. A., Navade, S. B., & Elmasari, R. (1989), A theory of attribute equivalence in database with application to schema integration, IEEE Trans. on Software Engineering, 15(4), pp 449-463. Liu, L., Pu, C., Lee, Y.(1996). An adoptive approach to query mediation across heterogeneous databases. In: Proc. Intl. Conf. on Cooperative Information Systems, pp 144-156. Lim, E. P., Srivastava, J., & Shekhar, S. (1994), Resolving attribute incompatibility in database integration: An evidential reasoning approach. In: Proc, Intl. Conf. on Data Engineering, pp 154-163. Litwin W. et al (1982). SIRIUS Systems for Distributed Data Management, In: chneider, H.J., ed., Distributed Databases, New Yark: North-Holland. Lim Ee Peng. Chiang Roger H.L., Cao Yinyan (1999). Tuple source relational model: A source aware data model for multidatabase, Data and Knowledge Engineering, 29, pp 83-114. Lim E.P., Srivastava J., Prabhakar S., Richardson J. (1993). Entity identification problem in database integration. In: Proc, Intl. Conf. on Data Engineering, 294-301. Litwin, W., Abdellatif, A., Zeroual, A., & Nicolas, B. (1989). MSQL: A multidatabase language, Information sciences, 49, pp 59-101. Lee, Jeong-Oog, Baik, Doo-Kwon (1999). SemQL: a semantic fuzzy query language for multidatabase systems,In: Proceedings of the eighth international conference on Information and knowledge management, United States, pp 259-266. Ma, Z. M., Zhang, W., & Ma, W. (2000). Semantic conicts and solutions in fuzzy multidatabase systems, DNIS, pp 80-90.
216
Prade, H., Testemale, C. (1984). Generalizing Database Relational Algebra for the Treatment of Incomplete and Uncertain Information and Vague Queries. Information Science, 115143. Rundensteiner, E. A., Hawkes, L. W., & Bandler, W. (1989). On Nearness Measures in Fuzzy Relational Data Models, International Journal of Approximate Reasoning, (3), 267-298. Raju, K.V.S.V.N., Majumdar, A.K. (1986). Fuzzy functional dependencies in fuzzy relations, In Proc. Second Intl. Conf. on Data Engg., Los Angeles, California, pp 312-319. Saltor, F., Castellanos, M., & Garcia-Solaco, M. (1991). Suitability of data models as canonical models for federated databases, SIGMOD Record, 20(4), pp 44-48. Seth, A.P., Larson, J.A. (1990). Federated database systems for managing distributed heterogeneous and autonomous databases. ACM Computing Surveys, 22(3), 183236. Sharma, A. K., Goswami, A., & Gupta, D. K. (2004). Fuzzy Inclusion Dependencies in Fuzzy Relational Databases, In Proceedings of International Conference on Information Technology: Coding and Computing (ITCC 2004), Las Vegas, USA, IEEE Computer Society Press, USA, Volum-1, pp 507-510. Sharma, A. K., Goswami, A., & Gupta, D. K. (2008). Fuzzy Inclusion Dependencies in Fuzzy Databases. In Galindo, J. (Ed.), Handbook of Research on Fuzzy Information Processing in Databases (pp. 657683). Hershey, PA, USA: Information Science Reference. Spaccapietra, S., Parent, C., & Dupont, Y. (1992). Model independent assertions for integration of heterogeneous schemas, Very Large Database Journal, l(l), pp 81-126. Tasi, P.S.M., Chen, A.L.P. (1993), Querying uncertain data in heterogeneous databases, In: Proc. RIDEIMS Conf., pp 161-168. Wang, Y. R., & Madnick, S. E. (1989), The inter-database instance identication problem in integrating autonomous systems, In: Proc. Intl. Conf. on Data Engineering, pp 46-55. Wang, Y.R., Madnick, S.E., R. Wang, S. Madnick. (1990), A polygen model for heterogeneous database systems: The source tagging perspective, In: Proc. Intl. Conf. on Very Large Data Bases, pp 519-538. Zadeh, L. A. (1965). Fuzzy Sets, Information and Control, 8, pp 338-353. Zaniolo C. (1979). Design of relational views over network schemas, In: Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp 179-190. Zemankova, M., & Kandel, A. (1984). Fuzzy Relational Database-A Key to Expert Systems, Verlag, TUV Rheinland, Cologne. Zemankova, M., Kandel, A. (1985). Implementing Imprecision in Information Systems. Information Sciences, 37(3), 107141. doi:10.1016/0020-0255(85)90008-8 Zhang, W., Laun, E., & Meng, W. (1997). A methodology of integrating fuzzy relational database in multidatabase systems, In: Proc, Intl. Conf. on Database Systems for Advance Applications. Zhao, J.L., Segev, A., Chatterjee, A. (1995). A universal relation approach to federated database management, In: Proc, Intl. Conf. on Data Engineering, pp 261-270.
217
from the definition of Cartesian product of fuzzy sets, dom(A1 ) dom(A2 ) dom(An ) is a fuzzy subset of U*=U1U2...Un. Hence a type-1 fuzzy relation r is also a fuzzy subset of U* with membership function r. Fuzzy Relational Database of Type-2: A type-2 fuzzy relation r is a fuzzy subset of D, where mr : D [0, 1] must satisfy the condition mr (t ) max min ma (u1 ), ma (u2 ), , ma (un ) where 1 2 n (u1 ,u2 ,,un )U t = (a1, a2 , , an ) D . Fuzzy Value Equivalent (FVEQ): Let A and B be two fuzzy sets with their membership functions A and B respectively. A fuzzy value A is said to be equivalent to some other fuzzy value b B, iff b mB (x ) , for some xS, where S is the set of crisp values that are returned by mA-1 (a ) ,
218
GT: Let P1 is a proximity relation defined over U. Fuzzy relational operator GT is defined to be a fuzzy subset of UU, where GT satisfies the following properties u1, u2 U : 0 if u u 1 2 mGT (u1, u2 ) = mP (u1, u2 ) otherwise. 1 LT: Let P2 is a proximity relation defined over a universe of discourse U. The fuzzy relational operator LT is defined to be a fuzzy subset of UU, where LT satisfies the following properties u1, u2 U : 0 if u u 1 2 mLT (u1, u2 ) = mP (u1, u2 ) otherwise. 1 NEQ, GOE, LOE: Membership functions of fuzzy relations `NOT EQUAL (NEQ), `GREATER THAN OR EQUAL (GOE) and `LESS THAN OR EQUAL (LOE) can be defined based on that of EQ, mNEQ (u1, u2 ) = [1 mEQ (u1, u2 )] GT and LT as follows: mGOE (u1, u2 ) = max[mGT (u1, u2 ), mEQ (u1, u2 )] mLOE (u1, u2 ) = min[mLT (u1, u2 ), mEQ (u1, u2 )] -Cut: Given a fuzzy set A defined on U and any number [0,1], the -cut A, and the strong -cut, A = {u | A (u ) } + A, are the crisp sets + A = {u | A (u ) > } -Resemblance: Given a set U with a resemblance relation EQ as previously defined. Then, U , EQ is called a resemblance space. An -level set EQ induced by EQ is termed as an -resemblance set. Define the relationship of two values x,yU that resemble each other with a degree larger than or equal to (i.e. EQ (x , y ) ) as -resemblant. The following notation is proposed for the notion of two values x,y being -resemblant: xEQy. A set PU is called an -preclass on U , EQ , if x , y P , x and y are -resemblant (i.e. xEQy holds).
219
Section 2
221
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Tadeusz Pankowski Poznan University of Technology, Poland
Chapter 9
ABSTRACT
This chapter addresses the problem of data integration in a P2P environment, where each peer stores schema of its local data, mappings between the schemas, and some schema constraints. The goal of the integration is to answer queries formulated against a chosen peer. The answer must consist of data stored in the queried peer as well as data of its direct and indirect partners. The chapter focuses on defining and using mappings, schema constraints, query propagation across the P2P system, and query answering in such scenario. Schemas, mappings, constraints (functional dependencies) and queries are all expressed using a unified approach based on tree-pattern formulas. The chapter discusses how functional dependencies can be exploited to increase information content of answers (by discovering missing values) and to control merging operations and propagation strategies. The chapter proposes algorithms for translating high-level specifications of mappings and queries into XQuery programs, and it shows how the discussed method has been implemented in SixP2P (or 6P2P) system.
DOI: 10.4018/978-1-60960-475-2.ch009
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
INTRODUCTION
The goal of data integration is to enable rapid development of new applications requiring information from multiple sources (Haas, 2007). Data integration consists in combining data from different sources into a unified format (Bernstein & Haas, 2008). There is a number of different research fields relevant to data integration. Among them we can distinguish: identification of the best data sources to use, cleansing and standardizing data coming from these sources, dealing with uncertainty and tracing data provenance, the way of querying diverse sources and optimizing queries and execution plans. Integration activities cover any form of data reuse, such as exchanging data between different applications databases, translating data for business-to-business e-commerce, and providing access to structured data and documents via a Web portal (Bernstein & Haas, 2008). A variety of architectural approaches can be used to deal with the problem of data integration. The most popular is the materialized integration realized by means of data warehouse that consolidates data from multiple sources. Other approaches use paradigm of virtual integration. While warehouses materialize the integrated data, virtual data integration offers a mediated schema against which users can pose queries. The query is translated into queries on the data sources and results of those queries are merged so that it appears to have come from a single integrated database (Miller et al., 2000, Pankowski & Hunt, 2005). In a peer-to-peer (P2P) data integration the role of the mediated schema can play schema of any peer database. Then the user issues a query against an arbitrarily chosen peer and expects that the answer will include relevant data stored in all P2P connected data sources. The data sources are related by means of XML schema mappings. A query must be propagated to all peers in the system along semantic paths of mappings and reformulated accordingly. The partial answers must be merged and sent back to the users peer (Madhavan & Halevy, 2003; Pankowski, 2008c; Tatarinov & Halevy, 2004). Much work has been done on data integration systems both with a mediated (global) schema and in P2P architecture, where the schema of any peer can play the role of the mediated schema (Arenas & Libkin, 2005; Madhavan & Halevy, 2003; Melnik et al., 2005, Yu & Popa, 2004). There is also a number of systems built in P2P data integration paradigm (Koloniari & Pitoura, 2005), notably Piazza (Tatarinov et al., 2003), PeerDB (Ooi et al., 2003). In these works the focus was on overcoming syntactic heterogeneity and schema mappings were used to specify how data structured under one schema (the source schema) can be transformed into data structured under another schema (the target schema) (Fagin et al., 2004; Fuxman et al., 2006). Some attention has been paid to the question of how schema constraints influence the query propagation. This chapter describes formal foundations and some algorithms used for XML data integration in P2P system. Schemas of XML data are described by means of a class of tree-pattern formulas, like in (Arenas & Libkin, 2005). These formulas are used to define both schema mappings and queries. In contrast to (Arenas & Libkin, 2005), except for schemas we use tree-pattern formulas also to specify constraints (functional dependencies) over schemas. Schemas, mappings, queries and constraints are specified in a uniform way as a class of tree-pattern formulas. Thanks to this, we are able to translate high-level specifications into XQuery programs. We also discuss the problem of query propagation between peers. We show how mutual relationships between schema constraints and queries can influence both propagation of queries and merging of answers. Taking into account such interrelationships may improve both, efficiency of the system and information content included in answers. We show in brief how the issues under consideration have been implemented in 6P2P system (SixP2P, Semantic Integration of XML data in P2P environment).
222
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
In this chapter, we assume that all XML documents are valid, so we will not be interested in checking their validity against DTDs. Instead, we are interested in such a formal description of the structures of XML documents that would be convenient for defining mappings and transforming data in XML data integration processes. Since every XML document is a finite tree, thus its structure can be specified by a tree pattern (Xu & Ozsoyoglu, 2005). For documents having recursive DTDs their tree patterns are restricted to finite depths. The notion of tree-patterns can be used to define tree-pattern formulas (Arenas & Libkin, 2005). A tree-pattern formula arises from a tree pattern by assigning variables to terminal labels (i.e. to paths starting in the root and ending with this terminal label). In Figure 1 the tree-pattern formulas, S1 and Sa,
223
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Figure 1. Graphical representation of DTDs (D1 and Da) and tree-pattern formulas (schemas) S1 and Sa for documents conforming to those DTDs, Sa is restricted to depth of 3
are represented in a form of trees, whose leaves are labeled with text-valued variables. We will consider two classes of tree-pattern formulas: schemas and functional dependencies. Definition 2. (tree pattern) A tree pattern over D = (top, L, Ter, ) is an expression that can be defined recursively by the following grammar:
TP::= /E | EE::= l | l[E1, ..., Ek] | E/E, and first(Ei) (l), for i = 1, ..., k; and first(E) (last(E)),
first(l) = last(l) = l, first(l[E1, ..., Ek]) = last(l[E1, ..., Ek]) = l, first(E/E) = first(E), last(E/E) = last(E).
Example 2. For L = {pubs, pub, title, year, author, name, university}, the following expressions are tree patterns over L:
TP1 = /pubs[pub[title, year[author[name, university]]]], TP2 = pubs/pub[title, author]/year, TP3 = pub/author. TPa = /parts[part[pid, part[pid part[pid]]]].
224
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
By Path(TP) we will denote the set of all paths defined by the tree pattern TP. This set is defined recursively as follows: Path(/E) = {/p | p Path(E)}, Path(l) = {l}, Path(E/E) = {p/p | p Path(E), p Path(E)}, Path(l[E1, ...,Ek]) = { l/pi | pi Path(Ei), i [1, ..., k] }.
Definition 3. (schema pattern) A tree pattern TP over D = (top, L, Ter, ) is called a schema pattern over D, if the following two conditions hold: TP is of the form /top[E], each path in Path(TP) ends with a terminal label. Tree pattern TP1 and TPa from Example 2 are tree-pattern schemas over D1 and Da, respectively. Definition 4. (arity of tree patterns) A tree pattern TP over D = (top, L, Ter, ) is of arity m if there are m paths in Path(TP) ending with a terminal label from Ter. To indicate these terminal labels (possibly with more that one occurrences) and their ordering we use the notation TP(l1, ..., lm) meaning that the occurrence of the terminal label l1 proceeds the occurrence of l2, and so on. Note that a label can have many occurrences in a tree pattern (e.g. for Sa in Example 1, we have Sa(pid, pid, pid)) . Definition 5. (tree-pattern formula, schema) Let TP be a tree pattern of arity m over D = (top, L, Ter, ), and (l1, ..., lm) be a list of (not necessarily distinct) terminal labels occurring in TP and x = (x1, ..., xm) be a tuple of distinct text-valued variables. Then the formula TP(x) created from TP by replacing the i-th terminal label li with the equality li = xi, is called a tree-pattern formula. If TP is a schema pattern, then TP(x) will be referred to as a schema. Schemas will be denoted by S(x), or by S if variable names are not important or clear from the context. Note that a schema S(x) is an XPath expression (XPath, 2006}, where the first slash, /, denotes the root of the corresponding XML document. Any variable x occurring in the schema has the type being the sequence of labels (the path) leading from the root to the leaf the variable x is assigned to. Definition 6. (variable types) Let S be a schema over x and let an atom l = x occur in S. Then the path p starting with top and ending with l is called the type of the variable x, denoted typeS(x) = p. A tree pattern has the type being the type of the elements returned by its evaluation according to the XPath semantics. This type is a path determined as follows: type(/E) = /type(E), type(l) = l, type(E/E) = type(E)/ type(E),
225
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
type(E[E]) = type(E).
It will be useful to perceive an XML tree I with a schema S(x), as a pair (S(x), ) (called the instance description), where is a set of valuations of variables in x. Definition 8. (variable valuation) Let Str {} be a set of values of text nodes. Let x be a set of variable names. A valuation for variables in x is a function : x Str {} assigning values in Str {} to variables in x. An XML tree I satisfies a description (S, ), denoted I (S, ), if I satisfies (S, ) for every , where this satisfaction is defined as follows. Definition 9. (schema satisfaction) Let S be a schema, x be a set of variables in S, and be a valuation of variables in x. An XML tree I = (r, Ne, Nt, child, , ) satisfies S by , denoted I (S, ), if the root r of I satisfies S by , denoted (I, r) (S, ), where:
226
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Figure 2. Two XML trees as equivalent instances of S1; J2 is a canonical instance, whereas J1 is not
(I, r) (/top[E], ), iff nNe (child(r, n) (I, n) (top[E], )); (I, n) (l[E1, ..., Ek ], ), iff (n) = l and n1, ..., nk Ne (child(n, n1) (I, n1) (E1, ) ... child(n, nk) (I, nk) (Ek, )); (I, n) (l/E, ), iff (n) = l nNe (child(n, n) (I, n) (E, )); (I, n) (l = x, ), iff (n) = l and nNt (child(n, n) (n) = (x)).
In fact, a description (S, ) represents a class of instances of S with the same set of valuations , since elements in instance trees can be grouped and nested in different ways. For example, both XML trees J1 and J2 in Figure 2 conform to the schema S1 from Example 1, and satisfy the description (S1, {(XML, 2005, Ann, LA), (XML, 2005, John, NY)}, although they are organized in different ways. By a canonical instance we will understand the instance with the maximal width, i.e. the instance where subtrees corresponding to valuations are pair-wise disjoint. For example, the instance J2 in Figure 2 is a canonical instance, whereas J1 is not since two authors are nested under one publication.
SCHEMA MAPPINGS
Further on in this chapter, we will refer to the running example depicted in Figure 3. There are three XML schema trees S1, S2, S3, along with their instances I1, I2, and I3, respectively. S1 is the same as S1 in Figure 1, and its instance I1 is empty. The schemas and their instances are on peers P1, P2, and P3, respectively. The key issue in data integration is this of schema mapping. Schema mapping is a specification defining how data structured under one schema (the source schema) is to be transformed into data structured under another schema (the target schema). In the theory of relational data exchange, source-to-
227
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Figure 3. XML schemas, S1, S2, S3, and their instances I1, I2 and I3, located in peers P1, P2, and P3
target dependencies (STDs) (Abiteboul et al., 1995) are usually used to express schema mappings (Fagin et al., 2004). We adopt the approach proposed in (Fagin et al., 2004), but instead of relational schemas we will deal with XML schemas (Arenas & Libkin, 2005; Pankowski et al., 2007). An XML schema mapping specifies the semantic relationship between a source XML schema and a target XML schema. Definition 10. (schema mapping) A mapping from a source schema S(x) to a target schema T(x, y), where x x, and y x = , is a formula of the form mST:= x (S(x) yT(x, y)). In other words, a mapping mST states, that for any valuation of variables in x, if this valuation satisfies a tree-pattern formula S(x), then there is such a valuation of y that T(x, y) is also satisfied. Variable names in a mapping are used to indicate correspondences between text values of paths bound to variables. In practice, a correspondence can also involve functions that transform values of source and target variables. These functions are irrelevant to our discussion, so they will be omitted. The result of a mapping is a canonical instance of the right-hand side schema, where each variable y y has the (null) value. The target instance can be obtained using the chase procedure (Abiteboul et al., 1995; Fagin et al., 2005; Pankowski, 2008d). In this work, however, we will propose an algorithm (Algorithm 1) translating the high-level specification of a mapping into an XQuery program (XQuery, 2002; Pankowski, 2008c) producing the target instance from a given source instance. Example 4. The mapping mS3S1 from S3 to S1 (Figure 3) is specified as:
mS3S1:= x1, x2, x3 (S3(x1, x2, x3) x4S1(x2, x3, x1, x4)) = = x1, x2, x3 (/authors[author[name = x1, paper[title = x2, year = x3]]] x4 /pubs[pub[title = x2, year = x3, author[name = x1, university = x4]]]).
228
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Algorithm 1 translates a mapping into an appropriate XQuery program. By x, y, v (possibly with subscripts) we denote variables, while $x, $y, $v are the corresponding XQuery variables. Algorithm 1. (translating a mapping mST to XQuery program)
Input: A mapping x (S(x) yT(x, y)), where: S:= /l[E], T:= /l[E], y = (y1, ..., ym). Output: Query in XQuery over S transforming an instance of S into the corresponding canonical instance of T. mappingToXQuery(x (/l[E] y1, ..., ym /l[E]) = <l>{let $y1: = null, ..., $ym := null for $vin doc(...)/l, ($v, E) return (E)} </l>
where: 1. 2. 3. ($v, l = x) = $xin if ($v[l]) then $v/l/text()else null, ($v, l[E]) = $vin if ($v[l]) then $v/lelse /, ($v, E), ($v, E1, ..., Ek) = ($v, E1), ..., ($v, Ek), and 4. 5. 6. (l = x) = <l>{$x}</l> (l[E]) = <l>(E)</l> (E1, ..., Ek) = (E1) ... (Ek) For the mapping from Example 4, Algorithm 1 generates the following XQuery program:
Query 1: <pubs>{ let $x4:=null for $_v in doc(I3.xml)/authors, $_v1 in if ($_v[author]) then $_v/author else /, $x1 in if ($_v1[name]) then $_v1/name/text() else null, $_v2 in if ($_v1[paper]) then $_v1/paper else /, $x2 in if ($_v2[title]) then $_v2/title/text() else null, $x3 in if ($_v2[year]) then $_v2/year/text() else null return <pub> <title>{$x2}</title>
229
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
The program creates a canonical instance of S1, i.e. elements are not grouped and all missing values are replaced with nulls ().
SCHEMA CONSTRAINTS
Over a schema, we can define XML functional dependencies (XFD), which constrain dependencies between values or values and nodes in instances of the schema. Definition 11. (TP-XFD) A tree-pattern formula F(x1, ..., xk) over a schema S(x) defines an XML functional dependency p1, ..., pk p (Arenas & Libkin, 2004) if: typeS(xi) = pi, for i = 1,...,k, type(S) = p.
Such TPFs will be referred to as tree-pattern XML functional dependencies (TP-XFDs), p is then the dependent path, and p1, ..., pk are determining paths. An XFD p1, ..., pk p denotes that a tuple of text values corresponding to the left-hand side uniquely determines the value of the right-hand side. It is assumed that each path pi on the left-hand ends with a terminal label, whereas the path p on the right-hand side can end with a terminal or non-terminal label. In general, there can also be a context determined by a path q in which the XFD is defined, then XFD has the following form (Arenas & Libkin, 2004): q; {p1, ..., pk} p .
Example 5. For example, to specify that in S1 (Figure 3) the university is determined by the authors name, we can write:
XFD: pubs.pub.author.name pubs.pub.author.university, or TP-XFD: /pubs/pub/author[name = x]/university.
To express that the constraint is valid in a subtree denoted by the path pubs.pub (the context), we write:
XFD: pubs.pub; pubs.pub.author.name pubs.pub.author.university, or TP-XFD: /pubs/pub[title = x1]/author[name = x2]/university.
230
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
In the last TP-XFD a key for the context subtree must be given (we assume that the context subtree is uniquely determined by the title of the publication) (see Buneman et al., 2003). Note that TP-XFDs are XPath expressions, so their semantics is precisely defined. Moreover, they can be easily incorporated into XQuery-based procedures exploiting these constraints. Definition 12. We say that a TP-XFD F(x) is defined over a schema S(x), if x x and typeS(x) = typeF(x) for each x x. Definition 13. (satisfaction of TP-XFD) An instance I = (S(x), ) satisfies a TP-XFD F(x) defined over S(x), if for any two valuation 1, 2 the following implication holds 1(x) = 2(x) F(1(x)) = F(2(x)), (1)
where F(1(x)) and F(2(x)) denote results of computing XPath expression F(x) by valuations 1 and 2, respectively. Example 6. Over schema S3 (Figure 3), we can specify the following TP-XFD: F1(x2):= /authors/author/paper[title = x2]/year, (the title of a paper determines the year of its publication), whereas over S2 (Figure 3) one of the following two TP-XFDs can be specified, either F2(x1):= /pubs/pub[title = x1], or F2(x1, x2):= /pubs/pub[title = x1, author[name= x2]]. Any instance satisfying F2(x1) must have at most one subtree of the type /pubs/pub for any distinct value of title (see J1 in Figure 2), whereas any instance satisfying F2(x1, x2) must have at most one subtree of the type /pubs/pub for any distinct pair of values (title, author/name) (see J2 in Figure 2). Further on, we will use TP-XFDs to discover some missing values. Thus, we will restrict ourselves to TP-XFDs determining text values. Definition 14. (text-valued TP-XFDs) We say that a TP-XFD F(x) over a schema S(x) determines text values (or is text-valued), if there is such x x, that typeS(x) = type(F(x)). Then this TP-XFD will be denoted by (F(x), x). Proposition 1. (discovering missing values) Let (F(x), x) be a text-valued TP-XFD over S(x), and I = (S(x), ) be an instance of S(x) satisfying (F(x), x). Let 1, 2 be such valuations that: 1(x) = 2(x), 1(x) = , 2(x) ,
231
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Then we say that the missing value of x for the valuation 1is discovered as 2(x). The following algorithm generates an XQuery program for a given schema and a set of TP-XFDs over this schema. The program discovers all possible missing values, with respect to the given set of TP-XFDs. We say that in this way the instance is being repaired. Algorithm 2. (generation of XQuery program discovering missing values)
Input: A schema S(x) = /top[E] and a set F of text-valued TP-XFDs over S(x). Output: Program in XQuery over instances of S(x) returning a repaired version of the given instance of S(x). Method: xfdToXQuery(/top[E]) identical to the translation function mappingToXQuery(x (/top[E] /top[E]) in Algorithm 1, except that the rule (4) is replaced with the rule: 4. (l = x) = <l>{ if ($x = null) thenF(x)[text() != null] /text() else $x} </l>, where (F(x), x) F.
Example 7. Discovering missing values in an instance of S1 (Figure 3) can be done using the XQuery program (Query2) generated for the schema S1 where TP-XFD constraints are: F1(x1):= /pubs/pub[title = x1]/year, and F2(x3):=/pubs/pub/author[name = x3]/university. The corresponding XQuery program is similar to this for Query 1. However, expressions defining elements year and university attempt to discover missing values, when the current values of $x2 or $x3 are null:
Query 2: <pubs>{ for $_v in doc(I1.xml)/pubs, $_v1 in if ($_v[pub]) then $_v/pub else /, $x1 in if ($_v1[title]) then $_v1/title/text() else null, $x2 in if ($_v1[year]) then $_v1/year/text()
232
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
else null, $_v2 in if ($_v1[author]) then $_v1/author else /, $x3 in if ($_v2[name]) then $_v2/name/text() else null, $x4 in if ($_v2[university]) then $_v2/university/text() else null return <pub> <title>{$x1}</title> <year>{if ($x2=null) then doc(I1.xml)/pubs/pub[title=$x1]/year[text()!=null]/text() else $x2}</year> <author> <name>{$x3}</name> <university>{if ($x4=null) then doc(I1.xml)/pubs/pub/author[name=$x3]/university [text()!=null]/text() else $x4}</university> </author> </pub> } </pubs>
As the result of repairing, all nulls violating TP-XFDs are replaced with non-null values.
233
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Figure 4. Reformulation of a query (mT(z)T(z), (z)) into a query (mS(x)T(x,y), (x)) using the mapping x(S(x) yT(x, y))
Definition 16. (answer to a query) Let q = (mST, ) be a query from S(x) to T(x, y) and I = (S, ) be an instance of S. The answer to q(I) is such the instance J = (T, ) that = {.restrict(x) null(y) | ((x)) = true}, where .restrict(x) is the restriction of the valuation to the variables in x, and null(y) is a valuation assigning nulls to all variables in y. Example 8. The query q12 = (mS1(x1,x2,x3,x4)S2(x1,x3,x4), x3 = John x2 = 2005), filters an instance of the source schema S1 according to the qualifier and produces an instance of the schema S2. A query is issued by the user against an arbitrarily chosen peer schema (the target schema). The user perceives a target schema T(z), and defines a qualifier (z), so initially the query is from T to T, and is of the form q = (mT(z)T(z), (z)). When the query is propagated to a source peer with the schema S(x), it must be reformulated into a query from S to T, i.e. to q = (mS(x)T(x,y), (x)). The query reformulation concerns the left-hand side of the query, and consists in the appropriate renaming of variables. The reformulation is performed as follows (Figure 4): 1. We want to determine the qualifier (x) over the source schema S(x). To do this we use the mapping mS(x)T(x,y).
The qualifier (x) is obtained as the result of the following rewriting of the qualifier (z) (x):= (z).rewrite(T(z), T(x, y)), The rewriting consists in appropriate replacement of variable names. A variable z z occurring in (z) is replaced by such a variable x x that the type of z in T(z) is equal to the type of x in T(x, y). If such x x does not exist, the query is not rewritable. Example 9. For the query q11 = (mS1(x1,x2,x3,x4)S1(x1,x2,x3,x4), x3 = John), we have the following reformulation for its propagation to S2
234
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
The possibility of discovering missing values during the process of merging constitutes the criterion of the selection of one of these two modes. To make the decision, relationships between TP-XFD constraints specified for the peers schema, and the query qualifier must be analyzed. Further on in this section, we formulate a theorem (Theorem 1) that states the sufficient condition when there is no sense in applying full merge because no missing value can be discovered (Pankowski, 2008a).
235
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
a Ans11 = q11( I1 ) = {( x1 :, x2 :, x3 :, x4 :)}, a Ans21 = q21( I 2 ) = {( x1 : XML, x3 : John, x4 : NY )}, a Ans31 = q31( I 3 ) = {( x3 :, x1 :, x2 :)},
Ansa = {( x1 : XML, x2 :, x3 : John, x4 : NY )}. Strategy (b). It differs from strategy (a) in that P2 after receiving the query propagates it to P3 and waits for the answer q32(I3). It is obvious that the result Ansb is equal to Ansa:
b b b Ansb = merge({ Ans11, Ans21, Ans31}) = {( x1 : XML, x2 :, x3 : John, x4 : NY )}.
Strategy (c). In contrast to the strategy (b), the peer P3 propagates the query to P2 and waits for the answer. Next, the peer P3 decides to merge the obtained answer q23(I2). with the whole its instance I3. The decision follows from the fact that the functional dependency /authors/author/paper[title = x1]/year is defined over the local schema of P2, and satisfies the necessary condition for discovering missing values of variable x2 of the type /authors/author/paper/year (Theorem 1). So we have:
c c c Ansc = merge({ Ans11, Ans23, Ans31}), c Ans23 = {( x1 : XML, x2 :, x3 : John)}, c c Ans31 = q31(merge({I 3, Ans23 }) = {( x1 : XML, x2 : 2005, x3 : John)},
Ansc = {( x1 : XML, x2 : 2005, x3 : John, x4 : NY )}. While computing merge({I3, Ansc23}) a missing value of x2 is discovered. Thus, the answer Ansc provides more information than those in strategies (a) and (b).
236
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
This discussion above shows that it is useful to analyze relationships between the query and functional dependencies defined over the peers schema. The analysis can influence the decision about the propagation and merging modes.
To illustrate application of the above theorem let us consider a query about Johns data in peers P2 and P3 in Figure 3. 1. Let q be a query with the qualifier 2(x2):= x2 = John in the peer P2. There is also TP-XFD F2(x2):= /pubs/pub/author[name = x2]/university specified over S2(x1, x2, x3). In force of Theorem 1 there is no chance to discover any missing value of Johns university. Indeed, if we obtain an answer with university =, then the real value is either in the local answer q(I2) or it does not occur in I2 at all. So, in P2 the partial merge should be performed. Performing the full merge in this case is pointless. Let q be a query with the qualifier 3(x1):= x1 = John issued against the peer P3. There is TPXFD F3(x2):= /authors/author/paper[title = x2]/year specified over S3(x1, x2, x3). Assumptions of
237
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Theorem 1 are not satisfied, so there is a chance to discover missing value of year using the full merge. Indeed, from P2 we obtain the answer q(I2):= (S3, {(John, XML, )}). The local answer q(I3) is empty. But performing the full merge and using F3(x2), we obtain: q(merge((S3, {(John, XML, )}), (S3, {(Ann, XML, 2005)})) = = (S3, {(John, XML, 2005}). Thus, the year of Johns publication has been discovered and the using of full merge is justified. The consequences of Theorem 1 impact also the way of query propagation. The P2P propagation (i.e. to all partners with the P2P propagation mode) may be rejected because of avoiding cycles. However, when the analysis of the query qualifier and TP-XFDs shows that there is a chance to discover missing values, the peer can decide to propagate the query with the local propagation mode (i.e. the peer expects only the local answer from a partner, without further propagation) instead of rejecting it. Such an action can take place in peer P3 in the case (2) discussed above.
238
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
information about partners, schemas constraints, mappings, answers). Using the query interface (QI) a user formulates a query. The query execution module (QE) controls the process of query reformulation, query propagation to partners, merging of partial answers, discovering missing values, and returning partial answers (Brzykcy et al., 2008; Pankowski, 2008b; Pankowski, 2008c; Pankowski et al., 2007; Pankowski, 2008d). Communication between peers (QAP) is realized by means of Web Services technology.
2.
3.
4.
5.
6.
7.
239
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
8.
A propagation is a relationship between a peer (the target peer) and another peer (the source peer) where the query has been sent (propagated) to. While propagating queries, the following three objectives are taken into account: (1) avoiding cycles, (2) deciding about propagation modes (P2P or local), and (3) deciding about merging modes (full or partial).
6P2P Database
A peers database consists of five tables: Peer, Constraints, Partners (Figure 7), Queries, and Propagations (Figure 8). 1. Peer(myPeer, myPatt, myData, xfdXQuery, keyXQuery) has exactly one row, where: myPeer the URL of the peer owning the database; myPatt the schema (tree-pattern formula) of the data source; myData the peers data source, i.e. an XML document or an XML view over some data repositories. xfdXQuery and keyXQuery are XQuery programs obtained by the translation of constraints specifications, TP-XFDs and keys, respectively. Constraints(constrId, constrType, constrExp) stores information about the local data constraints (in this chapter we discuss only TP-XFDs). Partners(partPeer, partPatt, mapXQuery) stores information about all peers partners (acquaintances), where: partPeer the URL of the partner; partPatt the right-hand side of the schema mapping to the partner (variable names reflect correspondences between paths in the source and in the target schema). mapXQuery is an XQuery program obtained by translation(by means of Algorithm 1) of the mapping determined by the Peer.myPatt (left-hand side of the mapping) and Partners.partPatt (right-hand side of the mapping). Queries and Propagations (Figure 8) maintain information about queries, qryId, and their threads, qryThreadId, managed in the 6P2P system. The user specifies a qualifier of the query, myQualif, as well as propagation (propagMode) and merging (mergeMode) modes.
2. 3.
4.
240
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Figure 8. Queries and propagations tables in 6P2P. Sample data illustrates instances of tables when a query is propagated from a peer @P to a peer @P
a cycle, i.e. @P occurs in q1.qryTrace. discoveryMayBeDone(q1, @P) is true if the hypothesis of Theorem 1 does not hold. acceptedPropagation(q1, @P)$ is true if @P accepts the propagation of q1 with the given parameters. Algorithm 3. (query propagation)
Input: @P a current peer; @P:Peer, @P:Partners, @P:Queries, @P:Propagations tables in the peers @P database. Output: New states of tables @P:Queries and @P:Propagations, if a partner peer @P accepts the propagation. Method: q:= @P:Queries; // a row describing the query thread to be propagatedifq. propagMode = P2P { := newpropagationParametersType; q1 // used to prepare propagations to all partners := newpropagId; q1.propagId := q.qryThreadId; q1.qryThreadId := q.qryId; q1.qryId := q.qryTrace + @P; q1.qryTrace // the sequence of visited peers used to avoid cycles := @P; // the peer where the answer should be returned q1.myPeer
241
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
q1.myQualif := q.myQualif; // the query qualifier := P2P; q1.propagMode := q.mergeMode; q1.mergeMode := @P:Peer.myPatt; // the schema of @P q1.myPatt foreach $@Pin @P:Partners.partPeer { // attempt to propagate the query to all partners if LeadsToCycle(q1, @P) and not discoveryMayBeDone(q1,@P) then next if LeadsToCycle(q1, @P) then q1.propagMode := local; if acceptedPropagation(q1, @P) then insert into @P:Propagations values (q1.propagId, q1.qryThreadId, @P, null, Waiting) }
If the source peer @P accepts the propagation q1 then it creates the following tuple q2 and inserts it into the @P:Queries table:
q2.qryThreadId := newqryThreadId, := q1.qryId, q2.qryId := q1.qryTrace, q2.qryTrace := q1.myPeer, q2.tgtPeer := (q1.myQualif).rewrite(q1.myPatt,@P:Partners.partPatt q2.myQualif wherePartners.partPeer = q1.myPeer), := q1.propagMode, q2.propagMode := q1.mergeMode, q2.mergeMode := q1.propagId, q2.tgtPropagId := q1.qryThreadId, q2.tgtThreadId := null, q2.tgtAnswer := Waiting, q2.tgtAnswer := XQuery program obtained by automatic translation q2.myXQuery of the query into XQuery.
CONCLUSION
The chapter discusses a method for schema mapping and query reformulation in a P2P XML data integration system. The discussed formal approach enables us to specify schemas, schema constraints, schema mappings, and queries in a uniform and precise way. Based on this approach we define some basic operations used for query reformulation and data merging, and propose algorithms for automatic generation of XQuery programs performing these operations in real. We discussed some issues concerning query propagation strategies and merging modes, when missing data is to be discovered in the P2P integration processes. The approach is implemented in 6P2P system. We presented its general architecture, and sketched the way how queries and answers were sent across the P2P environment.
242
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
ACKNOWLEDGMENT
The work was supported in part by the Polish Ministry of Science and Higher Education under Grant 3695/B/T02/2009/36.
REFERENCES
Abiteboul, S., Benjelloun, O., Manolescu, I., Milo, T., & Weber, R. (2002). Active XML: Peer-to-peer data and Web services integration. In Proceedings of 28th International Conference on Very Large Data Bases, (pp. 1087-1090). August 20-23, 2002, Hong Kong, China, Morgan Kaufmann. Abiteboul, S., Hull, R., & Vianu, V. (1995). Foundations of databases. Reading, MA: Addison-Wesley. Arenas, M., & Libkin, L. (2004). A normal form for XML documents. ACM Transactions on Database Systems, 29(1), 195232. doi:10.1145/974750.974757 Arenas, M., & Libkin, L. (2005). XML data exchange: Consistency and query answering. In L. Chen (Ed.), Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, (pp. 13-24). June 13-15, 2005, Baltimore, Maryland, USA, ACM. Bernstein, P. A., & Haas, L. M. (2008). Information integration in the enterprise. Communications of the ACM, 51(9), 7279. doi:10.1145/1378727.1378745 Brzykcy, G., Bartoszek, J., & Pankowski, T. (2008). Schema mappings and agents actions in P2P data integration system. Journal of Universal Computer Science, 14(7), 10481060. Buneman, P., Davidson, S. B., Fan, W., Hara, C. S., & Tan, W. C. (2003). Reasoning about keys for XML. Information Systems, 28(8), 10371063. doi:10.1016/S0306-4379(03)00028-0 Fagin, R., Kolaitis, P. G., & Popa, L. (2005). Data exchange: Getting to the core. ACM Transactions on Database Systems, 30(1), 174210. doi:10.1145/1061318.1061323 Fagin, R., Kolaitis, P. G., Popa, L., & Tan, W. C. (2004). Composing schema mappings: Second-order dependencies to the rescue. In: A. Deutsch (Ed.), Proceedings of the 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, (pp. 83-94). June 14-16, 2004, Paris, France, ACM. Fuxman, A., Kolaitis, P. G., Miller, R. J., & Tan, W. C. (2006). Peer data exchange. ACM Transactions on Database Systems, 31(4), 14541498. doi:10.1145/1189769.1189778 Haas, L. M. (2007). Beauty and the beast: The theory and practice of information integration . In Schwentick, T., & Suciu, D. (Eds.), Database theory. (LNCS 4353) (pp. 2843). Springer. Koloniari, G., & Pitoura, E. (2005). Peer-to-peer management of XML data: Issues and research challenges. SIGMOD Record, 34(2), 617. doi:10.1145/1083784.1083788 Madhavan, J., & Halevy, A. Y. (2003). Composing mappings among data sources. In J. Ch., Freytag, et al. (Eds.), VLDB 2003, Proceedings of 29th International Conference on Very Large Data Bases, (pp. 572-583). September 9-12, 2003, Berlin, Germany. Morgan Kaufmann.
243
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Martens, W., Neven, F., & Schwentick, T. (2007). Simple off the shelf abstractions for XML schema. SIGMOD Record, 36(3), 1522. doi:10.1145/1324185.1324188 Melnik, S., Bernstein, P. A., Halevy, A. Y., & Rahm, E. (2005). Supporting executable mappings in model management. In F. zcan (Ed.), Proceedings of the 24th ACM SIGMOD International Conference on Management of Data, (pp. 167-178). Baltimore, Maryland, USA, June 14-16, ACM. Miller, R. J., Haas, L. M., & Hernandez, M. A. (2000). Schema mapping as query discovery. In: A.E. Abbadi, et al. (Eds.), VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, (pp. 77-88). September 10-14, 2000, Cairo, Egypt. Morgan Kaufmann Milo, T., Abiteboul, S., Amann, B., Benjelloun, O., & Ngoc, F. D. (2005). Exchanging intensional XML data. ACM Transactions on Database Systems, 30(1), 140. doi:10.1145/1061318.1061319 Ooi, B. C., Shu, Y., & Tan, K.-L. (2003). Relational data sharing in peer-based data management systems. SIGMOD Record, 32(3), 5964. doi:10.1145/945721.945734 Pankowski, T. (2008a). Query propagation in a P2P data integration system in the presence of schema constraints. In A. Hameurlain (Ed.): Data management in Grid and peer-to-peer systems, (LNCS 5187). (pp. 46-57). Springer. Pankowski, T. (2008b). Reconciling inconsistent data in probabilistic XML data integration . In Gray, W. A., Jeffery, K. G., & Shao, J. (Eds.), Sharing data, information and knowledge, (LNCS 5071) (pp. 7586). Springer. doi:10.1007/978-3-540-70504-8_8 Pankowski, T. (2008c). XML data integration in SixP2Pa theoretical framework. In A. Doucet, S. Ganarski, & E. Pacitti (Eds.), Proceedings of the 2008 International Workshop on Data Management in Peer-to-Peer Systems, DaMaP 2008, (pp. 11-18). Nantes, France, March 25, 2008. ACM International Conference Proceeding Series. Pankowski, T. (2008d). XML schema mappings using schema constraints and Skolem functions . In Cotta, C., Reich, S., Schaefer, R., & Ligeza, A. (Eds.), Knowledge engineering and intelligent computations, knowledge-driven computing (pp. 199216). Springer. Pankowski, T., Cybulka, J., & Meissner, A. (2007). XML schema mappings in the presence of key constraints and value dependencies. In M. Arenas & J. Hidders (Eds.), Proceedings of the 1st Workshop on Emerging Research Opportunities for Web Data Management (EROW 2007) Collocated with the 11th International Conference on Database Theory (ICDT 2007), (pp. 1-15). Barcelona, Spain, January 13, 2007. Pankowski, T., & Hunt, E. (2005). Data merging in life science data integration systems . In Klopotek, M. A., Wierzchon, S. T., & Trojanowski, K. (Eds.), Intelligent Information Systems. New trends in intelligent information processing and Web mining, advances in soft computing (pp. 279288). Berlin, Heidelberg: Springer. doi:10.1007/3-540-32392-9_29 Schema, X. M. L. (2009). W3C XML schema definition language (XSD) 1.1 part 2: Datatypes. Retrieved from www.w3.org/TR/xmlschema11-2
244
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
Tatarinov, I., & Halevy, A. Y. (2004). Efficient query reformulation in peer-data management systems. In G. Weikum, A.C. Knig & S. Deloch (Eds.), Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 539-550). Paris, France, June 13-18, 2004. ACM. Tatarinov, I., & Ives, Z. G. (2003). The Piazza peer data management project. SIGMOD Record, 32(3), 4752. doi:10.1145/945721.945732 XPath. (2006). XML path language 2.0. Retrieved from www.w3.org/TR/xpath20 XQuery. (2002). XQuery 1.0: An XML query language. W3C Working Draft. Retrieved from www. w3.org/TR/ xquery Xu, W., & Ozsoyoglu, Z. M. (2005). Rewriting XPath queries using materialized views. In K. Bhm, et al. (Eds.), Proceedigns of the 31st International Conference on Very Large Data Bases, (pp. 121-132). Trondheim, Norway, August 30 - September 2, 2005, ACM. Yu, C., & Popa, L. (2004). Constraint-based XML query rewriting for data integration. In G. Weikum, A.C. Knig, & S. Deloch (Eds.), Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 371-382). Paris, France, June 13-18, 2004. ACM.
245
Pattern-Based Schema Mapping and Query Answering in Peer-to-Peer XML Data Integration System
values of a given set of paths. On the contrary to the relational counterpart, an XFD can have a non-text valued path on its right-hand side, and XFD can be considered in an XML subtree denoted by a context path. XFDs can be specified by a class of tree-pattern formulas. Query Reformulation: Query reformulation is a process in which a query (Q) formulated against a target schema (T) is rewritten to a form conforming to a source schema (S), using a mapping (m) from S to T. Query reformulation implements the virtual data integration as opposed to data exchange in materialized data integration. However, in both cases answer to Q must be the same, i.e. Q(m(IS)) = refm(Q)(IS), where m(IS) transforms an instance of schema S into an instance of schema T, and refm(Q) reformulates query Q into a query over schema S. Query Propagation: Query propagation is a process of sending a query across a network of peer-topeer connected nodes along semantic paths determined by schema mappings between schemas of peer databases. A peer receives a query from one of its partners (acquaintances), reformulates it and sends forward (propagates) to all other partners. The answer obtained from the peers database is then merged with answers obtained from partners to whom the query was propagated. The result is next sent back to the partner who delivered the query.
246
247
Chapter 10
ABSTRACT
Significant research efforts in the Semantic Web community are recently directed toward the representation and reasoning with fuzzy ontologies. Description logics (DLs) are the logical foundations of standard Web ontology languages. Conjunctive queries are deemed as an expressive reasoning service for DLs. This chapter focuses on fuzzy (threshold) conjunctive queries over knowledge bases encoding in fuzzy DL SHIF ( D) , the logic counterpart of fuzzy OWL Lite language. It shows decidability of fuzzy query entailment in this setting by providing a corresponding tableau-based algorithm. The chapter shows data complexity for answering fuzzy conjunctive queries in fuzzy SHIF ( D) is in coNP, as long as only simple roles occur in the query. Regarding combined complexity, this research proves a co3NExpTime upper bound in the size of the knowledge base and the query.
INTRODUCTION
The Semantic Web is an extension of the current Web in which the Web information can be given welldefined semantics, and thus enabling better cooperation between people and computers. In order to represent and reason with structured knowledge in the Semantic Web, W3C has developed and recommended the Web Ontology Language (OWL) (Bechhofer, Van Harmelen, Hendler, Horrocks, McGuinness, Patel-Schneider, Stein, et al., 2004), which comprises three sublanguages of increasing expressive power: OWL Lite, OWL DL and OWL Full. Description logics (DLs) (Baader, Calvanese, McGuinness,
DOI: 10.4018/978-1-60960-475-2.ch010
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Nardi, & Patel-Schneider, 2003), as the logical foundation of the standard Web Ontology Languages, support knowledge representation and reasoning by means of the concepts and roles. The logical counterparts of OWL Lite and OWL DL are the DLs SHIF (D) and SHOIN (D) , respectively. The most prominent feature of DLs is their built-in reasoning mechanism through which implicit knowledge is discovered from explicit information stored in a DL knowledge base (KB). In the real world, there exists a great deal of uncertainty and imprecision which is likely the rule than an exception. Thus, the problems that emerge are how to represent these non-crisp knowledge within ontologies and DLs. Based on Zadehs fuzzy set theory (Zadeh, 1965), there have been substantial amounts of work carried out in the context of fuzzy extensions of DLs (Straccia, 2001; Stoilos, Stamou, Pan, Tzouvaras, & Horrocks, 2007), and fuzzy ontologies (Stoilos, Simou, Stamou, & Kollias, 2006) are thus established. For a comprehensive review of fuzzy ontologies and fuzzy DLs, the readers can refer to (Lukasiewicz & Straccia, 2008). Fuzzy DL reasoners (Bobillo & Straccia, 2008; Stoilos et al., 2006) implement most of the standard fuzzy inference services (Straccia, 2001), including checking of fuzzy concept satisfiability, fuzzy concept subsumption, and ABox consistency. In addition, some fuzzy DL reasoners support different kinds of simple queries over a KB for obtaining assertional knowledge, such as retrieval, i.e., given a fuzzy KB , a fuzzy concept C, and n (0,1] , to retrieve all instances o occurring in the ABox, such that entails C (o) n , written as K C (o) n . In fact, fuzzy DL reasoners deal with these queries by transforming them into standard inference tasks. For example, the retrieval problem K C (o) n can be reduced to the (un)satisfiability problem of the KB {C ( a ) < n} , while the latter one is a standard inference problem. With the emergence of a good number of large-scale domain ontologies encoding in OWL languages, it is of particular importance to provide users with expressive querying service. Conjunctive queries (CQs) originated from research in relational databases, and, more recently, have also been identified as a desirable form of querying DL knowledge bases. Conjunctive queries provide an expressive query language with capabilities that go beyond standard instance retrieval. For example, consider a user query find me hotels that are very close to the conference venue (with membership degree at lest 0.9) and offer inexpensive (with membership degree at lest 0.7) rooms, which can be formalized in as Hotel ( x ) 1 closeTo( x, venue) 0.9 hasRoom( x, y ) 1 hasPrice( y , z ) 1 Expensive( z ) 0.7 . Existing DL reasoners are limited to providing basic reasoning services. There is, however, no support for queries that ask for n-tuples of related individuals or for the use of variables to formulate a query, just as conjunctive queries do. The reason for this lies in the fact that a fuzzy conjunctive query is not expressible as a part of a fuzzy DL knowledge base. Thus a fuzzy conjunctive query entailment problem cannot be reduced into a basic reasoning problem so as to be dealt with by existing fuzzy DL reasoners. There is also the need for sufficient expressive power of fuzzy DLs to support reasoning in a full fuzzy extension of the OWL Web ontology language (Stoilos et al., 2006). In this study, we thus deal with fuzzy conjunctive query entailment for an expressive fuzzy DL f- SHIF ( D) , the logic counterpart of fuzzy OWL Lite language.
248
We present a novel tableau-based algorithm for checking query entailment over f- SHIF ( D) KBs. We generalize the mapping conditions from a fuzzy query into a completion forest, reducing the times required for checking mapping in different completion forests. We close the open problem of the complexity for answering fuzzy conjunctive queries in expressive fuzzy DLs by establishing two complexity bounds: for data complexity, we prove a coNP upper bound, as long as only simple roles occur in the query. Regarding combined complexity, we prove a co3NExpTime upper bound in the size of the knowledge base and the query.
is called simple w.r.t. a RBox if, for each role R such that R * S , R Trans . The subscript
of * and Trans is dropped if clear from the context. f - SHIF (D) complex concepts (or simply concepts) are defined by concept names according to the following abstract syntax: C || A | C1 C2 | C1 C2 | C | R.C | R.C | 1S | 2 S | T .D | T .D ,
249
D d | d . For decidability reasons, roles in functional restrictions of the form 1S and their negation 2S are restricted to be simple abstract roles. A fuzzy TBox is a finite set of fuzzy concept axioms. Fuzzy concept axiom of the form A C are called fuzzy concept definitions, fuzzy concept axiom of the form A C are called fuzzy concept specializations, and fuzzy concept axiom of the form C D are called general concept inclusion (GCIs) axioms. n (fuzzy concept assertions), R (o, o) n A fuzzy ABox consists of fuzzy assertions of the form C (o) (fuzzy abstract role assertions), T (o, v ) n (fuzzy data type role assertions), or o o (inequality assertions), where o, o I , v I c , stands for any type of inequality, i.e., {, >, , <}. We use to denote or > , and to denote or < . We call ABox assertions defined by positive assertions, while those defined by negative assertions. Note that, we consider only positive fuzzy role assertions, since negative role assertions would imply the existence of role negation, which would lead to undecidability (Mailis, Stoilos, & Stamou, 2007). An f - SHIF ( D) knowledge base is a triple T , R, A with a TBox, a RBox and an ABox. For a fuzzy concept D , we denote by sub( D ) the set that contains D and it is closed under subconcepts of D , and define sub( ) as the set of all the sub-concepts of the concepts occurring in . We abuse the notion sub( D) to denote the set of all the data type predicates occurring in a knowledge base. The semantics of f - SHIF ( D) are provided by a fuzzy interpretation which is a pair = (D ,. ) . Here D is a non-empty set of objects, called the domain of interpretation, disjoint from DD , and . is an interpretation function that coincides with .D on every data value and fuzzy data type predicate, and maps ( i ) different individual names into different elements in D , ( ii ) a concept name A into a membership function A : [0,1], ( iii ) an abstract role name R into a membership function R : [0,1], ( iv ) a data type role T into a membership function T : D [0,1] . The semantics of f SHIF ( D) concepts and roles are depicted as follows. T (o ) = 1 ^ (o ) = 0 (C ) (o) = 1 C (o) (C D ) (o) = min{C (o), D (o)} (C D ) (o) = max{C (o), D (o)} ( R.C ) (o) = inf {max(1 R (o, o), C (o))}}
o
250
( 1S ) (o) =
o , o 1 2
( R ) (o, o) = R (o, o) A fuzzy interpretation satisfies a fuzzy concept specification A C , if A (o) C (o) for any o , written as I A C . Similarly, I A C if A (o) = C (o) for any o , and I C D , if C (o) D (o) for any o . For ABox assertions, I C (o) n (resp. I R (o, o) n ) , iff C (o ) n (resp. R (o , o ) n ) , and I o / o iff o o ' . If an interpretation satisfies all the axioms and assertions in a KB , we call it a model of . A KB is satisfiable iff it has at least one model. A KB entails (logically implies) a fuzzy assertion j , iff all the models of are also models of j , written as K j . Given a KB , we can w.l.o.g assume that 1. all concepts are in their negative normal forms (NNFs), i.e. negation occurs only in front of concept names. Through de Morgan law, the duality between existential restrictions ( $R.C ) and universal restrictions ( "R.C ), and the duality between functional restrictions ( 1S ) and their negations ( 2S ), each concept can be transformed into its equivalent NNF by pushing negation inwards. all fuzzy concept assertions are in their positive inequality normal forms (PINFs). A negative concept assertion can be transformed into its equivalent PINF by applying fuzzy complement operation on it. For example, C (o) < n is converted to C (o) > 1 n . all fuzzy assertions are in their normalized forms (NFs). By introducing a positive, infinite small value e , a fuzzy assertion of the form C (o) > n can be normalized to C (o) n + . The model equivalence of a KB and its normalized form was shown to justify the assumption (Stoilos, Straccia, Stamou, & Pan, 2006). there are only fuzzy GCIs in the TBox. A fuzzy concept specification A C can be replaced by a fuzzy concept definition A A C (Stoilos et al., 2007), where A is a new concept name, which stands for the qualities that distinguish the elements of A from the other elements of C . A fuzzy concept definition axiom A C can be eliminated by replacing every occurrence of A with C . The elimination is also known as knowledge base expansion. Note that the size of the expansion can be exponential in the size of the TBox. But if we follow the principle of Expansion is done on demand (Baader & Nutt, 2003), the expansion will have no impact on the algorithm complexity of deciding fuzzy query entailment.
2.
3.
4.
Example 1. As a running example, we use the f - SHIF ( D) KB K = T, R, A with T = {C R.C, T .d } , = , and = {C (o) 0.8} .
251
is either an individual name from I or I c , or a variable name from V . A fuzzy query atom is an expression of the form C (t ) n , R (t , t ) n , or T (t , t ) n with C a concept, R a simple abstract role, T a data type role, and t , t terms. As with fuzzy assertions, we refer to these three different types of atoms as fuzzy concept atoms, fuzzy abstract role atoms, and fuzzy data type role atoms, respectively. The fuzzy abstract role atoms and the fuzzy data type role atoms are collectively referred to as fuzzy role atoms. Definition 1. (Fuzzy Boolean Conjunctive Queries) A fuzzy boolean conjunctive query q is a nonempty set of fuzzy query atoms of the form q = {at1 n1 , , atk nk } . Then for every fuzzy query atom, we can say ati ni q .
We use Vars ( q) to denote the set of variables occurring in q , AInds ( q) and CInds ( q) to denote the sets of abstract and concrete individual names occurring in q , Inds ( q) to denotes the union of AInds ( q) and CInds ( q) , and Terms( q) for the set of terms in q , i.e. Terms( q) = Vars ( q) Inds( q) . The semantics of a fuzzy query is given in the same way as for the related fuzzy DL by means of fuzzy interpretation consisting of an interpretation domain and a fuzzy interpretation function. Definition 2. (Models of Fuzzy Queries) Let = (D ,. ) be a fuzzy interpretation of an f - SHIF ( D) KB, q a fuzzy boolean conjunctive query, and t , t terms in q . We say is a model of q , if there exists a mapping : Terms( q) D such that p( a ) = a for each a Ind ( q) , C ((t )) n for each fuzzy concept atom C (t ) n q , R ((t ), (t )) n (resp. T ((t ), (t )) n ) for each fuzzy role atom R (t , t ) n (resp. T (t , t ) n ) q . If I p at for every atom at q , we write I p q . If there is a p , such that I p q , we say satisfies q , written as I q . We call such a p a match of q in . If I q for each model of a KB , then we say entails q , written as K q . The query entailment problem is defined as follows: given a knowledge base and a query q , decide whether K q . Example 2. Considering the following fuzzy boolean CQ: q = {R ( x, y ) 0.6, R ( y , z ) 0.8, T ( y , yc ) 1, C ( y ) 0.6} . We observe that K q . Given the GCI C $R.C , we have that, for each model of , R.C (o ) C (o ) 0.8 > 0.6 holds. By the definition of fuzzy interpretation, there exists some element b in D , such that R (o , b) 0.8 > 0.6 and C (b) 0.8 > 0.6 holds. Similarly, there is some element c in D , such that R (b, c ) 0.8 and C ( c ) 0.8 holds. Since $T .d , there is some
252
element v in DD , such that T (b, v ) 1 0.8 and d ( v ) 1 holds. By constructing a mapping p with p( x ) = o , p( y ) = b , p( z ) = c , and p( yc ) = v , we have I q .
RELATED WORK
The first conjunctive query algorithm (Calvanese, De Giacomo, & Lenzerini, 1998) over DLs was actually specified for the purpose of deciding conjunctive query containment for DLRreg . Recently, query entailment and answering have been extensively studied for tractable DLs, i.e., DLs that have reasoning problems of at most polynomial complexity. For example, the constructors provided by DL-Lite family (Calvanese, De Giacomo, Lembo, Lenzerini, & Rosati, 2007) are elaborately chosen such that the basic reasoning tasks are PTime-complete and query entailment is in LogSpace with respect to data complexity. Moreover, in DL-Lite family, as TBox reasoning can usually be done independently of the ABox, ABox storage can be transformed into database storage, thus knowledge base users can achieve efficient queries by means of well-established DBMS query engines. Another tractable DL comes from EL with PTime-complete reasoning complexity. It was shown that union of conjunctive queries (UCQs) entailment in EL and in its extensions with role hierarchies is NP-complete regarding the combined complexity (Rosati, 2007b). The data complexity of UCQ entailment in EL is PTime-complete (Rosati, 2007a). Allowing, additionally, role composition in the logic as in EL++ , leads to undecidability (Krtzsch, Rudolph, & Hitzler, 2007). Query answering algorithms for expressive DLs are being tracked with equal intensity. CARIN system (Levy & Rousset, 1998), the first framework for combining a description logic knowledge base with rules, provided a decision procedure for conjunctive query entailment in the description logic ALCNR , where stands for role conjunction. Decision procedures for more expressive DLs, i.e., the whole SH family, were presented (Ortiz, Calvanese, & Eiter, 2006; Ortiz, Calvanese, & Eiter, 2008), and the coNP-complete data complexity for a whole range of sublogics of SHOIQ , as long as only simple roles in the query, was proved. The algorithms for answering CQs with transitive roles over SHIQ (Glimm, Lutz, Horrocks, & Sattler, 2008) and SHOQ (Glimm, Horrocks, & Sattler, 2007) KBs are provided and also a coNP upper bound was established. Following current research developments in crisp DLs, there also have been efforts for answering CQs over fuzzy DL knowledge bases. In particular, a fuzzy extension of DL-Lite was proposed in (Straccia, 2006), along with an algorithm for answering conjunctive queries over fuzzy DL-Lite KBs. Since the query language for fuzzy DL-Lite has the same syntax as that of crisp DLs, the technique for efficiently computing the top-k answers of a conjunctive query was shown. In (Pan, Stamou, Stoilos, Taylor, & Thomas, 2008), a general framework of the aforementioned query language was proposed, covering all the existing query languages for fuzzy ontologies as well as some new ones that can be customized by users. The algorithms for these queries were implemented in the system ONTOSEARCH2 and evaluation showed that these can still be answered in a very efficient way. Clearly, threshold queries give users more flexibility in that users can specify different thresholds for different atoms. Maillis et al. (Mailis et al., 2007) proposed a fuzzy extension of CARIN system called fuzzy CARIN, and provided the ability of answering to union of conjunctive queries. However, there is still no report for query answering over fuzzy DLs with data types. We tackle this issue in the next section.
253
Completion Forests
Definition 3. (Completion Forest) A completion tree T for an f - SHIF ( D) KB is a tree all of whose nodes are generated by expansion rules, except for the root node which might correspond to an abstract individual name in I . A completion forest for an f - SHIF ( D) KB consists of a set of completion trees whose root nodes correspond to abstract individual names occurring in the ABox, an equivalent relation and an inequivalent relation among nodes. The nodes in a completion forest , denoted Nodes( ), can be divided into abstract nodes (denoted ANodes( )) and concrete nodes (or data type nodes, denoted CNodes( )). Each abstract node o in a completion forest is labeled with a set (o) = {C , , n} , where C sub( ) , n (0,1]. The concrete nodes v can only serve as leaf nodes, each of which is labeled with a set ( v ) = {d , , n} , where d sub( D) , n (0,1]. Similarly, the edges in a completion forests can be divided into abstract edges and concrete edges. Each abstract edge o, o is labeled with a set (o, o) = { R, , n} , where R R . Each concrete edge o, v is labeled with a set (o, v) = {T , , n} , where T R c . If o, o is an edge in a completion forest with R, , n (o, o) , then o is called an R, n -successor of o and o is called an R, n -predecessor of o . Ignoring the inequality and membership degree, we can also call o an R -successor of o and o an R -predecessor of o . Ancestor and descendant are the transitive closure of predecessor and successor, respectively. The union of the successor and predecessor relation is the neighbor relation. The distance between two nodes o , o in a completion forest is the shortest path between them.
254
Starting with an f - SHIF ( D) KB K = T , R, A , the completion forest FK is initialized such that it contains ( i ) a root node o , with L(o) = {C , , n | C (o) n A} , for each abstract individual name o occurring in , ( ii ) a leaf node v , with L( v ) = {d , , n | d ( v ) n A} , for each concrete individual name v occurring in , (iii) an abstract edge o, o with L(o, o) = { R, , n | R(o, o) n A} , for each pair o, o of individual names for which the set {R | R (o, o) n } is non-empty, and ( iv ) a concrete edge o, v with L(o, v) = {T , , n | T (o, v ) n A} . We initialize the relation as {o, o | o o } , and the relation to be empty. Example 3. In our running example, FK contains only one node o labelled with (o) = {C , , 0.8} . Now we can formally define a new blocking condition, called k -blocking, for fuzzy query entailment depending on a depth parameter k > 0 . Definition 4. ( k -tree equivalence) The k -tree of a node v in T , denoted as Tvk , is the subtree of T
nodes in Tvk . Two nodes v and w in T are said to be k -tree equivalent in T , if Tvk and Twk are iso-
rooted at v with all the descendants of v within distance k . We use Nodes(Tvk ) to denote the set of
morphic, i.e., there exists a bijection y : Nodes(Tvk ) Nodes(Twk ) such that (i) y( v ) = w , (ii) for every node o Nodes(Tvk ) , (o) = ( y(o)) , (iii) for every edge connecting two nodes o and o in Tvk , (o, o) = ((o), (o)) . Definition 5. ( k -witness) A node w is a k -witness of a node v , if v and w are k -tree equivalent in T , w is an ancestor of v in T and v is not in Twk . Furthermore, Twk tree-blocks Tvk and each node o
Definition 6. ( k -blocking) A node o is k -blocked in a completion forest iff it is not a root node and it is either directly or indirectly k -blocked. Node o is directly k -blocked iff none of its ancestors is k -blocked, and o is a leaf of a tree-blocked k -tree. Node o is indirectly k -blocked iff one of its ancestors is k -blocked or it is a successor of a node o and (o, o) = . An initial completion forest is expanded according to a set of expansion rules that reflect the constructors in f - SHIF ( D) . The expansion rules, which syntactically decompose the concepts in node labels, either infer new constraints for a given node, or extend the tree according to these constraints (see Table 1). Termination is guaranteed by k -blocking. We denote by the set of all completion forests obtained this way. For a node o , (o) is said to contain a clash, if it contains one of the followings: ( i ) a pair of triples C , , n and C , , m with n + m > 1, ( ii ) one of the triples: , , n with n > 0 , C , , n with n > 1, ( iii ) some triple 1S , , n , and o has two R, n -neighbors o1 , o2 with n + n > 1 and o1 /o2 . Definition 7. ( k -complete and clash-free completion forest) A completion forest is called k -complete and clash-free if under k -blocking no rule can be applied to it, and none of its nodes and edges contains a clash. We denote by ccf k ( ) the set of k -complete and class-free completion forests in .
255
if 1. 1S , , n ( x ) , x is not indirectly k -blocked, 2. #{xi N I | R, ,1 n + ( x, xi )} > 1 and 3. there exist xl then (i) ( xk ) ( xk ) ( xl ) (ii) ( x, xk ) ( x, xk ) ( x, xl ) (iii) ( x, xl ) , ( xl ) (iv) set xi / xk for all xi with xi / xl . and xk , with no xl / xk , 4. xl is neither a root node nor an ancestor of xk .
..
if 1. 1S , , n ( x ) , 2. #{xi N I | R, ,1 n + ( x, xi )} > 1 and 3. there exist xl and xk , both root nodes, with no xl / x k , then 1. ( xk ) ( xk ) ( xl ) 2. For all edges xl , x , i. if the edge xk , x does not exist, create it with ( xk , x ) = , ii. ( xk , x ) ( xk , x ) ( xl , x ) . 3. For all edges x , xl , i. if the edge x , xk does not exist, create it with ( x , xk ) = , ii. ( x , xk ) ( x , xk ) ( x , xl ) . 4. Set ( xl ) = and remove all edges to/from xl . 5. Set x / xk for all x with x / xl and set xl xk .
Example 4. Figure 1 shows a 2-complete and clash-free completion forest for , i.e., ccf 2 ( ) , where o = {C , , 0.8, R.C , , 0.8} , d = {d , ,1} , R = { R, , 0.8} , T = {T , ,1} . In , o1 node o6 in To2 is directly blocked by o3 in To2 -tree, indicated by the dashed line.
4 1 1 4
and o4 are 2-tree equivalent, and o1 is a 2-witness of o4 . To2 tree-blocks To2 , and o1 tree-blocks o5 . The
256
L(o) = {C , , n | C (o) n A} . Since I K , then we have C (o ) n and Property (i) thus holds. Property ( ii )-( v ) can be proved in a similar way with ( i ). We then show that, each time an expansion rule is applied, all models are preserved in some resulting forest.
sponds to an individual name in . For each abstract individual o in I , the label of node o in FK is
Proof. The only if direction follows from Definition 8. For the if direction, we need to show that, for all nodes v, w in FK , Property ( i )-( v ) in Definition 8 hold. By Definition 3, each node in FK corre-
Lemma 2. Let be a completion forest in , r a rule in Table 1, F a set of completion forests obtained from by applying r , then for each model of , there exist an ' in F and an extension ' of , such that I F . Proof. -rule. Since R.C , , n ( x ) and I F , then there exists some o , such that
R ( x , o) n and C (o) n hold. In the completion forest obtained from by applying -rule,
257
a new node y is generated such that R, , n ( x, y) and C , , n ( y ) . By setting y = o , we obtain an extension of , and thus I F . The case of -rule is analogous to the -rule. The proofs for other rules are in similar way. Since the set of k -complete and clash-free completion forests for semantically captures (modulo new individuals), we can transfer query entailment K q to logical consequence of q from completion forests as follows. For any completion forest and any CQ q , let F q denote that I q for every model of . Theorem 1. Let k 0 be arbitrary. Then K q iff F q for each ccf k ( ) .
Proof. By Lemma 1 and Lemma 2, for each model of , there exists some F K and an extension of , such that I F . Assume ccf k ( ) , then there still have rules applicable to . We thus obtain an expansion from and an extension of , such that I F , and so forth until no rule is applicable. Now we either obtain a complete and clash free completion forest, or encounter a clash. The former is conflict with the assumption, and the latter is conflict with the fact that have models.
Definition 9. (Query mapping) A fuzzy query q can be mapped into , denoted q F , if there is a mapping : Terms( q) Nodes( ) , such that ( i ) m( a ) = a , if a Inds ( q) , ( ii ) for each fuzzy concept atom C ( x ) n (resp. d ( x ) n ) in q , C ( x ), , m (resp. d ( x ), , m ) ( ( x )) with m n , ( iii ) for each fuzzy role atom R ( x, y ) n (resp. T ( x, y ) n ) in q , m( y ) is a Rm -neighbor (resp. a Tm -neighbor) of m( x ) with m n . Example 5. By setting m( x ) = o1 , m( y ) = o2 , m( z ) = o3 , and m( yc ) = v1 , we can construct a mapping m of q into 1 .
Proof. If q F , then there is a mapping m : Terms ( q) Nodes ( ) satisfying Definition 9. For any model = (D ,. ) of , it satisfies Definition 8. We construct a mapping p : Terms ( q) such that, for each term x Terms ( q) , ( x ) = ( ( x )) . It satisfies C (p( x )) = C (( ( x )) ) m n , for each fuzzy concept atom C ( x ) n q . The proof for fuzzy role atoms can be shown in a similar way. Hence I q , which by Theorem 1 implies F q .
258
Lemma 3 shows the soundness of our algorithm. We prove, in next subsection, the converse (the completeness) also holds. We show that provided the completion forest has been sufficiently expanded, a mapping from q to can be constructed from a single canonical model.
models, whereas the canonical model IF is constructed from , which is a finite representation decided by the termination of the algorithm. This requires some nodes in represent several elements in D F . The paths is chosen to distinguish different elements represented in by the same node. The definition of fuzzy tableau is based on the one in (Stoilos et al., 2006, 2007).
I
of IF consists of a set of (maybe infinite) paths. The reason lies in that a KB may have infinite
and concrete roles occurring in , I the set of individual names occurring in , T = S, H, Ea , Ec , V is the fuzzy tableau of if ( i ) S is a nonempty set, ( ii ) H : S sub( K) [0,1] maps each element in S and each concept in sub( ) to the membership degree the element belongs to the concept, ( iii ) Ea : S S R K [0,1] maps each pair of elements in S and each role in R to the membership degree the pair belongs to the role, ( iv ) c : S D R D [0,1] maps each pair of elements and concrete values and each concrete role in R D to the membership degree the pair belongs to the role, ( C , D sub( ) , R R , and n [0,1] , T satisfies: 1. 2. 3. 4. 5. 6. 7. 8. 9. 11. for each s S , ( s, ^) = 0 , H(s, ) = 1 ; if ( s, C D ) n , then ( s, C ) n and ( s, D ) n ; if ( s, C D ) n , then ( s, C ) n or ( s, D ) n ; if ( s, R.C ) n , then for all t S , (s, t, R ) 1 n or (t , C ) n ; if ( s, R.C ) n , then there exists t S , such that a (s, t, R ) n and (t , C ) n ; if ( s, R.C ) n , and Trans ( R ) , then a (s, t, R ) 1 n or (t , R.C ) n ; if ( s, 2 S ) n , then #{t S | a (s, t, R ) n ; if ( s, S .C ) n , Trans ( R ) , and R * S , then a (s, t, R ) 1 n or (t , R.C ) n ; if ( s, 1S ) n , then #{t S | a (s, t, R ) 1 n + ; a (s, t, R ) n iff a (t , s, Inv ( R )) n ; v ) V : I A S maps each individual in I to a element in S . Additionally, for each s, t S ,
Definition 10. (Fuzzy tableau) Let K = T , R, A be a f - SHIF ( D) KB, R , R D the sets of abstract
C D , then for all s S , n N N q , ( s, C ) 1 n + or ( s, D ) n ; C (o) n , then H(V (o), C ) n ; R (o, o) n , then Ea (V (o), V (o), R ) n ; o /o , then (o) (o) .
259
17. if ( s, T .d ) n , then there exists t D , such that c (s, t, T ) n and d D (t ) n . The process of inducing a fuzzy tableau from a completion forest is as follows. Each element in S corresponds to a path in . We can view a blocked node as a loop so as to define infinite paths. To be more precise, a path p = [v0 / v0 , , vn / vn ' ] is a sequence of node pairs in . We define Tail ( p ) = vn , and Tail'( p ) = vn . We denote by [ p | vn +1 / vn +1'] the path
[v0 / v0 , , vn / vn , vn +1 / vn +1']
and use
[ p | vn +1 / vn +1', vn + 2 / vn + 2']
as the abbreviation of
[[ p | vn +1 / vn +1'], vn + 2 / vn + 2'] . The set Paths ( ) of paths in is inductively defined as follows: if v is a root node in , then [v / v'] Paths( ); if p Paths( ), and w Nodes( ), if w is the R -successor of Tail ( p ) and is not k -blocked, then [ p | w / w] Paths ( ) ; if there exists w Nodes ( ) and is the R -successor of Tail ( p ) , and is directly k -blocked by w , then [ p | w / w] Paths ( ) .
Definition 11. (Induced fuzzy tableau) The fuzzy tableau TF = S, H, Ea , Ec , V induced by is as follows. S = Paths ( ) , H( p, C ) sup{ni | C , , ni L( Tail ( p )) }, Ea ( p, [ p | [ p | w / w]], R ) sup{ni | R, , ni L( Tail ( p ), w)} ,
R* -neighbour of w , and R* denotes R or Inv ( R ) , c ( p, [ p | vc / vc ], T ) n with T , , n ( Tail ( p ), vc )} , where vc is a concrete node, [a / a ], if a is a root node, and L(a ) , i i i i V(ai ) = [a j / a j ], if ai is a root node, and L(ai )=, with ai a j and L(a j ) . From the fuzzy tableau of a fuzzy KB , we can obtain the canonical model of .
Ea ([v / v, w / w], R ) sup{ni | R* , , ni L(v, w)} , where v, w are root nodes, v is the
Definition 12. (Canonical model) Let T = S, H, Ea , Ec , V be a fuzzy tableau of , the canonical model of T , T = (D T , D
T . T
) , is defined as follows.
=S;
I T
= V (o ) ;
I
260
R + ( s, t ), E if Trans( R ), I for each s, t S S, R ( s, t ) = where max ( RE ( s, t ), S T ( s, t )), otherwise. * S R , S R + R ( s, t ) = a (s, t, R ) , and R ( s, t ) is the sup-min transitive closure of R ( s, t ) .
I T
for each s, t S D , T
I T
( s, t ) = Ec ( s, t ) .
Lemma 4. Let T be the fuzzy tableau of an f - SHIF ( D) KB K = T , R, A , then the canonical model T of T is a model of . Proof. Property 12 in Definition 10 ensures that T is a model of . For a detailed proof, see
Proposition 3 in (Stoilos et al., 2006). Property 1-11 and 13-17 in Definition 10 ensures that T is a model of and . For a detailed proof, see Lemma 5.2 and 6.5 in (Stoilos et al., 2007). Example 6. By unraveling in Figure 1, we obtain a model IF which has as domain the infinite set to witness the loops introduced by the blocked variables. When a node is not blocked, like o1 , the pair o1 / o1 is added to the path. Since To2 tree-blocks To2 , each time a path reaches o6 , which is a leaf node
1 4 I F
of paths from o to each oi . Note that a path actually comprises a sequence of pairs of nodes, in order
of a blocked tree, we add o3 / o6 to the path and loop back to the successors of o3 . This set of paths constitute the domain D . For each concept name A , we have A
I F I F
( pi ) n , if A, , n occurs in
the label of the last node in pi . For each role R , R semantics. Therefore, C
I F
-successor of pi . If role R Trans , the extension of R is expanded according to the sup-min transitive ( pi ) 0.8 for i 0 , and R ( pi , p j ) 0.8 for 0 i < j .
From a complete and clash-free completion forest , we can obtain a fuzzy tableau T , through
K , where k 1.
Proof. It follows from Lemma 5.9 and 6.10 in (Stoilos et al., 2007) and Proposition 5 in (Stoilos et al., 2006) that the induced tableau in Definition 11 satisfies Property 1-15 in Definition 10. By Lemma 4, the canonical model IF constructed from T is a model of . Now we illustrate how to construct a mapping of q to from a mapping of q to IF .
Definition 13. (Mapping graph) Let F ccf k ( K ) with k 0 , and a fuzzy query q , such that IF p is a mapping in Definition 2, then the mapping graph G = V , E is defined as: V (G ) = {( x )
I F
q.
D | x Terms( q)} ,
I F
E (G ) = {( x ), ( y )
| R ( x, y ) n q}
261
{( x ), ( y )
D | T ( x, y ) n q} .
V (Gp ) is divided into Vr (Gp ) and Vn(Gp ) , i.e., V (G ) = Vr (G ) Vn(G ) and Vr (G ) Vn(G ) = , where Vr (Gp ) = {[v / v ] | v is a root node in }. Definition 14. (Maximal q -distance) For any x, y Terms( q) , if ( x ), ( y ) Vn(G ) , then we use d p ( x, y ) to denote the length of the shortest path between p( x ) and p( y ) in Gp . If p( x ) and p( y ) are in two different connected components respectively, then d ( x, y ) = 1 . We define maximal q -distance d q = max x , y Terms ( q ){d ( x, y )} . Example 7. Consider a mapping p such that p( x ) = p6 , p( y ) = p7 , p( z ) = p8 , and p( yc ) = v5 . The
mapping graph Gp contains the nodes p6 , p7 and p8 , where Vr (G ) = , Vn(Gp ) = { p6 , p7 , p8 , v5 } , and E (G ) = { p6 , p7 , p7 , p8 , p7 , v5 } . Moreover, d p ( p6 , p7 ) = 1 , d p ( p7 , p8 ) = 1 , d p ( p6 , p8 ) = 2 ,
p d p ( p6 , v5 ) = 2 , and d p ( v5 , p8 ) = 2 , thus d q = 2 .
We use nq to denote the number of fuzzy role atoms in a fuzzy query q . We only consider fuzzy
role atoms R ( x, y ) n with R a simple role, so d p ( x, y ) = 1 and d q nq . We show provided the k
in the k -blocking condition is greater than or equal to nq , it suffices to find a mapping from q to . Lemma 6. Let F ccf k ( K ) with k nQ , and IF is the canonical model of . If IF q F. Proof. Since IF
I
q , then
D such that p( a ) = a
I F
((t ), (t )) n
set of the descendants of the vertices in Blocked (Gi ) . Recalling the definition of Nodes( ) , since is k -blocked, if there are two node pairs v / v and w / w in a path (also a vertex) p with v v and w w , then the distance between these two node pairs must greater than k . A path p always begins with v / v , if it contains a node pair w / w with w w , then the distance between v / v and w / w must greater than k . We prove two properties of Gi as follows.
(resp. T F ((t ), (t )) n ) for each fuzzy role atom R (t , t ) n (resp. T (t , t ) n ) q . We construct the mapping : Terms( q) Nodes( ) from the mapping p . First, we consider G , a subgraph of Gp , which is obtained by eliminating the vertices of the form [a / a ] with a a individual name and the arcs that enter or leave these vertices. G consists of a set of connected components- written as G1 , , Gm . We define Blocked (Gi ) as the set of all the vertices p such that Tails ( p ) Tails'( p ) . Then, for every ancestor p of p in Gi , Tails ( p ) = Tails'( p ) . We use AfterBlocked (Gi ) to denote the
262
(1)
exists some y Vars( q) , such that ( y ) Blocked (Gi ) , i.e., Tail (( y )) Tail'(( y )) , and p( x ) is a descendant of p( y ) . p( x ) is of the form [ p | v0 / v0 , , vm / vm ] , where Tail ( p ) Tail'( p ) .
fact that d ( x, y ) d q nq k .
If ( x ) AfterBlocked (Gi ) , then Tail (p( x )) = Tail'(p( x )) . If ( x ) AfterBlocked (Gi ) , then there
Assume that vm vm , then the length of the path p( x ) is larger than k , which contradicts with the If ( x ) V (Gi ) for some Gi with afterblocked (Gi ) and ( x ) afterblocked (Gi ) , then Tail'(p( x )) is tree-blocked by (Tail'(( x ))) . If afterblocked (Gi ) , then there exists some y Vars( q) , such that ( y ) Nodes(Gi ) has some proper sub-path p such that Tail ( p ) = Tail'( p ) . Since p( x ) and p( y ) are in the same Gi , then either p( x ) is an ancestor of p( y ) or there is some case, if Tail'(p( x )) was not tree-blocked, we would have that d ( x, y ) > n nq , which is a contradicz Terms( q) such that p( z ) is a common ancestor of p( x ) and p( y ) in Nodes(Gi ) . In the first
(2)
tion. In the second case, if Tail'(p( x )) was not tree-blocked, then Tail'(p( z )) would not be tree-blocked either, and thus we also derive a contradiction since d ( z, y ) > n nq . We thus can construct the mapping : Terms( q) Nodes( ) as follows. For each a Inds( q) , ( a ) = Tail (( a )) = a ; For each x Vars( q) with ( x ) afterblocked (Gi ) , ( x ) = Tail (( x )) ;
that ( x ) Gi , Tail (p( x ) = Tail'(p( x ) . If Tail'(p( y )) is a R, n -successor of Tail (p( x )) , then ( y ) = Tail'(( y )) is a R, n -successor of ( x ) = Tail (( x )) . If AfterBlocked (Gi ) , we make case study as follows. (a) If ( x ), ( y ) AfterBlocked (Gi ) , then ( x ) = Tail'(( x )) and ( y ) = Tail'(( y )) . If Tail'(p( y ))
If afterblocked (Gi ) = , then ( x ) = Tail (( x )) , Tail'(( x )) if ( x ) afterblocked (Gi ), If afterblocked (Gi ) , then ( x ) = (Tail'(( x ))) otherwise. We now prove that m satisfies Property 1-3 in Definition 9. Property 1 follows from the construction of m ; Property 2: For each fuzzy concept atom C ( x ) n q , since IF q , (( x ), C ) n holds. It follows that C , , n (Tail'(( x ))) or C , , n ( (Tail'(( x )))) , then we have C , , n ( ( x )) . Property 3: For each fuzzy role atom R ( x, y ) n q , (( x ), ( y ), R ) n holds. Then, either (1) Tail'(p( y )) is a R, n -successor of Tail (p( x )) , or (2) Tail'(p( x )) is a Inv ( R ), n -successor of Tail (p( y )) . case (1): For each connected component Gi , if AfterBlocked (Gi ) = , then for each term x such
is a R, n -successor of Tail (p( x )) , then ( y ) = Tail'(( y )) is a R, n -successor of ( x ) = Tail'(( x )) = Tail (( x )) . (b) If ( x ), ( y ) AfterBlocked (Gi ) , then Tail'(p( x )) = Tail (p( x )) . Otherwise, there will be ( y ) AfterBlocked (Gi ) . By Property (2), Tail'(p( x )) is tree-blocked by (Tail'(( x ))) ,
263
Tail'(p( y )) is tree-blocked by (Tail'(( y ))) . If Tail'(p( y )) is a R, n -successor of Tail (p( x )) , then ( y ) = (Tail'(( y ))) is a R, n -successor of ( x ) = (Tail'(( x ))) . (c) If ( x ) AfterBlocked (Gi ) and ( y ) AfterBlocked (Gi ) , then Tail (( x )) Tail'(( x )) and ( y ) = Tail'(( y )) is a R, n -successor of ( x ) = (Tail'(( x ))) . The proof for case (2) is in a similar way with case (1). F holds. Since m has the Property 1-3, q Example 8. Since Vr (G ) = and Gp is connected, the only connected component of Gp is itself. We have Blocked (Gp ) = { p6 } , and AfterBlocked (Gp ) = { p7 , p8 , v5 } . We obtain the mapping m1 from p By Definition 9, q1 F1 . by setting 1 ( x ) = (Tail'( p6 )) = o3 , m1 ( y ) = Tail'( p7 ) = o4 , m1 ( z ) = Tail'( p8 ) = o5 , and m1 ( yc ) = v5 . Tail (( x )) = (Tail'(( x ))) . If Tail'(p( y )) is a R, n -successor of Tail (p( x )) , then
F.
q . For the converse side, by Theorem 1, F q for each ccf k ( ) . By Lemma 6, q F. We can, from the only if direction of Theorem 2, establish our key result, which reduce query entailment K q to finding a mapping of q into every in ccf k ( ) . K
q . Then, by Theorem1,
264
Lemma 7. In a completion forest of , the maximal number of non-isomorphic k -trees is Tk = 2 p ( c,d,r ) where p(c, d, r ) is some polynomial w.r.t. c, d , and r .
k +1
k -complete clash free completion forest of . Each successor of such a node v may be a root node of a ( k - 1) -tree. If a node label of a k -tree contains some tripes of the form R.C , , n or 2 S , , n , the -rule or -rule is triggered and new nodes are added. The generating rules can be applied to each node at most c times. Each time it is applied, it generates at most two R -successors (if the label of the node contains 2 S , , n ) for each role R . This gives a bound of 2 c R -successors for each role, and a bound of 2 cr successors for each node. Since L( x, y) R K {} ( N A N q ) , there are at most 2r d different edge labels, and thus can link a node to one of its successor in 2r d ways. Hence, there can be at most 2r d combinations from a single node in a completion forest to its successors. Thus, the upper bound of the number of non-isomorphic k -trees is Tk = 2c d(2r dTk -1 ) 2 cr . Let y = 2cr , x = c + ry , then Tk = 2 x d1+ y (Tk 1 ) y = 2 x d1+ y (2 x d1+ y (Tk 2 ) y ) y = = (2 x d1+ y )1+ y ++ y
k 1
Since T0 = 2c d , we have for y 2 (It also holds for y = 1) that Tk (2c + 2cr d1+ 2 cr 2c )(2 cr ) (22 c + 2cr (22 c
2 2
+ 4c r + 4c r d k +1
2 3
2 2
=2
p ( c,d,r )k +1
where p(c, d, r ) = 2c 2 + 4c 2r 3 + 4c 2r 2d . Lemma 8. The upper bound of the number of nodes in is O ( I ((cr )d +1 )) , where d = (Tk + 1)k . Proof. By Lemma 7, there are at most Tk non-isomorphic k -trees. If there exists a path from v to v
with its length greater than (Tk + 1)k , then v would appear after a sequence of Tk + 1 non-overlapping k -trees, and one of them would have been blocked so that v would not be generated. The number of nodes in a k -tree is bounded by (cr ) d +1 . The number of nodes in a is bounded by | I | (2cr ) d +1 . Corollary 1. If k is linear in the size of || , q || , then the number of nodes in is at most triple exponential in the size of || , q || ; if the size of q , TBox , and RBox is fixed, and k is a constant, then the number of nodes in is polynomial in the size of . Theorem 3. (Termination) The expansion of into a complete and clash free completion forest
ccf k ( ) terminates in triple exponential time w.r.t. || , q || , and in polynomial time w.r.t. | | if the size of q , TBox , and RBox is fixed, and k is a constant.
265
make a case study of the numbers of the application of rules for expanding FK into . For each node,
the -rule and the -rule may be used at most O (c ) times, and the , , + , + and -rules may be used at most O ( cr ) times. The number of the application of these rules is at most O ( Mcr ) times. The number of the application of r -rule is at most O(| I |) times. For each node, The number of the application of -rule is at most O(d | |) times, and is at most O ( Md | |) times for M nodes. To sum up, the total number for the application of rules is at most O ( Mcr + Md | |) times. Theorem 4. Let be a f - SHIF ( D) KB and q a fuzzy conjunctive query in which all the roles are simple, deciding whether K q is in co3NexpTime w.r.t. combined complexity, and is in coNP w.r.t. data complexity. Proof. If K q , there must exists a ccf , such that q F does not hold. Due to the existence of the nondeterministic rules, and by Theorem 3, the construction of can be done in nondeterministic tripe exponential time in the size of || , q || . By Corollary 1, the number of the nodes in is at most triple exponential in the size of || , q || . For a fuzzy CQ q with k variables, checking for a mapping takes M k times, which is tripe exponential in the size of || , q || . Hence, deciding K q is in 3NexpTime, and deciding |= q is in co3NexpTime. Similarly, the data complexity deciding |= q is in coNP.
CONCLUSION
Fuzzy Description Logics-based knowledge bases are envisioned to be useful in the Semantic Web. Existing fuzzy DL reasoners either are not capable of answering complex queries (mainly conjunctive queries), or only apply to DLs with less expressivity. We thus present an algorithm for answering expressive fuzzy conjunctive queries over the relative expressive DL, namely fuzzy SHIF ( D) . The algorithm we suggest here can easily be adapted to existing (and future) DL implementations. Future direction concern applying the proposed technique to more expressive fuzzy query language, e.g. in (Pan et al., 2008).
REFERENCES
Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., & Patel-Schneider, P. F. (Eds.). (2003). The description logic handbook: Theory, implementation, and applications. Cambridge University Press. Baader, F., & Nutt, W. (2003). Basic description logics (pp. 4395). Bechhofer, S., Van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D., Patel-Schneider, P., et al. (2004). OWL Web ontology language reference. W3C recommendation. Bobillo, F., & Straccia, U. (2008). fuzzydl: An expressive fuzzy description logic reasoner. Proceedings of the 2008 IEEE International Conference on Fuzzy Systems, (pp. 923930).
266
Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., & Rosati, R. (2007). Tractable reasoning and efficient query answering in description logics: The dl-lite family. Journal of Automated Reasoning, 39(3), 385429. doi:10.1007/s10817-007-9078-x Calvanese, D., De Giacomo, G., & Lenzerini, M. (1998). On the decidability of query containment under constraints. Proceedings of the 17th ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS98), (pp. 149158). Glimm, B., Horrocks, I., & Sattler, U. (2007). Conjunctive query entailment for shoq. Proceedings of the 2007 International Workshop on Description Logic (DL 2007). CEUR Electronic Workshop Proceedings. Glimm, B., Lutz, C., Horrocks, I., & Sattler, U. (2008). Conjunctive query answering for the description logic shiq. [JAIR]. Journal of Artificial Intelligence Research, 31, 157204. Krtzsch, M., Rudolph, S., & Hitzler, P. (2007). Conjunctive queries for a tractable fragment of owl 1.1. Proceedings of the 6th International Semantic Web Conference (ISWC 2007), 310323. Levy, A. Y., & Rousset, M.-C. (1998). Combining horn rules and description logics in carin. Artificial Intelligence, 104(1-2), 165209. doi:10.1016/S0004-3702(98)00048-4 Lukasiewicz, T., & Straccia, U. (2008). Managing uncertainty and vagueness in description logics for the semantic Web. Journal of Web Semantics, 6(4), 291308. doi:10.1016/j.websem.2008.04.001 Mailis, T. P., Stoilos, G., & Stamou, G. B. (2007). Expressive reasoning with horn rules and fuzzy description logics. Proceedings of 2nd International Conference on Web Reasoning and Rule Systems (RR08). Ortiz, M., Calvanese, D., & Eiter, T. (2006). Data complexity of answering unions of conjunctive queries in shiq. Proceedings of the 2006 International Workshop on Description Logic. CEUR Electronic Workshop Proceedings. Ortiz, M., Calvanese, D., & Eiter, T. (2008). Data complexity of query answering in expressive description logics via tableaux. Journal of Automated Reasoning, 41(1), 6198. doi:10.1007/s10817-008-9102-9 Pan, J. Z., Stamou, G. B., Stoilos, G., Taylor, S., & Thomas, E. (2008). Scalable querying services over fuzzy ontologies. Proceedings of the 17th International World Wide Web Conference (WWW2008), (pp. 575584). Rosati, R. (2007a). The limits of querying ontologies. Proceedings of the 11th International Conference on Database Theory (ICDT 2007), (pp. 164178). Rosati, R. (2007b). On conjunctive query answering in EL. Proceedings of the 2007 International Workshop on Description Logic (DL 2007). CEUR Electronic Workshop Proceedings. Stoilos, G., Simou, N., Stamou, G. B., & Kollias, S. D. (2006). Uncertainty and the semantic Web. IEEE Intelligent Systems, 21(5), 8487. doi:10.1109/MIS.2006.105 Stoilos, G., Stamou, G. B., Pan, J. Z., Tzouvaras, V., & Horrocks, I. (2007). Reasoning with very expressive fuzzy description logics. [JAIR]. Journal of Artificial Intelligence Research, 30, 273320.
267
Stoilos, G., Straccia, U., Stamou, G. B., & Pan, J. Z. (2006). General concept inclusions in fuzzy description logics. Proceedings of the 17th European Conference on Artificial Intelligence (ECAI 2006), (pp. 457461). Straccia, U. (2001). Reasoning within fuzzy description logics. [JAIR]. Journal of Artificial Intelligence Research, 14, 137166. Straccia, U. (2006). Answering vague queries in fuzzy dl-lite. Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU-06), (pp. 22382245). Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338353. doi:10.1016/S00199958(65)90241-X
268
269
Chapter 11
ABSTRACT
The Resource Description Framework (RDF) is a flexible model for representing information about resources in the Web. With the increasing amount of RDF data which is becoming available, efficient and scalable management of RDF data has become a fundamental challenge to achieve the Semantic Web vision. The RDF model has attracted attentions in the database community and many researchers have proposed different solutions to store and query RDF data efficiently. This chapter focuses on using relational query processors to store and query RDF data. It gives an overview of the different approaches and classifies them according to their storage and query evaluation strategies.
INTRODUCTION
The Semantic Web term is coined by W3C founder Tim Berners-Lee in a Scientific American article describing the future of the Web (Berners-Lee et al., 2001). The main purpose of the Semantic Web vision is to provide a common framework for data-sharing across applications, enterprises, and communities. By giving data semantic meaning (through metadata), this framework allows machines to consume, understand, and reason about the structure and purpose of the data. The core of the Semantic Web is built on the Resource Description Framework (RDF) data model (Manola & Miller, 2004). The RDF model is designed to have a simple data model, with a formal semantics and provable inference, with an extensible URI-based vocabulary that allows anyone to make statements about any
DOI: 10.4018/978-1-60960-475-2.ch011
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
resource. Hence, in the RDF model, the universe is modeled as set of resources where a resource is defined as anything that can have a universal resource identifier (URI). RDF describes a particular resource using a set of RDF statements of the form (subject, predicate, object) triples, also known as (subject, property, value). The subject is the resource, the predicate is the characteristic being described, and the object is the value for that characteristic. Efficient and scalable management of RDF data is a fundamental challenge at the core of the Semantic Web. Several research efforts have been proposed to address these challenges (Abadi et al., 2009; Alexaki et al., 2001; Broekstra et al., 2002; Harth & Decker, 2005; Ma et al., 2004; Weiss et al., 2008). Relational database management systems (RDBMSs) have repeatedly shown that they are very efficient, scalable and successful in hosting types of data which have formerly not been anticipated to be stored inside relational databases such complex objects (Turker & Gertz, 2001), spatio-temporal data (Botea et al., 2008) and XML data (Grust et al., 2004). RDMBSs derive much of their performance from sophisticated optimizer components which makes use of physical properties that are specific to the relational model such as: sortedness, proper join ordering and powerful indexing mechanisms. This chapter focuses on using relational query processors to store and query RDF data. We give an overview of the different approaches and classifies them according to their storage and indexing strategy. The rest of the chapter is organized as follows. Section (RDF-SPARQL Preliminaries) introduces preliminaries of RDF data model and the W3C standard RDF query language, SPARQL. It also introduces the main alternative relational approaches for storing and querying RDF. Sections (Vertical (Triple) Stores, Property Table Stores, and Horizontal Stores) provide the details of the different techniques in each of the alternative relational approaches. Section (Experimental Evaluation) presents an experimental comparison between representatives of the different approaches. Finally, Section (Concluding Remarks) concludes the chapter and provides some suggestions for possible future research directions on the subject.
RDF-SPARQL PRELIMINARIES
The Resource Description Framework (RDF) is a W3C recommendation that has rapidly gained popularity a means of expressing and exchanging semantic metadata, i.e., data that specifies semantic information about data. RDF was originally designed for the representation and processing of metadata about remote information sources and defines a model for describing relationships among resources in terms of uniquely identified attributes and values. The basic building block in RDF is a simple tuple model, (subject, predicate, object), to express different types of knowledge in the form of fact statements. The interpretation of each statement is that subject S has property P with value O, where S and P are resource URIs and O is either a URI or a literal value. Thus, any object from one triple can play the role of a subject in another triple which amounts to chaining two labeled edges in a graph-based structure. Thus, RDF allows a form of reification in which any RDF statement itself can be the subject or object of a triple. One of the clear advantage of the RDF data model is its schema-free structure in comparison to the entity-relationship model where the entities, their attributes and relationships to other entities are strictly defined. RDF is not a syntax (i.e. data format). There exist various RDF syntaxes (e.g. Notation 3 (N3) language, Turtle, XML) and depending on the application space one syntax may be more appropriate than another. In RDF, the schema may evolve over the time which fits well with the modern notion of data management, dataspaces, and its pay-as-you-go philosophy (Jeffery et al., 2008). Figure 1 illustrates a sample RDF graph.
270
The SPARQL query language is the official W3C standard for querying and extracting information from RDF graphs (Prudhommeaux & Seaborne, 2008). RDF is a directed labeled graph data format and, thus, SPARQL is essentially a graph-matching query language. It represents the counterpart to select-project-join queries in the relational model. It is based on a powerful graph matching facility, allows binding variables to components in the input RDF graph and supports conjunctions and disjunctions of triple patterns. In addition, operators akin to relational joins, unions, left outer joins, selections, and projections can be combined to build more expressive queries. A basic SPARQL query has the form:
select ?variable1 ?variable2... where { pattern1. pattern2.... }
Figure 3 illustrates a general classification for RDF tripe stores based on their storage models Ma et al. (2008). In principle, RDF stores can be divided into two major categories: native stores and databasebased stores. Native stores are directly built on the file system, whereas database based repositories use relational or object relational databases as the backend store. Representative native stores include OWLIM (Kiryakov et al., 2005), HStar (Ma et al., 2008), AllegroGraph (AllegroGraph RDFStore, 2009) and YARs (Harth & Decker, 2005). Representative of ontology-dependent stores include DLDB (Pan & Heflin, 2003; Pan et al., 2008) and Sesame (Broekstra et al., 2002). The main focus of this chapter is to give an overview of the generic relational approaches for processing RDF data. In general, relational database management systems (RDBMSs) have repeatedly shown that they are very efficient, scalable and successful in hosting types of data which have formerly not been anticipated to be stored inside relational databases. In addition, RDBMSs have shown its ability to handle
271
vast amounts of data very efficiently using its powerful indexing mechanisms. In principle, RDMBSs derive much of their performance from sophisticated optimizer components which makes use of physical properties that are specific to the relational model such as: sortedness, proper join ordering and powerful indexing mechanisms. In fact, a main advantage of the relational-based approach of processing RDF data is that it can makes use of a large and robust body of work on query planning and optimization available in the infrastructure of relation query engines to implement efficient and scalable SPARQL query processors. For example, Cyganiak (2005) presented an approach for compiling SPARQL queries into standard relational algebraic plans. The relational RDF stores can be mainly classified to the following categories: 1. 2. 3. Vertical (triple) table stores: where each RDF triple is stored directly in a three-column table (subject, predicate, object). Property (n-ary) table stores: where multiple RDF properties are modeled as n-ary table columns for the same subject. Horizontal (binary) table stores: where RDF triples are modeled as one horizontal table or into a set of vertically partitioned binary tables (one table for each RDF property).
272
Figures 4, 5 and 6 illustrate examples of the three alternative relational representation of the sample RDF graph (Figure 1) and their associated SQL queries for evaluating the sample SPARQL query (Figure 2).
273
Therefore, several approaches have been proposed to deal with this limitation by using extensive set of indexes or by using selectivity estimation information to optimize the join ordering. Harris & Gibbins (2003) have described the 3store RDF storage system. The storage system of 3Store is based on a central triple table which holds the hashes for the subject, predicate, object and graph identifier. The graph identifier equal to zero if the triple resides in the anonymous background graph. A symbols table is used to allow reverse lookups from the hash to the hashed value, for example, to return results. Furthermore it allows SQL operations to be performed on pre-computed values in the data types of the columns without the use of casts. For evaluating SPARQL queries, the triples table is joined once for each triple in the graph pattern where variables are bound to their values when they encounter the slot in which the variable appears. Subsequent occurrences of variables in the graph pattern are used to constrain any appropriate joins with their initial binding. To produce the intermediate results table, the hashes of any SPARQL variables required to be returned in the results set are projected and the hashes from the intermediate results table are joined to the symbols table to provide the textual representation of the results. Neumann & Weikum (2008) have presented the RDF-3X (RDF Triple eXpress) RDF query engine which tries to overcome the criticism that triples stores incurs too many expensive self-joins by creating the exhaustive set of indexes and relying on fast processing of merge joins. The physical design of RDF-3x is workload-independent and eliminates the need for physical-design tuning by building indexes over all 6 permutations of the three dimensions that constitute an RDF triple. Additionally, indexes over count-aggregated variants for all three two-dimensional and all three one-dimensional projections. The query processor follows RISC-style design philosophy (Chaudhuri & Weikum, 2000) by using the full set of indexes on the triple tables to rely mostly on merge joins over sorted index lists. The query optimizer relies upon its cost model in finding the lowest-cost execution plan and mostly focuses on join order and the generation of execution plans. In principle, selectivity estimation has a huge impact on plan generation. While this is a standard problem in database systems, the schema-free nature of RDF data makes the problem more challenging. RDF-3X employs dynamic programming for plan enumeration, with a cost model based on RDF-specific statistical synopses. It relies on two kinds of statistics: 1) specialized histograms which are generic and can handle any kind of triple patterns and joins. The disadvantage of histograms is that it assumes independence between predicates. 2) frequent join paths in the data which give more accurate estimation. During query optimization, the query optimizer uses the join-path selectivity information when available and otherwise assume independence and use the histograms information. Neumann & Weikum (2009) have extended the work further by introducing a runtime technique for accelerating query executions. It uses a light-weight, RDF-specific technique for sideways information passing across different joins and index scans within the query execution plans. They have also enhanced the selectivity estimator of the query optimizer by using very fast index lookups on specifically designed aggregation indexes, rather than relying on the usual kinds of coarse-grained histograms. This provides much more accurate estimates at compile-time, at a fairly small cost that is easily amortized by providing better directives for the join-order optimization. Weiss et al. (2008) have presented the Hexastore RDF storage scheme with main focuses on scalability and generality in its data storage, processing and representation. Hexastore is based on the idea of indexing the RDF data in a multiple indexing scheme (Harth & Decker, 2005). It does not discriminate against any RDF element and treats subjects, properties and objects equally. Each RDF element type have its special index structures built around it. Moreover, every possible ordering of the importance or precedence of the three elements in an indexing scheme is materialized. Each index structure in a Hexastore centers
274
around one RDF element and defines a prioritization between the other two elements. Two vectors are associated with each RDF element (e.g., subject), one for each of the other two RDF elements (e.g., property and object). In addition, lists of the third RDF element are appended to the elements in these vectors. In total, six distinct indices are used for indexing the RDF data. These indices materialize all possible orders of precedence of the three RDF elements. A clear disadvantage of this approach is that Hexastore features a worst-case fivefold storage increase in comparison to a conventional triples table.
275
Using the knowledge of the frequent access patterns to construct the property-tables and influence the underlying database storage structures can provide a performance benefit and reduce the number of join operations during the query evaluation process. Chong et al. (2005) have introduced an Oracle-based SQL table function RDFMATCH to query RDF data. The results of RDFMATCH table function can be further processed by SQLs rich querying capabilities and seamlessly combined with queries on traditional relational data. The core implementation of RDFMATCH query translates to a self-join query on Triple-based RDF table store. The resulting query is executed efficiently by making use of B-tree indexes as well as creating materialized join views for specialized subject-property. Subject-Property Matrix materialized join views is used To minimize the query processing overheads that are inherent in the canonical triples-based representation of RDF. The materialized join views are incrementally maintained based on user demand and query workloads. A special module is provided to analyze table of RDF triples and estimate the size of various materialized views, based on which a user can define a subset of materialized views. For a group of subjects, the system defines a set of single-valued properties that occur together. These can be direct properties of these subjects or nested properties. A property p1 is a direct property of subject x1 if there is a triple (x1,p1,x2). A property pm is a nested property of subject x1 if there is a set of triples such as, (x1,p1,x2),..., (xm,pm,xm+1), where m> 1. For example, if there is a set of triples, (John, address, addr1), (addr1, zip, 03062), then zip is a nested property of John. Levandoski & Mokbel (2009) have presented another property table approach for storing RDF data without any assumption about the query workload statistics. The main goals of this approach are: (1) reducing the number of join operations which are required during the RDF query evaluation process by storing related RDF properties together (2) reducing the need to process extra data by tuning null storage to fall below a given threshold. The approach provides a tailored schema for each RDF data set which represents a balance between property tables and binary tables and is based on two main parameters: 1) Support threshold which represents a value to measure the strength of correlation between properties in the RDF data. 2) The null threshold which represents the percentage of null storage tolerated for each table in the schema. The approach involves two phases: clustering and partitioning. The clustering phase scans the RDF data to automatically discover groups of related properties (i.e., properties that always exist together for a large number of subjects). Based on the support threshold, each set of n properties which are grouped together in the same cluster are good candidates to constitute a single n-ary table and the properties which are not grouped in any cluster are good candidates for storage in binary tables. The partitioning phase goes over the formed clusters and balances the tradeoff between storing as many RDF properties in clusters as possible while keeping null storage to a minimum based on the null threshold. One of the main concerns of the partitioning phase is twofold: is to ensure the non-overlapping between the clusters and that each property exists in a single cluster and reduces the number of table accesses and unions necessary in query processing. Matono et al. (2005) have proposed a path-based relational RDF database. The main focus of this approach is to improve the performance for path queries by extracting all reachable path expressions for each resource, and store them. Thus, there is no need to perform join operations unlike the flat tripe stores or the property tables approach. In this approach, the RDF graph is divided into subgraphs and then each subgraph is stored by applicable techniques into distinct relational tables. More precisely, all classes and properties are extracted from RDF schema data, and all resources are also extracted from RDF data. Each extracted item is assigned an identifier and a path expression and stored in corresponding relational table.
276
HORIZONTAL STORES
Abadi et al. (2009) have presented SW-Store a new DBMS which is storing RDF data using a fully decomposed storage model (DSM) (Copeland & Khoshafian, 1985). In this approach, the triples table is rewritten into n two-column tables where n is the number of unique properties in the data. In each of these tables, the first column contains the subjects that define that property and the second column contains the object values for those subjects while the subjects that do not define a particular property are simply omitted from the table for that property. Each table is sorted by subject, so that particular subjects can be located quickly, and that fast merge joins can be used to reconstruct information about multiple properties for subsets of subjects. For a multivalued attribute, each distinct value is listed in a successive row in the table for that property. One advantage of this approach is that while property tables need to be carefully constructed so that they are wide enough but not too wide to independently answer queries, the algorithm for creating tables in the vertically partitioned approach is straightforward and need not change over time. Moreover, in the property-class schema approach, queries that do not restrict on class tend to have many union clauses while in the vertically partitioned approach, all data for a particular property is located in the same table and thus union clauses in queries are less common. The implementation of SW-Store relies on a column-oriented DBMS, C-store (Stonebraker et al., 2005), to store tables as collections of columns rather than as collections of rows. In standard row-oriented databases (e.g., Oracle, DB2, SQLServer, Postgres, etc.) entire tuples are stored consecutively. The problem with this is that if only a few attributes are accessed per query, entire rows need to be read into memory from disk before the projection can occur. By storing data in columns rather than rows projection occurs for freeonly those columns relevant to a query need to be read. Beckmann et al. (2006); Chu et al. (2007) have argued that storing a sparse data set (like RDF) in multiple tables can cause problems. They suggested storing a sparse data set in a single table while the complexities of sparse data management can be handled inside an RDBMS with the addition of an interpreted storage format. The proposed format starts with a header which contains fields such as relation-id, tuple-id, and a tuple length. When a tuple has a value for an attribute, the attribute identifier, a length field (if the type is of variable length), and the value appear in the tuple. The attribute identifier is the id of the attribute in the system catalog while the attributes that appear in the system catalog but not in the tuple are null for that tuple. Since the interpreted format stores nothing for null attributes, sparse data sets in a horizontal schema can in general be stored much more compactly in the format. While the interpreted format has storage benefits for sparse data, retrieving the values from attributes in tuples is more complex. In fact, the format is called interpreted because the storage system must discover the attributes and values of a tuple at tuple-access time, rather than using precompiled position information from a catalog, as the positional format allows. To tackle this problem, a new operator (called EXTRACT operator) is introduced to the query plans to precede any reference to attributes stored in the interpreted format and returns the offsets to the referenced interpreted attribute values which is then used to retrieve the values. Value extraction from an interpreted record is a potentially expensive operation that is dependent on the number attributes stored in a row, or the length of the tuple. Moreover, if a query evaluation plan fetches each attribute individually and uses an EXTRACT call per attribute, the record will be scanned for each attribute and will be very slow. Thus, a batch EXTRACT technique is used to allow one scan of the present values and saves time. Table 1 provides a summary of the relational techniques for processing RDF queries in terms of their representative approaches and their specific query optimization techniques.
277
EXPERIMENTAL EVALUATION
In this section, we present an experimental evaluation for the different approaches which are relying on the relational infrastructure to provide scalable engines to store and query RDF data (MahmoudiNasab & Sakr, 2010).
Experimental Settings
Our experimental evaluation of the alternative relational RDF storage techniques are conducted using the IBM DB2 DBMS running on a PC with 3.2 GHZ Intel Xeon processors, 4 GB of main memory storage and 250 GB of SCSI secondary storage. We used the SP2Bench data generator to produce four different testing datasets with number of triples equal to: 500K, 1M, 2M and 4M Triples. In our evaluation, we consider the following four alternative relational storage schemes:
278
1.
2. 3.
4.
Triple Stores (TS): where a single relational table is used to store the whole set of RDF triples (subject, predicate, object). We follow the RDF-3X and build indexes over all 6 permutations of the three fields of each RDF triple. Binary Table Stores (BS): for each unique predicate in the RDF data, we create a binary table (ID, Value) and two indexes over the permutations of the two fields are built. Traditional Relational Stores (RS): In this scheme, we use the Entity Relationship Model of the DBLP dataset and follow the traditional way of designing normalized relational schema where we build a separate table for each entity (with its associated descriptive attributes) and use foreign keys to represent the relationships between the different objects. We build specific partitioned B-tree indexes Graefe (2003) for each table based on the referenced attributes in the benchmark queries. Property Table Stores (PS): where we use the schema of RS and decompose each entity with number of attributes 4 into two subject-property tables. The decomposition is done blindly and based on the order of the attributes without considering the benchmark queries (workload independent).
Performance Metrics
We measure and compare the performance of the alternative relational RDF storage techniques using the following metrics: 1. 2. 3. Loading Time: represents the period of time for shredding the RDF dataset into the relational tables of the storage scheme. Storage Cost: depicts the size of the storage disk space which is consumed by the relational storage schemes for storing the RDF dataset. Query Performance: represents the execution times for the different SQL-translation of the SPARQL queries of SP2Bench over the alternative relational storage schemes.
279
All reported numbers of the query performance metric are the average of five executions with the highest and the lowest values removed. The rational behind this is that the first reading of each query is always expensively inconsistent with the other readings. This is because the relational database uses buffer pools as a caching mechanism. The initial period when the database spends its time loading pages into the buffer pools is known as the warm up period. During this period the response time of the database declines with respect to the normal response time. For all metrics: the lower the metric value, the better the approach.
Experimental Results
Table 3 summarizes the loading times for shredding the different datasets into the alternative relational representations. The RS scheme is the fastest due to the less required number of insert tuple operations. Similarly, the TS requires less loading time than BS since the number of inserted tuples and updated tables are smaller for each triple. Table 4 summarizes the storage cost for the alternative relational representations. The RS scheme represents the cheapest approach because of the normalized design and the absence of any data redundancy. Due to the limited percentage of the sparsity in the DBLP dataset, the PS does not introduce any additional cost in the storage space except a little overhead due to the redundancy of the object identification attributes in the decomposed property tables. The BS scheme represents the most expensive approach due to the redundancy of the ID attributes for each binary table. It should be also noted that the storage cost of TS and BS are affected by the additional sizes of their associated indexes. Table 5 summarizes the query performance for the SP2Bench benchmark queries over the alternative relational representations using the different sizes of the dataset. Remarks about the results of this experiment are given as follows: 1. There is no clear winner between the triple store (TS) and the binary table (BS) encoding schemes. Triple store (TS) with its simple storage and the huge number of tuples in the encoding relation is still very competitive to the binary tables encoding scheme because of the full set of B-tree physical indexes over the permutations of the three encoding fields (subject, predicate, object). The query performance of the (BS) encoding scheme is affected badly by the increase of the number of the predicates in the input query. It is also affected by the subject-object or object-object type of joins where no index information is available for utilization. Such problem could be solved by building materialized views over the columns of the most frequently referenced pairs of attributes.
2.
Table 3. A comparison between the alternative relational RDF storage techniques in terms of their loading times
Loading Time (in seconds) Dataset 500K 1M 2M $M Triple Stores 282 577 1242 2881 Binary Tables 306 586 1393 2936 Traditional Relational 212 402 931 1845 Property Tables 252 521 1176 2406
280
3.
4.
Although their generality, there is still a clear gap between the query performance of the (TS) and (BS) encoding schemes in comparison with the tailored relational encoding scheme (RS) of the RDF data. However, designing a tailored relational schema requires a detailed information about the structure of the represented objects in the RDF dataset. Such information is not always available and designing a tailored relational schema limits the schema-free advantage of the RDF data because any new object with a variant schema will require applying a change in the schema of the underlying relational structure. Hence, we believe that there is still required efforts to improve the performance of these generic relational RDF storages and reduce the query performance gap with the tailored relational encoding schemes. The property tables encoding schemes (PS) are trying to fill the gap between the generic encoding schemes (TS and BS) and the tailored encoding schemes (RS). The results of our experiments show that the (PS) encoding scheme can achieve a comparable query performance to the (RS) encoding scheme. However, designing the schema of the property tables requires either explicit or implicit
Table 4. A comparison between the alternative relational RDF storage techniques in terms of their storage cost
Storage Cost (in KB) Dataset 500K 1M 2M 4M Triple Stores 24721 48142 96251 192842 Binary Tables 32120 64214 128634 257412 Traditional Relational 8175 17820 36125 73500 Property Tables 10225 21200 43450 86200
Table 5. A comparison between the alternative relational RDF storage techniques in terms of their query performance (in milliseconds)
1M TS Q1 Q2 Q3a Q3b Q3c Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 1031 1672 982 754 1106 21402 1452 2042 592 9013 2502 383 762 BS 1292 1511 1106 883 1224 21292 1292 1998 30445 8651 15311 596 514 RS 606 776 61 46 97 11876 798 1889 412 1683 654 284 306 PS 701 1109 116 76 118 14116 932 2109 773 1918 887 387 398 TS 1982 2982 1683 1343 1918 38951 2754 3981 1102 15932 4894 714 1209 BS 2208 3012 1873 1408 2109 37642 2598 3966 58556 13006 26113 1117 961 2M RS 1008 1606 102 87 209 20192 1504 3786 776 3409 1309 554 614 PS 1262 1987 198 132 275 25019 1786 4407 1546 3902 1461 708 765 TS 3651 5402 3022 2063 3602 66354 5011 7011 2004 27611 9311 1306 2111 BS 3807 5601 3342 2203 3874 64119 4806 6986 116432 24412 37511 2013 1704 4M RS 1988 2308 191 176 448 39964 3116 6685 1393 8012 2204 1109 1079 PS 2108 3783 354 218 684 48116 35612 8209 2665 8609 2671 1507 1461
281
information about the characteristics of the objects in the RDF dataset. Such explicit information cannot be always available and the process of inferring such implicit information introduces an additional cost of a pre-processing phase. Such challenges call for new techniques for flexible designs for the property tables encoding schemes.
CONCLUDING REMARKS
RDF is a main foundation for processing semantic information stored on the Web. It is the data model behind the Semantic Web vision whose goal is to enable integration and sharing of data across different applications and organizations. The naive way to store a set of RDF statements is using a relational database with a single table including columns for subject, property, and object. While simple, this schema quickly hits scalability limitations. Therefore, several approaches have been proposed to deal with this limitation by using extensive set of indexes or by using selectivity estimation information to optimize the join ordering (Neumann & Weikum, 2008; Weiss et al., 2008). Another approach to reduce the self-join problem is to create separate tables (property tables) for subjects that tend to have common properties defined (Chong et al., 2005; Levandoski & Mokbel, 2009). Since Semantic Web data is often semi-structured, storing this data in a row-store can result in very sparse tables as more subjects or properties are added. Hence, this normalization technique is typically limited to resources that contain a similar set of properties and many small tables are usually created. The problem is that this may result in union and join clauses in queries since information about a particular subject may be located in many different property tables. This may complicate the plan generator and query optimizer and can degrade performance. Abadi et al. (2009) has explored the trade-off between triple-based stores and binary tables-based stores of RDF data. The main advantages of binary tables are: 1. Improved bandwidth utilization: In a column store, only those attributes that are accessed by a query need to be read off disk. In a row-store, surrounding attributes also need to be read since an attribute is generally smaller than the smallest granularity in which data can be accessed. Improved data compression: Storing data from the same attribute domain together increases locality and thus data compression ratio. Hence, bandwidth requirements are further reduced when transferring compressed data. On the other side, binary tables have the following main disadvantages: 1. 2. Increased cost of inserts: Column-stores perform poorly for insert queries since multiple distinct locations on disk have to be updated for each inserted tuple (one for each attribute). Increased tuple reconstruction costs: In order for column-stores to offer a standards-compliant relational database interface (e.g., ODBC, JDBC, etc.), they must at some point in a query plan stitch values from multiple columns together into a row-store style tuple to be output from the database.
2.
Abadi et al. (2009) reported that the performance of binary tables is superior to clustered property table while Sidirourgos et al. (2008) reported that even in column-store database, the performance of
282
binary tables is not always better than clustered property table and depends on the characteristics of the data set. Moreover, the experiments of Abadi et al. (2009) reported that storing RDF data in columnstore database is better than that of row-store database while Sidirourgos et al. (2008) experiments have shown that the gain of performance in column-store database depends on the number of predicates in a data set. Our experimental evaluation in addition to other independent benchmarking projects (Bizer & Schultz, 2008; Schmidt et al., 2008, 2009) have shown that no approach is dominant for all queries and none of these approaches can compete with a purely relational model. Therefore, it is clear that there is still room for optimization in the proposed generic relational RDF storage schemes and thus new techniques for storing and querying RDF data are still required to bring forward the Semantic Web vision.
REFERENCES
Abadi, D. J., Marcus, A., Madden, S., & Hollenbach, K. (2009). SW-Store: A vertically partitioned DBMS for Semantic Web data management. The VLDB Journal, 18(2), 385406. doi:10.1007/s00778-008-0125-y Alexaki, S., Christophides, V., Karvounarakis, G., Plexousakis, D., & Tolle, K. (2001). The ICS-FORTH RDFSuite: Managing voluminous RDF description bases. In Proceedings of the 2nd International Workshop on the Semantic Web (semWeb). AllegroGraph RDFStore. (2009). Allegrograph. Retrieved from http://www.franz.com/agraph/allegrograph/ Beckmann, J. L., Halverson, A., Krishnamurthy, R., & Naughton, J. F. (2006). Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In Proceedings of the 22nd International Conference on Data Engineering, (p. 58). Berners-Lee, T., Hendler, J. & Lassila, O. (2001). The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American. Bizer, C., & Schultz, A. (2008). Benchmarking the performance of storage systems that expose SPARQL endpoints. In Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems. Botea, V., Mallett, D., Nascimento, M. A., & Sander, J. (2008). PIST: An efficient and practical indexing technique for historical spatio-temporal point data. GeoInformatica, 12(2), 143168. doi:10.1007/ s10707-007-0030-3 Broekstra, J., Kampman, A., & van Harmelen, F. (2002). Sesame: A generic architecture for storing and querying RDF and RDF schema. In Proceedings of the First International Semantic Web Conference, (p. 54-68). Chaudhuri, S., & Weikum, G. (2000). Rethinking database system architecture: Towards a self-tuning RISC-style database system. In Proceedings of 26th International Conference on Very Large Data Bases, (p. 1-10). Chong, E. I., Das, S., Eadon, G., & Srinivasan, J. (2005). An efficient SQL-based RDF querying scheme. In Proceedings of the 31st International Conference on Very Large Data Bases, (pp. 1216-1227).
283
Chu, E., Beckmann, J. L., & Naughton, J. F. (2007). The case for a wide-table approach to manage sparse relational data sets. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 821-832). Copeland, G. P., & Khoshafian, S. (1985). A decomposition storage model. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 268-279). Cyganiak, R. (2005). A relational algebra for SPARQL. (Tech. Rep. No. HPL-2005-170). HP Labs. DBLP XML Records. (2009). Home page information. Retrieved from http://dblp.uni-trier.de/xml/ Graefe, G. (2003). Sorting and indexing with partitioned B-trees. In Proceedings of the 1st International Conference on Data Systems Research. Grust, T., Sakr, S., & Teubner, J. (2004). XQuery on SQL hosts. In Proceedings of the Thirtieth International Conference on Very Large Data Bases, (pp. 252-263). Harris, S., & Gibbins, N. (2003). 3store: Efficient bulk RDF storage. In Proceedings of the First International Workshop on Practical and Scalable Semantic Systems. Harth, A., & Decker, S. (2005). Optimized index structures for querying RDF from the Web. In Proceedings of the Third Latin American Web Congress, (pp. 71-80). Jeffery, S. R., Franklin, M. J., & Halevy, A. Y. (2008). Pay-as-you-go user feedback for dataspace systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 847-860). Kiryakov, A., Ognyanov, D., & Manov, D. (2005). Owlim-a pragmatic semantic repository for owl. In Proceedings of the Web Information Systems Engineering Workshops, (pp. 182-192). Levandoski, J. J., & Mokbel, M. F. (2009). RDF data-centric storage. In Proceedings of the IEEE International Conference on Web Services. Ma, L., Su, Z., Pan, Y., Zhang, L., & Liu, T. (2004). RStar: An RDF storage and query system for enterprise resource management. In Proceedings of the ACM International Conference on Information and Knowledge Management, (pp. 484-491). Ma, L., Wang, C., Lu, J., Cao, F., Pan, Y., & Yu, Y. (2008). Effective and efficient Semantic Web data management over DB2. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 1183-1194). Mahmoudi Nasab, H., & Sakr, S. (2010). An experimental evaluation of relational RDF sorage and querying techniques. In Proceedings of the 2nd International Workshop on Benchmarking of XML and Semantic Web Applications. Manola, F., & Miller, E. (2004). RDF primer. W3C recommendation. Retrieved from http://www.w3.org/ TR/REC-rdf-syntax/ Matono, A., Amagasa, T., Yoshikawa, M., & Uemura, S. (2005). A path-based relational RDF database. In Proceedings of the 16th Australasian Database Conference, (pp. 95-103). McBride, B. (2002). Jena: A Semantic Web toolkit. IEEE Internet Computing, 6(6), 5559. doi:10.1109/ MIC.2002.1067737
284
Neumann, T., & Weikum, G. (2008). RDF-3X: A RISC-style engine for RDF. [PVLDB]. Proceedings of the VLDB Endownment, 1(1), 647659. Neumann, T., & Weikum, G. (2009). Scalable join processing on very large RDF graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 627-640). Pan, Z., & Heflin, J. (2003). DLDB: Extending relational databases to support Semantic Web queries. In Proceedings of the First International Workshop on Practical and Scalable Semantic Systems. Pan, Z., Zhang, X., & Heflin, J. (2008). DLDB2: A scalable multi-perspective Semantic Web repository. In Proceedings of the IEEE/WIC /ACM International Conference on Web Intelligence, (pp. 489-495). Prudhommeaux, E., & Seaborne, A. (2008). SPARQL query language for RDF. W3C recommendation. Retrieved from http://www.w3.org/TR/rdf-sparql-query/ Schmidt, M., & Hornung, T. Kuchlin, N., Lausen, G. & Pinkel, C. (2008). An experimental comparison of RDF data management approaches in a SPARQL benchmark scenario. In Proceedings of the 7th International Semantic Web Conference, (pp. 82-97). Schmidt, M., Hornung, T., Lausen, G., & Pinkel, C. (2009). SP2Bench: A SPARQL performance benchmark. In Proceedings of the 25th International Conference on Data Engineering, (pp. 222-233). Sidirourgos, L., Goncalves, R., Kersten, M. L., Nes, N., & Manegold, S. (2008). Column-store support for RDF data management: Not all swans are white. [PVLDB]. Proceedings of the VLDB Endownment, 1(2), 15531563. Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., et al. (2005). C-Store: A column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases, (pp. 553-564). Trker, C. & Gertz, M. (2001). Semantic integrity support in SQL: 1999 and commercial object-relational database management systems. The VLDB Journal, 10(4), 241269. doi:10.1007/s007780100050 Weiss, C., Karras, P., & Bernstein, A. (2008). Hexastore: Sextuple indexing for Semantic Web data management. [PVLDB]. Proceedings of the VLDB Endownment, 1(1), 10081019.
285
Section 3
287
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword:
Chapter 12
ABSTRACT
Relational Algebra (RA) and structured query language (SQL) are supposed to have a bijective relationship by having the same expressive power. That is, each operation in SQL can be mapped to one RA equivalent and vice versa. Actually, this is an essential fact because in commercial database management systems, every SQL query is translated into equivalent RA expression, which is optimized and executed to produce the required output. However, RA has an explicit relational division symbol (), whereas SQL does not have a corresponding explicit division keyword. Division is implemented using a combination of four core operations, namely cross product, difference, selection, and projection. In fact, to implement relational division in SQL requires convoluted queries with multiple nested select statements and set operations. Explicit division in relational algebra is possible when the divisor is static; however, a dynamic divisor forces the coding of the query to follow the explicit expression using the four core operators. On the other hand, SQL does not provide any flexibility for expressing division when the divisor is static. Thus, the work described in this chapter is intended to provide SQL expression equivalent to
DOI: 10.4018/978-1-60960-475-2.ch012
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
explicit relational algebra division (with static divisor). In other words, the goal is to implement a SQL query rewriter in Java which takes as input a divide grammar and rewrites it to an efficient query using current SQL keywords. The developed approach could be adapted as front-end or wrapper to existing SQL query system.Users will be able to express explicit division in SQL which will be translated into an equivalent expression that involves only the standard SQL keywords and structure. This will turn SQL into more attractive for specifying queries involving explicit division.
INTRODUCTION
Since its development as the standard query language for relational databases, the Structured Query Language (SQL) has witnessed a number of developments and extensions. Different research groups have worked on different aspects of SQL (Brantner et al., 2007; Chong et al., 2005; Harris & Shadbolt, 2005; Hung et al., 2005; Karvounarakis et al., 2002; Pan & Hein, 2003; Prudhommeaux, 2005). We realized the gap between relational algebra and SQL as far as the division operation is concerned. While the relational algebra has an operator for explicit division, SQL does not provide any keyword for explicit division. Hence from the user perspective the translation between SQL and the relational algebra is not one to one. A user who is able to express explicit division in relational algebra does not find the same flexibility with SQL. The work described in this paper is intended to cover this gap. Given two tables S and T such that the schema of S is subset from the schema of T; a common type of database query requires finding all tuples of a table or view that are related to each and every one of the tuples of a second table or group, and is called Relational Division. For instance, it is very common to code queries for finding employees working on all projects; students who have completed a certain set of courses, etc. Such kind of queries require a division operator which is normally expressed internally as equivalent to a combination of four core relational algebra operators, namely, selection, projection, difference and cross-product. The research community realized the importance of the division process and hence defined a standalone operator for explicitly expressing division as in the above two examples. However, the explicit division operator is not applicable when the divisor is dynamic. For instance, queries such as finding students who completed all first year courses in their departments, finding persons who ate at every restaurant in their neighborhood, etc, are not doable using explicit division; these queries qualify as requiring implicit division because the divisor changes for different instances of the dividend. Implicit division could be coded by explicitly using the four core operators, though expressing implicit division is not an easy task. Having no divide keyword for expressing explicit division in SQL, as many as six query types in different SQL dialects have been identified using nested select, exists, not exists, except, contains and count keywords (Elmasri & Navathe, 2006; McCann, 2003).We argue that it is necessary to provide for explicit division in SQL. The division operator is firstly instantiated from the perspective of relational algebra (Dadashzadeh, 1989) and further extended to relational databases in terms of flexible implementation (Bosc et al., 1997). Further discussions shift the focus to advanced computation, such as Fuzzy systems (Galindo et al., 2001; Bosc & Pivert, 2006) Once integrated into SQL, the explicit division will allow users to express their queries easier and hence one of the main targets of SQL (ease of use by end users) will be satisfied better. Realizing this
288
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
gap, the goal of this work is twofold: to devise a minimal division grammar that when encountered will be rewritten to two contemporary SQL division queries; and to implement such a query rewriter in Java, execute both rewritten queries on a sample database, and compare their returned sets for consistency. It is not intended to expand the existing SQL syntax or the already accepted SQL standard; rather we have developed a wrapper that could be integrated on the top of the existing SQL. This way, the end users will be able to express explicit division using DIVIDE as if it is one of the agreed upon keywords for SQL. Then, the developed rewriter translates the SQL query with DIVIDE into an equivalent SQL query where the division is expressed using the traditional way in SQL and hence our approach does not affect the underlying query system. The rest of this paper is organized as follows. Next section describes the division operator, followed by the discussion of proposed explicit division in SQL. We then cover the division grammar parser, and report the results of the testing process. A special divide-by-zero case is discussed before our conclusion of findings.
RELATIONAL DEVISION
Relational division is one of the eight basic operations in Codds relational algebra (Darwen & Date, 1992). Though it can be coded using four of the five core relational algebra operators, it has been defined as standalone operator to make the coding of queries involving explicit division easier. The concept is that a divisor relation is used to partition a dividend relation and produce a quotient or results table. The quotient table is made up of those values of one group of columns (or a single column) for which a second group of columns (or column) had all of the values in the divisor.
289
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
D0: SELECT DISTINCT x.A FROM T1 AS x WHERE NOT EXISTS (SELECT * FROM T2 AS y WHERE NOT EXISTS (SELECT * FROM T1 AS z WHERE (z.A=x.A) AND (z.B=y.B)));
D1 is an alternate formulation which is simpler to implement (Brantner et al., 2007). It uses membership test, group-by, counting, and having SQL constructs, yet computationally it is much faster (less table scanning) and closer semantically to the division in relational algebra.
D1: SELECT A FROM T1 WHERE B IN (SELECT B FROM T2) GROUP BY A HAVING COUNT(*) = (SELECT COUNT (*) FROM T2);
This divide grammar is a departure from that posited by Rantzau (Harris & Shadbolt, 2005) because this paper assumes a complete SQL query may be contained within each of the dividend and divisor clauses. This is possible as long as the projection of the final dividend relation contains an ordered set of attributes which is a superset of the projection of the final divisor relations attributes.
290
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
The following symbols denote their corresponding query entity in bold. These will be used in the rewrite templates in the next sections.
By replacing the entities in square brackets with their appropriate values, a single divide grammar can be rewritten in a straightforward manner.
291
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
(SELECT [AB] FROM [T2] AS y WHERE NOT EXISTS (SELECT [AB] FROM [T1] AS z WHERE (z.[A-B]=x.[A-B]) AND (z.[AB]=y.[AB])));
However, it was discovered early on in this research project that R1 suffers from two important flaws: if T1 or T2 or both consist of more than one table, this template fails; the second is that the where-clauses of T1 and T2 collide with the templates where-not-exists clauses. That is why R2 is used instead.
R2: SELECT DISTINCT x.[A-B] FROM (SELECT [A-B] FROM [T1] WHERE [W1]) AS x WHERE NOT EXISTS (SELECT * FROM (SELECT [AB] FROM [T2] WHERE [W2]) AS y WHERE NOT EXISTS (SELECT * FROM (SELECT [AB] FROM [T1] WHERE [W1]) AS z WHERE (z.[A-B]=x.[A-B]) AND (z.[AB]=y.[AB])));
In R2, both the dividend and the divisor are each encased within a from statement which uses an alias to maintain the x-y-z relationship. Complete SQL statements may be retained in the dividend and divisor clauses. This is a highly impractical solution to the query rewrite problem because now there are six select statements. However, the classic rewrite is not intended as a solution to this project, rather it serves as the comparison query against R0.
Divide-On Grammar
Consider a scenario where the dividend is known and the divisor is not entirely known. Such a divisor might be a stored query, a view, or a table where SELECT * is used but division is not desired across all of its attributes. Even more common might be that the SELECT * is desired, but the attributes are not in a compatible order with those of the dividend. A way to refine or limit which attributes to divide on is desired. The second form of the proposed divide grammar is
292
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
G2: (table1) DIVIDE (table2) ON {attributes} G2a: (SELECT i..j,r..s,t..v FROM table1 WHERE ) DIVIDE (SELECT * FROM table2 WHERE ) ON {r..s}
Virtually all conceivable forms of G2 can be written as G1. That is, the on keyword is not needed, but is included as a convenience. In fact, when implemented in Java, if the on keyword is present, the B attributes will simply be replaced with them. This is different from Rantzaus proposed grammar (Harris & Shadbolt, 2005) in which this on keyword is necessary.
293
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
Step 1: Are the brackets matched? Yes - This step checks that the number of left brackets equals the number of right brackets. Step 2: Is the divide keyword present? Yes If missing, no rewrite is performed. A trivial regular expression checks for this keyword in any case. Step 3: Locate the position in the query where the first divide keyword occurs. Actually, the divide keyword and its immediate surrounding brackets must be matched. This is to prevent a rogue table with the word divide in it from damaging the rewrite.
(SELECT a, b FROM T1) DIVIDE (SELECT b FROM T2)
Step 4: Scan leftward until the number of left brackets matches the number of right brackets from the beginning of the divide keyword.
(SELECT a, b FROM T1) DIVIDE (SELECT b FROM T2)
Step 5: Scan rightward until the number of right brackets matches the number of left brackets from the end of the divide keyword.
294
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
Step 6: Find the ON keyword and attributes, if any. None found. Steps 7 and 8: Is another divide keyword present in either the dividend or divisor? None found. Step 9: Rewrite the query. A general SQL parser called Zql (Karvounarakis et al., 2002) is used to extract from the dividend and divisor the select attributes, the from tables, and any where qualifiers. The final rewritten query is in Example 1. An analogous process exists for the R2 template rewrite.
295
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
FROM T1, T2 WHERE T1.a = T2.a AND T1.b > 0 Divisor: SELECT b, c FROM T3, T4 WHERE T3.d = T4.d On: b Replacing (SELECT a, b, c FROM T1, T2 WHERE T1.a = T2.a AND T1.b > 0) Divide (SELECT b, c FROM T3, T4 WHERE T3.d = T4.d) ON {b} with SELECT a, c FROM T1, T2 WHERE ((T1.a = T2.a) AND (T1.b > 0)) AND b IN (SELECT b, c FROM T3, T4 WHERE (T3.d = T4.d)) GROUP BY a, c HAVING COUNT(*) = (SELECT COUNT(*) FROM T3, T4 WHERE (T3.d = T4.d)). The original query now looks like: SELECT a, c FROM T1, T2 WHERE ((T1.a = T2.a) AND (T1.b > 0)) AND b IN (SELECT b, c FROM T3, T4 WHERE (T3.d = T4.d)) GROUP BY a, c HAVING COUNT(*) = (SELECT COUNT(*) FROM T3, T4 WHERE (T3.d = T4.d))) DIVIDE ((SELECT a, d FROM T1, T2 WHERE T1.a = T2.a AND T1.b < 0) Divide (SELECT d, c
296
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
FROM T3, T4 WHERE T3.d = T4.d) ON {d}) ON {a} A divide grammar was found in the dividend: (SELECT a, c FROM T1, T2 WHERE ((T1.a = T2.a) AND (T1.b > 0)) AND b IN (SELECT b, c FROM T3, T4 WHERE (T3.d = T4.d)) GROUP BY a, c HAVING COUNT(*) = (SELECT COUNT(*) FROM T3, T4 WHERE (T3.d = T4.d))) DIVIDE ((SELECT a, d FROM T1, T2 WHERE T1.a = T2.a AND T1.b < 0) Divide (SELECT d, c FROM T3, T4 WHERE T3.d = T4.d) ON {d}) ON {a} Dividend: SELECT a, c FROM T1, T2 WHERE ((T1.a = T2.a) AND (T1.b > 0)) AND b IN (SELECT b, c FROM T3, T4 WHERE (T3.d = T4.d)) GROUP BY a, c HAVING COUNT(*) = (SELECT COUNT(*) FROM T3, T4 WHERE (T3.d = T4.d)) Divisor: (SELECT a, d FROM T1, T2 WHERE T1.a = T2.a AND T1.b < 0) Divide (SELECT d, c FROM T3, T4 WHERE T3.d = T4.d) ON {d} On: a Another divide grammar was found in the divisor: (SELECT a, d FROM T1, T2
297
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
WHERE T1.a = T2.a AND T1.b < 0) Divide (SELECT d, c FROM T3, T4 WHERE T3.d = T4.d) ON {d} Dividend: SELECT a, d FROM T1, T2 WHERE T1.a = T2.a AND T1.b < 0 Divisor: SELECT d, c FROM T3, T4 WHERE T3.d = T4.d On: d Replacing (SELECT a, d FROM T1, T2 WHERE T1.a = T2.a AND T1.b < 0) Divide (SELECT d, c FROM T3, T4 WHERE T3.d = T4.d) ON {d} with SELECT a FROM T1, T2 WHERE ((T1.a = T2.a) AND (T1.b < 0)) AND d IN (SELECT d, c FROM T3, T4 WHERE (T3.d = T4.d)) GROUP BY a HAVING COUNT(a) = (SELECT COUNT(*) FROM T3, T4 WHERE (T3.d = T4.d)). The original query now looks like: (SELECT a, c FROM T1, T2 WHERE ((T1.a = T2.a) AND (T1.b > 0)) AND b IN (SELECT b, c FROM T3, T4 WHERE (T3.d = T4.d)) GROUP BY a, c HAVING COUNT(*) = (SELECT COUNT(*) FROM T3, T4 WHERE (T3.d = T4.d)))
298
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
DIVIDE (SELECT a FROM T1, T2 WHERE ((T1.a = T2.a) AND (T1.b < 0)) AND d IN (SELECT d, c FROM T3, T4 WHERE (T3.d = T4.d)) GROUP BY a HAVING COUNT(a) = (SELECT COUNT(*) FROM T3, T4 WHERE (T3.d = T4.d))) ON {a} The final rewrite would then proceed exactly as before for the last divide keyword.
Limitations
The logic of the rewrite algorithm is sound. The experimental parser and rewriter, however, have some limitations. Notably, the target databases metadata is not polled to make sure tables exist, attributes exist, and attributes are on domains that make sense. As this is a pure character string rewriter, it is not expected to do this. These have been left out because the DBMS, of course, will check the final rewritten query for correctness so that these limitations are relegated to the DBMS. Actually, this is not a limitation even if it is seen because our target is to turn into easy task the coding of queries that involve explicit division. The target is assumed achieved one a query explicit division is translated into an equivalent SQL query. Then it is the duty of the SQL parser within the DBMS to check the syntax of the query for correctness. The Zql parser used in this experiment cannot properly parse complicated select statements. A new light-weight general SQL parser is needed for exhaustive query handling. Also, extracting aggregate functions, having and group by clauses inside the dividend or divisor were not included in this experiment so they would not interfere with the count method template. This limited nested query rewriting abilities.
EXPERIMENTS
A suite of JUnit tests were created to test varying complexities of divide queries. For each division query, a count-method rewrite and a not-exists/not-exists method rewrite were performed. As long as the dividend and divisor each were valid with select attributes that could be extracted with Zql, the rewrites succeeded. The implemented front end has been integrated into a complete system that communicates with MySQL. The integrated system takes any SQL statement, rewrite the query if it contains explicit division (expressed using DIVIDE), and then executes the query to return the results. All the translated queries were successful run. Finally, the verbatim translation process has been tested by coding queries that does not refer to existing tables/attributes. The queries were correctly translated by the front end into
299
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
corresponding SQL queries that involve same tables/attributes. However the MySQL parser reported them as erroneous queries because they refer to non-existing tables/attributes.
Performance
Consistently, the count method division returned results more than an order of magnitude faster that the not-exists/except division method while returning the same results set1. Leinders and den Bussche show an interesting theoretical result about the divide operation (http://www.gibello.com/code/zql/). Algorithms implementing the divide operation based on sorting and counting, as in R0, can achieve a time complexity of O(n log n). In contrast, they show that any expression of the divide operator in the relational algebra with union, difference, projection, selection, and equi-joins must produce intermediate results of quadratic size. The resources and time needed to rewrite a character string as outlined in this paper are negligible. The implementation is tested on Windows XP Professional (3.39GHz, 1GB of RAM). We randomonly sample DIVIDE queries from a pool of different cases. The performance chart is shown in Figure 1, where data labels are average time(ms) per query, we observe that the parsing and rewrite execute efficently on avarage, including large sampled data set.
300
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
DIVISION BY ZERO
An interesting question arises when the divisor table is empty: what should the result be? Using the count method rewrite, the empty set is returned. Using the not-exists/except method rewrite, on the other hand, enumerates all the rows from the dividend table. It makes the most sense to return an empty set in the authors opinion because just as algebraic division by zero produces a nonsense result, so does returning all the rows of the dividend table.
CONCLUSION
As a proof-of-concept, this project succeeded. Divide grammars outlined in this paper can be rewritten into more than one contemporary SQL division methods. Therefore, it is possible to add the divide keyword to the standard SQL language. It should be up to each vendor how they would like to implement a division algorithm but they should agree that the division grammar proposed in this paper is very straightforward. It has been successfully integrated into MySQL and the testing has been very encouraging. Currently we are trying to develop a nice user interface for the integrated system.
REFERENCES
Bosc, P., Dubois, D., Pivert, O., & Prade, H. (1997). Flexible queries in relational databases-the example of the division operator. Theoretical Computer Science, 171(1/2), 281302. doi:10.1016/S03043975(96)00132-6
301
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
Bosc, P., & Pivert, O. (2006). About approximate inclusion and its axiomatization. Fuzzy Sets and Systems, 157, 14381454. doi:10.1016/j.fss.2005.11.011 Brantner, M., May, N., & Moerkotte, G. (2007). Unnesting scalar SQL queries in the presence of disjunction. In Proceedings of IEEE ICDE. Chong, E. I., Das, S., Eadon, G., & Srinivasan, J. (2005). An efficient SQL-based RDF querying scheme. In Proceedings of VLDB. Codd, E. (1972). Relational completeness of database sub-languages. In R. Rustin, (Ed.), Courant Computer Science Symposium 6: Database Systems, (pp. 65-98). Prentice-Hall. Dadashzadeh, M. (1989). An improved division operator for relational algebra. Information Systems, 14(5), 431437. doi:10.1016/0306-4379(89)90007-0 Darwen, H., & Date, C. (1992). Into the great divide. In Date, C., & Darwen, H. (Eds.), Relational database: Writings 1989-1991 (pp. 155168). Reading, MA: Addison-Wesley. Date, C. (1995). An introduction to database systems (6th ed.). Elmasri, R., & Navathe, S. R. (2006). Fundamentals of database systems (5th ed.). Addison Wesley. Galindo, J., Medina, J. M., & Aranda-Garridoa, M. C. (2001)... Fuzzy Sets and Systems, 121(3), 471490. doi:10.1016/S0165-0114(99)00156-6 Gibello, P. (2010). Zql: A Java SQL parser. Retrieved June 2010, from http://www.gibello.com/code/zql/ Harris, S., & Shadbolt, N. (2005). SPARQL query processing with conventional relational database systems. In Proceedings of SSWS. Hung, E., Deng, Y., & Subrahmanian, V. S. (2005). RDF aggregate queries and views. In Proceedings of IEEE ICDE. Karvounarakis, G., Alexaki, S., Christophides, V., Plexousakis, D., & Scholl, M. (2002). RQL: A declarative query language for RDF. In Proceedings of WWW. Leinders, D., & den Bussche, J. V. (2005). On the complexity of division and set joins in the relational algebra. In Proceedings of ACM PODS, Baltimore, MD USA. Maier, D. (1983). The theory of relational databases. Computer Science Press. Matos, V. M., & Grasser, R. (2002). A simpler (and better) SQL approach to relational division. Journal of Information Systems Education, 13(2). McCann, L. (2003). On making relational division comprehensible. In Proceedings of ASEE/IEEE Frontiers in Education Conference. Pan, Z., & Hein, J. (2003). DLDB: Extending relational databases to support SemanticWeb queries. In Proceedings of PSSS. Prudhommeaux, E. (2005). Notes on adding SPARQL to MySQL. Retrieved from http://www. w3.org/2005/05/22-SPARQL-MySQL/
302
Making Query Coding in SQL Easier by Implementing the SQL Divide Keyword
Rantzau, R., & Mangold, C. (2006). Laws for rewriting queries containing division operators. In the Proceedings of IEEE ICDE. XAMPP. (2010). An apache distribution containing MySQL. Retrieved June 2010, from http://www. apachefriends.org/en/xampp.html
ENDNOTE
1
303
304
Chapter 13
ABSTRACT
Recently, there has been a lot of interest in the application of graphs in different domains. Graphs have been widely used for data modeling in different application domains such as: chemical compounds, protein networks, social networks, and Semantic Web. Given a query graph, the task of retrieving related graphs as a result of the query from a large graph database is a key issue in any graph-based application. This has raised a crucial need for efficient graph indexing and querying techniques. This chapter provides an overview of different techniques for indexing and querying graph databases. An overview of several proposals of graph query language is also given. Finally, the chapter provides a set of guidelines for future research directions.
INTRODUCTION
The field of graph databases and graph query processing has received a lot of attention due to the constantly increasing usage of graph data structure for representing data in different domains such as: chemical compounds (Klinger & Austin, 2005), multimedia databases (Lee et al., 2005), social networks (Cai et al., 2005), protein networks (Huan et al., 2004) and semantic web (Manola & Miller, 2004). To effectively understand and utilize any collection of graphs, a graph database that efficiently supports elementary querying mechanisms is crucially required. Hence, determining graph database members which constitute the answer set of a graph query q from a large graph database is a key performance issue
DOI: 10.4018/978-1-60960-475-2.ch013
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
in all graph-based applications. A primary challenge in computing the answers of graph queries is that pair-wise comparisons of graphs are usually really hard problems. For example, subgraph isomorphism is known to be NP-complete (Garey & Johnson, 1979). A naive approach to compute the answer set of a graph query q is to perform a sequential scan on the graph database and to check whether each graph database member satisfies the conditions of q or not. However, the graph database can be very large which makes the sequential scan over the database impracticable. Thus, finding an efficient search technique is immensely important due to the combined costs of pair-wise comparisons and the increasing size of modern graph databases. It is apparent that the success of any graph database application is directly dependent on the efficiency of the graph indexing and query processing mechanisms. Recently, there are many techniques that have been proposed to tackle these problems. This chapter gives an overview of different techniques of indexing and querying graph databases and classifies them according to their target graph query types and their indexing strategy. The rest of the chapter is organized as follows. The Preliminary section introduces preliminaries of graph databases and graph query processing. In Section (Subgraph Query Processing), a classification of the approaches of subgraph query processing problem and their index structures is given while the section (Supergraph Query Processing) focuses on the approaches for resolving the supergraph query processing problem. Section (Graph Similarity Queries) discusses the approach of approximate graph matching queries. Section (Graph Query Languages) gives an overview of several proposals of graph query languages. Finally, Section (Discussion and Conclusions) concludes the chapter and provides some suggestions for possible future research directions on the subject.
PRELIMINARIES
In this section, we introduce the basic terminologies used in this chapter and give the formal definition of graph querying problems.
In principle, there are two main types of graph databases. The first type consists of few numbers of very large graphs such as the Web graph and social networks (non-transactional graph databases). The
305
second type consists of a large set of small graphs such as chemical compounds and biological pathways (transactional graph databases). The main focus of this chapter is on giving an overview of the efficient indexing and querying mechanisms on the second type of graph databases.
Graph Queries
In principle, queries in transactional graph databases can be broadly classified into the following main categories: 1. Subgraph queries: this category searches for a specific pattern in the graph database. The pattern can be either a small graph or a graph where some parts of it are uncertain, e.g., vertices with wildcard labels. Therefore, given a graph database D = {g1,g2,..., gn} and a subgraph query q, the query answer set A = {gi|q gi,gi D}. A graph q is described as a subgraph of another graph database member gi if the set of vertices and edges of q form subset of the vertices and edges of gi. To be more formal, let us assume that we have two graphs g1(V1, E1, Lv1,Le1,Fv1,Fe1) and g2(V1, E2, Lv2,Le2,Fv2,Fe2), g1 is defined as subgraph of g2, if and only if: For every distinct vertex x V1 with a label vl Lv1, there is a distinct vertex y V2 with a label vl Lv2. For every distinct edge edge ab E1 with a label el Le1, there is a distinct edge ab E2 with a label el Le2. Figure 1(a) illustrates the subgraph search problem. Figure 2(a) shows an example of a graph database. Figure 2(b) illustrates examples of graph queries (q1 and q2). Let us assume that these queries are subgraph queries. If we evaluate these queries over the sample graph database (Figure 2(a)) then the answer set of q1 will consist of the graph database members g1 and g2 while the answer set of q2 will be empty. The more general type of the subgraph search problem is the subgraph isomorphism search problem, which is defined as follows. Let g1 =(V1,E1,Lv1,Le1,Fv1,Fe1) and g1 =(V2,E2,Lv2,Le2,Fv2,Fe2) be two graphs, g1 is defined as a graph isomorphism to g2, if and only if there exists at least one bijective function f: V1 V2 such that: 1) for any edge uv E1, there is an edge f(u)f(v) E2. 2) Fv1(u)= Fv2(f(u)) and Fv1(v)= Fv2(f(v)). 3) Fe1(uv)= Fe2(f(u)f(v)). Supergraph queries: searches for the graph database members of which their whole structures are contained in the input query. Therefore, given a graph database D = {g1,g2,..., gn} and a supergraph query q, the query answer set A = {gi|q gi,gi D}. Figure 1(b) illustrates the subgraph search problem. Let us assume that the graph queries of Figure 2(b) are supergraph queries. If we evaluate these queries over the sample graph database (Figure 2(a)) then the answer set of q1 will be empty while the answer set of q2 will contain the graph database member g3. Similarity (Approximate Matching) queries: this category finds graphs which are similar, but not necessarily isomorphic to a given query graph. Given a graph database D = {g1,g2,..., gn} and a query graph q, similarity search is to discover all graphs that are approximately similar to the graph query q. A key question in graph similarity queries is how to measure the similarity between a target graph member of the database and the query graph. In fact, it is difficult to give a precise definition of graph similarity. Different approaches have proposed different similarity metrics for graph data structures (Bunke & Shearer, 1998; Fernandez & Valiente, 2001; Raymond et al., 2002). Discussing these different similarity metrics is out of the scope of this chapter. We refer the interested reader to a detailed survey in (Gao et al., 2009).
2.
3.
306
307
1. 2.
Identifying the set of features of the subgraph query. Using the inverted index to retrieve all graphs that contain the same features of q.
The rationale behind this type of query processing techniques is that if some features of graph q do not exist in a data graph G, then G cannot contain q as its subgraph (inclusion logic). Formally, Clearly, the effectiveness of these filtering methods depends on the quality of mining techniques to effectively identify the set of features. Therefore, important decisions need to be made about: the indexing feature, the number and the size of indexing features. These decisions crucially affect the cost of the mining process and the pruning power of the indexing mechanism. A main limitation of these approaches is that the quality of the selected features may degrade over time after lots of insertions and deletions. In this case, the set of features in the whole updated graph database need to be re-identified and the index needs to be re-build from scratch. It should be noted that, achieving these tasks is quite time consuming.
308
length and records the number of occurrences of each path. Hence, in this index table, each row stands for a path and each column stands for a graph. Each entry in the table is the number of occurrences of the path in the graph. In the query processing, the path indexes is used to find a set of candidate graphs which contains paths in the query structure and to check if the counts of such paths are beyond the threshold specified in the query. In the verification step, each candidate graph is examined by subgraph isomorphism to obtain the final results. The main strength of this approach is that the indexing process of paths with limited lengths is usually fast. However, the size of the indexed paths could drastically increase with the size of graph database. In addition, the filtering power of paths data structure is limited. Therefore, the verification cost can be very high due to the large size of the candidate set.
GDIndex
Williams et al. (2007) have presented an approach for graph database indexing using a structured graph decomposition named GDIndex. In this approach, all connected and induced subgraphs of a given graph are enumerated. Therefore, a graph of size n is decomposed into at most 2n subgraphs when each of the vertices has a unique label. However, due to isomorphism between enumerated graphs, a complete graph with multiple occurrences of the same label may decompose into fewer subgraphs. If all labels are identical, a complete graph of size n is decomposed into just n +1 subgraphs. A directed acyclic graph (DAG) is constructed to model the decomposed graphs and the contained relationships between them. In this DAG, there is always one node that represents the whole graph G, and one node that represents the null graph. The children of a node P are all graphs Q where there is a directed link in the DAG between P and Q. Moreover, the descendants of a node P are all nodes that are reachable from P in the DAG. Figure 3 depicts a sample of graph decomposition using the GDIndex approach. A hash table is used to index the subgraphs enumerated during the decomposition process. The hash key of each subgraph is determined from the string value given by the canonical code of the subgraph. This canonical code is computed from its adjacency matrix. In this way, all isomorphic graphs produce the same hash key. Since all entries in the hash table are in canonical form, only one entry is made for each unique canonical code. This hash table enables the search function to quickly locate any node in the decomposition DAG, which is isomorphic to a query graph, if it exists. Therefore, in the queryprocessing step, the hash key of the graph query q is computed from the querys canonical code. This
309
computed hash value of the graph query is then used to identify and verify the set of queries that matches the canonical codes of the graph query. A clear advantage of the GDIndex approach is that no candidate verification is required. However, the index is designed for databases that consist of relatively smaller graphs and do not have a large number of distinct graphs.
GString
The GString approach (Jiang et al., 2007) considers the semantics of the graph structures in the database. It focuses on modeling graph objects in the context of organic chemistry using basic structures (Line, Star and Cycle) that have semantic meaning and use them as indexing features. Line structure denotes a structure consisting of a series of vertices connected end to end, Cycle structure denotes a structure consisting of a series of vertices that form a close loop and Star structure denotes a structure where a core vertex directly connects to several vertexes. For a graph g, GString first extracts all Cycle structures, then it extracts all Star structures, and finally, it identifies the remaining structures as Line structures. Figure 4 represents a sample graph representation using the GString basic structures. GString represents both graphs and queries on graphs as string sequences and transforms the subgraph search problem to the subsequence string-matching domain. A suffix tree-based index structure for the string representations is then created to support an efficient string matching process. Given a basic structure, its GString has three components: type, size, and a set of annotations (edits). For Line or Cycle, the size is the number of vertices in the structure. For Star, the size indicates the fanout of the central vertex. For a query graph q, GString derives its summary string representation which is then matched against the suffix-tree of the graph database. An element of a summary string matches a node in the suffix-tree if their types match, sizes are equal or the size in the query is no more than the size in the node and the counts of corresponding types of edits in the query are no larger than those in the node. A key disadvantage of the GString approach is that converting subgraph search queries into a string matching problem could be an inefficient approach especially if the size of the graph database or the subgraph query is large. Additionally, GString focuses on decomposing chemical compounds into basic structures that have semantic meaning in the context of organic chemistry and it is not trivial to extend this approach in other domain of applications.
310
GraphREL
Sakr (2009) has presented a purely relational framework for processing graph queries named GraphREL. In this approach, the graph data set is encoded using an intuitive Vertex-Edge relational mapping scheme ((Figure 5) and the graph query is translated into a sequence of SQL evaluation steps over the defined storage scheme. An obvious problem in the relational-based evaluation approach of graph queries is the huge cost which may result from the large number of join operations which are required to be performed between the encoding relations. Several relational query optimization techniques have been exploited to speed up the search efficiency in the context of graph queries. The main optimization technique of GraphREL is based on the observation that the size of the intermediate results dramatically affects the overall evaluation performance of SQL scripts (Teubner et al., 2008). Therefore, GraphREL keeps statistical information about the less frequently existing nodes and edges in the graph database in the form of simple Markov Tables. For a graph query q, the maintained statistical information is used to identify the highest pruning point on its structure (nodes or edges with very low frequency) to firstly filter out, as many as possible, of the false positives graphs that are guaranteed to not exist in the final results first before passing the candidate result set to an optional verification process. This statistical information is also used to influence the decision of relational query optimizers by selectivity annotations of the translated query predicates to make the right decision regarding the selection of most efficient join order and the cheapest execution plan (Bruno et al., 2009). Based on the fact that the number of distinct vertices and edges labels are usually far less than the number of vertices and edges in graph databases, GraphREL utilizes the powerful partitioned B-trees indexing mechanism of the relational databases to reduce the access costs of the secondary storage to a minimum. GraphREL applies an optional verification process only if more than one vertex of the set of query vertices has the same label. For large graph queries, GraphREL applies a decomposition mechanism to divide the large and complex SQL translation query into a sequence of intermediate queries (using temporary tables) before evaluating the final results. This decomposition mechanism reuses the
311
statistical summary information an effective selectivity-aware decomposition process and reduces the size of the intermediate result of each step.
Tree-Based Mining
Zhang et al. (2007) have presented an approach for using frequent subtrees as the indexing unit for graph structures named TreePI. The main idea of this approach is based on two main observations: 1) Tree data structures are more complex patterns than paths and trees can preserve almost equivalent amount of structural information as arbitrary subgraph patterns. 2) The frequent subtree mining process is relatively easier than general frequent subgraph mining process. Therefore, the TreePI starts by mining the frequent tree on the graph database and then selecting a set of frequent trees as index patterns. In the query processing, for a query graph q, the frequent subtrees in q are identified and then and then matched with the set of indexing features to obtain a candidate set. In the verification phase, the advantage of the location information partially stored with the feature trees is utilized for devising an efficient subgraph isomorphism tests. As the canonical form of any tree can be calculated in polynomial time, the indexing
312
and searching operations can be effectively improved. Moreover, operations on trees, such as isomorphism and normalization are asymptotically simpler than graphs, which are usually NP-complete (Fortin, 1996). Zhao et al. (2007) have extended the ideas of TreePi (Zhang et al., 2007) to achieve better pruning ability by adding a small number of discriminative graphs () to the frequent tree-features in the index structure. They propose a new graph indexing mechanism, named (Tree+), which first selects frequent tree-features as the basis of a graph index, and then on-demand selects a small number of discriminative graph-features that can prune graphs more effectively than the selected tree-features, without conducting costly graph mining beforehand.
313
Grafil
Yan et al. (2005) have proposed a feature-based structural filtering algorithm, named Grafil (Graph Similarity Filtering) to perform substructure similarity search in graph databases. Grafil models each query graph as a set of features and transforms the edge deletions into the feature misses in the query graph. With an upper bound on the maximum allowed feature misses, Grafil can filter many graphs directly without performing pair-wise similarity computation. It uses two data structures: feature-graph matrix and edge-feature matrix. The feature-graph matrix is used to compute the difference in the number of features between a query graph and graphs in the database. In this matrix, each column corresponds to a target graph in the graph database, each row corresponds to a feature being indexed and each entry records the number of the embeddings of a specific feature in a target graph. The edge-feature matrix is used to compute a bound on the maximum allowed feature misses based on a query relaxation ratio. In this matrix, each row represents an edge while each column represents an embedding of a feature. Grafil uses a multifilter composition strategy, where each filter uses a distinct and complimentary subset of the features. The filters are constructed by a hierarchical, one-dimensional clustering algorithm that groups features with similar selectivity into a feature set. During the query processing, the feature-graph matrix is used to calculate the difference in the number of features between each graph database member gi and the query q. If the difference is greater than a user-defined parameter dmax then it is discarded while the remaining graphs constitute a candidate answer set. The substructure similarity is then calculated for each candidate to prune the false positives candidates. A loop of query relaxation steps can be applied if the user needs more matches than those returned from the current value of dmax.
Closure Tree
He & Singh (2006) have proposed a tree-based index structure named CTree (Closure-Tree). CTree index is very similar to the R-tree indexing mechanism (Guttman, 1984) but extended to support graphmatching queries. In this index structure, each node in the tree contains discriminative information about its descendants in order to facilitate effective pruning. The closure of a set of vertices is defined as a generalized vertex whose attribute is the union of the attribute values of the vertices. Similarly, the closure of a set of edges is defined as a generalized edge whose attribute is the union of the attribute values of the edges. The closure of two graphs g1 and g2 under a mapping M is defined as a generalized graph (V, E) where V is the set of vertex closures of the corresponding vertices and E is the set of edge closures of the corresponding edges (Figure 6). Hence, a graph closure has the same characteristics of a graph. However, the only difference is that the graph database member has singleton labels on vertices and edges while the graph closure can have multiple labels. In a closure tree, each node is a graph closure of its children where the children of an internal node are nodes and the children of a leaf node are database graphs. A subgraph query is processed in two phases. The first phase traverses the CTree and the nodes are pruned based on a pseudo subgraph isomorphism. A candidate answer set is returned. The second phase verifies each candidate answer for exact subgraph isomorphism and returns the answers. In addition to pruning based on pseudo subgraph isomorphism, a lightweight histogram-based pruning is also employed. The histogram of a graph is a vector that counts the number of each distinct attribute of the vertices and edges. For similarity queries,
314
CTree defines graph similarity based on edit distance, and computes it using heuristic graph mapping methods. It conceptually approximates subgraph isomorphism by sub-isomorphism using adjacent subgraphs then it approximates sub-isomorphism by using adjacent subtrees.
SAGA
Tian et al. (2007) have presented an approach of approximate graph query matching named SAGA (substructure index-based approximate graph alignment). SAGA measures graph similarity by a distance value such that graphs that are more similar have a smaller distance. The distance model contains three components: 1. 2. 3. The StructDist component measures the structural differences for the matching node pairs in the two graphs. The NodeMismatches component is the penalty associated with matching two nodes with different labels. The NodeGaps component is used to measure the penalty for the gap nodes in the query graph.
SAGA index is built on small substructures of graphs in the database (Fragment Index). Each fragment is a set of k nodes from the graphs in the database where K is a user-defined parameter. The index does not enumerate all possible k-node sets. It uses another user-defined parameter distmax to avoid indexing any pair of nodes in a fragment if its distance measure is greater than distmax. The fragments in SAGA do not always correspond to connected subgraphs. The reason behind this is to allow node gaps in the matching process. To efficiently evaluate the subgraph distance between a query graph and a database graph, an additional index called DistanceIndex is also maintained. This index is used to look up the pre-computed distance between any pair of nodes in a graph. The graph matching process goes through the following three steps: 1. 2. The query is broken into small fragments and the Fragment Index is probed to find database fragments that are similar to the query fragments. The hits from the index probes are combined to produce larger candidate matches. A hit-compatible graph is built for each matching graph. Each node in the hit-compatible graph corresponds to a pair of matching query and database fragments. An edge is drawn between two nodes in the hit-compatible
315
3.
graph if and only if two query fragments share zero or more nodes, and the corresponding database fragments in the hit-compatible graph also share the same corresponding nodes. An edge between two nodes tells us that the corresponding two hits can be merged to form a larger match which is then a candidate match. Each candidate is examined to produce the query results. For each candidate, the percentage of the gap nodes is checked. If it exceeds a user-defined threshold Pg, then the candidate match is ignored otherwise, the DistanceIndex is probed to calculate the real subgraph matching distance. If two matches have the same matching distance and one is a submatch of the other, only the supermatch is considered.
Table 1 provides a comparison between the different graph indexing techniques in terms of their supported query types, indexing unit and indexing strategy.
316
During query evaluation node variables are bound to nodes of the database graph such that all nodeconditions in the WHERE clause evaluate to TRUE. The query result is constructed from these variable bindings according to the subgraph-specification of the SELECT clause. Query evaluation considers each node variable in the FROM clause. For each of these variables, all possible assignments of the variable to nodes of the graph are determined for which the conditions of the WHERE clause mentioning only this variable evaluates to TRUE. Node variables are equally assigned to molecules and interactions. Once all possible bindings are computed for each node variable, the Cartesian product of these sets is computed. From this set, all instances are removed for which the entire WHERE clause evaluates to FALSE and all distinct assignments from the remaining elements of the Cartesian product are combined to form the match graph. In general, SQL query operates on tables and produces a table (a set of rows), in the same manner as a PQL query operates on a graph and produces a graph (a set of nodes) and not a set of graphs. In the result of the SQL query the rows from the Cartesian product are preserved, columns might be removed, added, or changed in their value. In PQL, the concrete combinations of bindings of different node variables that together fulfill the WHERE clause are not preserved in the match graph of a PQL query. The match graph simply consists of all bindings present in the filtered Cartesian product into a flat, duplicate-free list of nodes. In (He & Singh, 2008) He and Singh have proposed a general graph query and manipulation languages called GraphQL. In this language, graphs are considered as the basic unit of information and each query manipulates one or more collections of graphs. It also targets graph databases that supports arbitrary attributes on nodes, edges, and graphs. The core of GraphQL is its data model and its graph algebra. In the GraphQL data model, a graph pattern is represented as a graph structure and a predicate on attributes of the graph. Each node, edge, or graph can have arbitrary attributes. A tuple with a list of name and value pairs is used to denote these attributes. In GraphQL algebra, graphs are the basic unit of information. Each operator takes one or more collections of graphs as input and generates a collection of graphs as output. For example, the selection operator takes a collection of graphs as input and produces a collection of graphs that match the graph pattern as an output. A graph pattern can match a specific graph database member many times. Therefore, an exhaustive option is used to specify whether it should return one or all possible matches. A Cartesian product operator takes two collections of graphs C and D and produces a collection of graphs as output where each output graph is composed of a graph from C and another from D. The join operator is defined as a Cartesian product operator followed by a selection operator. The composition operator generates new graphs by combining information from matched graphs based on graph templates that specify the output structure of the graphs. Consens & Mendelzon (1990) have proposed a graphical query language for graph databases called GraphLog. Graphlog queries ask for patterns that must be present or absent in the database graph. In GraphLog, the query graph can define a set of new edges that are added to the graph whenever the search pattern is found. Awad has followed a similar approach in (Awad, 2007) where he presented a visual query language for business process definitions called BPMN-Q. BPMN-Q allows expressing structural queries and specifies proceedings of determining whether a given process model graph is structurally similar to a query graph. BPMN-Q relies on the notations of BPMN languages as its concrete syntax and provides a set of new visual constructs that can be seen as abstractions over the existing modeling constructs. For example a Path construct connecting two nodes in a query represents an abstraction over
317
whatever nodes could be in between in the matching process model while Negative Path used to express that two nodes must have no connection between them. There are some other proposal for graph query languages that have been proposed in the literature such as: GOQL (Sheng et al., 1999), GOOD (Gyssens et al., 1994) and SPARQL (Prudhommeaux & Seaborne, 2008). For example, GOQL and GOOD are designed based on an extension of OQL (ObjectOriented Query Language) and rely on an object-oriented graph data model. SPARQL query language is a W3C recommendation for querying RDF graph data. It describes a directed labeled graph by a set of triples, each of which describes a (attribute, value) pair or an interconnection between two nodes. The SPARQL query language works primarily through a primitive triple pattern matching techniques with simple constraints on query nodes and edges. Table 2 provides a comparison between the graph query languages in terms of their target domain, query unit and query style.
318
usually larger. Hence, reducing the size of the candidate answer set by removing as much as possible of the false positive graphs is the main criteria to evaluate the effectiveness of any filtering technique. There is a clear imbalance between the number of developed techniques for processing supergraph queries and the other types of graph queries. The reason behind this is that the supergraph query type can be considered to be relatively new. Therefore, there are many technical aspects which still remain unexplored. A clear gap in the research efforts in the domain of graph database is the absence of a standard graph query language which plays the same role as that SQL for the relational data model or XPath and XQuery for the XML hierarchical model. Although there are a number of query languages that have been proposed in the literature (Consens & Mendelzon, 1990; Gyssens et al., 1994; Sheng et al., 1999), none of them has been universally accepted as they are designed to deal with different representations of the graph data model. A standard definition of a general purpose graph query language with more powerful and flexible constructs is essentially required. A concrete algebra behind this expected query language is also quite important. The proposed graph query processing techniques concentrate on the retrieval speed of their indexing structures in addition to their compact size. Further management of these indexes is rarely taken into account. Although efficient query processing is an important objective, efficient update maintenance is also an important concern. In the case of dynamic graph databases, it is quite important that indexing techniques avoid the costly recomputation of the whole index and provide more efficient mechanisms to update the underneath index structures with minimum effort. Therefore, efficient mechanisms to handle dynamic graph databases are necessary. Finally, query processing usually involves a cost-based optimization phase in which query optimizers rely on cost models to attempt on choosing an optimal query plan from amongst several alternatives. A key issue of any cost model is the cardinality estimation of the intermediate and final query results. Although there is an initial effort has been proposed by Stocker et al. (2008) for estimating the selectivity estimation of basic graph patterns, there is still a clear need for summarization and estimation frameworks for graph databases. These frameworks need to provide accurate selectivity estimations of more complex graph patterns which can be utilized in accelerating the processing of different types of graph queries.
REFERENCES
Angles, R., & Gutierrez, C. (2008). Survey of graph database models. ACM Computing Surveys, 40(1), 139. doi:10.1145/1322432.1322433 Awad, A. (2007). BPMN-Q: A language to query business processes. In Proceedings of the 2nd International Workshop on Enterprise Modelling and Information Systems Architectures, (pp. 115-128). Bruno, N., Chaudhuri, S., & Ramamurthy, R. (2009). Power hints for query optimization. In Proceedings of the 25th International Conference on Data Engineering, (pp. 469-480). Bunke, H., & Shearer, K. (1998). A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19(3-4), 255259. doi:10.1016/S0167-8655(97)00179-7
319
Cai, D., Shao, Z., He, X., Yan, X., & Han, J. (2005). Community mining from multi-relational networks. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, (pp. 445-452). Chen, C., Yan, X., Yu, P. S., Han, J., Zhang, D.-Q., & Gu, X. (2007). Towards graph containment search and indexing. In Proceedings of the 33rd International Conference on Very Large Data Bases, (pp. 926-937). Cheng, J., Ke, Y., Ng, W., & Lu, A. (2007). FG-Index: Towards verification-free query processing on graph databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 857-872). Consens, M. P., & Mendelzon, A. O. (1990). GraphLog: A visual formalism for real life recursion. In Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, (pp. 404-416). Fernandez, M.-L. & Valiente, G. (2001). A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters, 22(6/7), 753-758. Fortin, S. (1996). The graph isomorphism problem. (Technical Report). Department of Computing Science, University of Alberta. Gao, X., Xiao, B., Tao, D. & Li, X. (2009). A survey of graph edit distance. Pattern Analysis & Applications. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NPcompleteness. W.H. Freeman. Giugno, R., & Shasha, D. (2002). GraphGrep: A fast and universal method for querying graphs. In IEEE International Conference in Pattern Recognition, (pp. 112-115). Guttman, A. (1984). R-Trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 47-57). Gyssens, M., Paredaens, J., den Bussche, J. V., & Gucht, D. V. (1994). A graph-oriented object database model. [TKDE]. IEEE Transactions on Knowledge and Data Engineering, 6(4), 572586. doi:10.1109/69.298174 He, H., & Singh, A. K. (2006). Closure-Tree: An index structure for graph queries. In Proceedings of the 22nd International Conference on Data Engineering, (pp. 38-52). He, H., & Singh, A. K. (2008). Graphs-at-a-time: Query language and access methods for graph databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 405-418). Huan, J., Wang, W., Bandyopadhyay, D., Snoeyink, J., Prins, J., & Tropsha, A. (2004). Mining protein family specific residue packing patterns from protein structure graphs. In Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, (pp. 308-315). Jiang, H., Wang, H., Yu, P. S., & Zhou, S. (2007). GString: A novel approach for efficient search in graph databases. In Proceedings of the 23rd International conference on Data Engineering, (pp. 566-575).
320
Klinger, S., & Austin, J. (2005). Chemical similarity searching using a neural graph matcher. In Proceedings of the 13th European Symposium on Artificial Neural Networks, (p. 479-484). Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In Proceedings of the IEEE International Conference on Data Mining, (pp. 313-320). Kuramochi, M., & Karypis, G. (2004). GREW-a scalable frequent subgraph discovery algorithm. In Proceedings of the IEEE International Conference on Data Mining, (pp. 439-442). Lee, J., Oh, J.-H., & Hwang, S. (2005). STRG-index: Spatio-temporal region graph indexing for large video databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 718-729). Leser, U. (2005). A query language for biological networks. In Proceedings of the Fourth European Conference on Computational Biology/Sixth Meeting of the Spanish Bioinformatics Network, (p. 39). Manola, F., & Miller, E. (2004). RDF primer: World Wide Web consortium proposed recommendation. Retrieved from http://www.w3.org/TR/rdf-primer/ Prudhommeaux, E., & Seaborne, A. (2008). SPARQL query language for RDF. World Wide Web consortium proposed recommendation. Retrieved from http://www.w3.org/TR/rdf-sparql-query/ Raymond, J. W., Gardiner, E. J., & Willett, P. (2002). RASCAL: Calculation of graph similarity using maximum common edge subgraphs. The Computer Journal, 45(6), 631644. doi:10.1093/comjnl/45.6.631 Sakr, S. (2009). GraphREL: A decomposition-based and selectivity-aware relational framework for processing sub-graph queries. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications, (pp. 123-137). Sheng, L., Ozsoyoglu, Z. M., & Ozsoyoglu, G. (1999). A graph query language and its query processing. In Proceedings of the 15th International Conference on Data Engineering, (pp. 572-581). Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., & Reynolds, D. (2008). SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th International Conference on World Wide Web, (pp. 595-604). Teubner, J., Grust, T., Maneth, S., & Sakr, S. (2008). Dependable cardinality forecasts for XQuery. [PVLDB]. Proceedings of the VLDB Endowment, 1(1), 463477. Tian, Y., McEachin, R. C., Santos, C., States, D. J., & Patel, J. M. (2007). SAGA: A subgraph matching tool for biological graphs. Bioinformatics (Oxford, England), 23(2), 232239. doi:10.1093/bioinformatics/btl571 Wang, C., Wang, W., Pei, J., Zhu, Y., & Shi, B. (2004). Scalable mining of large disk-based graph databases. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 316-325). Washio, T., & Motoda, H. (2003). State of the art of graph-based data mining. SIGKDD Explorations, 5(1), 5968. doi:10.1145/959242.959249
321
Williams, D. W., Huan, J., & Wang, W. (2007). Graph database indexing using structured graph decomposition. In Proceedings of the 23rd International Conference on Data Engineering, (pp. 976-985). Yan, X., & Han, J. (2002). gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining, (pp. 721-724). Yan, X., & Han, J. (2003). CloseGraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 286-295). Yan, X., Yu, P. S., & Han, J. (2004). Graph indexing: A frequent structure-based approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 335-346). Yan, X., Yu, P. S., & Han, J. (2005). Substructure similarity search in graph databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 766-777). Zhang, S., Hu, M., & Yang, J. (2007). TreePi: A novel graph indexing method. In Proceedings of the 23rd International Conference on Data Engineering, (pp. 966-975). Zhang, S., Li, J., Gao, H., & Zou, Z. (2009). A novel approach for efficient supergraph query processing on graph databases. In Proceedings of the 12th International Conference on Extending Database Technology, (pp. 204-215). Zhao, P., Yu, J. X., & Yu, P. S. (2007). Graph indexing: Tree + delta >= Graph. In Proceedings of the 33rd International Conference on Very Large Data Bases, (pp. 938-949).
322
323
Chapter 14
ABSTRACT
Multimedia objects such as images, audio, and video do not present the total ordering relationship, so the relational operators (<, , , >) are not suitable to compare them. Therefore, similarity queries are the most useful, and often the only types of queries adequate to search multimedia objects stored in a database. Unfortunately, the ubiquitous query language SQL the most widely employed language in Database Management Systems (DBMS) does not provide effective support for similarity queries. This chapter presents an already validated strategy that adds similarity queries to SQL, supporting a powerful set of similarity operators. The chapter also describes techniques to store and retrieve multimedia objects in an efficient way and shows existing DBMS alternatives to execute similarity queries over multimedia data.
DOI: 10.4018/978-1-60960-475-2.ch014
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
INTRODUCTION
With the increasing availability and capacity of recording equipments, managing the huge amount of multimedia data generated has been more and more challenging. Without a proper retrieval mechanism, such data is usually forgotten on a storage device and most of them are never touched again. As the information embedded into multimedia data is intrinsically complex and rich, the retrieval approaches for such data usually rely on its contents. However, Multimedia Objects (MO) are seldom compared directly, because their binary representation is of little help to understand their content. Rather, a set of predefined features is extracted from the MO, which is thereafter used in place of the original object to perform the retrieval. For example, in Content-Based Image Retrieval (CBIR), images are preprocessed by specific feature extraction algorithms to retrieve their color or texture histograms, polygonal contours of the pictured objects, etc. The features are employed to define a mathematical signature that represents the content of the image regarding specific criteria. The features are employed in the search process. Although many progress have been achieved in the recent years to handle multimedia content, the development of large-scale applications has been facing problems because existing Database Management Systems (DBMS) lack support for such data. The operators usually employed to compare numbers and small-texts in traditional DBMS are not useful to compare MO. Moreover, MO demand specific indexing structures and other advanced resources, for example, maintaining the query context during a user interaction with a multimedia database. The most promising approach to overcome these issues is to add support for similarity-based data management inside the DBMS. Similarity can be defined through a function that compares pairs of MO and returns a value stating how similar (close) they are. As it is shown later in this chapter, employing similarity as the basis of the retrieval process allows writing very elaborated queries using a reduced set of operators and developing a consistent and efficient query execution mechanism. Although a number of works has been reported in the literature describing the basic algorithms to execute similarity retrieval operations on multimedia and other complex object datasets (Roussopoulos et al., 1995, Hjaltason and Samet, 2003, Bohm et al., 2001), there are few works on how to integrate similarity queries into the DBMS core. Some DBMS provide proprietary modules to handle multimedia data and perform a limited set of similarity queries (IBM Corp., 2003, Oracle Corp., 2005, Informix Corp., 1999). However, such approaches are generalist and do not allow including domain-specific resources, which prevent many applications from using them. Moreover, it is worth to note that it is important considering the support of similarity queries in SQL as native predicates to allow representing queries that mix traditional and similarity-based predicates and to execute them efficiently in a Relational DBMS (RDBMS) (Barioni et al., 2008). This chapter presents the key foundations toward supporting similarity queries as a native resource in RDBMS, addressing the fundamental aspects related to the representation of similarity queries in SQL. It also describes case studies showing how it is possible to perform similarity queries within existing DBMS (Barioni et al., 2006, Kaster et al., 2009). In the following sections, we describe related work and fundamental concepts, including the general strategy usually adopted to represent and to compare MO, the kinds of similarity queries that can be employed to query multimedia data and some adequate indexing methods. We also discuss issues regarding the support of similarity queries in relational DBMS, presenting the current alternatives and also an already validated approach to seamlessly integrate similarity queries in SQL. There is also a description
324
of case studies for the enhancement of existing DBMS with appropriate techniques to store multimedia data and algorithms to efficiently execute similarity queries over them. Finally, we conclude the chapter and give future research directions on multimedia retrieval support in DBMS.
Each of these queries requires a particular processing over the data content to be answered. The point is: how to represent the query to be answered by a general-purpose retrieval system? By generalpurpose we mean a system that provides sets of primitives that can be reused and specialized by distinct applications, and that share a common group of retrieval requirements. In the following subsections two approaches to answer these problems are discussed.
325
Although this approach is suitable to specific applications, the amount of operators needed to represent queries can be excessively high. The operators employed to identify objects should be specialized to each application domain to acquire higher semantics, yielding a variety of functions to represent each (class of) object(s). Moreover, although there are works aimed at retrieving images by spatial similarity between segmented regions, such as (Huang et al., 2008, Yeh and Chang, 2008), queries are usually represented using a sketch or an example of the desired spatial relationships, because representing the association among various objects can be very difficult. The representation of more complex classes of queries is even harder using content-based operators. Queries Q3, Q4 and Q5 would require defining respectively: what determines a video to be copyrighted, what is the visual pattern that represents the act of playing, and what a drum solo is. It is easy to notice that this approach would lead to an explosion on the amount of operators to meet every situation and the variety of relationships among complex elements and among their components.
326
Similarity Evaluation
Similarity evaluation requires a sequence of computational tasks that results in a value quantifying how close two multimedia objects are. In the philosophical and the psychological literature, the notion of similarity space states that the mind embodies a representational hyperspace. In this space, dimensions represent ways in which objects can be distinguished, points represent possible objects and distances between points are inversely proportional to the perceived similarity between those objects (Gauker, 2007). When dealing with multimedia data, the driving rules of the similarity space that correspond to the human perception is fairly unknown yet. Different users can have different perceptions of a same image or movie depending on their intent and background. Moreover, usually the similarity between two pairs of objects does not always follow the same measurement, the perceived distances may vary according to the user intent in the moment. The similarity evaluation process is usually quite complex and specialized, thus a great deal of approaches is encountered in the literature. We will organize the building blocks of this process around an abstraction of the similarity space concept. It is assumed that a multimedia data can be modeled through a set of features summarizing its content, which is usually called the datas feature vector, although the features do not need to compose a vector space (e.g. complex objects of a same dataset can have feature vectors of varying sizes). The features along with the distance functions constitute a descriptor, which determines the similarity value (Torres et al., 2009). If we interpret each pair <feature vector-distance function> as a similarity space instance, the whole set of possible instances forms our abstraction of the human similarity space, in which the instances are switched according to the user perception at each instant. Therefore, at the base level, it is only necessary to have the feature vector and the distance function to compute the similarity. Although having a manner to compute the similarity suffices to answer similarity queries (see Similarity Queries section), we will discuss herein the internals of similarity evaluation. In some domains, to choose the right combination is not a problem. For instance, geoprocessing applications usually treat similarity as the distance among objects in the Earths surface (e.g. return the restaurants that are at most 1 mile from me, return the closest hospital to a given kindergarten). In this case, the (spatial) features of the elements are well-defined (their geographical coordinates) and the function used to compute the proximity is mainly restricted to the Euclidean distance or a shortest path algorithm if the routes are modeled as a graph. On the other hand, for multimedia data it is mandatory to identify which pair <feature vector-distance function> forms the space that closely represents the user interpretation in each situation. Assessing the best combination between them improves the query precision (Bugatti et al., 2008). The challenge is to identify the similarity space instance that best fits the user expectation. This ideal instance is usually called the semantic (similarity) space (He et al., 2002). There are many fields and techniques employed to pursue this challenge, as shown in Figure 1. This illustration is not intended to be exhaustive, but to include the most usual concepts regarding similarity. We highlight in this figure that several concepts directly affect the components of a similarity space instance, such as feature extraction, selection and transformation and feature/partial distance weighting. Every modification promoted by any of these elements produces a new alternative instance. On a higher level are the techniques, algorithms and external information that are employed to define how each of these elements will be addressed to accomplish the user needs. In this level are: data processing algorithms, knowledge discovery techniques, relevance feedback, machine learning and others.
327
Figure 1. The basis of a similarity space instance a feature vector and a distance function and the fields and techniques employed to drive its semantics
The abstraction levels present increasing complexity and a tendency to narrow the domain. The lack of coincidence between the low-level features automatically extracted from the multimedia data (Level 1) and the high-level human interpretation (based on features of levels 2 and 3) is known in the literature as the semantic gap (Smeulders et al., 2000). Many research efforts have been directed to extract higher-semantic features in order to bridge the gap. Feature extraction is the first step on which the multimedia data is processed generating a feature vector. The algorithms that identify the features that aim at inferring the data content are called feature extractors. There are many feature extractors for multimedia data proposed in the literature. Images are usually represented through their color, texture and shape patterns extracted either globally in the im-
328
age or locally in regions identified by segmentation methods. Examples of color feature extractors are the normalized histograms (Long et al., 2003) and the descriptors being considered for inclusion on the MPEG-7 standard (Manjunath et al., 2001), which also include texture descriptors. Regarding texture, it can be cited the Haralicks descriptors (Haralick, 1979), features obtained using the wavelet transform (Santini and Gupta, 2001) and the Gabor filters (Jain and Farrokhnia, 1991). Shape extractors include the contour-based Fourier descriptors (Zahn and Roskies, 1972), the Curvature Scale Space (CSS) descriptors (Mokhtarian and Mackworth, 1986) and the region-based Zernike moments (Khotanzad and Hong, 1990). With respect to videos, their representation is much more complex due to their intrinsic temporal information. However, they borrow the techniques developed for content-based image retrieval because digital videos are essentially sequences of frames, which are still images. As sequential frames have slight differences, the extraction algorithms are usually based on shots, which are sets of contiguous frames that show an action. Therefore, shot boundary detection is a fundamental process applied in video content and there are many approaches developed to address it (Lienhart, 2001). After having detected the shots, usually a key frame is selected to represent it, which is submitted to image feature extractors. Other features available in videos include the sound track, texts appearing in the scenes, subtitle data that can be extracted and closed captions. Feature extractors for audio also segment the files in frames. Frames whose sound energy is below a predefined threshold are considered as silence and ignored. From the non-silence frames, features are extracted regarding time, frequency and coefficient domains. Examples of audio extractors are the MelFrequency Cepstral Coefficients (MFCC), features based on the Short Time Fourier Transform (STFT) (Tzanetakis and Cook, 2002) and approaches based on audio fingerprintings (Baluja and Covell, 2008). Other concepts that directly affect the feature vector component are feature selection and feature transformation. Both aims at reducing the dimensionality of the feature vector, simplifying the representation through the elimination of redundant information and also requires fewer memory and time resources. Feature selection consists in obtaining a subset of the feature vector that includes the most relevant features to discriminate the objects. Feature selection is based on statistics and machine learning algorithms (refer to (Guyon and Elisseeff, 2003) to an overview). Feature transformation create new features by combining and transforming the original features. The most widely adopted transformation methods are the Principal Component Analysis (PCA) and the Linear Discriminant Analysis (LDA) (Blanken et al., 2007). Many works use feature transformation to allow indexing multimedia datasets, which usually are high-dimensional, in index structures tailored to low-dimensional data, such as the R-tree.
n i =1
xi yi
, where n is the
dimension of the embedded vector space and p is an integer. When p is 1, 2 and , we have respectively the distances L1 (Manhattan), L2 (Euclidean) and L (Chebychev). Distance functions that are able to define metric spaces are particularly interesting, because metric spaces allow handling multi-dimensional feature datasets as well as feature datasets for whom the con-
329
cept of dimensions do not apply (as for example shapes defined by polygons with distinct number of vertices). Formally, a metric space is a pair , d , where is the set of all objects complying with the properties of the domain and is a distance function that complies with the following three properties: symmetry: (s1,s2) = (s2,s1); non-negativity: 0 < (s1,s2) < if s1 s2 and (s1,s1) = 0; and triangular inequality: (s1,s2) (s1,s3) + (s3,s2), s1, s2 , s3 . A function that satisfies these properties is called a metric. The Minkowski distances with p 1 are metrics, therefore vector spaces ruled by any of such functions are special cases of metric spaces. Another important property of metric spaces is that they allow developing fast indexing structures (see Indexing Methods for Multimedia section). Other examples of metrics are the Canberra distance (Kokare et al., 2003) and the Weak Attribute Interaction Distance (WAID), which allows users to define the influence between features according to their perception (Felipe et al., 2009). Distance functions can be affected by weighting techniques, producing distinct similarity space instances and tuning the evaluation. These techniques can be classified in: feature weighting and partial distance weighting. Feature weighting has the goal of establishing the ideal balance among the relevance of each feature for the similarity that best satisfies the user needs. The trivial strategy for weighting features is based on exhaustive experimental evaluation. Nonetheless, there is an increasing number of approaches dynamically guided by information provided in the query formulation and/or in relevance feedback cycles (Liu et al., 2007, Wan and Liu, 2006, Lee and Street, 2002). Partial distance weighting is employed when an object is represented by many feature vectors and the similarity evaluation between two objects first computes the (partial) distance between each feature vector, usually employing distinct distance functions, and then uses another function to aggregate these values to calculate the final distance. The automatic partial distance weighting methods can be classified into supervised (e.g. (Bustos et al., 2004)) and unsupervised (e.g. (Bueno et al., 2009)). Now that we already know how to represent and compare the similarity of multimedia objects, it is time to learn how to query these data. There are several types of similarity queries that can be employed to query multimedia data. These types of queries are discussed in the next section.
Similarity Queries
Let us remember a few fundamental concepts of the relational model to provide a proper definition of similarity queries following the database theory. It is worth to stress that every traditional concept of the relational model remains valid when retrieving multimedia objects by the similarity of their contents. Suppose R is a relation with n attributes described by a relational schema R = (S1, ... , Sn ) , composed of a set of m tuples ti, such that R = {t1, , tm}. Each attribute Sj, 1 j n, indicates a role for domain j , that is S j j . Therefore, when j is the multimedia domain from a metric space, each attribute Sj stores multimedia values. Each tuple of the relation stores one value for each attribute Sj, where each value si, 1 i m, assigned to Sj is an element taken from domain j and the dataset Sj is composed of the set of elements si that are assigned to the attribute Sj in at least one tuple of the stored relation. Notice that more than one attribute Sj, Sk from R can share the same domain, that is, it is possible to have j = k . Regarding the multimedia domain from a metric space, the elements must be compared by similarity, using a distance function defined over the respective domain. Elements can be compared using the properties of the domain, regardless of the attributes that store the elements. Therefore, every pair
330
of elements from one attribute and from distinct attributes sharing the same domain can be compared. The symbols employed in this chapter are summarized on Table 1. There are several types of similarity queries that can be employed to compare multimedia data. A similarity query is expressed using a similarity predicate. There are basically two families of similarity predicates: those limiting the answer based on a given similarity threshold , and those limiting the answer based on the number k of elements that should be retrieved. Moreover, the operators involved in similarity queries can be either unary or binary.
Similarity Selections
Similarity selections are performed by unary operators that compare the elements of Sj with one or more reference elements sq j given as part of the predicate. A similarity selection over an attribute Sj of the relation R can be represented as (< selection predicate> R) where the selection predicate has the form S j q Q and denotes a similarity operator. This predicate expresses a similarity comparison between the set of values S j j of an attribute Sj and a set of constant values Q j , called the reference or query centers set, taken from the domain j and given as part of the query predicate. The answer of a similarity selection is the subset of tuples ti from R whose values si from attribute Sj meet the selection predicate. It is important to note that similarity selections exhibit properties distinct from those of the traditional selections (for example, they do not possess the commutative property, see (Ferreira, 2009), s so we use the () symbol instead of the traditional .
331
When Q is an unitary set, the similarity selection correspond to the traditional range and k-nearestneighbor queries. These two types of similarity selections can be represented in the following manner: Range selection (Rq): given a maximum query distance , the query ( S
Rq (), { sq })
R retrieves
every tuple ti from R whose value si from attribute Sj satisfies (si,sq) . Considering R to be a set of images, an example is: Select the images that are similar to the image P by up to 5 units, represented as (image R (),5 { P}) Images ;
q
retrieves the k tuples ti from R whose values si from attribute Sj have the smallest distance from the query element sq, according to the distance function . An example is: Select the 3 images most similar to the image P, represented as (image kNN (), 3 { P}) Images .
q
kNN (), k { sq })
Figures 2 (a) and (b) illustrate the range and k-nearest neighbor selections in a 2-dimensional Euclidean space. When the query centers set Q has more than one object, the similarity selection correspond to aggregate similarity queries. In order to perform the comparison, these queries require the definition of a similarity aggregation function dg, which evaluates the aggregate similarity of each element si S j regarding its similarity measured by the metric to every element sq Q . The aggregate range and the aggregate k-nearest neighbor selections can be represented as follows (Razente et al., 2008b): Aggregate Range selection (ARq): given a maximum aggregate query distance , a similarity aggregation function dg and a set of query centers Q, the query ARq retrieves every tuple ti from R whose value si from attribute Sj satisfies dg(si,Q) . An aggregate range selection can be ex pressed as ( S AR d (), Q ) R . An example is: Select the images that are similar to the set of images
j q
Aggregate k-Nearest Neighbor selection (kANNq): given an integer value k 1, the query kANNq retrieves the k tuples ti from R whose values si from attribute Sj minimize the similarity aggregation function dg regarding the query centers Q. An aggregate k-nearest neighbor selection can be expressed as ( S kANN d (), k Q ) R . An example is: Select the 3 images most similar to the set of
j q
Images ;
Images .
The predicate of aggregate similarity selections uses the value of the similarity aggregation function to rank the elements in Sj with respect to Q. There are several ways to define the similarity aggregation function dg. Herein, we consider the class of functions analyzed in (Razente et al., 2008b) and generated by: d g (Q, si ) =
g
sq Q
(s , s )
q i
(1)
332
Figure 2. Examples of similarity queries considering the Euclidean distance. (a) Range selection. (b) k-Nearest-Neighbor selection considering k=4. (c) Aggregate Range selection. (d) k-Aggregate Nearest Neighbor selection considering k=1 and g=2. (e) Range join. (f) k-Nearest Neighbors join considering k=2. (g) k-Closest Neighbors join considering k=3
where is a distance function over j , Q is the set of query centers, si is a dataset element, and the power g is a non-zero real value that we call the grip factor of the similarity aggregation function. Considering Equation 1, the aggregate range and the aggregate k-nearest neighbor queries applied over a unitary set Q correspond to the traditional range and k-NN queries respectively. It is important to note that, different values for the grip factor g can provide interesting different interpretations. For example, g = 1 defines the minimization of the sum of the distances, g = 2 defines the minimization of the mean square distance (see Figures 2(c) and (d)), and g = defines the minimization
333
Figure 3. Illustration of the effect of the grip factor g in an Euclidean 2-dimensional space, considering |Q| = 3
of the maximum distance. Figure 3 helps on getting the intuition of the meaning of these grip factors presenting the effect of g in a 2-dimensional Euclidean space and considering Q composed of the three query centers shown. Each curve represents the geometric place where Equation 1 has the same value. Therefore, each curve is an isoline representing a different covering radius, thus defining both range and k-limited queries. There are several applications that can benefit from the use of queries that employ this type of similarity predicates. For instance, multimedia data (such as images, audio or video) require extracting features that are used in place of the data element when performing the comparisons in a similarity query. The features are usually the result of mathematical algorithms, resulting in low level features. Considering an image domain, the features are usually based on color, texture and shape. However, there exists a semantic gap between the low level features and the human interpretation subjectivity. To deal with the semantic gap, relevance feedback techniques have been developed. In these techniques, positive and/or negative examples are informed by the user to allow the system to derive a more precise representation of the user intent (Zhou and Huang, 2003). The new representation of the user intent can be achieved by query point movement or by multiple point movement techniques. In these techniques, the system learns from the search results provided by the user and takes advantage of this information to adapt ranking functions. One way to tell to the system what is the users intention is specifying, in the same query, other elements besides the query center, which are positive or negative examples of the intend answer. This representation is based on multiple query centers.
Similarity Joins
Binary operators correspond to similarity joins. A similarity join over an attribute Sj in relation R1 and an attribute Sk in relation R2 can be represented as ( R1 form S j q Sk
< join predicate>
between the set of values S j j of an attribute Sj from the relation R1 and a set of values Sk k of
334
an attribute Sk from the relation R2, taken from the domains of attributes Sj and Sk (that is, j = k ) and given as part of the query predicate. The answer of a similarity join is composed of the concatenation of tuples from R1 and R2 whose values si from attribute Sj and sm from attribute Sk meet the join predicate. There are basically three types of similarity joins, as follows. Range join : given a maximum query distance , the query R1 R2 retrieves the pairs of tuples < ti, tm > from R1 and R2 whose values si from attribute Sj and sm from attribute Sk satisfies (si,sm) . An example is: Select the European landscapes that differ from American landscapes by at most 5 units, represented as EuropeanLandscapes AmericanLandscapes, considering that both American and European landscapes are elements from the images domain. k-Closest Neighbors join : given an integer value k 1, the query R1 R2 retrieves the k closest pairs of tuples < ti, tm > from R1 and R2, according to the distance function . An example is: Select the 20 most similar pairs of European and American landscapes, represented as EuropeanLandscapes
europeanimage kCN q [ (), 20 ] americanimage kCNq S j kCN q [ (), k ] Sk europeanimage Rq [ (), 5 ] americanimage Rq S j Rq [ (), ] Sk
AmericanLandscapes;
S j kNN q [ (), k ] Sk
k-Nearest Neighbors join : given an integer value k 1, the query R1 R2 retrieves pairs of tuples < ti, tm > from R1 and R2, such that there are k pairs for each value si from attribute Sj together with its nearest values sm from attribute Sk, according to the distance function . An example is: Select the 10 European landscapes that are the most similar to each American landscape, represented as EuropeanLandscapes AmericanLandscapes. The k-Nearest neighbor join is not commutative.
europeanimage kNN q [ (),10 ] americanimage
kNNq
Figures 2 (e), (f) and (g) present an illustration of the three types of similarity joins described previously. In these figures, the white circles represent elements of the attribute Sj and the gray circles represent elements of the attribute Sk. Every similarity operator allows a number of variations, such as retrieving the most dissimilar elements instead of the most similar, and taking into account occurrences of ties in k-limited predicates. Predicates can also be limited by both k and , so the most restrictive condition is the one that applies.
335
consider to query multimedia data is related to using indexing methods specially tailored to efficiently answer similarity queries. Indexing methods, such as B-tree and its variations, and hashing structures (a description of these methods can be found in (Garcia-Molina et al., 2002)) are normally provided by DBMS. However, although these indexing methods are enough to attend the needs of traditional application users, they are not suitable for systems that require similarity searches, which usually deal with data that present high dimensionality and do not present the total ordering property. In order to answer similarity queries in generic metric spaces the most suitable indexing structures are the Metric Access Methods (MAM). A MAM is a distance-based indexing method that employs only metrics (such as those described in the The Role of the Distance Function Component section) to organize the objects in the database. There are several works presenting MAM proposals in the literature. Among the first proposed, there were the ones called BK-trees (Burkhard-Keller-trees). The main idea of these structures consists in choosing an arbitrary central object and applying a distance function to split the remaining objects into several subsets. In this way, the indexing structure is built recursively, executing this procedure for each non empty subset. A good overview about these and other indexing structures widely mentioned in the literature, such as the VP-tree (Vantage Point tree) (Yianilos, 1993), the MVP-tree (Multi-Vantage Point tree) (Bozkaya and Ozsoyoglu, 1997), the GNAT (Geometric Near-neighbor Access Tree) (Brin, 1995), the M-tree (Ciaccia et al., 1997) and Slim-tree (Traina-Jr et al., 2002), can be found in (Hjaltason and Samet, 2003). Many of the methods mentioned above were developed to improve single-center similarity search operations (such as k-nearest neighbor and range queries), using the triangle inequality property to prune branches of the trees. Considering the k-NN queries, for example, there are several approaches proposed to improve its performance, such as: branch-and-bound (Roussopoulos et al., 1995, Ciaccia et al., 1997, Hjaltason and Samet, 2003), incremental (Hjaltason and Samet, 1995, Hjaltason and Samet, 1999) and multi-step algorithms (Korn et al., 1996, Seidl and Kriegel, 1998). Other approaches are to estimate a final limiting range for the query and to perform a sequence of start small and grow steps (Tasan and Ozsoyoglu, 2004). All of these works refer to algorithms dealing with just one query center. Regarding the case of multiple query centers, the proposals found in the literature are more recent. The aggregate range query was first proposed in (Wu et al., 2000) and was used as a relevance feedback mechanism for content-based image retrieval in a system named Falcon. Considering the aggregate knearest neighbor queries, the first approach appeared in 2005 with the work presented in (Papadias et al., 2005), which deals only with spatial data. The first strategies proposed considering the case of metric space appeared more recently with the works presented in (Razente et al., 2008b, Razente et al., 2008a). Moreover, there are also works considering the improvement of similarity join algorithms. The first strategies were designed primarily for data in a vector space (Brinkhoff et al., 1993), but others were developed considering data lying in metric spaces (Dohnal et al., 2003a, Dohnal et al., 2003b, Jacox and Samet, 2008). The use of this type of query for improving data mining processes has also been explored in works such as (Bohm and Krebs, 2002, Bohm and Krebs, 2004).
336
Others were built on top of a DBMS, but with retrieval mechanisms detached from its query processor. Nevertheless, this approach prevents using several optimization alternatives when executing complex queries, especially when search conditions also contain operators over simple data, which are efficiently handled by DBMS. In order to add support for similarity queries in a RDBMS it is necessary: (i) to create a representation for multimedia data, (ii) to define how the similarity evaluation is carried out, (iii) to state how a query involving similarity operations is written and (iv) to provide mechanisms to execute the queries efficiently. The last requisite have been fulfilled by several works reporting basic algorithms to execute similarity retrieval operations on data sets of multimedia objects (as described in the Indexing Methods for Multimedia section). The first two requisites have also been extensively addressed in the literature, as it was discussed in the Similarity Evaluation section. The third requirement is directly related to the most widely employed language in DBMS, the SQL. Although a few works focused on languages to represent similarity queries (Carey and Kossmann, 1997, Carey and Kossmann, 1998, Melton and Eisenberg, 2001, Gao et al., 2004), none of them is able to provide a production-strength support seamlessly integrated with the other features of the language. This section presents existing DBMS solutions aimed at similarity retrieval and their limitations, and introduces an extension to SQL that allows managing similarity data inside a DBMS in a consistent way.
337
The Landscapes relation includes attributes of traditional data types (e.g. Id, Place and Photographer) and one attribute of a multimedia data type: Picture. As Picture stores images and each vendor use a proprietary data type to this intend, in the example we employ the fictitious type IMAGETYPE to represent them. After having the table created and the data loaded, the features representing the image are extracted and stored, so they can be later employed in content-based queries. These tasks require several instructions to be accomplished in all the aforementioned modules and are omitted in the example. Thereafter, suppose a user wants to retrieve the images which differ at most 1.5 similarity units from a query example (example.jpg), regarding color and texture features. Such range query can be written as follows, respectively using the DB2 QbScoreFromStr (a) or the Oracle ORDImageSignature.evaluateScore (b)1 similarity functions: (see Algorithm 2) Although these modules allow querying images by similarity, they have some drawbacks. The first one is that their source code are not available to make improvements and to include domain-specific knowledge. The built-in functions provided are of general purpose and usually yield poor results when applied over specific domains, precluding many applications from using these modules. Other major shortcoming is that the data definition and manipulation instructions are very verbose and sometimes error-prone, requiring to write sentences to perform intermediary steps which should be transparent to the user. To circumvent those drawbacks, it would be interesting to provide architectures allowing the development of multimedia systems capable of adapting themselves to the particular needs of each application domain. Moreover, to address the latter drawback, it would be interesting to develop a simple language with a syntax close to the widely adopted standard SQL. The next subsection presents a seamless way to extend SQL to include similarity-related handling constructions.
338
STILLIMAGE and all the needed requirements to process it. However, other complex domains, such as video and audio, can be manipulated following the same approach. Other important issues are related to the need of incorporating constructions in the SQL that allow: In the Data Definition Language (DDL): To define similarity measures that specify the distance function to be employed and the structure that represents the data to be compared by similarity. Each data domain demands structures and similarity functions tailored to the inherent features of the underlying data; To specify multimedia data types when defining a table; To associate multimedia domain attributes with the available similarity measures; To define indexes for multimedia domain attributes. In the Data Manipulation Language (DML): To insert and/or update the data in a multimedia database; To allow specifying similarity queries in an integrated manner with the other resources of the SQL, including operations such as selection and join.
In the following subsections, we describe a strategy that addresses all these issues, extending the syntax of DDL and DML commands. The definition of these extensions requires both, the description of the new constructors (i.e., the language syntax) and their meaning (i.e., the language semantics). In order to specify the syntax of the SQL extension we employ the BNF (Backus-Naur Form), a widely adopted notation for the specification of program languages. For the meaning specification of the new constructors, we use suggestive examples and informal descriptions.
339
Algorithm 3.
CREATE METRIC <metric_name> USING <distance_function> FOR <complex_data_type> (<extractor_name> ( <parameter_name> AS <parameter_alias> [weight],...) [, <extractor_name> ( <parameter_name> AS <parameter_alias> [weight],...), ...] );
jectEXT, which returns the features Area (in number of pixels) and the center pixel XYCenter of the largest continuous area of the same color in the image. If one wants to define a metric that evaluates the similarity of two images, considering their histogram and the position of the largest continuous area (but not its area), with the histogram weighting twice the center, the following command can be used: (see Algorithm 4)where the Area parameter of the LargestObjectEXT was not provided, indicating that this feature should not be treated by the extractor. Once a metric was created, it can be associated with one or several image attributes defined in any relation. The METRICs are associated with complex attributes as constraints following any of the two usual ways to define constraints in the table definition commands: column constraint or table constraint. In the example following, the metric Histo&Center is associated to the Picture attribute of the table Landscape: (see Algorithm 5) When defining a METRIC constraint using a column constraint syntax, it is only necessary to provide the USING <metric_name> clause in the attribute definition. Note that we do no intend that comparing two landscapes using the Histo&Center metric guarantee obtaining the most similar images by human standards a much better comparison metric should be developed it is used here only as an example of the proposed syntax. With regard to multimedia Algorithm 4.
CREATE METRIC Histo&Center USING LP1 FOR STILLIMAGE ( HistogramEXT (Histogram AS Histo 2), LargestObjectEXT (XYCenter AS XYLargestObj) );
Algorithm 5.
ALTER TABLE Landscapes ADD METRIC (Picture) USING (Histo&Center);
340
retrieval, it is expected that users gradually enhance the similarity evaluation algorithms, adding new feature extractors, distance functions and experimenting with several combinations. In this sense, a single multimedia attribute can be employed in queries considering different metrics. Therefore, the presented SQL extension allows associating several metrics with the same complex attribute, and allows the user to choose one of them to be employed in each query formulation. When more than one metric is associated with the same attribute, the DEFAULT keyword must follow the name to be used as the default one, i.e., the metric that should be employed if none is explicitly provided in a query. Other SQL extended command is the CREATE INDEX. As the indexes that apply to multimedia searching by similarity depends on the employed metric (see Indexing Methods for Multimedia section), this information must be provided in order to be possible to create them. This requirement is accomplished in the SQL extension adding the USING <metric_name> clause to the CREATE INDEX syntax. Note that this command implicitly adds a corresponding METRIC constraint to the referred attribute. Regarding our example, considering that a metric called Texture was defined for the STILLIMAGE type, the following command creates an index to execute queries using this metric over the Picture attribute and sets Texture as its default metric: (see Algorithm 6)
341
Algorithm 7.
<complex_attr> NEAR [ANY] {<value>|<complex_attr2>} [STOP AFTER <k>] [RANGE <>]
Algorithm 8.
SELECT * FROM Landscapes WHERE Picture NEAR example.jpg STOP AFTER 5;
The second approach obtains the value from the database. Then, in order to answer the same query regarding a landscape stored in the database, the following command can be used. (see Algorithm 9) It is worth to note that the inner SELECT of the previous command can return more than one tuple when a non key attribute is used in the WHERE clause. In this case, the result of the inner SELECT can potentially return a set with several query centers for the similarity predicate. Therefore, the command turns into an aggregate similarity query and a grip factor must be selected according to Equation 1 presented in the Similarity Queries section. To specify which grip factor must be used, one of the following keywords can be inserted after the keyword NEAR: SUM to ask for g = 1, ALL to ask for g = 2 and MAX to ask for g = . For example, to retrieve the three pictures whose landscapes looks more similar to those of Paris, regarding a grip factor of 2, the following command can be issued: (see Algorithm 10) Algorithm 9.
SELECT * FROM Landscapes WHERE Picture NEAR (SELECT Picture FROM Landscapes WHERE Id = 123) STOP AFTER 5;
Algorithm 10.
SELECT * FROM Landscapes WHERE Picture NEAR ALL (SELECT Picture FROM Landscapes WHERE place = Paris) STOP AFTER 3;
342
The construction to express a similarity join compares a complex attribute from the left table R to a compatible attribute from the right table S, in the format R.attr1 NEAR S.attr2 <...>. The similarity join syntaxes are expressed as regular joins, either in the FROM or in the WHERE clauses. Two complex attributes are compatible if they are associated with the same metric (consequently, they are of the same type). The construction R.attr1 NEAR S.attr2 RANGE expresses a range join, the construction R.attr1 NEAR S.attr2 STOP AFTER k expresses a nearest join, and the construction R.attr1 NEAR ANY S.attr2 STOP AFTER k expresses a closest join. For example, the following command retrieves the 5 landscapes pairs whose pictures look more similar to each other: (see Algorithm 11) Variations on the basic command can be expressed with modifiers in the command. If one wants to retrieve the most dissimilar elements instead of the most similar, the keyword NEAR is replaced by the keyword FAR. If more than one metric was defined, the default one is used, unless the clause BY <metric name> is declared. Concerning the running example, to select up to fifty landscapes that are among the most similar pictures to a given query image considering the Histo&Center metric and the fifty most similar pictures to another query image regarding the Texture metric, the following command can be stated: (see Algorithm 12) Queries with predicates limited to k neighbors (either selections or joins) can take into account the occurrence of ties. The default behavior of a k-limited query is retrieving k elements without ties, as it is the behaviour of most of the works reported in the literature. However, the SQL extension allows specifying WITH TIE LIST following the STOP AFTER specification, to ask for every element tied at the same distance of the k-th nearest neighbor to the query element. Both STOP AFTER and RANGE can be specified in the same query. In this case, the answer is limited by having at most k elements and elements not farther (or nearer) than a distance from the query center. For example, the command (see Algorithm 13)retrieves from the 5 images most similar to the query image that are not farther than 0.03 units from the query image. On the other hand, if neither STOP AFTER nor RANGE is specified, then RANGE 0 (zero) is assumed. Algorithm 11.
SELECT * FROM Landscapes L1, Landscapes L2 WHERE L1.Picture NEAR L2.Picture STOP AFTER 5;
Algorithm 12.
SELECT * FROM Landscapes WHERE Picture NEAR example1.jpg By Histo&Center STOP AFTER 50 AND Picture NEAR example2.jpg By Texture STOP AFTER 50;
Algorithm 13.
SELECT * FROM Landscapes WHERE Picture NEAR example.jpg STOP AFTER 5 RANGE 0.03;
343
344
from its tables, SIREN joins the system tables and the user tables, removing the feature attributes, thus the user never sees the table split nor the features. This is the same approach for the treatment of BLOB data in DBMS, which are stored apart from the original table into system-controlled areas and only references to them are stored in the table. When the user poses queries involving similarity predicates, SIREN uses the extracted features to execute the similarity operators. The current version of SIREN has three types of feature extractors regarding the STILLIMAGE data type: a texture extractor (TEXTUREEXT), a shape extractor based on Zernike Moments (ZERNIKEEXT) and a color extractor based on the normalized color histogram (HISTOGRAMEXT). For sound objects storing music, there are the a sound-texture extractor (SOUNDTEXTUREEXT), which extracts the Mel-Frequency Cepstral Coefficients (MFCC) and features based on the Short Time Fourier Transform (STFT). The A Look on the Feature Vector Component section presented references of these extractors. The similarity operators implemented consist of the similarity selections for single query centers, the similarity selections for multiple query centers and the similarity joins, as presented in the Similarity Queries section. The traditional single center similarity selections, that is, the similarity range query and the k-nearest-neighbor query operators, are available in several MAM presented in the literature. However, regarding the multiple center similarity selections, there are operators available only for the Slim-tree MAM. Therefore, the Slim-tree MAM (Traina-Jr et al., 2002) is employed to index the multimedia attributes. The Slim-tree is implemented in a C++ access method library, which is called from the SIREN Indexer subsystem to execute the similarity-related operations. Unfortunately, there is no procedure already published to execute similarity joins in this MAM. Therefore, in SIREN they are always executed using sequential scan implementations. Following, it is used the previous example to illustrate how SIREN executes a query asking for the five closest landscapes from a given picture. (see Algorithm 14) This command is analyzed and rewritten by SIREN following the steps shown in Figure 4. Initially, the application program submits the SQL command. The command interpreter analyzes the original command and detects that it contains a similarity predicate refering to a query image that is not stored in the DBMS (Step 1). The interpreter also identifies the type of similarity operation that needs to be executed (a kNN query in the case), the multimedia attribute involved (Picture) and the parameters of the predicate (the query center sq = example.jpg, the number of neighbors k=5 and the metric that is the attributes default). Thereafter, it queries the SIREN data dictionary (Step 2). The data dictionary is searched to obtain information regarding the complex attribute Picture: the attributes default metric (Texture in the example), the feature extractors that are employed by the metric (the TextureEXT extractor), the distance function to be used (), and the index structure Ix to be employed (Step 3). The query image sq is submitted to the required feature extractors (Step 4) and the extracted feature vector V is Algorithm 14.
SELECT * FROM Landscapes WHERE Picture NEAR example.jpg STOP AFTER 5;
345
Algorithm 15.
SELECT FROM ON WHERE Id, Place, Photographer, IPV$Landscapes_Picture.Image AS Picture Landscapes JOIN IPV$Landscapes_Picture Landscapes.Picture = IPV$Landscapes_Picture.Image_id Picture IN (6969, 6968, 6975, 8769, 9721);
returned (Step 5). The interpreter sends to the indexer the following parameters: the feature vector V, the similarity operation (kNN) and its respective parameters (sq, k), and the index structure Ix (Step 6). The indexer returns the set of images identifiers Soid that answers the command (Step 7). The command interpreter uses the identifiers Soid to rewrite the original command submitted by the application program and resubmits it to the underlaying DBMS (Step 8). The command rewritten by SIREN for the current example is presented below: (see Algorithm 15)where (6969, 6968, 6975, 8769, 9721) are examples of identifiers returned and IPV$Landscapes_Picture is the SIREN controlled table that stores the real image and the feature vectors. Finally, the DBMS answers the query, obtaining the images and the traditional attributes requested (Step 9) and the result data is returned to the application program. It is worth noting that if a query image stored in the DBMS was specified in the original command submitted by the application program, it would not be sent to the features extractors. In this case, Steps 4 and 5 are replaced by a query to the DBMS in order to retrieve the feature vector of the stored image, which is much more efficient.
346
Algorithm 16.
SELECT * FROM Landscapes WHERE Place = Paris AND Figure NEAR example.jpg RANGE 0.03 AND Figure NEAR example.jpg STOP AFTER 5;
SIREN allows executing simple queries, as well as any combination of operations among themselves and among traditional selections and joins, providing a powerful way to execute similarity queries in multimedia databases. For instance, the following query mixes two similarity-based select conditions and one regular condition: (see Algorithm 16) As SIREN intercepts the SQL commands, it can detect that the two similarity conditions considers the same query center and thus they can be executed in only one pass using the kAndRange algorithm (Traina-Jr et al., 2006). On the other hand, it will always execute first the similarity operations and then the regular conditions, even if the selectivity of the regular condition is higher. In addition to the example provided herein, there are others considering real datasets and several query statements in (Barioni et al., 2006, Barioni et al., 2008).
347
method library used in SIREN to provide the distance functions and the MAMs. As FMI-SiRO follows the Oracle Extensibility Framework, it is integrated to the query processor and its operations are all under the DBMS transactional control, except the index updates, whose concurrency control is responsibility of the indexing library. Multimedia data and their feature vectors are stored in BLOB columns in FMI-SiRO. However, unlike SIREN the attributes that store the feature vectors must be explicitly defined and handled by users. As a consequence, if there are various feature vectors representing complementary characteristics of the same data, they will be stored in a set of attributes to be employed in queries either individually or combined in multiple ways. With regard to the chapters running example, the attributes Picture_ Histo&Center and Picture_Texture must be added to the Landscapes table by the user to store respectively the images histogram together with the center pixel of the largest continuous area and the images texture features. Feature extraction is performed through the function generateSignature, which submits the complex data to the respective algorithms and populates the feature vector attribute with the returned features. The SQL block below creates the image feature vectors used in the example: (see Algorithm 17) The second and third parameters of function generateSignature are, respectively, the names of the attributes that hold the image and the feature vector and the last parameter is the name of the extractor. In this example, it is considered that the Histo&Center extractor internally codes the linear combination of the Histogram and LargestObject extractors cited in the A SQL Extension for Similarity Data Management section. Note that the attributes storing the feature vectors are IN/OUT parameters. Hence, it is necessary to select the data using an exclusive lock, expressed in the example by the FOR UPDATE clause. In FMI-SiRO, the association between the feature vector and the distance function to form the desired similarity space does not need to be previously stated, as it is done in query time. This syntax allows high flexibility, although it is prone to misuse. In FMI-SiRO, there are several distance functions, such as the Manhattan_distance, Euclidean_distance and Canberra_distance. However, new distance functions can be included according to application requirements. These functions can be employed to formulate similarity queries executed using sequential scans. For example, the following two instructions execute respectively a range query and a 5-NN query. (see Algorithm 18)where Example is a variable containing the BLOB feature vector of the query center, 0.03 is the maximum query distance for the range query and 5 is the number of neighbors for the k-NN query. Using standard SQL requires numbering the rows according to the ordering criterion (the distance in similarity queries), which is used to filter the k neighbors. This is done through SQL-99 window functions as showed in the example.
Algorithm 17.
SELECT Picture, Picture_Histo&Center, Picture_Texture INTO Pic, Pic_Histo&Center, Pic_Texture FROM Landscapes FOR UPDATE; generateSignature(Pic, Pic_Histo&Center, Histo&Center); generateSignaturePic, Pic_Texture, Texture);
348
Algorithm 18.
SELECT * FROM Landscapes WHERE Manhattan_distance(Picture_Histo&Center, Example) <= 0.03; SELECT * FROM ( SELECT Id, Place, Photographer, ROW_NUMBER() OVER ( ORDER BY Manhattan_distance(Picture_Histo&Center, Example) AS rn FROM Landscapes ) WHERE rn <= 5;
Algorithm 19.
CREATE OPERATOR Manhattan_dist BINDING (BLOB, BLOB) RETURN FLOAT USING Manhattan_distance;
The execution of queries in FMI-SiRO is illustrated in Figure 5. Before posing a query it is necessary to load the feature vector of the query center into a database variable. If the feature vector is already stored in the database, it is queried and stored in the variable. Otherwise, the application calls the generateSignature function, providing the query image (Step 1). The DBMS submits the image to the FMI-SiROs feature extractor, which generates the feature vector that is in turn stored in the variable (steps 2 to 4). Thereafter, the query is submitted to the DBMS (Step 5), which forwards the distance calculation to the FMI-SiROs access method library (Step 6a), obtaining the distances (Step 7a). Based on the distances, the DBMS executes the query and returns the result data to the application (Step 8). FMI-SiRO also provides new metric index structures to speed up similarity queries. They are built on the associated access method library and are accessible by the query processor through SQL operators, which are defined in the Oracle DBMS as links between user-defined functions and index types. FMISiRO defines operators to enable the indexed execution of both range and k-NN queries. For instance, the following instruction creates an operator for range queries employing the Manhattan distance: (see Algorithm 19) To create a new index type it is necessary to have a data type implementing the methods required by the Oracle Extensible Indexing Interface. Such data types in FMI-SiRO have the following header: (see Algorithm 20)where each of these methods maps to an external C++ function that executes the respective action on the index using access method library. The new index types associate the operators and the implementation types, as exemplified below: (see Algorithm 21) The FMI-SiRO indexes are managed in the same way as the built-in ones. To execute indexed similarity queries, the user must create an index for the feature vector and write the queries using the operators associated to the respective index type. For example, the following instructions create an index and executes the same aforementioned queries, but now using the index: (see Algorithm 22) The value provided in the PARAMETERS clause of the index creation instruction is the index page size. Note that indexed k-NN queries in FMI-SiRO do not require explicitly ordering the rows according to their distances to the query center, because the k-NN condition is tested during the index search. The
349
Algorithm 20.
CREATE OR REPLACE scanctx RAW(4), STATIC FUNCTION STATIC FUNCTION STATIC FUNCTION STATIC FUNCTION STATIC FUNCTION STATIC FUNCTION MEMBER FUNCTION MEMBER FUNCTION TYPE index_im_type AS OBJECT ( ODCIIndexCreate(), ODCIIndexDrop(), ODCIIndexInsert(), ODCIIndexDelete(), ODCIIndexUpdate(), ODCIIndexStart(), ODCIIndexFetch(), ODCIIndexClose());
Algorithm 21.
CREATE INDEXTYPE Slim_Manhattan FOR Manhattan_dist(BLOB, BLOB), Manhattan_kNN(BLOB, BLOB) USING index_im_type;
Algorithm 22.
CREATE INDEX new_index ON Landscapes(Picture_Histo&Center) INDEX TYPE IS Slim_Manhattan PARAMETERS (8192); SELECT * FROM Landscapes WHERE Manhattan_dist(Picture_Histo&Center, Example) <= 0.03; SELECT * FROM Landscapes WHERE Manhattan_kNN(Picture_Histo&Center, Example) <= 5;
execution of indexed queries is alike to the execution of sequential ones, except that the DBMS query processor requests an index scan to FMI-SiRO (Step 6b of Figure 5), which returns the physical row identifiers (RowIds) satisfying the predicate (Step 7b of Figure 5). To retrieve the most dissimilar elements instead of the most similar, the comparison operator <= should be reversed (e.g. Manhattan_dist(...) > 1.5 returns the elements out of this range from the query center). Similarity joins and aggregated selections can also be represented in FMI-SiRO. To be employed in join operations, the distance function parameters are the joining attributes of the involved tables. For instance, the following query is a similarity range join: (see Algorithm 23) Aggregated selections employ functions that accept multiple query centers as parameters and compute the aggregated similarity. Such queries are written in the same way as the regular similarity selections, but using the aggregated similarity functions in place of the distance functions. Complex queries interleaving similarity and regular selection predicates also can be posed in FMISiRO. The command that states a query equivalent as the last example of the previous section is the following. (see Algorithm 24)
350
Algorithm 23.
SELECT * FROM Landscapes L1, Landscapes L2 WHERE Manhattan_dist(L1.Picture_Texture, L2.Picture_Texture) <= 0.03
Algorithm 24.
SELECT * FROM Landscapes WHERE Place = Paris AND Manhattan_dist(Figure_Texture, Example) <= 0.03 AND Manhattan_knn(Figure_Texture, Example) <= 5;
FMI-SiRO allows the query processor to choose the best execution strategy regarding all query conditions. Therefore, the most selective condition can be used to filter the results or intersections can be employed between the individual index results before accessing the data blocks. However, optimizer cannot detect special similarity constructions, as a Range and a k-NN query with the same query center. This happens because only the built-in rewriting rules of the query processor are evaluated. To overcome this issue, it would be necessary to modify the DBMS query processor. Another alternative would be to combine the two approaches described in this section: a blade that intercepts SQL instructions and rewrite them in the best query plan using the standard DBMS resources as well as those provided by a module for similarity searching. However, this is a solution that only can be implemented changing the code of the underlying DBMS, and thus only can be implemented in a way specific for a given product.
CONCLUSION
Most of the systems available to query multimedia data by similarity were developed considering a specific data domain and presenting a closed architecture. This approach does not allow applications to extend the access over traditional data in order to also deal with other data types, such as images and audio. This can be a big issue, for example, for the extension of typical applications of a medical information system, such as the electronic patient record system, in a way that they could also support image retrieval by similarity to support decision making. Commercial products, such as Oracle InterMedia and IBM DB2 AIV Extenders, support the management of several types of multimedia data through user-defined functions and types. Although this approach can use the existing highly optimized algorithms for each specific similarity operation, it does not allow optimizations among the operators nor their integration with the other operators used in a query. Therefore, it is fundamental that DBMS support the management of multimedia data by similarity through native predicates in SQL, built into an architecture that is capable to be easily adjusted to the particular needs of each application domain. This chapter contributes to support similarity queries as a built-in resource in relational DBMS, addressing the fundamental aspects related to the representation
351
of similarity queries in SQL. Having this goal in mind, the solutions for similarity query representation and execution presented herein have several interesting characteristics. First, the SQL extension presented enables representing similarity queries as just one more type of predicate, leading to the integration of similarity as operations in relational algebra. This characteristic enables extending the optimizers of the relational DBMS to treat and optimize similarity queries as well. Second, the presented retrieval engines shows how to benefit from improvements on data retrieval techniques aimed at similarity, such as techniques involved in the similarity evaluation and index structures that support similarity operators. Third, the presented solutions can act as a hub for the development of algorithms to perform broadly employed similarity operations regarding data analysis. For example, data mining processes often require performing similarity operations, and having them integrated in the database server, possibly optimized by a MAM, can be feasible in the future.
REFERENCES
Baluja, S., & Covell, M. (2008). Waveprint: Efficient wavelet-based audio fingerprinting. Pattern Recognition, 41(11), 34673480. doi:10.1016/j.patcog.2008.05.006
352
Barioni, M. C. N., Razente, H., Traina, A. J. M., & Traina-Jr, C. (2006). Siren: A similarity retrieval engine for complex data. In International Conference on Very Large Databases (VLDB), (pp.11551158). Seoul, Korea. Barioni, M. C. N., Razente, H., Traina, A. J. M., & Traina-Jr, C. (2008). Seamlessly integrating similarity queries in sql. Software, Practice & Experience, 39, 355384. doi:10.1002/spe.898 Blanken, H. M., de Vries, A. P., Blok, H. E., & Feng, L. (Eds.). (2007). Multimedia retrieval (Datacentric systems and applications). Springer. Bohm, C., & Krebs, F. (2002). High performance data mining using the nearest neighbor join. In International Conference on Data Mining (ICDM), (pp. 4350). Maebashi City, Japan. Bohm, C., & Krebs, F. (2004). The k-nearest neighbour join: Turbo charging the KDD process. Knowledge and Information Systems, 6(6), 728749. doi:10.1007/s10115-003-0122-9 Bohm, C., Stefan, B., & Keim, D. A. (2001). Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 322373. doi:10.1145/502807.502809 Bozkaya, T., & Ozsoyoglu, M. (1997). Distance-based indexing for high-dimensional metric spaces. In ACM International Conference on Management of Data (SIGMOD), (pp. 357368). Tucson: ACM Press. Brin, S. (1995). Near neighbor search in large metric spaces. In International Conference on Very Large Databases (VLDB), (pp. 574584). Zurich, Switzerland. Morgan Kaufman. Brinkhoff, T., Kriegel, H.-P., & Seeger, B. (1993). Efficient processing of spatial joins using r-trees. In ACM International Conference on Management of Data (SIGMOD), (pp. 237246). New York: ACM. Bueno, R., Kaster, D. S., Paterlini, A. A., Traina, A. J. M., & Traina-Jr, C. (2009). Unsupervised scaling of multi-descriptor similarity functions for medical image datasets. In IEEE International Symposium on Computer-Based Medical Systems (CBMS), (pp. 18). Albuquerque, NM: IEEE. Bugatti, P. H., Traina, A. J. M., & Traina-Jr, C. (2008). Assessing the best integration between distancefunction and image-feature to answer similarity queries. In ACM Symposium on Applied Computing (SAC), (pp. 12251230). Fortaleza, CE, Brazil. ACM. Bustos, B., Keim, D., Saupe, D., Schreck, T., & Vranic, D. (2004). Automatic selection and combination of descriptors for effective 3D similarity search. In IEEE International Symposium on Multimedia Software Engineering, (pp. 514521). Miami: IEEE. Carey, M. J., & Kossmann, D. (1997). On saying enough already! in SQL. In ACM International Conference on Management of Data (SIGMOD), (pp. 219230). Carey, M. J., & Kossmann, D. (1998). Reducing the braking distance of an SQL query engine. In International Conference on Very Large Databases (VLDB), (pp. 158169). New York. Ciaccia, P., Patella, M., & Zezula, P. (1997). M-tree: An efficient access method for similarity search in metric spaces. In International Conference on Very Large Databases (VLDB), (pp. 426435). Athens, Greece. Morgan Kaufmann.
353
IBM Corp. (2003). Image, audio, and video extenders administration and programming guide. DB2 universal database version 8. Deserno, T. M., Antani, S., & Long, R. (2009). Ontology of gaps in content-based image retrieval. Journal of Digital Imaging, 2(22), 114. Dohnal, V., Gennaro, C., Savino, P., & Zezula, P. (2003a). Similarity join in metric spaces. In European Conference on Information Retrieval Research(ECIR), (pp 452467). Pisa, Italy. Dohnal, V., Gennaro, C., & Zezula, P. (2003b). Similarity join in metric spaces using ed-index. In 14th International Conference on Database and Expert Systems Applications (DEXA), (pp. 484493). Prague, Czech Republic. Eakins, J., & Graham, M. (1999). Content-based image retrieval. (Technical Report 39), University of Northumbria at Newcastle. Felipe, J. C., Traina-Jr, C., & Traina, A. J. M. (2009). A new family of distance functions for perceptual similarity retrieval of medical images. Journal of Digital Imaging, 22(2), 183201. doi:10.1007/ s10278-007-9084-x Ferreira, M. R. P., Traina, A. J. M., Dias, I., Chbeir, R., & Traina-Jr, C. (2009). Identifying algebraic properties to support optimization of unary similarity queries. 3rd Alberto Mendelzon International Workshop on Foundations of Data Management, Arequipa, Peru, (pp. 110). Flickner, M., Sawhney, H. S., Ashley, J., Huang, Q., Dom, B., & Gorkani, M. (1995). Query by image and video content: The QBIC system. IEEE Computer, 28(9), 2332. Gao, L., Wang, M., Sean Wang, X., & Padmanabhan, S. (2004). Uexpressing and optimizing similaritybased queries in SQLs. (Technical Report CS-04-06), University of Vermont. Retrieved from http:// www.cs.uvm.edu /csdb/ techreport.shtml Garcia-Molina, H., Ullman, J. D., & Widom, J. (2002). Database systems: The complete book. Upper Saddle River, NJ: Prentice Hall. Gauker, C. (2007). A critique of the similarity space theory of concepts. Mind & Language, 22(4), 317345. doi:10.1111/j.1468-0017.2007.00311.x Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 11571182. doi:10.1162/153244303322753616 Haralick, R. M. (1979). Statistical and structural approaches to texture. IEEE, 67, 786804. He, X., Ma, W.-Y., King, O., Li, M., & Zhang, H. (2002). Learning and inferring a semantic space from users relevance feedback for image retrieval. In ACM International Conference on Multimedia (MULTIMEDIA), (pp. 343346). New York: ACM. Hjaltason, G. R., & Samet, H. (1995). Ranking in spatial databases. In International Symposium on Advances in Spatial Databases (SSD), (pp. 8395). Portland, Maine. Hjaltason, G. R., & Samet, H. (1999). Distance browsing in spatial databases. ACM Transactions on Database Systems, 24(2), 265318. doi:10.1145/320248.320255
354
Hjaltason, G. R., & Samet, H. (2003). Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 28(4), 517580. doi:10.1145/958942.958948 Huang, P.-W., Hsu, L., Su, Y.-W., & Lin, P.-L. (2008). Spatial inference and similarity retrieval of an intelligent image database system based on objects spanning representation. Journal of Visual Languages and Computing, 19(6), 637651. doi:10.1016/j.jvlc.2007.09.001 Informix Corp. (1999). Excalibur image DataBlade module users guide. Informix Press. Jacox, E. H., & Samet, H. (2008). Metric space similarity joins. ACM Transactions on Database Systems, 33(2), 138. doi:10.1145/1366102.1366104 Jain, A. K., & Farrokhnia, F. (1991). Unsupervised texture segmentation using gabor filters. Pattern Recognition, 24(12), 11671186. doi:10.1016/0031-3203(91)90143-S Kaster, D. S., Bugatti, P. H., Traina, A. J. M., & Traina-Jr, C. (2009). Incorporating metric access methods for similarity searching on Oracle database. In Brazilian Symposium on Databases (SBBD), (pp. 196210). Fortaleza, Brazil. Khotanzad, A., & Hong, Y. H. (1990). Invariant image recognition by zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(5), 489497. doi:10.1109/34.55109 Kokare, M., Chatterji, B., & Biswas, P. (2003). Comparison of similarity metrics for texture image retrieval. In Conference on Convergent Technologies for Asia-Pacific Region, (pp. 571575). Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., & Protopapas, Z. (1996). Fast nearest neighbor search in medical image databases. In International Conference on Very Large Databases (VLDB), pp(. 215226). San Francisco. Lee, K.-M., & Street, W. N. (2002). Incremental feature weight learning and its application to a shapebased query system. Pattern Recognition Letters, 23(7), 865874. doi:10.1016/S0167-8655(01)00161-1 Lienhart, R. (2001). Reliable transition detection in videos: A survey and practitioners guide. International Journal of Image and Graphics, 1, 469486. doi:10.1142/S021946780100027X Liu, Y., Zhang, D., Lu, G., & Ma, W.-Y. (2007). A survey of content-based image retrieval with highlevel semantics. Pattern Recognition Letters, 40, 262282. Long, F., Zhang, H., & Feng, D. D. (2003). Fundamentals of content-based image retrieval (Multimedia information retrieval and management-technological fundamentals and applications). Springer. Manjunath, B. S., Ohm, J.-R., Vasudevan, V. V., & Yamada, A. (2001). Color and texture descriptors. IEEE Transactions on Circuits and Systems for Video Technology, 11(6), 703715. doi:10.1109/76.927424 Melton, J., & Eisenberg, A. (2001). SQL multimedia and application packages (SQL/MM). SIGMOD Record, 30(4), 97102. doi:10.1145/604264.604280 Mokhtarian, F., & Mackworth, A. (1986). Scale-based description and recognition of planar curves and two-dimensional objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(1), 3443. doi:10.1109/TPAMI.1986.4767750
355
Oracle Corp. (2005). Oracle interMedia users guide, (10.2). Oracle. Papadias, D., Tao, Y., Mouratidis, K., & Hui, C. K. (2005). Aggregate nearest neighbor queries in spatial databases. ACM Transactions on Database Systems, 30(2), 529576. doi:10.1145/1071610.1071616 Razente, H., Barioni, M. C. N., Traina, A. J. M., & Traina-Jr, C. (2008a). Aggregate similarity queries in relevance feedback methods for content-based image retrieval. In ACM Symposium on Applied Computing (SAC), (pp. 869874). Fortaleza, Brazil. Razente, H., Barioni, M. C. N., Traina, A. J. M., & Traina-Jr, C. (2008b). A novel optimization approach to efficiently process aggregate similarity queries in metric access methods. In ACM International Conference on Information and Knowledge Management, (pp. 193202). Napa, CA. Roussopoulos, N., Kelley, S., & Vincent, F. (1995). Nearest neighbor queries. In ACM International Conference on Management of Data (SIGMOD), (pp. 7179). Santini, S., & Gupta, A. (2001). A wavelet data model for image databases. In IEEE International Conference on Multimedia and Expo (ICME), Tokyo, Japan. IEEE Computer Society. Seidl, T., & Kriegel, H.-P. (1998). Optimal multi-step k-nearest neighbor search. In ACM International Conference on Management of Data (SIGMOD), (pp. 154165). Seattle, Washington. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. [TPAMI]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 13491380. doi:10.1109/34.895972 Tasan, M., & Ozsoyoglu, Z. M. (2004). Improvements in distance-based indexing. In International Conference on Scientific and Statistical Database Management (SSDBM), (p. 161). Washington, DC: IEEE Computer Society. Torres, R. S., Falco, A. X., Gonalves, M. A., Papa, J. P., Zhang, P., & Fan, W. (2009). A genetic programming framework for content-based image retrieval. Pattern Recognition, 42(2), 283292. doi:10.1016/j. patcog.2008.04.010 Traina-Jr, C., Traina, A. J. M., Faloutsos, C., & Seeger, B. (2002). Fast indexing and visualization of metric datasets using slim-trees. [TKDE]. IEEE Transactions on Knowledge and Data Engineering, 14(2), 244260. doi:10.1109/69.991715 Traina-Jr, C., Traina, A. J. M., Vieira, M. R., Arantes, A. S., & Faloutsos, C. (2006). Efficient processing of complex similarity queries in RDBMS through query rewriting. In International Conference on Information and Knowledge Management (CIKM), (pp.413). Arlington, VA. Tzanetakis, G., & Cook, P. R. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293302. doi:10.1109/TSA.2002.800560 Wan, C., & Liu, M. (2006). Content-based audio retrieval with relevance feedback. Pattern Recognition Letters, 27(2), 8592. doi:10.1016/j.patrec.2005.07.005 Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6, 134.
356
Wu, L., Faloutsos, C., Sycara, K., & Payne, T. R. (2000). Falcon: Feedback adaptive loop for content-based retrieval. In International Conference on Very Large Databases (VLDB), (pp. 297306). Cairo, Egypt. Yeh, W.-H., & Chang, Y.-I. (2008). An efficient iconic indexing strategy for image rotation and reflection in image databases. [JSS]. Journal of Systems and Software, 81(7), 11841195. doi:10.1016/j. jss.2007.08.019 Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In ACM/SIGACT-SIAM Symposium on Discrete Algorithms (SODA), (pp. 311321). Austin, TX, EUA. Society for Industrial and Applied Mathematics. Zahn, C. T., & Roskies, R. Z. (1972). Fourier descriptors for plane closed curves. IEEE Transactions on Computers, 21(3), 269281. doi:10.1109/TC.1972.5008949 Zhou, X. S., & Huang, T. S. (2003). Relevance feedback in image retrieval: A comprehensive review. Multimedia Systems, 8(6), 536544. doi:10.1007/s00530-002-0070-3
ADDITIONAL READING
Antani, S., Kasturi, R., & Jain, R. (2002). A survey on the use of pattern recognition methods for abstraction, indexing and retrieval of images and video. Pattern Recognition, 35(4), 945965. doi:10.1016/ S0031-3203(01)00086-3 Bohm, C., Stefan, B., & Keim, D. A. (2001). Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 322373. doi:10.1145/502807.502809 Cano, P. (2009). Content-Based Audio Search: From Audio Fingerprinting To Semantic Audio Retrieval. VDM Verlag. Chvez, E., Navarro, G., Baeza-Yates, R., & Marroqun, J. L. (2001). Searching in Metric Spaces. ACM Computing Surveys, 33(3), 273321. doi:10.1145/502807.502808 Datta, R., Joshi, D., Li, J., & Wang, J. Z. (2008). Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 40(2), 160. doi:10.1145/1348246.1348248 Deb, S. (2003). Multimedia Systems and Content-Based Image Retrieval. Information Science Publishing. 2003 Deserno, T. M., Antani, S., & Long, R. (2009). Ontology of Gaps in Content-Based Image Retrieval. Journal of Digital Imaging, 22(2), 202215. doi:10.1007/s10278-007-9092-x Doulamis, N., & Doulamis, A. (2006). Evaluation of relevance feedback schemes in content-based in retrieval systems. Signal Processing Image Communication, 21(4), 334357. doi:10.1016/j.image.2005.11.006 Geetha, P., & Narayanan, V. (2008). A Survey of Content-Based Video Retrieval. Journal of Computer Science, 4(6), 474486. doi:10.3844/jcssp.2008.474.486
357
Gibbon, D. C., & Liu, Z. (2008). Introduction to Video Search Engines (1st ed.). Springer. Groff, J. R., & Weinberg, P. N. (2002). SQL: The Complete Reference (2nd ed.). McGraw-Hill, Osborne Media. Kosch, H. (2003). Multimedia Database Management Systems: Indexing, Access, and MPEG-7. Boca Raton, FL, USA: CRC Press, Inc. Lew, M. S., Sebe, N., Djeraba, C., & Jain, R. (2006). Content-based multimedia information retrieval: State of the art and challenges. [TOMCCAP]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2(1), 119. doi:10.1145/1126004.1126005 Melton, J. (2001). SQL:1999 - Understanding Relational Language Components. The Morgan Kaufmann Series in Data Management Systems (1st ed.). Morgan Kaufmann. Mostefaoui, A. (2006). A modular and adaptive framework for large scale video indexing and content-based retrieval: the SIRSALE system. Software, Practice & Experience, 36(8), 871890. doi:10.1002/spe.722 Muller, H. (2008). Medical multimedia retrieval 2.0. IMIA Yearbook of Medical Informatics, 2008, 5563. Samet, H. (2006). Foundations of Multidimensional and Metric Data Structures. San Francisco, CA: Morgan Kaufmann. Vasconcelos, N. (2008). From Pixels to Semantic Spaces: Advances in Content-Based Image Retrieval. IEEE Computer, 40(7), 2026. Vaswani, V. (2004). The Future of SQL. In MySQL: The Complete Reference. McGraw-Hill/Osborne. Zezula, P., Amato, G., Dohnal, V., & Batko, M. (2006). Similarity Search: The Metric Space Approach (Series Advances in Database Systems, vol. 32), Springer. Zhang, Z., & Zhang, R. (2008). Multimedia Data Mining: A Systematic Introduction to Concepts and Theory. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series. Chapman & Hall; 1 edition. Zhou, S. X., & Huang, T. S. (2003). Relevance feedback in image retrieval: A comprehensive review. Multimedia Systems, 8(6), 536544. doi:10.1007/s00530-002-0070-3
358
Query-By-Example: a query where the user gives an example of what he wants to retrieve Semantic Gap: the gap between what the user wants and the results based on features extracted automatically from multimedia data
ENDNOTE
1
In Oracle, feature vectors are stored explicitly in other attributes of the table containing the image. Thus, the Oracle example considers that the table Landscapes also has another attribute holding the features (Picture_Features).
359
360
Compilation of References
Abadi, D. J., Marcus, A., Madden, S., & Hollenbach, K. (2009). SW-Store: A vertically partitioned DBMS for Semantic Web data management. The VLDB Journal, 18(2), 385406. doi:10.1007/s00778-008-0125-y Abiteboul, S., & Kanellakis, P. C. (1998). Object identity as a query language primitive. [JACM]. Journal of the ACM, 45(5), 798842. doi:10.1145/290179.290182 Abiteboul, S., Hull, R., & Vianu, V. (1995). Foundations of databases. Reading, MA: Addison-Wesley. Abiteboul, S., Benjelloun, O., Manolescu, I., Milo, T., & Weber, R. (2002). Active XML: Peer-to-peer data and Web services integration. In Proceedings of 28th International Conference on Very Large Data Bases, (pp. 1087-1090). August 20-23, 2002, Hong Kong, China, Morgan Kaufmann. Agrawal, S., Chaudhuri, S., Das, G., & Gionis, A. (2003). Automated ranking of database query results. ACM Transactions on Database Systems, 28(2), 140174. Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Modeling multidimensional databases. In 13th International Conference on Data Engineering (ICDE97), (pp. 232243). Agrawal, R., Rantzau, R., & Terzi, E. (2006). Contextsensitive ranking. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 383-394). Ahlberg, C., & Shneiderman, B. (1994). Visual information seeking: tight coupling of dynamic query filters with starfield displays (pp. 313317). Proceedings on Human Factors in Computing Systems.
Alexaki, S., Christophides, V., Karvounarakis, G., Plexousakis, D., & Tolle, K. (2001). The ICS-FORTH RDFSuite: Managing voluminous RDF description bases. In Proceedings of the 2nd International Workshop on the Semantic Web (semWeb). AllegroGraph RDFStore. (2009). Allegrograph. Retrieved from http://www.franz.com/agraph/allegrograph/ Allen, J. F. (1983). Maintaining knowledge about temporal intervals. Communications of the ACM, 832843. doi:10.1145/182.358434 Angles, R., & Gutierrez, C. (2008). Survey of graph database models. ACM Computing Surveys, 40(1), 139. doi:10.1145/1322432.1322433 Arenas, M., & Libkin, L. (2004). A normal form for XML documents. ACM Transactions on Database Systems, 29(1), 195232. doi:10.1145/974750.974757 Arenas, M., & Libkin, L. (2005). XML data exchange: Consistency and query answering. In L. Chen (Ed.), Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, (pp. 1324). June 13-15, 2005, Baltimore, Maryland, USA, ACM. Awad, A. (2007). BPMN-Q: A language to query business processes. In Proceedings of the 2nd International Workshop on Enterprise Modelling and Information Systems Architectures, (pp. 115-128). Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., & Patel-Schneider, P. F. (Eds.). (2003). The description logic handbook: Theory, implementation, and applications. Cambridge University Press. Baader, F., & Nutt, W. (2003). Basic description logics (pp. 4395).
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Compilation of References
Balke, W., & Gntzer, U. (2004). Multi-objective query processing for database systems. In Proceedings of the International Conference on Very Large Databases (VLDB), (pp. 936-947). Balke, W., Gntzer, U., & Zheng, J. (2004). Efficient distributed skylining for Web Information Systems. In Proceedings of International Conference on Extending Database Technology (EDBT), (pp. 256-273). Baluja, S., & Covell, M. (2008). Waveprint: Efficient wavelet-based audio fingerprinting. Pattern Recognition, 41(11), 34673480. doi:10.1016/j.patcog.2008.05.006 Bancilhon, F. (1996). Object databases. [CSUR]. ACM Computing Surveys, 28(1), 137140. doi:10.1145/234313.234373 Barioni, M. C. N., Razente, H., Traina, A. J. M., & TrainaJr, C. (2008). Seamlessly integrating similarity queries in sql. Software, Practice & Experience, 39, 355384. doi:10.1002/spe.898 Barioni, M. C. N., Razente, H., Traina, A. J. M., & Traina-Jr, C. (2006). Siren: A similarity retrieval engine for complex data. In International Conference on Very Large Databases (VLDB), (pp.11551158). Seoul, Korea. Bechhofer, S., Van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D., Patel-Schneider, P., et al. (2004). OWL Web ontology language reference. W3C recommendation. Beckmann, J. L., Halverson, A., Krishnamurthy, R., & Naughton, J. F. (2006). Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In Proceedings of the 22nd International Conference on Data Engineering, (p. 58). Bentley, J., Kung, H.T., Schkolnick, M. & Thompson, C.D. (1978). On the average number of maxima in a set of vectors and applications. Journal of ACM (JACM). Berners-Lee, T., Hendler, J. & Lassila, O. (2001). The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American. Bernstein, P. A., & Haas, L. M. (2008). Information integration in the enterprise. Communications of the ACM, 51(9), 7279. doi:10.1145/1378727.1378745
Bizer, C., & Schultz, A. (2008). Benchmarking the performance of storage systems that expose SPARQL endpoints. In Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems. Blanken, H. M., de Vries, A. P., Blok, H. E., & Feng, L. (Eds.). (2007). Multimedia retrieval (Data-centric systems and applications). Springer. Bobillo, F., & Straccia, U. (2008). fuzzydl: An expressive fuzzy description logic reasoner. Proceedings of the 2008 IEEE International Conference on Fuzzy Systems, (pp. 923930). Bohm, C., & Krebs, F. (2004). The k-nearest neighbour join: Turbo charging the KDD process. Knowledge and Information Systems, 6(6), 728749. doi:10.1007/ s10115-003-0122-9 Bohm, C., Stefan, B., & Keim, D. A. (2001). Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 322373. doi:10.1145/502807.502809 Bohm, C., & Krebs, F. (2002). High performance data mining using the nearest neighbor join. In International Conference on Data Mining (ICDM), (pp. 4350). Maebashi City, Japan. Bookstein, A. (1980). Fuzzy requests: An approach to weighted Boolean searches. Journal of the American Society for Information Science American Society for Information Science, 31, 240247. doi:10.1002/asi.4630310403 Bordogna, G., & Pasi, G. (1995). Linguistic aggregation operators of selection criteria in fuzzy information retrieval. International Journal of Intelligent Systems, 10(2), 233248. doi:10.1002/int.4550100205 Bordogna, G., Carrara, P., & Pasi, G. (1991). Query term weights as constraints in fuzzy information retrieval. Information Processing & Management, 27(1), 1526. doi:10.1016/0306-4573(91)90028-K Bordogna, G., & Pasi, G. (2007). A flexible approach to evaluating soft conditions with unequal preferences in fuzzy databases. International Journal of Intelligent Systems, 22(7), 665689. doi:10.1002/int.20223
361
Compilation of References
Bordogna, G., & Psaila, G. (2008). Customizable flexible querying classic relational databases. In Galindo, J. (Ed.), Handbook of research on fuzzy information processing in databases (pp. 191217). Hershey, PA: Information Science Reference. Bordogna, G., Carrara, P., Pagani, M., Pepe, M., & Rampini, A. (2009). Extending INSPIRE metadata to imperfect temporal descriptions, In the Proceedings of the Global Spatial Data Infrastructures Conference (GSDI11), June 15-19 2009, Rotterdam (NL), CD Proceedings, Ref 235. Brzsnyi, S., Kossmann, D., & Stocker, K. (2001). The skyline operator. In Proceedings of the 17th International Conference on Data Engineering, (pp. 421- 430). Washington, DC, USA. IEEE Computer Society. Bosc, P., Kraft, D., & Petry, F. E. (2005). Fuzzy sets in database and Information Systems: Status and opportunities. Fuzzy Sets and Systems, 153(3), 418426. doi:10.1016/j.fss.2005.05.039 Bosc, P., Lietard, L., & Pivert, O. (2003). Sugeno fuzzy integral as a basis for the interpretation of flexible queries involving monotonic aggregates. Information Processing & Management, 39(2), 287306. doi:10.1016/S03064573(02)00053-5 Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 117. doi:10.1109/91.366566 Bosc, P., Dubois, D., Pivert, O., & Prade, H. (1997). Flexible queries in relational databases-the example of the division operator. Theoretical Computer Science, 171(1/2), 281302. doi:10.1016/S0304-3975(96)00132-6 Bosc, P., & Pivert, O. (2006). About approximate inclusion and its axiomatization. Fuzzy Sets and Systems, 157, 14381454. doi:10.1016/j.fss.2005.11.011 Bosc, P. (1999). Fuzzy databases. In Bezdek, J. (Ed.), Fuzzy sets in approximate reasoning and Information Systems (pp. 403468). Boston: Kluwer Academic Publishers. Bosc, P., & Pivert, O. (1993). An approach for a hierarchical aggregation of fuzzy predicates. In Proceedings of the 2nd IEEE International Conference on Fuzzy Systems (FUZZ-IEEE93), (pp. 12311236). San Francisco, USA.
Bosc, P., & Pivert, O. (2000). On the specification of representation-based conditions in a context of incomplete databases. In the Proceedings of Database and Expert Systems Applications, 10th International Conference (DEXA 99), August 30 - September 3, 1999, Florence (It), (pp. 594-603). Bosc, P., & Pivert, O. (2000). SQLf query functionality on top of a regular relational RDBMS. Knowledge Management in Fuzzy Databases, 171-190. Heidelberg: Physica-Verlag. Bosc, P., Pivert, O., & Lietard, L. (2001). Aggregate operators in database flexible querying. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZIEEE 2001), (pp. 1231-1234), Melbourne, Australia. Botea, V., Mallett, D., Nascimento, M. A., & Sander, J. (2008). PIST: An efficient and practical indexing technique for historical spatio-temporal point data. GeoInformatica, 12(2), 143168. doi:10.1007/s10707-007-0030-3 Boulos, J., Dalvi, N., & Mandhani, B. Mathur, S., Re, C. & Suciu, D. (2005). MystiQ: A system for finding more answers by using probabilities. System Demo in 2005 ACM SIGMOD International Conference on Management of Data. Retrieved October 18, 2009, from http://www. cs.washington.edu/homes/suciu/demo.pdf Bozkaya, T., & Ozsoyoglu, M. (1997). Distance-based indexing for high-dimensional metric spaces. In ACM International Conference on Management of Data (SIGMOD), (pp. 357368). Tucson: ACM Press. Brantner, M., May, N., & Moerkotte, G. (2007). Unnesting scalar SQL queries in the presence of disjunction. In Proceedings of IEEE ICDE. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. (1984). Classification and regression trees. Boca Raton, FL: CRC Press. Brin, S. (1995). Near neighbor search in large metric spaces. In International Conference on Very Large Databases (VLDB), (pp. 574584). Zurich, Switzerland. Morgan Kaufman. Brinkhoff, T., Kriegel, H.-P., & Seeger, B. (1993). Efficient processing of spatial joins using r-trees. In ACM International Conference on Management of Data (SIGMOD), (pp. 237246). New York: ACM.
362
Compilation of References
Broekstra, J., Kampman, A., & van Harmelen, F. (2002). Sesame: A generic architecture for storing and querying RDF and RDF schema. In Proceedings of the First International Semantic Web Conference, (p. 54-68). Bruno, N., Chaudhuri, S., & Ramamurthy, R. (2009). Power hints for query optimization. In Proceedings of the 25th International Conference on Data Engineering, (pp. 469-480). Bruno, N., Gravano, L., & Marian, A. (2002). Evaluating top-k queries over Web-accessible databases. Proceedings of the 18th International Conference on Data Engineering, (pp. 369-380). Brzykcy, G., Bartoszek, J., & Pankowski, T. (2008). Schema mappings and agents actions in P2P data integration system. Journal of Universal Computer Science, 14(7), 10481060. Buckles, B. P., Petry, F. E., & Sachar, H. S. (1989). A domain calculus for fuzzy relational databases. Fuzzy Sets and Systems, 29, 327340. doi:10.1016/01650114(89)90044-4 Bueno, R., Kaster, D. S., Paterlini, A. A., Traina, A. J. M., & Traina-Jr, C. (2009). Unsupervised scaling of multi-descriptor similarity functions for medical image datasets. In IEEE International Symposium on ComputerBased Medical Systems (CBMS), (pp. 18). Albuquerque, NM: IEEE. Bugatti, P. H., Traina, A. J. M., & Traina-Jr, C. (2008). Assessing the best integration between distance-function and image-feature to answer similarity queries. In ACM Symposium on Applied Computing (SAC), (pp. 12251230). Fortaleza, CE, Brazil. ACM. Buneman, P., Davidson, S. B., Fan, W., Hara, C. S., & Tan, W. C. (2003). Reasoning about keys for XML. Information Systems, 28(8), 10371063. doi:10.1016/ S0306-4379(03)00028-0 Bunke, H., & Shearer, K. (1998). A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19(3-4), 255259. doi:10.1016/ S0167-8655(97)00179-7
Bustos, B., Keim, D., Saupe, D., Schreck, T., & Vranic, D. (2004). Automatic selection and combination of descriptors for effective 3D similarity search. In IEEE International Symposium on Multimedia Software Engineering, (pp. 514521). Miami: IEEE. Cai, D., Shao, Z., He, X., Yan, X., & Han, J. (2005). Community mining from multi-relational networks. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, (pp. 445-452). Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., & Rosati, R. (2007). Tractable reasoning and efficient query answering in description logics: The dl-lite family. Journal of Automated Reasoning, 39(3), 385429. doi:10.1007/s10817-007-9078-x Calvanese, D., De Giacomo, G., & Lenzerini, M. (1998). On the decidability of query containment under constraints. Proceedings of the 17th ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS98), (pp. 149158). Card, S., MacKinlay, J., & Shneiderman, B. (1999). Readings in information visualization: using vision to think. Morgan Kaufmann. Carey, M. J., & Kossmann, D. (1997). On saying enough already! in SQL. In ACM International Conference on Management of Data (SIGMOD), (pp. 219230). Carey, M. J., & Kossmann, D. (1998). Reducing the braking distance of an SQL query engine. In International Conference on Very Large Databases (VLDB), (pp. 158169). New York. Cater, S. C., & Kraft, D. H. (1989). A generalization and clarification of the Waller-Kraft wish-list. Information Processing & Management, 25, 1525. doi:10.1016/03064573(89)90088-5 Chakrabarti, K., Chaudhuri, S., & Hwang, S. (2004). Automatic categorization of query results. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 755766). Chambers, C., Ungar, D., Chang, B., & Hlzle, U. (1991). Parents are shared parts of objects: Inheritance and encapsulation in self. Lisp and Symbolic Computation, 4(3), 207222. doi:10.1007/BF01806106
363
Compilation of References
Chaudhuri, S., & Weikum, G. (2000). Rethinking database system architecture: Towards a self-tuning RISC-style database system. In Proceedings of 26th International Conference on Very Large Data Bases, (p. 1-10). Chaudhuri, S., Das, G., Hristidis, V., & Weikum, G. (2004). Probabilistic ranking of database query results. Proceedings of the 30th International Conference on Very Large Data Base, (pp. 888899). Chen, C., Yan, X., Yu, P. S., Han, J., Zhang, D.-Q., & Gu, X. (2007). Towards graph containment search and indexing. In Proceedings of the 33rd International Conference on Very Large Data Bases, (pp. 926-937). Chen, Z. Y., & Li, T. (2007). Addressing diverse user preferences in SQL-Query-Result navigation. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 641-652). Cheng, J., Ke, Y., Ng, W., & Lu, A. (2007). FG-Index: Towards verification-free query processing on graph databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 857-872). Chong, E. I., Das, S., Eadon, G., & Srinivasan, J. (2005). An efficient SQL-based RDF querying scheme. In Proceedings of the 31st International Conference on Very Large Data Bases, (pp. 1216-1227). Chrobak, M., Keynon, C., & Young, N. (2005). The reverse greedy algorithm for the metric k-median problem. Information Processing Letters, 97, 6872. doi:10.1016/j. ipl.2005.09.009 Chu, E., Beckmann, J. L., & Naughton, J. F. (2007). The case for a wide-table approach to manage sparse relational data sets. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 821-832). Ciaccia, P., Patella, M., & Zezula, P. (1997). M-tree: An efficient access method for similarity search in metric spaces. In International Conference on Very Large Databases (VLDB), (pp. 426435). Athens, Greece. Morgan Kaufmann. Codd, E. F. (1979). Extending the database relational model to capture more meaning. [TODS]. ACM Transactions on Database Systems, 4(4), 397434. doi:10.1145/320107.320109
Connolly, T., & Begg, C. (2005). Database systems-a practical approach to design, implementation, and management. United Kingdom: Pearson Education Limited. Consens, M. P., & Mendelzon, A. O. (1990). GraphLog: A visual formalism for real life recursion. In Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, (pp. 404-416). Copeland, G. P., & Khoshafian, S. (1985). A decomposition storage model. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 268-279). Cross Drafting Teams, I. N. S. P. I. R. E. (2007). INSPIRE technical architecture overview, INSPIRE cross drafting teams report. Retrieved on February 17, 2009, from http:// inspire.jrc.ec.europa.eu/reports.cfm Cyganiak, R. (2005). A relational algebra for SPARQL. (Tech. Rep. No. HPL-2005-170). HP Labs. Dadashzadeh, M. (1989). An improved division operator for relational algebra. Information Systems, 14(5), 431437. doi:10.1016/0306-4379(89)90007-0 Darwen, H., & Date, C. (1992). Into the great divide. In Date, C., & Darwen, H. (Eds.), Relational database: Writings 1989-1991 (pp. 155168). Reading, MA: Addison-Wesley. Das, G., Hristidis, V., Kapoor, N., & Sudarshan, S. (2006). Ordering the attributes of query results. Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 395-406). DBLP XML Records. (2009). Home page information. Retrieved from http://dblp.uni-trier.de/xml/ De Caluwe, R., Devis, F., Maesfranckx, P., De Tr, G., & Van der Cruyssen, B. (1999). Semantics and modelling of flexible time indication. In Zadeh, L. A., & Kacprzyk, J. (Eds.), Computing with words in Information/Intelligent Systems (pp. 229256). Physica Verlag. De Caluwe, R., & De Tr, G. G., Van der Cruyssen, B., Devos, F. & Maesfranckx, P. (2000). Time management in fuzzy and uncertain object-oriented databases. In O. Pons, A. Vila & J. Kacprzyk (Eds.), Knowledge management in fuzzy databases. (pp. 67-88). Heidelberg: Physica-Verlag.
364
Compilation of References
De Kunder, M. (2010). The size of the World Wide Web. Retrieved from http://www.worldwidewebsize.com De Tr, G., De Caluwe, R., Tourn, K., & Matth, T. (2003). Theoretical considerations ensuing from experiments with flexible querying. In T. Bilgi, B. De Baets & O. Kaynak (Eds.), Proceedings of the IFSA 2003 World Congress, (pp. 388391). (LNCS 2715). Springer. De Tr, G., Verstraete, J., Hallez, A., Matth, T., & De Caluwe, R. (2006). The handling of select-project-join operations in a relational framework supported by possibilistic logic. In Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU), (pp. 21812188). Paris, France. De Tr, G., Zadrony, S., Matthe, T., Kacprzyk, J., & Bronselaer, A. (2009). Dealing with positive and negative query criteria in fuzzy database querying. (LNCS 5822), (pp. 593-604). Dekkers, M. (2008). Temporal metadata for discovery-a review of current practice. M. Craglia (Ed.), (EUR 23209 EN, JRC Scientific and Technical Report). DeMichiel, L. G. (1989). Resolving database incompatibility: an approach to performing relational operations over mismatched domains. IEEE Transactions on Knowledge and Data Engineering, 1(4), 485493. doi:10.1109/69.43423 Deng, L., Cai, Y., Wang, C., & Jiang, Y. (2009). Fuzzy temporal logic on fuzzy temporal constraint metworks. In the Proceedings of the Sixth International Conference on Fuzzy Systems and Knowledge Discovery, 6, (pp. 272-276). Deserno, T. M., Antani, S., & Long, R. (2009). Ontology of gaps in content-based image retrieval. Journal of Digital Imaging, 2(22), 114. Dhillon, I. S., Mallela, S., & Kumar, R. (2002). Enhanced word clustering for hierarchical text classification. Proceedings of the 8th ACM SIGKDD International Conference, (pp. 191200). Directive, I. N. S. P. I. R. E. 2007/2/EC of the European Parliament and of the Council of 14. (2007). INSPIRE. Retrieved on February 17, 2009, from www.ecgis.org/ inspire/directive/l_10820070425en00010014.pdf
Dittrich, K. R. (1986). Object-oriented database systems: The notions and the issues. In Proceedings of the International Workshop on Object-Oriented Database Systems, (pp. 24). Dubois, D., & Prade, H. (1997). Using fuzzy sets in flexible querying: Why and how? In Andreasen, T., Christiansen, H., & Larsen, H. L. (Eds.), Flexible query answering systems. Dordrecht: Kluwer Academic Publishers. Dubois, D., & Prade, P. (2008). Handling bipolar queries in fuzzy information processing. In Galindo, J. (Ed.), Handbook of research on fuzzy information processing in databases (pp. 97114). New York: Information Science Reference. Dubois, D., & Prade, H. (2002). Bipolarity in flexible querying. (LNAI 2522), (pp. 174-182). Eakins, J., & Graham, M. (1999). Content-based image retrieval. (Technical Report 39), University of Northumbria at Newcastle. Eliassen, F., & Karlsen, R. (1991). Interoperability and object identity. SIGMOD Record, 20(4), 2529. doi:10.1145/141356.141362 Elmasri, R., & Navathe, S. R. (2006). Fundamentals of database systems (5th ed.). Addison Wesley. European Commission. (2007). Draft implementing rules for metadata (v. 3). INSPIRE Metadata Report. Retrieved on February 17, 2009, from http://inspire.jrc.ec.europa. eu/reports.cfm European Commission. (2009). INSPIRE metadata implementing rules: Technical guidelines based on EN ISO 19115 and EN ISO 19119. INSPIRE Metadata Report. Retrieved on February 18, 2009, from http://inspire.jrc. ec.europa.eu/reports.cfm Fagin, R. (2002). Combining fuzzy information: An overview. SIGMOD Record, 31(2), 109118. doi:10.1145/565117.565143 Fagin, R., Kolaitis, P. G., & Popa, L. (2005). Data exchange: Getting to the core. ACM Transactions on Database Systems, 30(1), 174210. doi:10.1145/1061318.1061323
365
Compilation of References
Fagin, R., Kolaitis, P. G., Popa, L., & Tan, W. C. (2004). Composing schema mappings: Second-order dependencies to the rescue. In: A. Deutsch (Ed.), Proceedings of the 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, (pp. 83-94). June 14-16, 2004, Paris, France, ACM. Felipe, J. C., Traina-Jr, C., & Traina, A. J. M. (2009). A new family of distance functions for perceptual similarity retrieval of medical images. Journal of Digital Imaging, 22(2), 183201. doi:10.1007/s10278-007-9084-x Ferreira, M. R. P., Traina, A. J. M., Dias, I., Chbeir, R., & Traina-Jr, C. (2009). Identifying algebraic properties to support optimization of unary similarity queries. 3rd Alberto Mendelzon International Workshop on Foundations of Data Management, Arequipa, Peru, (pp. 110). Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2001). Placing search in context: The concept revisited. Proceedings of the 9th International World Wide Web Conference, (pp. 406414). Flickner, M., Sawhney, H. S., Ashley, J., Huang, Q., Dom, B., & Gorkani, M. (1995). Query by image and video content: The QBIC system. IEEE Computer, 28(9), 2332. Fortin, S. (1996). The graph isomorphism problem. (Technical Report). Department of Computing Science, University of Alberta. Fuxman, A., Kolaitis, P. G., Miller, R. J., & Tan, W. C. (2006). Peer data exchange. ACM Transactions on Database Systems, 31(4), 14541498. doi:10.1145/1189769.1189778 Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Galindo, J. (2005). New characteristics in FSQL, a fuzzy SQL for fuzzy databases. WSEAS Transactions on Information Science and Applications, 2(2), 161169. Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998). A server for Fuzzy SQL queries. In T. Andreasen, H. Christiansen & H.L. Larsen (Eds.), Proceedings of the Third International Conference on Flexible Query Answering Systems, (pp. 164-174). (LNAI 1495). London: Springer-Verlag.
Gao, L., Wang, M., Sean Wang, X., & Padmanabhan, S. (2004). Uexpressing and optimizing similarity-based queries in SQLs. (Technical Report CS-04-06), University of Vermont. Retrieved from http:// www.cs.uvm.edu / csdb/ techreport.shtml Gao, X., Xiao, B., Tao, D. & Li, X. (2009). A survey of graph edit distance. Pattern Analysis & Applications. Garcia-Molina, H., Ullman, J. D., & Widom, J. (2002). Database systems: The complete book. Upper Saddle River, NJ: Prentice Hall. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness. W.H. Freeman. Gauker, C. (2007). A critique of the similarity space theory of concepts. Mind & Language, 22(4), 317345. doi:10.1111/j.1468-0017.2007.00311.x Geerts, F., Mannila, H., & Terzim, E. (2004). Relational link-based ranking. Proceedings of the 30th International Conference on Very Large Data Base, (pp. 552-563). Gibello, P. (2010). Zql: A Java SQL parser. Retrieved June 2010, from http://www.gibello.com/code/zql/ Giugno, R., & Shasha, D. (2002). GraphGrep: A fast and universal method for querying graphs. In IEEE International Conference in Pattern Recognition, (pp. 112-115). Glimm, B., Lutz, C., Horrocks, I., & Sattler, U. (2008). Conjunctive query answering for the description logic shiq. [JAIR]. Journal of Artificial Intelligence Research, 31, 157204. Glimm, B., Horrocks, I., & Sattler, U. (2007). Conjunctive query entailment for shoq. Proceedings of the 2007 International Workshop on Description Logic (DL 2007). CEUR Electronic Workshop Proceedings. Godfrey, P., Shipley, R., & Gryz, J. (2005). Maximal vector computation in large data sets. In VLDB 05: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), (pp. 229-240). Goncalves, M., & Tineo, L. (2008). SQLfi y sus aplicaciones. [Medelln, Colombia]. Avances en Sistemas e Informtica, 5(2), 3340.
366
Compilation of References
Goncalves, M., & Vidal, M.-E. (2009). Reaching the top of the Skyline: An efficient indexed algorithm for Top-k Skyline queries. In Proceedings of International Conference on Database and Expert Systems Applications (DEXA), (pp. 471-485). Grabisch, M., Greco, S., & Pirlot, M. (2008). Bipolar and bivariate models in multicriteria decision analysis: Descriptive and constructive approaches. International Journal of Intelligent Systems, 23, 930969. doi:10.1002/ int.20301 Graefe, G. (2003). Sorting and indexing with partitioned B-trees. In Proceedings of the 1st International Conference on Data Systems Research. Grust, T., Sakr, S., & Teubner, J. (2004). XQuery on SQL hosts. In Proceedings of the Thirtieth International Conference on Very Large Data Bases, (pp. 252-263). Guttman, A. (1984). R-Trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 47-57). Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 11571182. doi:10.1162/153244303322753616 Gyssens, M., Paredaens, J., den Bussche, J. V., & Gucht, D. V. (1994). A graph-oriented object database model. [TKDE]. IEEE Transactions on Knowledge and Data Engineering, 6(4), 572586. doi:10.1109/69.298174 Gyssens, M., & Lakshmanan, L. V. S. (1997). A foundation for multi-dimensional databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB97), (pp. 106115). Haas, L. M. (2007). Beauty and the beast: The theory and practice of information integration. In Schwentick, T., & Suciu, D. (Eds.), Database theory. (LNCS 4353) (pp. 2843). Springer. Haralick, R. M. (1979). Statistical and structural approaches to texture. IEEE, 67, 786804. Harris, S., & Gibbins, N. (2003). 3store: Efficient bulk RDF storage. In Proceedings of the First International Workshop on Practical and Scalable Semantic Systems.
Harris, S., & Shadbolt, N. (2005). SPARQL query processing with conventional relational database systems. In Proceedings of SSWS. Harth, A., & Decker, S. (2005). Optimized index structures for querying RDF from the Web. In Proceedings of the Third Latin American Web Congress, (pp. 71-80). He, H., & Singh, A. K. (2006). Closure-Tree: An index structure for graph queries. In Proceedings of the 22nd International Conference on Data Engineering, (pp. 38-52). He, H., & Singh, A. K. (2008). Graphs-at-a-time: Query language and access methods for graph databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 405-418). He, X., Ma, W.-Y., King, O., Li, M., & Zhang, H. (2002). Learning and inferring a semantic space from users relevance feedback for image retrieval. In ACM International Conference on Multimedia (MULTIMEDIA), (pp. 343346). New York: ACM. Hjaltason, G. R., & Samet, H. (1999). Distance browsing in spatial databases. ACM Transactions on Database Systems, 24(2), 265318. doi:10.1145/320248.320255 Hjaltason, G. R., & Samet, H. (2003). Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 28(4), 517580. doi:10.1145/958942.958948 Hjaltason, G. R., & Samet, H. (1995). Ranking in spatial databases. In International Symposium on Advances in Spatial Databases (SSD), (pp. 8395). Portland, Maine. Huan, J., Wang, W., Bandyopadhyay, D., Snoeyink, J., Prins, J., & Tropsha, A. (2004). Mining protein family specific residue packing patterns from protein structure graphs. In Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, (pp. 308-315). Huang, P.-W., Hsu, L., Su, Y.-W., & Lin, P.-L. (2008). Spatial inference and similarity retrieval of an intelligent image database system based on objects spanning representation. Journal of Visual Languages and Computing, 19(6), 637651. doi:10.1016/j.jvlc.2007.09.001 Hull, R., & King, R. (1987). Semantic database modeling: Survey, applications, and research issues. [CSUR]. ACM Computing Surveys, 19(3), 201260. doi:10.1145/45072.45073
367
Compilation of References
Hung, E., Deng, Y., & Subrahmanian, V. S. (2005). RDF aggregate queries and views. In Proceedings of IEEE ICDE. IBM Corp. (2003). Image, audio, and video extenders administration and programming guide. DB2 universal database version 8. Informix Corp. (1999). Excalibur image DataBlade module users guide. Informix Press. ISO8601. (2004). Data elements and interchange formatsinformation interchange- Representation of dates and times. (Ref: ISO 8601). Jacox, E. H., & Samet, H. (2008). Metric space similarity joins. ACM Transactions on Database Systems, 33(2), 138. doi:10.1145/1366102.1366104 Jain, A. K., & Farrokhnia, F. (1991). Unsupervised texture segmentation using gabor filters. Pattern Recognition, 24(12), 11671186. doi:10.1016/0031-3203(91)90143-S Jeffery, S. R., Franklin, M. J., & Halevy, A. Y. (2008). Pay-as-you-go user feedback for dataspace systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 847-860). Jiang, H., Wang, H., Yu, P. S., & Zhou, S. (2007). GString: A novel approach for efficient search in graph databases. In Proceedings of the 23rd International conference on Data Engineering, (pp. 566-575). Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the European Conference on Machine Learning, (pp. 137142). Joachims, T. (2002). Optimizing search engines using clickthrough data. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, (pp. 133142). Kacprzyk, J., & Yager, R. R. (2001). Linguistic summaries of data using fuzzy logic. International Journal of General Systems, 30, 133154. doi:10.1080/03081070108960702 Kacprzyk, J., Yager, R. R., & Zadrony, S. (2000). A fuzzy logic based approach to linguistic summaries of databases. International Journal of Applied Mathematics and Computer Science, 10, 813834.
Kacprzyk, J., & Zadrony, S. (2005). Linguistic database summaries and their protoforms: Towards natural language based knowledge discovery tools. Information Sciences, 173, 281304. doi:10.1016/j.ins.2005.03.002 Kacprzyk, J., & Zadrony, S. (2009). Protoforms of linguistic database summaries as a human consistent tool for using natural language in data mining. International Journal of Software Science and Computational Intelligence, 1(1), 100111. Kacprzyk, J., Zadrony, S., & Zikowski, A. (1989). FQUERY III+: A human-consistent database querying system based on fuzzy logic with linguistic quantifiers. Information Systems, 14, 443453. doi:10.1016/03064379(89)90012-4 Kacprzyk, J., & Zikowski, A. (1986). Database queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man, and Cybernetics, 16, 474479. doi:10.1109/ TSMC.1986.4308982 Kacprzyk, J., & Zadrony, S. (1995). FQUERY for Access: Fuzzy querying for windows-based DBMS. In Bosc, P., & Kacprzyk, J. (Eds.), Fuzziness in database management systems (pp. 415433). Heidelberg, Germany: Physica-Verlag. Kacprzyk, J., & Zadrozny, S. (2010). Computing with words and systemic functional linguistics: Linguistic data summaries and natural language generation. In Huynh, V.-N., Nakamori, Y., Lawry, J., & Inuiguchi, M. (Eds.), Integrated uncertainty management and applications (pp. 2336). Heidelberg: Springer-Verlag. doi:10.1007/9783-642-11960-6_3 Kacprzyk, J., & Zadrony, S. (1997). Implementation of OWA operators in fuzzy querying for Microsoft Access. In Yager, R. R., & Kacprzyk, J. (Eds.), The ordered weighted averaging operators: Theory and applications (pp. 293306). Boston: Kluwer Academic Publishers. Kandel, A. (1986). Fuzzy Mathematical Techniques with Applications, Addison Wesley Publishing Co., California. Kaufman, A. (1975). Inroduction to the Theory of Fuzzy Subsets, Vol-I, Academic Press. New York: Sanfrancisco. Karvounarakis, G., Alexaki, S., Christophides, V., Plexousakis, D., & Scholl, M. (2002). RQL: A declarative query language for RDF. In Proceedings of WWW.
368
Compilation of References
Kaster, D. S., Bugatti, P. H., Traina, A. J. M., & Traina-Jr, C. (2009). Incorporating metric access methods for similarity searching on Oracle database. In Brazilian Symposium on Databases (SBBD), (pp. 196210). Fortaleza, Brazil. Kaul, M., Drosten, K., & Neuhold, E. J. (1990). Integrating heterogeneous information bases by object-oriented views, In: Proc. Intl. Conf. on Data Engineering, pp 2-10. Litwin W. and Abdellatif A. (1986). Multidatabase Interoperabilty, IEEE. The Computer Journal, 12(19), 1018. Kent, W. (1991). A rigorous model of object references, identity and existence. Journal of Object-Oriented Programming, 4(3), 2838. Khoshafian, S. N., & Copeland, G. P. (1986). Object identity. Proceedings of OOPSLA86, ACM SIGPLAN Notices, 21(11), 406416. Khotanzad, A., & Hong, Y. H. (1990). Invariant image recognition by zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(5), 489497. doi:10.1109/34.55109 Kieling, W. (2002). Foundations of preferences in database systems. Proceedings of the 28th International Conference on Very Large Data Bases, (pp. 311-322). Kiryakov, A., Ognyanov, D., & Manov, D. (2005). Owlima pragmatic semantic repository for owl. In Proceedings of the Web Information Systems Engineering Workshops, (pp. 182-192). Klement, E. P., Mesiar, R., & Pap, E. (Eds.). (2000). Triangular norms. Dordrecht, Boston, London: Kluwer Academic Publishers. Klinger, S., & Austin, J. (2005). Chemical similarity searching using a neural graph matcher. In Proceedings of the 13th European Symposium on Artificial Neural Networks, (p. 479-484). Koch, C. (2009). MayBMS: A database management system for uncertain and probabilistic data. In Aggarwal, C. (Ed.), Managing and mining uncertain data (pp. 149184). Springer. doi:10.1007/978-0-387-09690-2_6 Kokare, M., Chatterji, B., & Biswas, P. (2003). Comparison of similarity metrics for texture image retrieval. In Conference on Convergent Technologies for Asia-Pacific Region, (pp. 571575).
Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. Proceedings of the 14th International Conference on Machine Learning, (pp. 170178). Koloniari, G., & Pitoura, E. (2005). Peer-to-peer management of XML data: Issues and research challenges. SIGMOD Record, 34(2), 617. doi:10.1145/1083784.1083788 Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., & Protopapas, Z. (1996). Fast nearest neighbor search in medical image databases. In International Conference on Very Large Databases (VLDB), pp(. 215226). San Francisco. Kossmann, D., Ramsak, F., & Rost, S. (2002). Shooting stars in the sky: An online algorithm for skyline queries. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), (pp. 275-286). Koutrika, G., & Ioannidis, Y. (2004). Personalization of queries in database systems. Proceedings of the 20th International Conference on Database Engineering, (pp. 597-608). Krtzsch, M., Rudolph, S., & Hitzler, P. (2007). Conjunctive queries for a tractable fragment of owl 1.1. Proceedings of the 6th International Semantic Web Conference (ISWC 2007), 310323. Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In Proceedings of the IEEE International Conference on Data Mining, (pp. 313-320). Kuramochi, M., & Karypis, G. (2004). GREW-a scalable frequent subgraph discovery algorithm. In Proceedings of the IEEE International Conference on Data Mining, (pp. 439-442). Lacroix, M., & Lavency, P. (1987). Preferences: Putting more knowledge into queries. In Proceedings of the 13 International Conference on Very Large Databases, (pp. 217-225). Brighton, UK. Laurent, A. (2003). Querying fuzzy multidimensional databases: Unary operators and their properties. International Journal of Uncertainty. Fuzziness and Knowledge-Based Systems, 11, 3146. doi:10.1142/S0218488503002259
369
Compilation of References
Lee, K.-M., & Street, W. N. (2002). Incremental feature weight learning and its application to a shape-based query system. Pattern Recognition Letters, 23(7), 865874. doi:10.1016/S0167-8655(01)00161-1 Lee, J., Oh, J.-H., & Hwang, S. (2005). STRG-index: Spatio-temporal region graph indexing for large video databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 718-729). Leinders, D., & den Bussche, J. V. (2005). On the complexity of division and set joins in the relational algebra. In Proceedings of ACM PODS, Baltimore, MD USA. Leser, U. (2005). A query language for biological networks. In Proceedings of the Fourth European Conference on Computational Biology/Sixth Meeting of the Spanish Bioinformatics Network, (p. 39). Levandoski, J. J., & Mokbel, M. F. (2009). RDF datacentric storage. In Proceedings of the IEEE International Conference on Web Services. Levy, A. Y., & Rousset, M.-C. (1998). Combining horn rules and description logics in carin. Artificial Intelligence, 104(1-2), 165209. doi:10.1016/S0004-3702(98)00048-4 Ley, M. (2010). The dblp computer science bibliography. Retrieved from http://www.informatik.uni-trier. de/~ley/db Li, C., & Wang, X. S. (1996). A data model for supporting on-line analytical processing. In Proceedings of the Conference on Information and Knowledge Management, Baltimore, MD, (pp. 8188). Li, C., Chen-chuan, K., Ihab, C., Ilyas, F., & Song, S. (2005). RankSQL: Query algebra and optimization for relational top-k queries. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, (pp. 131-142). ACM Press. Lieberman, H. (1986). Using prototypical objects to implement shared behavior in object-oriented systems. In Proceedings of OOPSLA86, ACM SIGPLAN Notices, 21(11), 214223. Lienhart, R. (2001). Reliable transition detection in videos: A survey and practitioners guide. International Journal of Image and Graphics, 1, 469486. doi:10.1142/ S021946780100027X
Lin, X., Yuan, Y., Zhang, Q., & Zhang, Y. (2007). Selecting Stars: The k Most Represen-tative Skyline Operator. In Proceedings of International Conference on Database Theory (ICDE), pp. 86-95. Liu, Y., Zhang, D., Lu, G., & Ma, W.-Y. (2007). A survey of content-based image retrieval with high-level semantics. Pattern Recognition Letters, 40, 262282. Liu, F., Yu, C., & Meng, W. (2002). Personalized Web search by mapping user queries to categories. Proceedings of the ACM International Conference on Information and Knowledge Management, (pp. 558-565). Long, F., Zhang, H., & Feng, D. D. (2003). Fundamentals of content-based image retrieval (Multimedia information retrieval and management-technological fundamentals and applications). Springer. Lpez, Y., & Tineo, L. (2006). About the performance of SQLf evaluation mechanisms. CLEI Electronic Journal, 9(2), 8. Retrieved October 10, 2009, from http://www. clei.cl/cleiej/papers/v9i2p8.pdf Lukasiewicz, T., & Straccia, U. (2008). Managing uncertainty and vagueness in description logics for the semantic Web. Journal of Web Semantics, 6(4), 291308. doi:10.1016/j.websem.2008.04.001 Ma, L., Su, Z., Pan, Y., Zhang, L., & Liu, T. (2004). RStar: An RDF storage and query system for enterprise resource management. In Proceedings of the ACM International Conference on Information and Knowledge Management, (pp. 484-491). Ma, L., Wang, C., Lu, J., Cao, F., Pan, Y., & Yu, Y. (2008). Effective and efficient Semantic Web data management over DB2. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 1183-1194). Madhavan, J., & Halevy, A. Y. (2003). Composing mappings among data sources. In J. Ch., Freytag, et al. (Eds.), VLDB 2003, Proceedings of 29th International Conference on Very Large Data Bases, (pp. 572-583). September 9-12, 2003, Berlin, Germany. Morgan Kaufmann. Mahmoudi Nasab, H., & Sakr, S. (2010). An experimental evaluation of relational RDF sorage and querying techniques. In Proceedings of the 2nd International Workshop on Benchmarking of XML and Semantic Web Applications.
370
Compilation of References
Maier, D. (1983). The theory of relational databases. Computer Science Press. Mailis, T. P., Stoilos, G., & Stamou, G. B. (2007). Expressive reasoning with horn rules and fuzzy description logics. Proceedings of 2nd International Conference on Web Reasoning and Rule Systems (RR08). Maiocchi, R., Pernici, B., & Barbic, F. (1992). Automatic deduction of temporal indications. ACM Transactions on Database Systems, 17(4), 647668. doi:10.1145/146931.146934 Manjunath, B. S., Ohm, J.-R., Vasudevan, V. V., & Yamada, A. (2001). Color and texture descriptors. IEEE Transactions on Circuits and Systems for Video Technology, 11(6), 703715. doi:10.1109/76.927424 Manola, F., & Miller, E. (2004). RDF primer. W3C recommendation. Retrieved from http://www.w3.org/ TR/REC-rdf-syntax/ Martens, W., Neven, F., & Schwentick, T. (2007). Simple off the shelf abstractions for XML schema. SIGMOD Record, 36(3), 1522. doi:10.1145/1324185.1324188 Matono, A., Amagasa, T., Yoshikawa, M., & Uemura, S. (2005). A path-based relational RDF database. In Proceedings of the 16th Australasian Database Conference, (pp. 95-103). Matos, V. M., & Grasser, R. (2002). A simpler (and better) SQL approach to relational division. Journal of Information Systems Education, 13(2). Matth, T., & De Tr, G. (2009). Bipolar query satisfaction using satisfaction and dissatisfaction degrees: Bipolar satisfaction degrees. In S.Y. Shin & S. Ossowski (Eds.), Proceedings of the SAC Conference, (pp. 1699-1703). ACM. McBride, B. (2002). Jena: A Semantic Web toolkit. IEEE Internet Computing, 6(6), 5559. doi:10.1109/ MIC.2002.1067737 McCann, L. (2003). On making relational division comprehensible. In Proceedings of ASEE/IEEE Frontiers in Education Conference.
Melnik, S., Bernstein, P. A., Halevy, A. Y., & Rahm, E. (2005). Supporting executable mappings in model management. In F. zcan (Ed.), Proceedings of the 24th ACM SIGMOD International Conference on Management of Data, (pp. 167-178). Baltimore, Maryland, USA, June 14-16, ACM. Melton, J., & Eisenberg, A. (2001). SQL multimedia and application packages (SQL/MM). SIGMOD Record, 30(4), 97102. doi:10.1145/604264.604280 Meng, X. F., & Ma, Z. M. (2008). A context-sensitive approach for Web database query results ranking. Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, (pp. 836-839). Miller, R. J., Haas, L. M., & Hernandez, M. A. (2000). Schema mapping as query discovery. In: A.E. Abbadi, et al. (Eds.), VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, (pp. 77-88). September 10-14, 2000, Cairo, Egypt. Morgan Kaufmann Milo, T., Abiteboul, S., Amann, B., Benjelloun, O., & Ngoc, F. D. (2005). Exchanging intensional XML data. ACM Transactions on Database Systems, 30(1), 140. doi:10.1145/1061318.1061319 Mokhtarian, F., & Mackworth, A. (1986). Scale-based description and recognition of planar curves and twodimensional objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(1), 3443. doi:10.1109/ TPAMI.1986.4767750 Morin, E. (1999). Seven complex lessons in education for the future. United Nations Educational, Scientific and Cultural Organization. Retrieved October 18, 2009, from http://www.unesco.org/education/tlsf/TLSF/theme_a/ mod03/img/sevenlessons.pdf Neumann, T., & Weikum, G. (2008). RDF-3X: A RISCstyle engine for RDF. [PVLDB]. Proceedings of the VLDB Endownment, 1(1), 647659. Neumann, T., & Weikum, G. (2009). Scalable join processing on very large RDF graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 627-640).
371
Compilation of References
Ooi, B. C., Shu, Y., & Tan, K.-L. (2003). Relational data sharing in peer-based data management systems. SIGMOD Record, 32(3), 5964. doi:10.1145/945721.945734 Ortiz, M., Calvanese, D., & Eiter, T. (2008). Data complexity of query answering in expressive description logics via tableaux. Journal of Automated Reasoning, 41(1), 6198. doi:10.1007/s10817-008-9102-9 Ortiz, M., Calvanese, D., & Eiter, T. (2006). Data complexity of answering unions of conjunctive queries in shiq. Proceedings of the 2006 International Workshop on Description Logic. CEUR Electronic Workshop Proceedings. Pan, J. Z., Stamou, G. B., Stoilos, G., Taylor, S., & Thomas, E. (2008). Scalable querying services over fuzzy ontologies. Proceedings of the 17th International World Wide Web Conference (WWW2008), (pp. 575584). Pan, Z., & Heflin, J. (2003). DLDB: Extending relational databases to support Semantic Web queries. In Proceedings of the First International Workshop on Practical and Scalable Semantic Systems. Pan, Z., & Hein, J. (2003). DLDB: Extending relational databases to support SemanticWeb queries. In Proceedings of PSSS. Pan, Z., Zhang, X., & Heflin, J. (2008). DLDB2: A scalable multi-perspective Semantic Web repository. In Proceedings of the IEEE/WIC /ACM International Conference on Web Intelligence, (pp. 489-495). Pankowski, T., & Hunt, E. (2005). Data merging in life science data integration systems. In Klopotek, M. A., Wierzchon, S. T., & Trojanowski, K. (Eds.), Intelligent Information Systems. New trends in intelligent information processing and Web mining, advances in soft computing (pp. 279288). Berlin, Heidelberg: Springer. doi:10.1007/3-540-32392-9_29 Pankowski, T., Cybulka, J., & Meissner, A. (2007). XML schema mappings in the presence of key constraints and value dependencies. In M. Arenas & J. Hidders (Eds.), Proceedings of the 1st Workshop on Emerging Research Opportunities for Web Data Management (EROW 2007) Collocated with the 11th International Conference on Database Theory (ICDT 2007), (pp. 1-15). Barcelona, Spain, January 13, 2007.
Papadias, D., Tao, Y., Mouratidis, K., & Hui, C. K. (2005). Aggregate nearest neighbor queries in spatial databases. ACM Transactions on Database Systems, 30(2), 529576. doi:10.1145/1071610.1071616 Papadias, D., Tao, Y., Fu, G., & Seeger, B. (2003). An optimal and progressive algorithm for skyline queries. In SIGMOD 03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, (pp. 467-478). New York: ACM Press. Peckham, J., & Maryanski, F. (1988). Semantic data models. [CSUR]. ACM Computing Surveys, 20(3), 153189. doi:10.1145/62061.62062 Pedersen, T. B., & Jensen, C. S. (2001). Multidimensional database technology. IEEE Computers, 34(12), 4046. Pei, J., Yuan, Y., Lin, X., Jin, W., Ester, M., & Wang, Q. L. W. (2006). Towards multidimensional subspace skyline analysis. ACM Transactions on Database Systems, 31(4), 13351381. doi:10.1145/1189769.1189774 Prudhommeaux, E., & Seaborne, A. (2008). SPARQL query language for RDF. W3C recommendation. Retrieved from http://www.w3.org/TR/rdf-sparql-query/ Prudhommeaux, E. (2005). Notes on adding SPARQL to MySQL. Retrieved from http://www.w3.org/2005/05/22SPARQL-MySQL/ Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81106. doi:10.1007/BF00116251 Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann Publishers Inc. Rantzau, R., & Mangold, C. (2006). Laws for rewriting queries containing division operators. In the Proceedings of IEEE ICDE. Raymond, J. W., Gardiner, E. J., & Willett, P. (2002). RASCAL: Calculation of graph similarity using maximum common edge subgraphs. The Computer Journal, 45(6), 631644. doi:10.1093/comjnl/45.6.631 Raymond, D. (1996). Partial order databases. Unpublished doctoral thesis, University of Waterloo, Canada
372
Compilation of References
Rosado, A., Ribeiro, R., Zadrony, S., & Kacprzyk, J. (2006). Flexible query languages for relational databases: An overview. In Bordogna, G., & Psaila, G. (Eds.), Flexible databases supporting imprecision and uncertainty (pp. 353). Berlin, Heidelberg: Springer Verlag. doi:10.1007/3540-33289-8_1 Roussopoulos, N., Kelley, S., & Vincent, F. (1995). Nearest neighbor queries. In ACM International Conference on Management of Data (SIGMOD), (pp. 7179). Roussos, Y., Stavrakas, Y., & Pavlaki, V. (2005). Towards a context-aware relational model. Proceedings of the International Workshop on Context Representation and Reasoning, Paris, (pp. 101-106). Rui, Y., Huang, T. S., & Merhotra, S. (1997). Contentbased image retrieval with relevance feedback in MARS. Proceedings of the IEEE International Conference on Image Processing, (pp. 815-818). Sakr, S. (2009). GraphREL: A decomposition-based and selectivity-aware relational framework for processing sub-graph queries. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications, (pp. 123-137). Salton, G. (1989). Automatic text processing: The transformation, analysis and retrieval of information by computer. Addison Wesley. Santini, S., & Gupta, A. (2001). A wavelet data model for image databases. In IEEE International Conference on Multimedia and Expo (ICME), Tokyo, Japan. IEEE Computer Society. Savinov, A. (2008). Concepts and concept-oriented programming. Journal of Object Technology, 7(3), 91106. doi:10.5381/jot.2008.7.3.a2 Schema, X. M. L. (2009). W3C XML schema definition language (XSD) 1.1 part 2: Datatypes. Retrieved from www.w3.org/TR/xmlschema11-2 Schmidt, M., Hornung, T., Lausen, G., & Pinkel, C. (2009). SP2Bench: A SPARQL performance benchmark. In Proceedings of the 25th International Conference on Data Engineering, (pp. 222-233).
Seidl, T., & Kriegel, H.-P. (1998). Optimal multi-step k-nearest neighbor search. In ACM International Conference on Management of Data (SIGMOD), (pp. 154165). Seattle, Washington. Sharma, A. K., Goswami, A., & Gupta, D. K. (2008). Fuzzy Inclusion Dependencies in Fuzzy Databases. In Galindo, J. (Ed.), Handbook of Research on Fuzzy Information Processing in Databases (pp. 657683). Hershey, PA, USA: Information Science Reference. Sharma, A. K., Goswami, A., & Gupta, D. K. (2004). Fuzzy Inclusion Dependencies in Fuzzy Relational Databases, In Proceedings of International Conference on Information Technology: Coding and Computing (ITCC 2004), Las Vegas, USA, IEEE Computer Society Press, USA, Volum-1, pp 507-510. Shen, X., Tan, B., & Zhai, C. (2005). Context-sensitive information retrieval using implicit feedback. Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 4350). Sheng, L., Ozsoyoglu, Z. M., & Ozsoyoglu, G. (1999). A graph query language and its query processing. In Proceedings of the 15th International Conference on Data Engineering, (pp. 572-581). Shipman, D. W. (1981). The functional data model and the data language DAPLEX. [TODS]. ACM Transactions on Database Systems, 6(1), 140173. doi:10.1145/319540.319561 Sibley, E. H., & Kerschberg, L. (1977). Data architecture and data model considerations. In Proceedings of the AFIPS Joint Computer Conferences, (pp. 85-96). Sidirourgos, L., Goncalves, R., Kersten, M. L., Nes, N., & Manegold, S. (2008). Column-store support for RDF data management: Not all swans are white. [PVLDB]. Proceedings of the VLDB Endownment, 1(2), 15531563. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. [TPAMI]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 13491380. doi:10.1109/34.895972
373
Compilation of References
Smith, J. M., & Smith, D. C. P. (1977). Database abstractions: Aggregation and generalization. [TODS]. ACM Transactions on Database Systems, 2(2), 105133. doi:10.1145/320544.320546 Stein, L. A. (1987). Delegation is inheritance. In Proceedings of OOPSLA87, ACM SIGPLAN Notices, 22(12), 138146. Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., & Reynolds, D. (2008). SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th International Conference on World Wide Web, (pp. 595-604). Stoilos, G., Simou, N., Stamou, G. B., & Kollias, S. D. (2006). Uncertainty and the semantic Web. IEEE Intelligent Systems, 21(5), 8487. doi:10.1109/MIS.2006.105 Stoilos, G., Stamou, G. B., Pan, J. Z., Tzouvaras, V., & Horrocks, I. (2007). Reasoning with very expressive fuzzy description logics. [JAIR]. Journal of Artificial Intelligence Research, 30, 273320. Stoilos, G., Straccia, U., Stamou, G. B., & Pan, J. Z. (2006). General concept inclusions in fuzzy description logics. Proceedings of the 17th European Conference on Artificial Intelligence (ECAI 2006), (pp. 457461). Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., et al. (2005). C-Store: A columnoriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases, (pp. 553-564). Straccia, U. (2001). Reasoning within fuzzy description logics. [JAIR]. Journal of Artificial Intelligence Research, 14, 137166. Straccia, U. (2006). Answering vague queries in fuzzy dl-lite. Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU-06), (pp. 22382245). Sugiyama, K., Hatano, K., & Yoshikawa, M. (2004). Adaptive Web search based on user profile constructed without any effort from users. Proceedings of the 13th International World Wide Web Conference, (pp. 975-990).
Turker, C. & Gertz, M. (2001). Semantic integrity support in SQL: 1999 and commercial object-relational database management systems. The VLDB Journal, 10(4), 241269. doi:10.1007/s007780100050 Tahani, V. (1977). A conceptual framework for fuzzy query processing: A step toward very intelligent database systems. Information Processing & Management, 13, 289303. doi:10.1016/0306-4573(77)90018-8 Takahashi, Y. (1995). A fuzzy query language for relational databases. In Bosc, P., & Kacprzyk, J. (Eds.), Fuzziness in database management systems (pp. 365384). Heidelberg, Germany: Physica-Verlag. Tan, K., Eng, P., & Ooi, B. (2001). Efficicient progressive skyline computation. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), (pp. 301-310). Tasan, M., & Ozsoyoglu, Z. M. (2004). Improvements in distance-based indexing. In International Conference on Scientific and Statistical Database Management (SSDBM), (p. 161). Washington, DC: IEEE Computer Society. Tatarinov, I., & Ives, Z. G. (2003). The Piazza peer data management project. SIGMOD Record, 32(3), 4752. doi:10.1145/945721.945732 Tatarinov, I., & Halevy, A. Y. (2004). Efficient query reformulation in peer-data management systems. In G. Weikum, A.C. Knig & S. Deloch (Eds.), Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 539-550). Paris, France, June 13-18, 2004. ACM. Teubner, J., Grust, T., Maneth, S., & Sakr, S. (2008). Dependable cardinality forecasts for XQuery. [PVLDB]. Proceedings of the VLDB Endowment, 1(1), 463477. Tian, Y., McEachin, R. C., Santos, C., States, D. J., & Patel, J. M. (2007). SAGA: A subgraph matching tool for biological graphs. Bioinformatics (Oxford, England), 23(2), 232239. doi:10.1093/bioinformatics/btl571 Timarn, R. (2001). Arquitecturas de integracin del proceso de descubrimiento de conocimiento con sistemas de gestin de bases de datos: Un estado del arte. [Universidad del Valle, Colombia.]. Ingeniera y Competitividad, 3(2), 4451.
374
Compilation of References
Tineo, L. (2006) A contribution to database flexible querying: Fuzzy quantified queries evaluation. Unpublished doctoral dissertation, Universidad Simn Bolvar, Caracas, Venezuela. Torres, R. S., Falco, A. X., Gonalves, M. A., Papa, J. P., Zhang, P., & Fan, W. (2009). A genetic programming framework for content-based image retrieval. Pattern Recognition, 42(2), 283292. doi:10.1016/j. patcog.2008.04.010 Traina-Jr, C., Traina, A. J. M., Faloutsos, C., & Seeger, B. (2002). Fast indexing and visualization of metric datasets using slim-trees. [TKDE]. IEEE Transactions on Knowledge and Data Engineering, 14(2), 244260. doi:10.1109/69.991715 Traina-Jr, C., Traina, A. J. M., Vieira, M. R., Arantes, A. S., & Faloutsos, C. (2006). Efficient processing of complex similarity queries in RDBMS through query rewriting. In International Conference on Information and Knowledge Management (CIKM), (pp.413). Arlington, VA. Tsichritzis, D. C., & Lochovsky, F. H. (1976). Hierarchical data-base management: A survey. [CSUR]. ACM Computing Surveys, 8(1), 105123. doi:10.1145/356662.356667 Tsotras, V. J., & Kumar, A. (1996). Temporal database bibliography update. SIGMOD Record, 25(1), 4151. Tweedie, L., Spence, R., Williams, D., & Bhogal, R. S. (1994). The attribute explorer. Proceedings of the International Conference on Human Factors in Computing Systems, (pp. 435436). Tzanetakis, G., & Cook, P. R. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293302. doi:10.1109/ TSA.2002.800560 Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 727. doi:10.1007/BF01014018 Vila, L. (1994). A survey on temporal reasoning in artificial intelligence. AI Communications, 7(1), 428.
Vila, M. A., Cubero, J.-C., Medina, J.-M., & Pons, O. (1997). Using OWA operator in flexible query processing. In Yager, R. R., & Kacprzyk, J. (Eds.), The ordered weighted averaging operators: Theory and applications (pp. 258274). Boston: Kluwer Academic Publishers. Vila, L., & Godo, L. (1995). Query answering in fuzzy temporal constraint networks. In the Proceedings of FUZZ-IEEE/IFES95, Yokohama, Japan. IEEE Press. Vlachou, A., & Vazirgiannis, M. (2007). Link-based ranking of skyline result sets. In Proceedings of the 3rd Multidisciplinary Workshop on Advances in Preference Handling (M-Pref). W3C Semantic Web discussion list. (2010). Kit releases 14 billion triples to the linked open data cloud. Retrieved from http://permalink.gmane.org/gmane.org.w3c. semantic-web/12889 Wan, C., & Liu, M. (2006). Content-based audio retrieval with relevance feedback. Pattern Recognition Letters, 27(2), 8592. doi:10.1016/j.patrec.2005.07.005 Wang, X. S., Bettini, C., Brodsky, A., & Jajodia, S. (1997). Logical design for temporal databases with multiple granularities. ACM Transactions on Database Systems, 22(2), 115170. doi:10.1145/249978.249979 Wang, C., Wang, W., Pei, J., Zhu, Y., & Shi, B. (2004). Scalable mining of large disk-based graph databases. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 316-325). Washio, T., & Motoda, H. (2003). State of the art of graph-based data mining. SIGKDD Explorations, 5(1), 5968. doi:10.1145/959242.959249 Weiss, C., Karras, P., & Bernstein, A. (2008). Hexastore: Sextuple indexing for Semantic Web data management. [PVLDB]. Proceedings of the VLDB Endownment, 1(1), 10081019. Widom, J. (2009). Trio: A system for integrated management of data, uncertainty, and lineage. In Aggarwal, C. (Ed.), Managing and mining uncertain data (pp. 113148). Springer. doi:10.1007/978-0-387-09690-2_5 Wieringa, R., & de Jonge, W. (1995). Object identifiers, keys, and surrogates-object identifiers revisited. Theory and Practice of Object Systems, 1(2), 101114.
375
Compilation of References
Williams, D. W., Huan, J., & Wang, W. (2007). Graph database indexing using structured graph decomposition. In Proceedings of the 23rd International Conference on Data Engineering, (pp. 976-985). Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6, 134. Wu, L., Faloutsos, C., Sycara, K., & Payne, T. (2000). FALCON: Feedback adaptive loop for content-based retrieval. Proceedings of the 26th International Conference on Very Large Data Bases, (pp. 297-306). Wu, L., Faloutsos, C., Sycara, K., & Payne, T. R. (2000). Falcon: Feedback adaptive loop for content-based retrieval. In International Conference on Very Large Databases (VLDB), (pp. 297306). Cairo, Egypt. XAMPP. (2010). An apache distribution containing MySQL. Retrieved June 2010, from http://www.apachefriends.org/en/xampp.html XPath. (2006). XML path language 2.0. Retrieved from www.w3.org/TR/xpath20 XQuery. (2002). XQuery 1.0: An XML query language. W3C Working Draft. Retrieved from www.w3.org/TR/ xquery Xu, W., & Ozsoyoglu, Z. M. (2005). Rewriting XPath queries using materialized views. In K. Bhm, et al. (Eds.), Proceedigns of the 31st International Conference on Very Large Data Bases, (pp. 121-132). Trondheim, Norway, August 30 - September 2, 2005, ACM. Yager, R. R. (1982). A new approach to the summarization of data. Information Sciences, 28, 6986. doi:10.1016/0020-0255(82)90033-0 Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multi-criteria decision making. IEEE Transactions on Systems, Man, and Cybernetics, 18, 183190. doi:10.1109/21.87068 Yager, R. R., & Kacprzyk, J. (1997). The ordered weighted averaging operators: Theory and applications. Boston: Kluwer. Yan, X., & Han, J. (2002). gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining, (pp. 721-724).
Yan, X., & Han, J. (2003). CloseGraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 286-295). Yan, X., Yu, P. S., & Han, J. (2004). Graph indexing: A frequent structure-based approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 335-346). Yan, X., Yu, P. S., & Han, J. (2005). Substructure similarity search in graph databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 766-777). Yeh, W.-H., & Chang, Y.-I. (2008). An efficient iconic indexing strategy for image rotation and reflection in image databases. [JSS]. Journal of Systems and Software, 81(7), 11841195. doi:10.1016/j.jss.2007.08.019 Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In ACM/ SIGACT-SIAM Symposium on Discrete Algorithms (SODA), (pp. 311321). Austin, TX, EUA. Society for Industrial and Applied Mathematics. Yu, C., & Popa, L. (2004). Constraint-based XML query rewriting for data integration. In G. Weikum, A.C. Knig, & S. Deloch (Eds.), Proceedings of the ACM SIGMOD International Conference on Management of Data, (pp. 371-382). Paris, France, June 13-18, 2004. ACM. Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J. X., & Zhang, Q. (2005). Efficient computation of the skyline cube. In VLDB 05: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), (pp. 241-252). VLDB Endowment. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338353. doi:10.1016/S0019-9958(65)90241-X Zadeh, L. A. (1983). A computational approach to fuzzy quantifiers in natural languages. Computers & Mathematics with Applications (Oxford, England), 9, 149184. doi:10.1016/0898-1221(83)90013-5 Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1, 328. doi:10.1016/0165-0114(78)90029-5 Zadeh, L. (1994). Soft computing and fuzzy logic. IEEE Software, 11(6), 4856. doi:10.1109/52.329401
376
Compilation of References
Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338353. doi:10.1016/S0019-9958(65)90241-X Zadeh, L. A. (2006). From search engines to question answering systems-the problems of world knowledge relevance deduction and precisiation. In Sanchez, E. (Ed.), Fuzzy logic and the Semantic Web (pp. 163210). Amsterdam: Elsevier. Zadrony, S., & Kacprzyk, J. (2007). Bipolar queries using various interpretations of logical connectives (pp. 181190). Zadrony, S., De Tr, G., De Caluwe, R., & Kacprzyk, J. (2008). An overview of fuzzy approaches to flexible database querying. In Galindo, J. (Ed.), Handbook of research on fuzzy information processing in databases (pp. 3454). Hershey, PA/ New York: Idea Group, Inc. Zadrony, S. (2005). Bipolar queries revisited. In V. Torra, Y. Narukawa & S. Miyamoto (Eds.), Modelling decisions for artificial intelligence (MDAI 2005), (pp. 387-398). (LNAI 3558). Berlin, Heidelberg: Springer-Verlag. Zadrony, S., & Kacprzyk, J. (1996) Multi-valued fields and values in fuzzy querying via FQUERY for Access. In Proceedings of FUZZ-IEEE.96 - Fifth International Conference on Fuzzy Systems New Orleans, USA, (pp. 1351-1357). Zadrony, S., & Kacprzyk, J. (2002). Fuzzy querying of relational databases: A fuzzy logic view. In Proceedings of the EUROFUSE Workshop on Information Systems, (pp. 153-158). Varenna, Italy. Zadrony, S., & Kacprzyk, J. (2006). Bipolar queries and queries with preferences. In Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA06), Krakow, Poland, (pp. 415-419). IEEE Computer Society.
Zahn, C. T., & Roskies, R. Z. (1972). Fourier descriptors for plane closed curves. IEEE Transactions on Computers, 21(3), 269281. doi:10.1109/TC.1972.5008949 Zemankova, M., & Kacprzyk, J. (1993). The roles of fuzzy logic and management of uncertainty in building intelligent Information Systems. Journal of Intelligent Information Systems, 2, 311317. doi:10.1007/BF00961658 Zemankova-Leech, M., & Kandel, A. (1984). Fuzzy relational databases-a key to expert systems. Cologne, Germany: Verlag TV Rheinland. Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., & Ma, J. (2004). Learning to cluster Web search results. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 210217). Zhang, S., Hu, M., & Yang, J. (2007). TreePi: A novel graph indexing method. In Proceedings of the 23rd International Conference on Data Engineering, (pp. 966-975). Zhang, S., Li, J., Gao, H., & Zou, Z. (2009). A novel approach for efficient supergraph query processing on graph databases. In Proceedings of the 12th International Conference on Extending Database Technology, (pp. 204-215). Zhao, P., Yu, J. X., & Yu, P. S. (2007). Graph indexing: Tree + delta >= Graph. In Proceedings of the 33rd International Conference on Very Large Data Bases, (pp. 938-949). Zhou, X. S., & Huang, T. S. (2003). Relevance feedback in image retrieval: A comprehensive review. Multimedia Systems, 8(6), 536544. doi:10.1007/s00530-002-0070-3
377
378
Li Yan received her Ph.D. degree from Northeastern University, China. She is currently an Associate Professor of the School of Software at Northeastern University, China. Her research interests include database modeling, XML data management, as well as imprecise and uncertain data processing. She has published papers in several journals such as Data and Knowledge Engineering, Information and Software Technology, International Journal of Intelligent Systems and some conferences such as WWW and CIKM. Zongmin Ma (Z. M. Ma) received the Ph. D. degree from the City University of Hong Kong and is currently a Full Professor in College of Information Science and Engineering at Northeastern University, China. His current research interests include intelligent database systems, knowledge representation and reasoning, the Semantic Web and XML, knowledge-bases systems, and semantic image retrieval. He has published over 100 papers in international journals, conferences and books in these areas since 1999. He also authored and edited several scholarly books published by Springer-Verlag and IGI Global, respectively. He has served as member of the international program committees for several international conferences and also spent some time as a reviewer of several journals. Dr. Ma is a senior member of the IEEE. *** Ana Aguilera received her PhD in Computer Systems from University of Rennes I, France, in 2008 with a PhD thesis award trs honorable, her MSc in Computer Science from Universidad Simn Bolvar in 1998 and her Engineer in Computer Systems from UCLA in 1994 with great praise. She is Associate Professor and staff member of University of Carabobo (since 1997). She has received the distinction: Orden Jos Felix Ribas in third class 1993, member of program to research promotion PPI (1998-2000), and Outstanding Teacher of UC (1997). She is coordinator of Research Group in Databases (since 2004). In the area of Fuzzy Databases and Medical Computer Science, she has more than twenty articles in indexed journal and international arbitrated conferences and more than ten advisories of works conducing to academic titles. She is the responsible of the project Creation and Application of Fuzzy Databases Management Systems supported by FONACIT (since 2009). Reda Alhajj received his B.Sc. degree in Computer Engineering in 1988 from Middle East Technical University, Ankara, Turkey. After he completed his BSc with distinction from METU, he was offered a full scholarship to join the graduate program in Computer Engineering and Information Sciences at
Bilkent University in Ankara, where he received his M.Sc. and Ph.D. degrees in 1990 and 1993, respectively. Currently, he is Professor in the Department of Computer Science at the University of Calgary, Alberta, Canada. He has published over 275 papers in refereed international journals and conferences. He served on the program committee of several international conferences including IEEE ICDE, IEEE ICDM, IEEE IAT, SIAM DM; program chair of IEEE IRI 2008, OSIWM 2008, SONAM 2009, IEEE IRI 2009. He is editor in chief of International Journal of Social Networks Analysis and Mining, associate editor of IEEE SMC- Part C and he is member of the editorial board of the Journal of Information Assurance and Security; he has been guest editor for a number of special issues and edited a number of conference proceedings. He recently received the Grad Studies Outstanding Achievement in Supervision Award. Dr. Alhajjs primary work and research interests are in the areas of biocomputing and biodata analysis, data mining, multiagent systems, schema integration and re-engineering, social networks, and XML. He currently leads a research group of 10 PhD and 8 MSc candidates. Ghazi Al-Naymat received his PhD degree in May 2009 from the School of Information Technologies at The University of Sydney, Australia. He is a Postdoctoral Fellow at the School of Computer Science and Engineering at The University of New South Wales, Australia. His research focuses on developing novel data mining techniques for different applications and datasets such as: graph, spatial, spatiotemporal, and time series databases. The Australian Research Council and (ARC) and The University of new South Wales are supporting Dr. Al-Naymats current research. He has published a number of papers in excellent international journals and conferences. Maria Camila N. Barioni received the B.Sc. degree in Computer Science from the Federal University of Uberlandia, Brazil, in 2000, and the M.Sc. and the Ph.D. in Computer Science in 2002 and 2006 at University of Sao Paulo at Sao Carlos, Brazil. She is currently an assistant professor with the Mathematics, Computing and Cognition Center of the Federal University of ABC and the undergraduate students officer for the Computer Science course. Her research interests include multimedia databases, multimedia data mining, indexing methods for multidimensional data and information visualization. Mounir Bechchi studied Computer Science at the ENSIAS engineering school (Rabat Morocco). Then, he worked as a research engineer at the INRIA-Rocquencourt until December 2005. He did his Ph.D. in Computer Science under the supervision of Prof. N. Mouaddib at the University of Nantes from January 2006 to September 2009. The subject of his study was Clustering-based Approximate Answering of Query Result in Large and Distributed Databases. He currently works as a database administrator in Bimedia, La Roche sur Yon, France. His areas of expertise include database design, administration, performance tuning and optimization, database recovery, and data warehousing architecture. Gloria Bordogna is a senior researcher of the National Research Council (CNR) and contract professor at the Faculty of Engineering of Bergamo University, where she teaches IR and GIS. She graduated in Physics at the University of Milano. Her research interests concern soft computing techniques in the area of information retrieval, flexible query languages and Geographic Information Systems. She was involved in several European projects such as Ecourt, PENG, and IDE-Univers, edited three volumes and a special issue of JASIST in her research area, and participated in the program committee of several conferences such as FUZZ-IEEE, ACM SIGIR, ECIR, FQAS, CIKM, IEEE-ACM WI/IAT, WWW.
379
Francesco Bucci received the Laurea Specialistica in Computer Science Engineering from Bergamo University in 2007, and worked at the Italian National Research Council within the IDE-Univers project during 2008, for which he developed the discovery service of IREA CNR SDI. Jos Toms Cadenas is a PhD Student in Computer Science and Information Technology, Granada University, Spain (2009-present). He received his MSc on Industrial Engineering from Carabobo University, Valencia, Venezuela, in 2010, his MSc on Computer Science from Simn Bolvar University, Caracas, Venezuela, in 2008 and his Computer Engineering from Simn Bolvar University, Caracas, Venezuela, in 1985. He is an Assistant Professor (April 2008 present) at Computer and I.T. Department of Simn Bolvar University, Caracas, Venezuela, and was an Academic Assistant (April 2006 - April 2008) at Computer and I.T. Department of Simn Bolvar University, Caracas, Venezuela. He spent 1999 thru 2006 in other higher education institutes in Venezuela (IUETLV- La Victoria and CULTCA- Los Teques) in informatic departments, where he taught Database, Software Engineering and programming (grade and postgrades). He was co-responsible on the project Creating and Applying Fuzzy Database Management Systems support by FONACIT (January 2009 - December 2009) and was an Associate Investigator (February 2008 december 2008). Also he has experience for 20 years in different companies (public and privates) in Venezuela. His interest areas include database systems, software engineering and Information and Communication Technologies in Education. Paola Carrara is graduated in Physics at the University of Milan, Italy. She has been a researcher of the National Research Council (CNR) of Italy since 1986. Her scientific activity regards designing and managing Information Systems. Her main interests are in (Fuzzy) Information Retrieval, spatio-temporal archives of images, architectures, technologies and standards for geographic information on the Internet, in particular Spatial Data Infrastructures and the Sensor Web initiative. She was responsible for the Italian branch of the project IDE-Univers which created the first European Spatial Data Infrastructure in the research field. With Gloria Bordogna, she promoted and organized the special session Management of Uncertain information in the Digital Earth at IPMU 2010. Jingwei Cheng received his BSc from Jilin University in 1995. He is currently a PhD candidate in the College of Information Science and Engineering at Northeastern University under the supervision of Prof. Z. M. Ma. He has published in conference proceedings of DEXA, WI/IAT, ASWC and Fuzz-IEEE. His current research interests include description logics, RDF, SPARQL, and Semantic Web. Guy De Tr is Professor at the Faculty of Engineering of Ghent University and leading the Database, Document and Content Management research group. His research interests include the handling of imperfect information, (fuzzy) database modeling, flexible querying, information retrieval and content based retrieval of multimedia. His publications comprise three books (one as author and two as co-editor), 17 chapters in various books, and 30 papers in international journals. He was guest editor of special issues in Fuzzy Sets and Systems, International Journal of Intelligent Systems, and Control and Cybernetics and is reviewer of several journals and conferences in the areas of databases and fuzzy information processing. He co-coordinates SCDMIR, the Eusflat Working Group on Soft Computing in Database Management and Information Retrieval.
380
Eric Draken is an undergraduate student in the Department of Computer Science at the University of Calgary with a focus on relational database development and software design. He is active in designing and optimizing Web applications where any given Web page may assemble data from multiple resources asynchronously using XML-based protocols. Shang Gao received his BSc in Computer Science from the University of Waterloo in 2006, and MSc in Computer Science from the University of Calgary in 2009. He is currently a PhD candidate in the Department of Computer Science at the University of Calgary under the supervision of Prof. Reda Alhajj. He received a number of prestigious awards and scholarships including iCore graduation studies scholarship, Department of Computer Science research award and University of Calgary Queen Elizabeth II Scholarship. He published over 10 papers in fully refereed conferences and journals. His research interests cover data mining, financial data analysis, social networks, bioinformatics, and XML. Marlene Goncalves Da Silva received her Bachelor on Computer Science in 1998 from Central of Venezuela University, and her Master on Computer Science in 2002 and PhD on Computer Science in 2009 from the University Simn Bolvar, Caracas Venezuela. She is an Associate Professor of the Computer Science department at the University Simn Bolvar, and Visitor Scholar at Youngstown State University (2009-2010). She has reported her research on DEXA and OTM. Her current research interests are Preference based queries. Her home page is http://www.ldc.usb.ve/~mgoncalves A. Goswami obtained his Ph.D. degree from Jadavpur University, Kolkata, India and joined the IIT as a regular faculty member in 1992. He has published several papers at national & international level, and guided many M.Tech. & Ph.D. Thesis. His research areas are: theoretical computer science and operation research. He is a Member, Editorial Board of the Journals: (i) International Journal of Fuzzy Systems and Rough Systems (ii) International Journal of Mathematics in Operational Research (IJMOR). At present he is holding the post of Chairman, HMC (Hall Management Centre) of the institute as an additional responsibility. D. K. Gupta obtained his Ph.D. degree from IIT Kharagpur, India and joined the institute as a regular faculty member in 1985. He has published several papers at the national and international level, and guided many M.Tech. and PhD thesis. His research areas are: computer science, constraint satisfaction problems and numerical & interval analysis. He is a member of The National Academic of Sciences and Life member of ISTAM. He is Co-Principal-Investigators of projects (i) FIST Program Department of Mathematics (ii) Multi objective and Multi level Decision making model with an application to environment and Regional planning, that are sponsored by DST, New Delhi. Janusz Kacprzyk is Professor at the Systems Research Institute, Polish Academy of Sciences, and Honorary Professor at Yli Normal University, Shanxi, China. He is an Academician (Member of the Polish Academy of Sciences). His research interests include soft computing, fuzzy logic, decisions, database querying, and information retrieval. His publication record is: 5 books, 30 volumes, 300 papers. He is Fellow of IEEE and IFSA. He received The 2005 IEEE CIS Fuzzy Pioneer Award, and The Sixth Kaufmann Prize and Gold Medal for pioneering works on uncertainty. He is Editor in chief of
381
three Springers book series, is on editorial boards of ca. 25 journals, and a member of the IPC at ca. 200 conferences. Daniel S. Kaster received the B.Sc. degree in Computer Science from the University of Londrina, Brazil, in 1998 and the M.Sc. degree in Computer Science from the University of Campinas, Brazil, in 2001. He is currently a Lecturer with the Computer Science Department of the University of Londrina, Brazil, and a Ph.D. candidate in Computer Science from the University of S?o Paulo at S?o Carlos, Brazil. His research interests include searching complex data and multimedia databases. Xiangfu Meng received his Ph.D. degree from Northeastern University, China. He is currently a Lecturer of the College of Electronic and Information Engineering at Liaoning Technical University, China. His research interests include Web database flexible query, query results ranking and categorization, XML data management, and the Semantic Web. He teaches computer networks, database systems, and software architecture. Noureddine Mouaddib received his Ph. D. degree and Habilitation in Computer Science from the University Poincar (Nancy I) in 1989 and 1995, respectively. Since 1996, he is a full Professor at Polytechnic School of University of Nantes in France. He is the founder and President of the International University of Rabat (www.uir.ma). He was the founder and the head of the Atlas-GRIM team of LINA Laboratory, pursuing research in databases, particularly in summarization of large databases, flexible querying, and fuzzy databases. He has authored and co-authored over 100 technical papers in international conferences and journals. He was member of several program committees and executive chair of international conferences. Tadeusz Pankowski, Ph.D., D.Sc., is a professor of computer science at Poznan University of Technology, Poland. His research interests are in foundations of database models, data languages, and information integration from heterogeneous databases (relational and XML). His research concerning semantic data integration in peer-to-peer systems has been supported by Polish Ministry of Science and Higher Education. He is the author or coauthor of two recognized books: Foundations of Databases, and Security of Data in Information Systems (both in Polish). He teaches courses on database systems, data mining, information integration and software engineering. He serves as a member of program committees of numerous international conferences, such as XSym in conjunction with VLDB, IEEE ICIIC. He is a member of the ACM, IEEE and PTI (Polish Society of Computer Science). Monica Pepe received the master degree in Geology in 1993 and the PhD in Physical Geography from the University of Pavia, Italy. She has been with the National Research Council (CNR) of Italy since 1994. Her research activity regards the use of remote sensing for environmental studies and for pattern recognition tasks. She has worked on automatic interpretation methods of multisource data for thematic and environmental mapping, on the basis of the combined use of remote sensing image analysis and domain knowledge representation. In the last few years she has been interested in Spatial Data Infrastructures (SDI) and OpenGIS Web Services (OWS) issues in order to make geographic information derived from her research activity retrievable, accessible, and exploitable in an interoperable framework, with particular focus on the INSPIRE Directive.
382
Guillaume Raschia studied CS at the PolytechNantes engineering school. He graduated (Engineer and Master) in 1998. He did his Ph.D. under the supervision of Prof. N. Mouaddib at the University of Nantes from 1999 to 2001. He studied the database summarization paradigm with fuzzy set theoretic background. He had a lecturer position in 2002 and obtained an assistant professor position in September 2003 in the CS department of PolytechNantes. Since then, he is affiliated to the LINA labs and he is a member of the INRIA-Atlas research group as well. Guillaume Raschias main research topics are database indexing and data reduction techniques, flexible querying, and approximate answering systems. Anna Rampini has worked at the Italian National Research Council since 1984. Her research interests are in image processing, pattern recognition, remote sensing images interpretation, and GIS. Her main research activities regard the definition and development of knowledge-based systems for the automatic interpretation of remote sensing images aimed to produce thematic maps on the basis of the combined use of remote sensing data analysis and domain knowledge representation, and support experts in the evaluation and prevention of environmental risks. She has been involved in several national and international projects and coordinated the European Projects FIREMEN Project (Fire Risk Evaluation in Mediterranean Environment) and AWARE (A tool for monitoring and forecasting Available WAter REsource in mountain environment). Humberto L. Razente received the B.Sc. degree in Computer Science from the Federal University of Mato Grosso, Brazil, in 2000, and the M.Sc. and the Ph.D. in Computer Science in 2004 and 2009 at University of Sao Paulo at Sao Carlos, Brazil. He is currently an assistant professor with the Mathematics, Computing and Cognition Center of the Federal University of ABC. His research interests include access methods for complex data, similarity searching, multimedia databases and information visualization. Sherif Sakr received his PhD degree in computer science from Konstanz University, Germany in 2007. He received his BSc and MSc degree in computer science from the Faculty of Computers and Information, Cairo University, Egypt, in 2000 and 2003 respectively. He is a senior research associate/ lecturer in the Service Oriented Computing (SOC) research group at School of Computer Science and Engineering (CSE), University of New South Wales (UNSW), Australia. Prior to taking up the current position, he worked as a postdoctoral research fellow at National ICT Australia (NICTA). His research interest is data and information management in general, particularly in areas of indexing techniques, query processing and optimization techniques, graph data management, social networks, and data management in cloud computing. His work has been published in international journals and conferences such as: Proceedings of the VLDB endowment (PVLDB), Journal of Database Management (JDM), International Journal of Web Information Systems (IJWIS), Journal of Computer Systems and Science (JCSS), VLDB, SIGMOD, WWW, DASFAA, and DEXA. One of his papers has awarded the Outstanding Paper Excellence Award 2009 of Emerald Literati Network. Alexandr Savinov received his PhD from the Technical University of Moldova in 1993 and his MS degree from the Moscow Institute of Physics and Technology (MIPT) in 1989. He is currently a researcher at SAP Research Center Dresden, Germany. His primary research interests include data modeling, programming and knowledge management methodologies with applications to database systems, Grid and cloud computing, distributed systems, peer-to-peer technologies, Semantic Web, and other areas. He is
383
an author of two novel methodologies in computer science: concept-oriented model (COM) and conceptoriented programming (COP). Previous research interests include fuzzy expert systems and data mining. In particular, he developed a novel matrix-based approach to fuzzy knowledge representation and inference implemented in the expert system shell EDIP. In data mining, he proposed an original algorithm for mining dependence rules and developed a data mining system with component architecture, SPIN! Awadhesh Kumar Sharma obtained his M.Tech. & Ph.D. degrees from IIT Kharagpur, India and joined the college as a regular faculty member in January 1988. He has been bearing the responsibility of Head the Department in addition to teaching at UG & PG level. He is FIE, FIETE, MISTE, MCSI, MIAENG, and Expert-DOEACC, Government of India. He has published several research papers at national and international level. His research area is database systems. He is a member of Editorial Board & Review Committees of some International Journals & Conferences. Leonid Tineo received his PhD in Computing from Universidad Simn Bolvar (USB), Venezuela, in 2006. Tineo received his MSc in Computer Science from USB in 1992 and his Eng. in Computing from USB in 1990. He is Titular Professor (since 2007), Staff Member of USB (since 1991), Level I Accredited Researcher of Venezuelan Researcher Promotion Program (since 2003), Outstanding Professor CONABA (2002), and Outstanding Educational Work USB (1999). He was the Coordinator of USB Database Research Group (2002-2008). He has exerted the post of Information and Integration Coordinator of the Research and Development Deanship at USB (2002-2007). In Fuzzy Databases area, he has more than twenty articles in extenso in arbitrated Proceedings, more than fifteen published brief notes, eight papers in indexed journals, two book chapters, and more than fifteen advisories of works conducing to academic titles. Tine is responsible for the project Creation and Application of Fuzzy Databases Management Systems supported by Venezuelan National Foundation for Sciences, Innovation and Technology FONACIT, Grant G-2005000278 (2006-2008). Agma J. M. Traina received the B.Sc. the M.Sc. degrees in Computer Science from the University of S?o Paulo, Brazil, in 1983, 1987 and the and Ph.D. in Computational Physics in 1991. She is currently a full Professor with the Computer Science Department of the University of Sao Paulo at Sao Carlos, Brazil and the graduate students officer for the Computer Science program. Her research interests include image databases, image mining, indexing methods for multidimensional data, information visualization and image processing for medical applications. She has supervised more than 20 graduate students. Caetano Traina Jr. received the B.Sc. degree in Electrical Engineering, the M.Sc. in Computer Science and the Ph.D. in Computational Physics from the University of So Paulo, Brazil, in 1978, 1982 and 1987, respectively. He is currently a full professor with the Computer Science Department of the University of So Paulo at So Carlos, Brazil. His research interests include indexing and access methods for complex data, data mining, similarity searching, query rewriting, and multimedia databases. He has supervised more than 30 graduate students Mara-Esther Vidal received her Bachelor on Computer Engineering in 1987, Master on Computer Science in 1991 and PhD on Computer Science in 2000 from the University Simn Bolvar, Caracas Venezuela. She is a Full Professor of the Computer Science department at the University Simn Bolvar
384
and has been Assistant Researcher at the Institute of Advanced Computer Studies in the University of Maryland (UMIACS) (1995-1999), and Visitor Professor at UMIACS (2000-2009) and in Universidad Politecnica de Catalunya (2003). She has reported her research in AAAI, IJCAI, SIGMOD, CoopIs, WIDM, WebDB, ICDE, DILS, DEXA, ALPWS, ACM SAC, CAISE, OTM, EDBT, SIGMOD RECORDS and TPLP Journal. Her current research interests are query rewriting and optimization in emerging infrastructures. Prof. Vidal is member of SIGMOD. Her home page is http://www.ldc.usb.ve/~mvidal. Sawomir Zadrony is Associate Professor (Ph.D. 1994, D.Sc. 2006) at the Systems Research Institute, Polish Academy of Sciences. His current scientific interests include applications of fuzzy logic in database management systems, information retrieval, decision support, and data analysis. He is the author and co-author of approximately 150 journal and conference papers. He has been involved in the design and implementation of several prototype software packages. He is also a teacher at the Warsaw School of Information Technology in Warsaw, Poland and at the Technical University of Radom, Poland.
385
386
Index
Symbols
-Range Query 36
A
abstract roles 249, 250 agglomerative single-link approach 46 Algorithmic Solution 11 ARES system 34, 35 automated ranking 29, 37, 42, 58 AWARE project 145
B
Basic Distributed Skyline (BDS) 107 Basic Multi-Objective Retrieval (BMOR) 107 Berners-Lee, Tim 269, 283 bipolar queries 118, 120, 128, 130, 131, 134, 135 Block-Nested-Loops (BNL) 107 Bottom-Up Skycube algorithm (BUS) 108, 109, 112, 113, 115 Branch-and-Bound Skyline (BBS) 107 brushing histogram 4, 5
C
C4.5-Categorization 2, 3, 20, 21, 22 C4.5 decision tree constructing algorithm 4 cartesian product, symmetric 35 categorization approach 1, 5, 20, 24 categorization case 3 Categorization Cost Experiment 20 category tree 1, 2, 3, 4, 5, 6, 7, 8, 13, 14, 15, 17, 21, 22, 24, 27 Category Tree Construction 8 Chakrabarti et als System 45
Chaudhuris system 40, 41 cIndex indexing structure 313 cluster-based retrieval 44 Cluster Hypothesis 59 clustering 4, 5, 7, 8, 10, 11, 12, 14, 22, 23, 24, 25 clustering approach 4 clustering-based techniques 29, 58, 59 clustering of query results 29, 48, 77 Clustering Problem 10 cluster queries 4, 11 CoBase 31, 32, 33 completion forests 249, 254, 255, 256, 257, 258, 259, 260, 264, 265 Complexity Analysis 19 concept assertions (ABox) 248, 250, 251, 253, 254, 264 concept axioms (TBox) 250, 251, 253, 264, 265 concept-oriented model (COM) 85, 86, 87, 89, 90, 92, 93, 94, 96, 97, 98, 99, 101 concept-oriented programming (COP) 86 concept-oriented query language (COQL) 85, 86, 89, 91, 92, 95, 98, 99 conjunctive queries (CQs) 247, 248, 253, 267, 268 content-based image retrieval (CBIR) 324, 357, 358 content-based operators approach 325, 326
D
Data Asphyxiation 29 database 323, 324, 330, 331, 335, 336, 337, 339, 342, 344, 348, 349, 354, 355 database integration 186, 187, 188, 189, 216
Copyright 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Index
Database Integration Process 189 database management systems (DBMS) 289, 299, 300, 303, 323, 324, 325, 335, 336, 337, 339, 340, 344, 345, 346, 347, 348, 349, 350, 351, 352 database migration 189, 216 database query processing models 30 databases 270, 271, 273, 274, 275, 276, 277, 280, 282, 283, 284, 285 Database systems 28, 30, 49, 59 Data Clustering 8 data clusters 3, 12, 20, 27 data definition language (DDL) 325, 339 data integration 221, 222, 223, 227, 239, 242, 243, 244, 245, 246 data integration, materialized 222 data integration, virtual 222 data manipulation language (DML) 325, 339, 341 data manipulation operations 188 data modeling 85, 86, 87, 89, 90, 91, 92, 96, 97, 98, 99, 101 data provenance 222 Data Smog 29 dataspaces 270 data strings 249 DATA TUPLES 8 data warehouse 189 data warehouses 222 DB Clustering Techniques 45 DBXplorer 56 decision support systems 28, 59 decomposed storage model (DSM) 277 description logics (DL) 247, 248, 249, 252, 253, 266, 267 dimension 86, 87, 88, 92, 93, 94, 95, 96, 97, 98, 99, 101 direct acyclic graph (DAG) 97 Divide and Conquer (DC) 108, 115 divide keyword 288, 294, 295, 299, 301 dividend relation 289, 290, 303 divide operator 303 divide query 293, 295 DIVIDE system 289, 290, 293, 294, 295, 296, 297, 299, 300
division 287, 288, 289, 290, 291, 292, 299, 300, 301, 302, 303 divisor relation 289, 290, 303 DL knowledge base (KB) 248, 251, 252, 254, 255, 257, 258, 259, 260, 261, 264, 266 document type definition (DTD) 223 domain-specific identities 85, 86, 96, 98 domain-specific structure 89, 90 Duality principle 86 dynamic query slider 4
E
edges 305, 306, 311, 314, 317, 318 efficient implementation of Top-k Skyline (EITKS) 108 Efficient Top-k Query Processing 42 e-merchant 30 EmpDB 31, 32, 34, 52, 53, 54 empty-answer problem 30, 31 European Spatial Data Infrastructure (ESDI) 141 Executor 164, 170, 178 Exploitation activity 143 Exploration activity 143 exploration model 5, 7 exploratory data retrieval 28, 59, 61, 76 Explore-Select algorithm (ESA) 61, 66, 67, 69, 70, 71, 73, 74 Explore-Select-Rearrange Algorithm (ESRA) 29, 30, 61, 70, 71, 72, 73, 74, 76, 77 extensible markup language (XML) 221, 222, 223, 225, 226, 227, 228, 230, 238, 239, 240, 242, 243, 244, 245, 246, 270, 278, 284 extension operator 87, 88 EXTRACT operator 277
F
feature extraction algorithms 358 feature weighting 330 feature weighting technique 330 Feedback-Based Systems 42 flexible fuzzy queries 119, 121, 129 flexible fuzzy querying 118, 119, 124, 132 flexible querying 118, 119, 124, 128, 132, 134, 135
387
Index
flexible querying languages 118 flexible query processing 50 flexible relations 188 functional dependencies 221, 222, 224, 230, 237, 239 functional roles 249 fuzziness 190 fuzzy comparison operators 51, 53 fuzzy concept assertions 250, 251 fuzzy concept axioms 250 fuzzy concrete domains 249 fuzzy conjunctive queries 247, 249, 266 fuzzy database framework 142, 155, 157 fuzzy database identifiers (DBids) 196, 197 fuzzy database instances 191 fuzzy databases 122, 136, 140, 141, 158, 187, 192, 197, 198, 205, 206, 208, 214 fuzzy data types 249 fuzzy data values 185, 190, 214 fuzzy extension 122, 124 fuzzy logic 118, 120, 121, 122, 123, 127, 128, 130, 134, 136, 138, 167, 184, , , 247, 248, 249, 250, 251, 252, 253, 254, 255, 258, 259, 260, 261, 262, 263, 264, 266, 267, 268, 288, 302 fuzzy modifiers 51 fuzzy multidatabases 185, 200, 214 fuzzy multidatabase system 190 fuzzy ontologies 247, 248, 253, 267 fuzzy OWL Lite 247, 248 Fuzzy Plan Tree 177, 180, 184 fuzzy probabilistic relational data model 190 fuzzy quantifiers 51, 53 fuzzy queries 50, 118, 119, 120, 121, 128, 129, 132, 134, 137, 162, 163, 164, 166, 168, 169, 170, 172, 173, 175, 178, 179, 180, 181, 182, 184, fuzzy query 185, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 212, 214, 216, 218 FuzzyQUERY (FQUERY) 124, 125, 126, 127, 131, 132, 133, 134, 136, 138 fuzzy querying 118, 119, 120, 124, 132, 133, 134, 135, 136, 138 fuzzy querying systems 124 Fuzzy Query Language (FQL) 122 fuzzy query processor 160
Fuzzy Query Tree 168, 169, 170, 171, 172, 175, 177, 178, 179, 184 fuzzy relation 190, 194, 195, 196, 201, 208, 209, 210, 218, 219 fuzzy relational databases 185, 187, 190, 192, 195, 208, 214 fuzzy relational data model 187, 192 fuzzy relations 190, 191, 193, 194, 195, 201, 208, 209, 210, 211, 214, 215, 218, 219 fuzzy set framework 141 fuzzy sets 50, 51, 66, 67, 119, 121, 124, 126, 127, 128, 131, 149, 152, 156, 162, 164, 165, 166, 167, 169, 172, 173, 178, 184, 185, 190, 200, 218, 219 fuzzy set theory 50, 51, 119, 248, 249 fuzzy systems 288 fuzzy temporal indications 148 fuzzy terms 50, 51, 53, 162, 164, 166, 168, 170, 174, 177, 179 fuzzy tuple source (FTS) 185, 191, 192, 193, 194, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 212, 213, 214, 218 fuzzy tuple source structured query language (FTS-SQL) 185, 199, 200, 201, 202, 203, 204, 205, 206, 212, 214, 218 fuzzy values 51, 52, 53, 148
G
GDIndex index structure 309, 310, 316 generated trees 5, 20 geodata 140, 141, 142, 143, 144, 145, 148, 157, 159 Geodata Discovery 143 geoprocessing applications 327 GIndex index structure 312, 316 global data model 186, 187, 188 Godfreys System 33 GPTree index structure 313, 316 grammar 290, 292, 303 graph databases 304, 305, 306, 311, 312, 313, 314, 317, 319, 320, 321, 322 graph data structure 304, 305, 316 GraphGrep index structure 308, 316, 320 graph indexing techniques 304, 305, 307, 313, 316, 318, 319, 322
388
Index
graph indexing techniques, mining-based 307, 308, 312 graph query languages 304, 316, 319, 321 graph query processing 304, 305, 318, 319 GraphREL index structure 311, 316, 321 graphs 304, 305, 306, 308, 309, 310, 311, 312, 313, 314, 315, 317, 319, 320, 321 graphs, directed-labeled 305 graphs, undirected-labeled 305 greedy algorithm 4, 25 Greedy catagorization 2, 11, 20, 21, 22, 23 Greedy-Refine 11, 23 Greedy-Refine algorithm 23 GString index structure 310, 320, 321
J
Java programming language 287, 288, 289, 293, 300, 302 Jena1 schema 275 Jena2 schema 275 join query 293 JUnit tests 293, 299
K
keywords 287, 288, 293, 294, 295, 299, 301 k-median problem 10, 25 k-Nearest Neighbor Query 36 Koutrika et als approach 49
H
Healys division 289 Hexastore RDF storage scheme 274, 278, 285, Hierarchical address spaces 86 hierarchical data model (HDM) 96 high-dimensional Skyline spaces 102, 114 HITS algorithm 39
L
Linear Elimination Sort for Skyline (LESS) 107 link-based ranking methods 38, 39 local database 187, 189, 191, 202, 203, 204, 205 local fuzzy databases 187, 192, 197, 205, 206, 214
I
IDE-Univers 145 imperfect temporal metadata 140, 141, 142, 156 Imperfect Time Indications 149, 150 Inclusion principle 86, 87 index scan 164, 168, 169, 171, 178 Infoglut 29 Information Fatigue Syndrome 29 information overload 1, 2, 3, 29, 37, 59, 80, 82 Information Pollution 29 Information Retrieval (IR) 38, 44, 45, 81 Information Technology 141 INSPIRE 140, 141, 142, 143, 144, 145, 146, 151, 157, 158 instance integration process 189 integers 249 interactive approach 132, 136 inverse roles 249 inverted indexes 307, 308 IQE approach 34, 35, 36
M
many-answers problem 30, 37, 42, 61, 67 mappings 221, 222, 223, 228, 239, 242, 243, 244, 246 maXimal Generalizations that Fail (XGFs) 31, 32, 33 mediated schemas 222 metadata 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 155, 156, 157, 158, 159, 269, 270, 299 Minimal Generalizations that Succeed (MGSs) 31, 32, 33 multidatabase system (MDBS) 188, 189, 190, 199, 204, 215, 216, 217 Multidimensional Analysis 95 multidimensional index structures 36 Multidimensionality 86 multidimensional models 97 multimedia objects, audio 323, 329, 334, 339, 351, 352, 354, 356
389
Index
multimedia objects, images 323, 324, 325, 326, 327, 328, 329, 332, 334, 335, 336, 337, 338, 339, 340, 341, 343, 344, 345, 346, 348, 349, 351, 353, 354, 355, 356, 357, 358, 359 multimedia objects (MO) 323, 324 multimedia objects, video 323, 326, 327, 329, 334, 338, 339, 354, 357, 358 MySQL database management system 299, 300, 301, 302, 303
probabilistic model 38 problem-dependent 326 progressive algorithm 107, 116 property-class schema 277 proportional linguistic quantifiers 127 protoform 132, 133, 134
Q
qualitative approach 50 query-by-example 359 Query Clustering 9 query clusters 3, 11, 12, 21 query entailment 247, 248, 249, 251, 252, 253, 254, 255, 258, 264, 267 query graphs 304, 306, 309, 310, 312, 313, 314, 315, 317, 318 query history 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 19, 20, 22, 23 query model 37 query personalization 4 query propagation 221, 222, 238, 239, 241, 243, 244, 246 Query Prune 9 query pruning algorithm 9 query reformulation 246 Query Relaxation 31 query results 28, 29, 30, 37, 38, 39, 41, 42, 44, 45, 46, 47, 58, 59, 68, 69, 70, 73, 76, 77, 78 query tree 164, 168, 170, 177
N
nearest neighbor (NN) algorithm 36, 107 Normalized Skyline Frequency value (NSF) 112, 113
O
Online Analytical Processing (OLAP) model 98, 120 ontologies 247, 248, 253, 267 optimizer module 168, 169, 177 Order principle 86
P
parser 164, 289, 293, 295, 299, 300, 302, 303 partial distance weighting 330 partial distance weighting technique 327, 330 partially ordered set 87, 88, 97, 98 partition point 14 pattern-mining 312 PeerDB system 222 peers 221, 222, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246 peer to peer (P2P) environments 221, 222, 235, 238, 240, 241, 242, 243, 244, 245 physical fuzzy relational operators 160 Piazza system 222, 245 PIR system 40, 41 Planner Module 177 Planner-Optimizer 164 platform-specific references 86 platform-specific structure 89 possibility theory 118 primitive value 87, 89 probabilistic-based approaches 38
R
RDF, horizontal tables stores for 270, 272, 278 RDF, property tables stores for 270, 279 RDF triple stores 271, 272, 273, 274, 275, 276, 282, 283 RDF, vertical tables stores for 270, 272, 273, 274, 278 reflexive 35 relational algebra (RA) 121, 122, 137, 139, 287, 288, 289, 290, 300, 302, 303 relational algebra tree 184, relational calculus 121, 122, 139 relational database 120, 135, 139
390
Index
relational database management systems (RDBMS) 160, 161, 162, 163, 164, 166, 167, 168, 172, 180, 182, 183, 270, 271, 272, 277, 283, 303 relational databases 248 relational division 288, 289, 303 relational fuzzy data model 191 relational model 185, 188, 192, 194, 200, 208, 212, 214, 216, 218 relational multimedia databases 5 relational query processors 269, 270 relation schema 120 relevance 29, 37, 38, 39, 40, 42, 44, 47, 54, 56, 58, 59, 60, 80 resource description framework (RDF) 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285 resources 270, 275, 276, 284 result tuples 5 retrieved tuples 37 rewrite algorithm 299 rewriter 288, 289, 299 role hierarchy (RBox) 249, 250, 264, 265 role names, non-transitive 249 role names, transitive 249 R*-tree 36 R-Tree 36
S
SAINTETIQ model 29, 61, 62, 63, 64, 65, 66, 67, 69, 72, 73, 76, 81 SAINTETIQ System 61 scalability 29, 59, 60 schema constraints 221, 222, 242, 244 Schema Integration Process 189 schema mapping 227, 244, 245 schema patterns 225 schemas 221, 222, 223, 224, 225, 227, 228, 237, 239, 245, 246 SEAVEii system 31 SEAVE system 31, 32 selection predicates 331, 341 semantic gap 328, 334, 359 semantic metadata 270 Semantics 87 Semantic Search 102
semantic space 327 Semantic Web 102, 115, 247, 266, 267, 268, 269, 270, 275, 282, 283, 284, 285, sequential permutation 11 sequential scan 164, 168, 171, 178 SHIF(D) algorithm 247, 248, 249, 250, 251, 254, 255, 259, 261, 264, 266 similarity evaluation 326, 327, 358 similarity operators approach 326 similarity predicates 331, 334, 341, 342, 345 similarity queries 323, 324, 325, 327, 330, 331, 332, 333, 335, 336, 337, 338, 339, 341, 342, 344, 345, 346, 347, 348, 349, 351, 352, 353, 354, 356, 358 Similarity Search 34 similarity selections 331 similarity space 327, 328, 330, 339, 340, 354 simple query language (SQL) 323, 324, 325, 337, 338, 339, 341, 343, 344, 345, 347, 348, 349, 351, 352, 353, 355, 358 SixP2P system 221, 222, 244 Skycube 104, 106, 108, 117 Skycube computation 104, 108 skyline frequency metric (SFM) 105, 108, 109, 110, 111, 112 Skyline system 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 162 Sort-Filter-Skyline (SFS) 107 SPARQL query language 270, 271, 272, 273, 274, 278, 279, 283, 285 Spatial Data Infrastructures (SDI) 140, 141, 142, 143, 144, 145, 151, 157 SQL-based query 30 SQLf system 163, 164, 166, 167, 168, 170, 171, 172, 173, 180, 181, 182, 183, 184, SQL query language 5 SS-Tree 36 stores, database-based 271 stores, native 271 strict inclusion relation 87 structured data repositories 28 structured query language (SQL) 273, 275, 276, 279, 280, 283, 285, , 287, 288, 289, 290, 292, 293, 295, 299, 300, 301, 302, 303
391
Index
sub-elements 87, 91 subgraph isomorphism 305, 306, 309, 312, 313, 314, 315 subgraph query processing 305, 307 super-elements 87, 91 supergraph queries 306, 316, 319 supergraph query processing 305, 313, 322 systemic functional linguistics (SFL) 133, 134
type constraint 88
U
uniform resource identifier (URI) 269, 270, 275 union query 293, 300 United Nations Conference on Environment and Development 141 user assistance 132
T
tableaus 247, 249, 254, 259, 260, 261, 268 Tahanis Approach 52 Target Entry List (tlist) 178 TechnoStress 29, 82 terminal elements 223 terminal labels 223, 225 text categorization methods 4 text type 223 texture extractor 326, 345 Threshold Algorithm (TA) 43, 44 TKSI algorithm 102, 108, 109, 110, 111, 112, 113, 115 Top-Down Skycube algorithm (TDS) 108 top-k queries 37, 43 Top-k Skyline approach 102, 104, 108 Top-k Skyline problem 102, 104 top-k subset 29, 37, 58 traditional database management systems 118 transitive 35 transitive role axioms 249 trapezoidal fuzzy membership functions 64 tree-pattern formulas 221, 222, 223, 224, 225, 228, 230, 239, 240, 245, 246 tree patterns 223, 224, 225, 245 tree patterns, finite 223 TreePI index structure 312, 313, 316, 322 tuples 1, 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 27, 87, 97, 120, 129, 130, 131
V
VAGUE approach 34, 35, 83 Vector space model 38 vertices 305, 306, 309, 310, 311, 314
W
Web database queries 1 Web ontology language (OWL) 247, 248, 266 World Wide Web Consortium (W3C) 269, 270, 271, 284, 285
X
XAMPP Apache distribution 300, 303 XML functional dependency (XFD) 230, 231, 232, 235, 237, 238, 239, 245, 246 XML schema mappings 222, 244 XQuery programs 221, 222, 239, 240, 242 X-Tree 36
Z
Zadeh, Lotfi 119, 120, 123, 124, 127, 128, 132, 133, 134, 137, 248, 268 Zql lexical parser 295, 299, 300, 302, 303
392