XML keyword search is a user-friendly way to query XML data using only keywords. In XML keyword s... more XML keyword search is a user-friendly way to query XML data using only keywords. In XML keyword search, to achieve high precision without sacrificing recall, it is important to remove spurious results not intended by the user. Efforts to eliminate spurious results have enjoyed some success using the concepts of LCA or its variants, SLCA and MLCA. However, existing methods still could find many spurious results. The fundamental cause for the occurrence of spurious results is that the existing methods try to eliminate spurious results locally without global examination of all the query results and, accordingly, some spurious results are not consistently eliminated. In this paper, we propose a novel keyword search method that removes spurious results consistently by exploiting the new concept of structural consistency. We define structural consistency as a property that is preserved if there is no query result having an ancestor-descendant relationship at the schema level with any other query results. A naive solution to obtain structural consistency would be to compute all the LCAs (or variants) and then to remove spurious results according to structural consistency. Obviously, this approach would always be slower than existing LCA-based ones. To speed up structural consistency checking, we must be able to examine the query results at the schema level without generating all the LCAs. However, this is a challenging problem since the schema-level query results do not homomorphically map to the instance-level query results, causing serious false dismissal. We present a comprehensive and practical solution to this problem and formally prove that this solution preserves structural consistency at the schema level without incurring false dismissal. We also propose a relevance-feedback-based solution for the problem where our method has low recall, which occurs when it is not the user’s intention to find more specific results. This solution has been prototyped in a full-fledged object-relational DBMS Odysseus developed at KAIST. Experimental results using real and synthetic data sets show that, compared with the state-of-the-art methods, our solution significantly (1) improves precision while providing comparable recall for most queries and (2) enhances the query performance by removing spurious results early.
Prefetching is an effective method for minimizing the number of round-trips between the client an... more Prefetching is an effective method for minimizing the number of round-trips between the client and the server in database manage-ment systems. In this paper, we propose new notions of the type-level access locality and the type-level access pattern. We also for-mally define ...
Conventional object-relational database management system (ORDBMS) vendors provide extension mech... more Conventional object-relational database management system (ORDBMS) vendors provide extension mechanisms for adding user-defined types and functions to their own DBMSs. Here, the extension mechanisms are implemented using a high-level (typically, SQL-level) interface. We call this mechanism loose-coupling. The advantage of loose-coupling is that it is easy to implement. However, it is not preferable for implementing new data types and operations in large databases when high performance is required. We have earlier proposed the tight-coupling architecture (Whang et al. 2002, 2005) to satisfy this requirement. In tight-coupling, new data types and operations are integrated into the core of the DBMS engine in the extensible type layer. Thus, they are supported in a consistent manner with high performance. This tight-coupling architecture is being used to incorporate information retrieval features and spatial database features into the Odysseus ORDBMS that has been under development at KAIST/AITrc for 19Â years. In this paper, we introduce the tightly-coupled spatial database features of Odysseus/OpenGIS. By taking advantage of tight-coupling, Odysseus/OpenGIS provides excellent performance in processing spatial queries as well as flexible concurrency control and recovery on spatial data. We show the performance through extensive experiments. Finally, we present sample applications of a geographical information system (GIS) implemented using Odysseus/OpenGIS.
Recently, access control on XML data has become an important research topic. Previous research on... more Recently, access control on XML data has become an important research topic. Previous research on access control mechanisms for XML data has focused on increasing the efficiency of access control itself, but has not addressed the issue of integrating access control with query processing. In this paper, we propose an efficient access control mechanism tightly integrated with query processing for XML databases. We present the novel concept of the dynamic predicate (DP), which represents a dynamically constructed condition during query execution. A DP is derived from instance-level authorizations and constrains accessibility of the elements. The DP allows us to effectively integrate authorization checking into the query plan so that unauthorized elements are excluded in the process of query execution. Experimental results show that the proposed access control mechanism improves query processing time significantly over the state-of-the-art access control mechanisms. We conclude that the DP is highly effective in efficiently checking instance-level authorizations in databases with hierarchical structures.
Since recent applications such as XML applications, Geographical Information Systems (GIS), and C... more Since recent applications such as XML applications, Geographical Information Systems (GIS), and CAD/CAM systems require highly efficient data management, they are built on Object-Relational DBMS (ORDBMS). The applications are called navigational applications, and they navigate the composite objects connected via the reference and the collection attributes in the ORDBMS. When a navigational application accesses an object, it first checks whether
Ranked subsequence matching finds top-k subsequences most similar to a given query sequence from ... more Ranked subsequence matching finds top-k subsequences most similar to a given query sequence from data sequences. Recently, Han et al. [12] proposed a solution (referred to here as HLMJ) to this problem by using the concept of the minimum distance matching window pair (MDMWP) and a global priority queue. By using the concept of MDMWP, HLMJ can prune many unnecessary accesses to data subsequences using a lower bound distance. However, we notice that HLMJ may incur serious performance overhead for important types of queries. In this paper, we propose a novel systematic framework to solve this problem by viewing ranked subsequence matching as ranked union. Specifically, we propose a notion of the matching subsequence equivalence class (MSEQ) and a novel lower bound called the MSEQ-distance. To completely eliminate the performance problem of HLMJ, we also propose a cost-aware density-based scheduling technique, where we consider both the density and cost of the priority queue. Extensive experimental results with many real datasets show that the proposed algorithm outperforms HLMJ and the adapted PSM [22], a state-of-the-art index-based merge algorithm supporting non-monotonic distance functions, by up to two to three orders of magnitude, respectively.
Intelligent pipeline inspection gauges (PIGs) are inspection vehicles that move along within a ga... more Intelligent pipeline inspection gauges (PIGs) are inspection vehicles that move along within a gas (or oil) pipeline and acquire signals from their surrounding rings of sensors. By analyzing the signals captured by intelligent PIGs, we can detect pipeline defects, such as holes, curvatures and other potential causes of gas explosions. We notice that the size of collected data using a PIG is usually in GB range. Thus, analyzer software must handle such scalable data and provide various kinds of visualization tools so that analysts can easily detect any defects in the pipeline. In this paper, we propose a scalable pipeline data processing framework using database and visualization techniques. Specifically, we analyze requirements for our system, giving its overall architecture of our system. Second, we describe several important subsystems in our system: such as a scalable pipeline data store, integrated multiple visualization, and automatic summary report generator. Third, by performing experiments with GB-range real data, we show that our system is scalable to handle such large pipeline data. Experimental results show that our system outperforms a relational database management system (RDBMS) based repository by up to 31.9 times.
XML keyword search is a user-friendly way to query XML data using only keywords. In XML keyword s... more XML keyword search is a user-friendly way to query XML data using only keywords. In XML keyword search, to achieve high precision without sacrificing recall, it is important to remove spurious results not intended by the user. Efforts to eliminate spurious results have enjoyed some success using the concepts of LCA or its variants, SLCA and MLCA. However, existing methods still could find many spurious results. The fundamental cause for the occurrence of spurious results is that the existing methods try to eliminate spurious results locally without global examination of all the query results and, accordingly, some spurious results are not consistently eliminated. In this paper, we propose a novel keyword search method that removes spurious results consistently by exploiting the new concept of structural consistency. We define structural consistency as a property that is preserved if there is no query result having an ancestor-descendant relationship at the schema level with any other query results. A naive solution to obtain structural consistency would be to compute all the LCAs (or variants) and then to remove spurious results according to structural consistency. Obviously, this approach would always be slower than existing LCA-based ones. To speed up structural consistency checking, we must be able to examine the query results at the schema level without generating all the LCAs. However, this is a challenging problem since the schema-level query results do not homomorphically map to the instance-level query results, causing serious false dismissal. We present a comprehensive and practical solution to this problem and formally prove that this solution preserves structural consistency at the schema level without incurring false dismissal. We also propose a relevance-feedback-based solution for the problem where our method has low recall, which occurs when it is not the user’s intention to find more specific results. This solution has been prototyped in a full-fledged object-relational DBMS Odysseus developed at KAIST. Experimental results using real and synthetic data sets show that, compared with the state-of-the-art methods, our solution significantly (1) improves precision while providing comparable recall for most queries and (2) enhances the query performance by removing spurious results early.
Prefetching is an effective method for minimizing the number of round-trips between the client an... more Prefetching is an effective method for minimizing the number of round-trips between the client and the server in database manage-ment systems. In this paper, we propose new notions of the type-level access locality and the type-level access pattern. We also for-mally define ...
Conventional object-relational database management system (ORDBMS) vendors provide extension mech... more Conventional object-relational database management system (ORDBMS) vendors provide extension mechanisms for adding user-defined types and functions to their own DBMSs. Here, the extension mechanisms are implemented using a high-level (typically, SQL-level) interface. We call this mechanism loose-coupling. The advantage of loose-coupling is that it is easy to implement. However, it is not preferable for implementing new data types and operations in large databases when high performance is required. We have earlier proposed the tight-coupling architecture (Whang et al. 2002, 2005) to satisfy this requirement. In tight-coupling, new data types and operations are integrated into the core of the DBMS engine in the extensible type layer. Thus, they are supported in a consistent manner with high performance. This tight-coupling architecture is being used to incorporate information retrieval features and spatial database features into the Odysseus ORDBMS that has been under development at KAIST/AITrc for 19Â years. In this paper, we introduce the tightly-coupled spatial database features of Odysseus/OpenGIS. By taking advantage of tight-coupling, Odysseus/OpenGIS provides excellent performance in processing spatial queries as well as flexible concurrency control and recovery on spatial data. We show the performance through extensive experiments. Finally, we present sample applications of a geographical information system (GIS) implemented using Odysseus/OpenGIS.
Recently, access control on XML data has become an important research topic. Previous research on... more Recently, access control on XML data has become an important research topic. Previous research on access control mechanisms for XML data has focused on increasing the efficiency of access control itself, but has not addressed the issue of integrating access control with query processing. In this paper, we propose an efficient access control mechanism tightly integrated with query processing for XML databases. We present the novel concept of the dynamic predicate (DP), which represents a dynamically constructed condition during query execution. A DP is derived from instance-level authorizations and constrains accessibility of the elements. The DP allows us to effectively integrate authorization checking into the query plan so that unauthorized elements are excluded in the process of query execution. Experimental results show that the proposed access control mechanism improves query processing time significantly over the state-of-the-art access control mechanisms. We conclude that the DP is highly effective in efficiently checking instance-level authorizations in databases with hierarchical structures.
Since recent applications such as XML applications, Geographical Information Systems (GIS), and C... more Since recent applications such as XML applications, Geographical Information Systems (GIS), and CAD/CAM systems require highly efficient data management, they are built on Object-Relational DBMS (ORDBMS). The applications are called navigational applications, and they navigate the composite objects connected via the reference and the collection attributes in the ORDBMS. When a navigational application accesses an object, it first checks whether
Ranked subsequence matching finds top-k subsequences most similar to a given query sequence from ... more Ranked subsequence matching finds top-k subsequences most similar to a given query sequence from data sequences. Recently, Han et al. [12] proposed a solution (referred to here as HLMJ) to this problem by using the concept of the minimum distance matching window pair (MDMWP) and a global priority queue. By using the concept of MDMWP, HLMJ can prune many unnecessary accesses to data subsequences using a lower bound distance. However, we notice that HLMJ may incur serious performance overhead for important types of queries. In this paper, we propose a novel systematic framework to solve this problem by viewing ranked subsequence matching as ranked union. Specifically, we propose a notion of the matching subsequence equivalence class (MSEQ) and a novel lower bound called the MSEQ-distance. To completely eliminate the performance problem of HLMJ, we also propose a cost-aware density-based scheduling technique, where we consider both the density and cost of the priority queue. Extensive experimental results with many real datasets show that the proposed algorithm outperforms HLMJ and the adapted PSM [22], a state-of-the-art index-based merge algorithm supporting non-monotonic distance functions, by up to two to three orders of magnitude, respectively.
Intelligent pipeline inspection gauges (PIGs) are inspection vehicles that move along within a ga... more Intelligent pipeline inspection gauges (PIGs) are inspection vehicles that move along within a gas (or oil) pipeline and acquire signals from their surrounding rings of sensors. By analyzing the signals captured by intelligent PIGs, we can detect pipeline defects, such as holes, curvatures and other potential causes of gas explosions. We notice that the size of collected data using a PIG is usually in GB range. Thus, analyzer software must handle such scalable data and provide various kinds of visualization tools so that analysts can easily detect any defects in the pipeline. In this paper, we propose a scalable pipeline data processing framework using database and visualization techniques. Specifically, we analyze requirements for our system, giving its overall architecture of our system. Second, we describe several important subsystems in our system: such as a scalable pipeline data store, integrated multiple visualization, and automatic summary report generator. Third, by performing experiments with GB-range real data, we show that our system is scalable to handle such large pipeline data. Experimental results show that our system outperforms a relational database management system (RDBMS) based repository by up to 31.9 times.
Uploads
Papers by Wook-Shin Han