Fast parallel similarity search in multimedia databases
Most similarity search techniques map the data objects into some high-dimensional feature space. The similarity search then corresponds to a nearest-neighbor search in the feature space which is computationally very intensive. In this paper, we present ...
Similarity-based queries for time series data
We study a set of linear transformations on the Fourier series representation of a sequence that can be used as the basis for similarity queries on time-series data. We show that our set of transformations is rich enough to formulate operations such as ...
Meaningful change detection in structured data
Detecting changes by comparing data snapshots is an important requirement for difference queries, active databases, and version and configuration management. In this paper we focus on detecting meaningful changes in hierarchically structured data, such ...
Improved query performance with variant indexes
The read-mostly environment of data warehousing makes it possible to use more complex indexes to speed up queries than in situations where concurrent updates are present. The current paper presents a short review of current indexing technology, ...
Highly concurrent cache consistency for indices in client-server database systems
In this paper, we present four approaches to providing highly concurrent B+-tree indices in the context of a data-shipping, client-server OODBMS architecture. The first performs all index operations at the server, while the other approaches support ...
Concurrency and recovery in generalized search trees
This paper presents general algorithms for concurrency control in tree-based access methods as well as a recovery protocol and a mechanism for ensuring repeatable read. The algorithms are developed in the context of the Generalized Search Tree (GiST) ...
Range queries in OLAP data cubes
A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation ...
Cubetree: organization of and bulk incremental updates on the data cube
The data cube is an aggregate operator which has been shown to be very powerful for On Line Analytical Processing (OLAP) in the context of data warehousing. It is, however, very expensive to compute, access, and maintain. In this paper we define the “...
Maintenance of data cubes and summary tables in a warehouse
Data warehouses contain large amounts of information, often collected from a variety of independent sources. Decision-support functions in a warehouse, such as on-line analytical processing (OLAP), involve hundreds of complex aggregate queries over ...
Database buffer size investigation for OLTP workloads
It is generally accepted that On-Line Transaction Processing (OLTP) systems benefit from large database memory buffers. As enterprise database systems become larger and more complex, hardware vendors are building increasingly large systems capable of ...
Database performance in the real world: TPC-D and SAP R/3
Traditionally, database systems have been evaluated in isolation on the basis of standardized benchmarks (e.g., Wisconsin, TPC-C, TPC-D). We argue that very often such a performance analysis does not reflect the actual use of the DBMSs in the “real ...
The BUCKY object-relational benchmark
- Michael J. Carey,
- David J. DeWitt,
- Jeffrey F. Naughton,
- Mohammad Asgarian,
- Paul Brown,
- Johannes E. Gehrke,
- Dhaval N. Shah
According to various trade journals and corporate marketing machines, we are now on the verge of a revolution—the object-relational database revolution. Since we believe that no one should face a revolution without appropriate armaments, this paper ...
The STRIP rule system for efficiently maintaining derived data
Derived data is maintained in a database system to correlate and summarize base data which records real world facts. As base data changes, derived data needs to be recomputed. This is often implemented by writing active rules that are triggered by ...
An array-based algorithm for simultaneous multidimensional aggregates
Computing multiple related group-bys and aggregates is one of the core operations of On-Line Analytical Processing (OLAP) applications. Recently, Gray et al. [GBLP95] proposed the “Cube” operator, which computes group-by aggregations over all possible ...
Online aggregation
Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to ...
Balancing push and pull for data broadcast
The increasing ability to interconnect computers through internet-working, wireless networks, high-bandwidth satellite, and cable networks has spawned a new class of information-centered applications based on data dissemination. These applications ...
InfoSleuth: agent-based semantic integration of information in open and dynamic environments
- R. J. Bayardo,
- W. Bohrer,
- R. Brice,
- A. Cichocki,
- J. Fowler,
- A. Helal,
- V. Kashyap,
- T. Ksiezyk,
- G. Martin,
- M. Nodine,
- M. Rashid,
- M. Rusinkiewicz,
- R. Shea,
- C. Unnikrishnan,
- A. Unruh,
- D. Woelk
The goal of the InfoSleuth project at MCC is to exploit and synthesize new technologies into a unified system that retrieves and processes information in an ever-changing network of information sources. InfoSleuth has its roots in the Carnot project at ...
STARTS: Stanford proposal for Internet meta-searching
Document sources are available everywhere, both within the internal networks of organizations and on the Internet. Even individual organizations use search engines from different vendors to index their internal document collections. These search engines ...
On saying “Enough already!” in SQL
In this paper, we study a simple SQL extension that enables query writers to explicitly limit the cardinality of a query result. We examine its impact on the query optimization and run-time execution components of a relational DBMS, presenting two ...
A framework for implementing hypothetical queries
Previous approaches to supporting hypothetical queries have been “eager”: some representation of the hypothetical state (or the corresponding delta) is materialized, and query evaluation is filtered through that representation. This paper develops a ...
High-performance sorting on networks of workstations
- Andrea C. Arpaci-Dusseau,
- Remzi H. Arpaci-Dusseau,
- David E. Culler,
- Joseph M. Hellerstein,
- David A. Patterson
We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale SMPs that have traditionally held the performance ...
Dynamic itemset counting and implication rules for market basket data
We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate ...
Beyond market baskets: generalizing association rules to correlations
One of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, “A customer purchasing ...
Scalable parallel data mining for association rules
One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the ...
Efficiently supporting ad hoc queries in large datasets of time sequences
Ad hoc querying is difficult on very large datasets, since it is usually not possible to have the entire dataset on disk. While compression can be used to decrease the size of the dataset, compressed data is notoriously difficult to index or access.
In ...
DEVise: integrated querying and visual exploration of large datasets
DEVise is a data exploration system that allows users to easily develop, browse, and share visual presentation of large tabular datasets (possibly containing or referencing multimedia objects) from several sources. The DEVise framework is being ...
Partitioned garbage collection of a large object store
We present new techniques for efficient garbage collection in a large persistent object store. The store is divided into partitions that are collected independently using information about inter-partition references. This information is maintained on ...
Size separation spatial join
We introduce a new algorithm to compute the spatial join of two or more spatial data sets, when indexes are not available on them. Size Separation Spatial Join (S3J) imposes a hierarchical decomposition of the data space and, in contrast with previous ...
Building a scaleable geo-spatial DBMS: technology, implementation, and evaluation
- Jignesh Patel,
- JieBing Yu,
- Navin Kabra,
- Kristin Tufte,
- Biswadeep Nag,
- Josef Burger,
- Nancy Hall,
- Karthikeyan Ramasamy,
- Roger Lueder,
- Curt Ellmann,
- Jim Kupsch,
- Shelly Guo,
- Johan Larson,
- David De Witt,
- Jeffrey Naughton
This paper presents a number of new techniques for parallelizing geo-spatial database systems and discusses their implementation in the Paradise object-relational database system. The effectiveness of these techniques is demonstrated using a variety of ...
A toolkit for negotiation support interfaces to multi-dimensional data
CoDecide is an experimental user interface toolkit that offers an extension to spreadsheet concepts specifically geared towards support for cooperative analysis of the kinds of multi-dimensional data encountered in data warehousing. It is distinguished ...