Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

    Philip Yu

    Abstract A parallel hash join algorithm based on the concept of hierarchical hashing is proposed to address the problem data skew. The proposed algorithm adds an extra scheduling phase to the usual hash and join phases. During the... more
    Abstract A parallel hash join algorithm based on the concept of hierarchical hashing is proposed to address the problem data skew. The proposed algorithm adds an extra scheduling phase to the usual hash and join phases. During the scheduling phase, a heuristic optimization algorithm, using the output of the hash phase, attempts to balance the load across the multiple processors in the subsequent join phase.
    Abstract Analyzing the executions of a buggy software program is essentially a data mining process. Although many interesting methods have been developed to trace crashing bugs (such as memory violation and core dumps), it is still... more
    Abstract Analyzing the executions of a buggy software program is essentially a data mining process. Although many interesting methods have been developed to trace crashing bugs (such as memory violation and core dumps), it is still difficult to analyze noncrashing bugs (such as logical errors). In this paper, we develop a novel method to classify the structured traces of program executions using software behavior graphs.
    Abstract Due to rapid growth of the Internet technology and new scientific/technological advances, the number of applications that model data as graphs increases, because graphs have high expressive power to model complicated structures.... more
    Abstract Due to rapid growth of the Internet technology and new scientific/technological advances, the number of applications that model data as graphs increases, because graphs have high expressive power to model complicated structures. The dominance of graphs in real-world applications asks for new graph data management so that users can access graph data effectively and efficiently. In this paper, we study a graph pattern matching problem over a large data graph.
    Abstract There are numerous applications that need to deal with a large graph and need to query reachability between nodes in the graph. A 2-hop cover can compactly represent the whole edge transitive closure of a graph in O (| V|.| E|... more
    Abstract There are numerous applications that need to deal with a large graph and need to query reachability between nodes in the graph. A 2-hop cover can compactly represent the whole edge transitive closure of a graph in O (| V|.| E| 1/2) space, and be used to answer reachability query efficiently. However, it is challenging to compute a 2-hop cover. The existing approaches suffer from either large resource consumption or low compression rate.
    Abstract In this paper, we devise efficient algorithms for mining association rules with adjustable accuracy. It is noted that several applications require mining the transaction data to capture the customer behavior frequently. In those... more
    Abstract In this paper, we devise efficient algorithms for mining association rules with adjustable accuracy. It is noted that several applications require mining the transaction data to capture the customer behavior frequently. In those applications, the efficiency of data mining could be a more important faktor t. han the requirement for complete accuracy of the mining results. Allowing imprecise results can significantly improve the data mining efficiency.
    Abstract Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one... more
    Abstract Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution.
    Abstract In recent years, the World Wide Web has shown enormous growth in size. Vast repositories of information are available on practically every possible topic. In such cases, it is valuable to perform topical resource discovery... more
    Abstract In recent years, the World Wide Web has shown enormous growth in size. Vast repositories of information are available on practically every possible topic. In such cases, it is valuable to perform topical resource discovery effectively. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the World Wide Web quickly, without having to explore all web pages.
    This paper discusses a framework and provides an overview of general methods for optimizing the management of advertisements on web servers. We discuss the major issues which arise in web advertisement management, and describe basic... more
    This paper discusses a framework and provides an overview of general methods for optimizing the management of advertisements on web servers. We discuss the major issues which arise in web advertisement management, and describe basic mathematical techniques which can be employed to handle such problems. These include a number of statistical, optimization and scheduling models.
    Abstract Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (eg, temperature, humidity, etc) as well as location. Aside from timestamp, type and... more
    Abstract Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (eg, temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How to simultaneously find patterns within and across the multiple aspects? How to do it incrementally in a streaming fashion?
    Abstract We present data representations, distance measures and organizational structures for fast and efficient retrieval of similar shapes in image databases. Using the Hough Transform we extract shape signatures that correspond to... more
    Abstract We present data representations, distance measures and organizational structures for fast and efficient retrieval of similar shapes in image databases. Using the Hough Transform we extract shape signatures that correspond to important features of an image. The new shape descriptor is robust against line discontinuities and takes into consideration not only the shape boundaries, but also the content inside the object perimeter.
    Abstract Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces.... more
    Abstract Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including large scale scientific data analysis, target marketing, Web usage analysis, etc.
    A clear trend of the Web is that a variety of new consumer devices with diverse processing powers, display capabilities, and network connections is gaining access to the Internet. Tailoring Web content to match the device characteristics... more
    A clear trend of the Web is that a variety of new consumer devices with diverse processing powers, display capabilities, and network connections is gaining access to the Internet. Tailoring Web content to match the device characteristics requires functionalities for content transformation, namely transcoding, that are typically carried out by the content provider or by some proxy server at the edge.
    Abstract DNA microarray technology is about to bring an explosion of gene expression data that may dwarf even the human sequencing projects. Researchers are motivated to identify genes whose expression levels rise and fall coherently... more
    Abstract DNA microarray technology is about to bring an explosion of gene expression data that may dwarf even the human sequencing projects. Researchers are motivated to identify genes whose expression levels rise and fall coherently under a set of experimental perturbations, that is, they exhibit fluctuation of a similar shape when conditions change.
    Technology breakthroughs are needed to–manage and analyze continuous streams for knowledge extraction–adapt system management rapidly based on changes of the data and the environment–make numerous real-time decisions about priorities of... more
    Technology breakthroughs are needed to–manage and analyze continuous streams for knowledge extraction–adapt system management rapidly based on changes of the data and the environment–make numerous real-time decisions about priorities of what inputs to examine, what analyses to execute, etc–operate over physically distributed sites–be highly secure and to support protection of private information–be scalable in many dimensions
    Abstract. In many classification and data-mining applications the user does not know a priori which distance measure is the most appropriate for the task at hand without examining the produced results. Also, in several cases, different... more
    Abstract. In many classification and data-mining applications the user does not know a priori which distance measure is the most appropriate for the task at hand without examining the produced results. Also, in several cases, different distance functions can provide diverse but equally intuitive results (according to the specific focus of each measure).
    Abstract Many patterns have been discovered to explain and analyze how people make friends. Among them is the triadic closure, supported by the principle of the transitivity of friendship, which means for an individual the friends of her... more
    Abstract Many patterns have been discovered to explain and analyze how people make friends. Among them is the triadic closure, supported by the principle of the transitivity of friendship, which means for an individual the friends of her friend are more likely to become her new friends. However, people's motivations under this principle haven't been well studied, and it's still unknown that how this principle works in diverse situations.
    Abstract In recent years, a number of indirect data collection methodologies have lead to the proliferation of uncertain data. Such data points are often represented in the form of a probabilistic function, since the corresponding... more
    Abstract In recent years, a number of indirect data collection methodologies have lead to the proliferation of uncertain data. Such data points are often represented in the form of a probabilistic function, since the corresponding deterministic value is not known. This increases the challenge of mining and managing uncertain data, since the precise behavior of the underlying data is no longer known. In this paper, we provide a survey of uncertain data mining and management applications.
    Abstract Monitoring continual queries or subscriptions is to determine the subset of all queries or subscriptions whose predicates match a given event. Predicates contain not only equality but also non-equality clauses. Event matching is... more
    Abstract Monitoring continual queries or subscriptions is to determine the subset of all queries or subscriptions whose predicates match a given event. Predicates contain not only equality but also non-equality clauses. Event matching is usually accomplished by first identifying a" small" candidate set of subscriptions for an event and then determining the matched subscriptions from the candidate set. Prior work has focused on using equality clauses to identify the candidate set.
    This paper presents distributed divergence control algorithms for epsilon serializability for both homogeneous and heterogeneous distributed databases. Epsilon serializability allows for more concurrency by permitting non-serializable... more
    This paper presents distributed divergence control algorithms for epsilon serializability for both homogeneous and heterogeneous distributed databases. Epsilon serializability allows for more concurrency by permitting non-serializable interleavings of database operations among epsilon transactions.
    Abstract There has been a good deal of progress made recently toward the efficient parallelization of individual phases of single queries in multiprocessor database systems.
    Abstract Text classification is a major data mining task. An advanced text classification technique is known as partially supervised text classification, which can build a text classifier using a small set of positive examples only. This... more
    Abstract Text classification is a major data mining task. An advanced text classification technique is known as partially supervised text classification, which can build a text classifier using a small set of positive examples only. This leads to our curiosity whether it is possible to find a set of features that can be used to describe the positive examples. Therefore, users do not even need to specify a set of positive examples.
    Abstract With ever-increasing amounts of graph data from disparate sources, there has been a strong need for exploiting significant graph patterns with user-specified objective functions. Most objective functions are not antimonotonic,... more
    Abstract With ever-increasing amounts of graph data from disparate sources, there has been a strong need for exploiting significant graph patterns with user-specified objective functions. Most objective functions are not antimonotonic, which could fail all of frequency-centric graph mining algorithms. In this paper, we give the first comprehensive study on general mining method aiming to find most significant patterns directly.
    Abstract Graph reachability is fundamental to a wide range of applications, including XML indexing, geographic navigation, Internet routing, ontology queries based on RDF/OWL, etc. Many applications involve huge graphs and require fast... more
    Abstract Graph reachability is fundamental to a wide range of applications, including XML indexing, geographic navigation, Internet routing, ontology queries based on RDF/OWL, etc. Many applications involve huge graphs and require fast answering of reachability queries. Several reachability labeling methods have been proposed for this purpose. They assign labels to the vertices, such that the reachability between any two vertices may be decided using their labels only.
    Abstract Pattern-based clustering is important in many applications, such as DNA micro-array data analysis, automatic recommendation systems and target marketing systems. However, pattern-based clustering in large databases is... more
    Abstract Pattern-based clustering is important in many applications, such as DNA micro-array data analysis, automatic recommendation systems and target marketing systems. However, pattern-based clustering in large databases is challenging. On the one hand, there can be a huge number of clusters and many of them can be redundant and thus make the pattern-based clustering ineffective. On the other hand, the previous proposed methods may not be efficient or scalable in mining large databases.
    Abstract Traditionally, text classifiers are built from labeled training examples. Labeling is usually done manually by human experts (or the users), which is a labor intensive and time consuming process. In the past few years,... more
    Abstract Traditionally, text classifiers are built from labeled training examples. Labeling is usually done manually by human experts (or the users), which is a labor intensive and time consuming process. In the past few years, researchers investigated various forms of semi-supervised learning to reduce the burden of manual labeling. In this paper, we propose a different approach. Instead of labeling a set of documents, the proposed method labels a set of representative words for each class.
    Nowadays, sharing data among organizations is often required during the business collaboration. Data mining technology has enabled efficient extraction of knowledge from large databases. This, however, increases risks of disclosing the... more
    Nowadays, sharing data among organizations is often required during the business collaboration. Data mining technology has enabled efficient extraction of knowledge from large databases. This, however, increases risks of disclosing the sensitive knowledge when the database is released to other parties. To address this privacy issue, one may sanitize the original database so that the sensitive knowledge is hidden.
    Abstract Predicting the values of continuous variable as a function of several independent variables is one of the most important problems for data mining. A very large number of regression methods, both parametric and nonparametric, have... more
    Abstract Predicting the values of continuous variable as a function of several independent variables is one of the most important problems for data mining. A very large number of regression methods, both parametric and nonparametric, have been proposed in the past.
    Abstract We present a new approach to community discovery. Community discovery usually partitions the graph into communities or clusters. Focused community discovery allows the searcher to specify start points of interest, and find the... more
    Abstract We present a new approach to community discovery. Community discovery usually partitions the graph into communities or clusters. Focused community discovery allows the searcher to specify start points of interest, and find the community of those points. Focused search allows for a much more scalable algorithm in which the time depends only on the size of the community, and not on the number of nodes in the graph, and so is scalable to arbitrarily large graphs.
    Abstract This paper studies controlled local replication for hash routing, such as CARP, among a collection of loosely-coupled proxy web cache servers. Hash routing partitions the entire URL space among the shared web caches, creating a... more
    Abstract This paper studies controlled local replication for hash routing, such as CARP, among a collection of loosely-coupled proxy web cache servers. Hash routing partitions the entire URL space among the shared web caches, creating a single logical cache. Each partition is assigned to a cache server. Duplication of cache contents is eliminated and total incoming traffic to the shared web caches is minimized. Client requests for non-assigned-partition objects are forwarded to sibling caches.
    Abstract Mining high utility itemsets from a transactional database refers to the discovery of itemsets with high utility like profits. Although a number of relevant approaches have been proposed in recent years, they incur the problem of... more
    Abstract Mining high utility itemsets from a transactional database refers to the discovery of itemsets with high utility like profits. Although a number of relevant approaches have been proposed in recent years, they incur the problem of producing a large number of candidate itemsets for high utility itemsets. Such a large number of candidate itemsets degrades the mining performance in terms of execution time and space requirement.
    Abstract Dynamic finite versioning (DFV) schemes are an effective approach to concurrent transaction and query processing, where a finite number of consistent, but maybe slightly out-of-date, logical snapshots of the database can be... more
    Abstract Dynamic finite versioning (DFV) schemes are an effective approach to concurrent transaction and query processing, where a finite number of consistent, but maybe slightly out-of-date, logical snapshots of the database can be dynamically derived for query access. In DFV, the storage overhead for keeping additional versions of changed data to support the logical snapshots and the amount of obsolescence faced by queries are two major performance issues.
    High-performance stream processing is critical in many sense-and-respond application domains—from environmental monitoring to algorithmic trading. In this paper, we focus on language and runtime support for improving the performance of... more
    High-performance stream processing is critical in many sense-and-respond application domains—from environmental monitoring to algorithmic trading. In this paper, we focus on language and runtime support for improving the performance of sense-and-respond applications in processing data from high-rate live streams. The central tenets of this work are the programming model, the workload splitting mechanisms, the code generation framework, and the underlying System S middleware and Spade programming model.
    Abstract Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many... more
    Abstract Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (eg, papers, web pages) containing that person's name may be returned. It is hard to determine which documents are about the person we care about.
    Abstract Summary form only given. In the last decade, data mining has emerged as one of the most dynamic and lively areas in information technology. Although many algorithms and techniques for data mining have been proposed, they either... more
    Abstract Summary form only given. In the last decade, data mining has emerged as one of the most dynamic and lively areas in information technology. Although many algorithms and techniques for data mining have been proposed, they either focus on domain independent techniques or on very specific domain problems.
    Abstract One fundamental task in near-neighbor search as well as other similarity matching efforts is to find a distance function that can efficiently quantify the similarity between two objects in a meaningful way. In DNA microarray... more
    Abstract One fundamental task in near-neighbor search as well as other similarity matching efforts is to find a distance function that can efficiently quantify the similarity between two objects in a meaningful way. In DNA microarray analysis, the expression levels of two closely related genes may rise and fall synchronously in response to a set of experimental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very similar.
    Databases and data warehouse systems have been evolving from handling normalized spreadsheets stored in relational databases to managing and analyzing diverse application-oriented data with complex interconnecting structures. Responding... more
    Databases and data warehouse systems have been evolving from handling normalized spreadsheets stored in relational databases to managing and analyzing diverse application-oriented data with complex interconnecting structures. Responding to this emerging trend, information networks have been growing rapidly and showing their critical importance in many applications, such as the analysis of XML, social networks, Web, biological data, multimedia data, and spatiotemporal data.
    Abstract With the increasing popularity of the world wide web, it may be desirable to store copies of popular documents in proxy caches and thus diminish the delay times for URL requests. Web documents have to be treated as indivisible... more
    Abstract With the increasing popularity of the world wide web, it may be desirable to store copies of popular documents in proxy caches and thus diminish the delay times for URL requests. Web documents have to be treated as indivisible objects for the purpose of caching. In this paper we study the problem of caching web documents on disks. In web apphcations, the objects are of non-homogeneous size, and this leads to a problem in physical placement in any scheme which tries to emulate LRU.
    Abstract Similarity search in text has proven to be an interesting problem from the qualitative perspective because of inherent redundancies and ambiguities in textual descriptions. The methods used in search engines in order to retrieve... more
    Abstract Similarity search in text has proven to be an interesting problem from the qualitative perspective because of inherent redundancies and ambiguities in textual descriptions. The methods used in search engines in order to retrieve documents most similar to user-defined sets of keywords are not applicable to targets which are medium to large size documents, because of even greater noise effects, stemming from the presence of a large number of words unrelated to the overall topic in the document.
    Abstract The large itemset model has been proposed in the literature for finding associations in a large database of sales transactions. A different method for evaluating and finding itemsets referred to as strongly collective itemsets is... more
    Abstract The large itemset model has been proposed in the literature for finding associations in a large database of sales transactions. A different method for evaluating and finding itemsets referred to as strongly collective itemsets is proposed. We propose a criterion stressing the importance of the actual correlation of the items with one another rather than their absolute level of presence. Previous techniques for finding correlated itemsets are not necessarily applicable to very large databases.
    Abstract In this paper, we study the problem on how to build an index structure for large string databases to efficiently support various types of string matching without the necessity of mapping the substrings to a numerical space (eg,... more
    Abstract In this paper, we study the problem on how to build an index structure for large string databases to efficiently support various types of string matching without the necessity of mapping the substrings to a numerical space (eg, string B-tree and MRS-index) nor the restriction of in-memory practice (eg, suffix tree and suffix array).
    Abstract In this paper we examine a content-based method to download/record digital video from networks to client stations and home VCR's. The method examined is an alternative to the conventional time-based method used for recording... more
    Abstract In this paper we examine a content-based method to download/record digital video from networks to client stations and home VCR's. The method examined is an alternative to the conventional time-based method used for recording analogue video. Various approaches to probing the video content and to triggering the VCR operations are considered, including frame signature matching, program barcode matching, preloaded pattern search, and annotation signal search in a hypermedia environment.
    Abstract Collective classification approaches exploit the dependencies of a group of linked objects whose class labels are correlated and need to be predicted simultaneously. In this paper, we focus on studying the collective... more
    Abstract Collective classification approaches exploit the dependencies of a group of linked objects whose class labels are correlated and need to be predicted simultaneously. In this paper, we focus on studying the collective classification problem in heterogeneous networks, which involves multiple types of data objects interconnected by multiple types of links. Intuitively, two objects are correlated if they are linked by many paths in the network.
    Abstract The growing sizes of text repositories on the world wide web has created a need for efficient indexing and retrieval methods for text collections. Almost all of the text retrieval and indexing methods have been designed for the... more
    Abstract The growing sizes of text repositories on the world wide web has created a need for efficient indexing and retrieval methods for text collections. Almost all of the text retrieval and indexing methods have been designed for the case of simple keyword search, in which a few keywords are specified, and the text is retrieved on the basis of matches to these keywords.
    Abstract Collaborative mining of distributed data streams in a mobile computing environment is referred to as Pocket Data Mining PDM. Hoeffding trees techniques have been experimentally and analytically validated for data stream... more
    Abstract Collaborative mining of distributed data streams in a mobile computing environment is referred to as Pocket Data Mining PDM. Hoeffding trees techniques have been experimentally and analytically validated for data stream classification. In this paper, we have proposed, developed and evaluated the adoption of distributed Hoeffding trees for classifying streaming data in PDM applications.
    Abstract Previous work on structural joins mostly focuses on maintaining offline indexes on disks. Most of them also require the elements in both sets to be sorted. In this paper, we study an on-the-fly, in-memory indexing approach to... more
    Abstract Previous work on structural joins mostly focuses on maintaining offline indexes on disks. Most of them also require the elements in both sets to be sorted. In this paper, we study an on-the-fly, in-memory indexing approach to structural joins. There is no need to sort the elements or maintain indexes on disks. We identify the similarity between the structural join problem and the stabbing query problem, and extend a main memory-based indexing technique for stabbing queries to structural joins.
    Abstract Industry companies frequently outsource datasets to mining firms and academic institutions create repositories and share datasets in the interest of promoting research collaboration. Still, many practitioners feel reserved about... more
    Abstract Industry companies frequently outsource datasets to mining firms and academic institutions create repositories and share datasets in the interest of promoting research collaboration. Still, many practitioners feel reserved about about sharing or outsourcing datasets, primarily because of the fear of losing the principal rights over the dataset.
    Abstract—Real-time street parking availability information is important in urban areas, and if available could reduce congestion, pollution, and gas consumption. In this paper, an advanced street parking system called PhonePark is... more
    Abstract—Real-time street parking availability information is important in urban areas, and if available could reduce congestion, pollution, and gas consumption. In this paper, an advanced street parking system called PhonePark is presented. Using the GPS, accelerometer, and Bluetooth sensors on a traveler's mobile phone, in conjunction with geospatial data, we can automatically detect when and where the traveler parked her car, and when she released a parking slot.
    Abstract The problem of graph classification has attracted great interest in the last decade. Current research on graph classification assumes the existence of large amounts of labeled training graphs. However, in many applications, the... more
    Abstract The problem of graph classification has attracted great interest in the last decade. Current research on graph classification assumes the existence of large amounts of labeled training graphs. However, in many applications, the labels of graph data are very expensive or difficult to obtain, while there are often copious amounts of unlabeled graph data available.
    Abstract Hash routing partitions the entire URL space among a collection of cooperating proxy caches. Each partition is assigned to a cache server. Duplication of cache contents is eliminated. Client requests to a cache server for... more
    Abstract Hash routing partitions the entire URL space among a collection of cooperating proxy caches. Each partition is assigned to a cache server. Duplication of cache contents is eliminated. Client requests to a cache server for non-assigned partition objects are forwarded to proper sibling caches. As a result, the load level of the cache servers can be quite unbalanced.
    Abstract We explore the method of combining the replication and parity approaches to tolerate multiple disk failures in a disk array. In addition to the conventional mirrored and chained declustering methods, a method based on the hybrid... more
    Abstract We explore the method of combining the replication and parity approaches to tolerate multiple disk failures in a disk array. In addition to the conventional mirrored and chained declustering methods, a method based on the hybrid of mirrored-and-chained declustering is explored. A performance study that explores the effect of combining replication and parity approaches is conducted.

    And 413 more