Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common... more Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is a...
High-throughput protein interaction data, with ever-increasing volume, are becoming the foundatio... more High-throughput protein interaction data, with ever-increasing volume, are becoming the foundation of many biological discoveries. However, high-throughput protein interaction data are often associated with high false positive and false negative rates. It is desirable to develop scalable methods to identify these errors. In this paper, we develop a computational method to identify spurious interactions and missing interactions from high-throughput protein interaction data. Our method uses both local and global topological information of protein pairs, and it assigns a local interacting score and a global interacting score to every protein pair. The local interacting score is calculated based on the common neighbors of the protein pairs. The global interacting score is computed using globally interacting protein group pairs. The two scores are then combined to obtain a final score called LGTweight to indicate the interacting possibility of two proteins. We tested our method on the DIP yeast interaction dataset. The experimental results show that the interactions ranked top by our method have higher functional homogeneity and localization coherence than existing methods, and our method also achieves higher sensitivity and precision under 5-fold cross validation than existing methods.
There are different types of correlation patterns between the variables of a time course data set... more There are different types of correlation patterns between the variables of a time course data set, such as positive correlations, negative correlations, time-lagged correlations, and those correlations containing small interrupted gaps. Usually, these correlations are maintained only on a subset of time points rather than on the whole span of the time points which are traditionally required for correlation definition. As these types of patterns underline different trends of data movement, mining all of them is an important step to gain a broad insight into the dependencies of the variables. In this work, we prove that these diverse types of correlation patterns can be all represented by a generalized form of positive correlation patterns. We also prove a correspondence between positive correlation patterns and sequential patterns. We then present an efficient single-scan algorithm for mining all of these types of correlations. This "pan-correlation" mining algorithm is evaluated on synthetic time course data sets, as well as on yeast cell cycle gene expression data sets. The results indicate that: (i) our mining algorithm has linear time increment in terms of increasing number of variables; (ii) negative correlation patterns are abundant in real-world data sets; and (iii) correlation patterns with time lags and gaps are also abundant. Existing methods have only discovered incomplete forms of many of these patterns, and have missed some important patterns completely.
Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortal... more Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortality. The epitope conservation in the HA1 protein allows antibodies to cross-neutralize the 1918 and 2009 H1N1 viruses. However, few works have thoroughly studied the binding hot spots in the two antigen-antibody interfaces which are responsible for the antibody cross neutralization. Results: We apply predictive methods to identify binding hot spots at the epitope sites of the HA1 proteins and at the paratope sites of the 2D1 antibody. We find that the six mutations at the HA1’s epitopes from 1918 to 2009 do not contribute greatly to the 2D1 antibody binding. However, the change of binding free energy on the whole exhibits an increased tendency after these mutations, making the binding stronger. This is consistent with the observation that the 1918 H1N1 neutralizing antibody can cross-react to 2009 H1N1. We have identified five distinguished hot spot residues, including Lys166, which are common between the two epitopes. These common hot spots again can be used to explain why the 2D1 antibody can cross-react. We believe that these hot spot residues are mutation candidates which may help H1N1 viruses evade the immune system. We have also identified seven residues at the paratope site of the 2D1 antibody, four from the heavy chain and three from the light chain. All of them are predicted to be energetically important in HA1 recognition. The identification of these hot spot residues and their structural analysis are potentially beneficial to future drug design against H1N1 viruses. Contact: dcslij@nus.edu.sg
Motivation: Infection with strains of different subtypes and the subsequent crossover reading bet... more Motivation: Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells' reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source. Results: We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted to capture signature patterns of pure subtypes and CRFs. The method was applied to 7185 HIV-1 sequences, comprising 5530 pure subtype sequences and 1655 CRF sequences. Results have demonstrated that the method can achieve very high accuracy (reaching 99%) in the prediction of the complete set of labels of HIV-1 recombinant forms. A few wrong predictions are actually incomplete predictions, very close to the complete set of genuine labels.
Motivation: A maximal match between two genomes is a contiguous non-extendable sub-sequence commo... more Motivation: A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results: We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighboring maximal matches to form long and mutation-containing matches. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark data sets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission.
Novel technologies and growing interest have resulted in a large increase in the amount of data a... more Novel technologies and growing interest have resulted in a large increase in the amount of data available for genomics and transcriptomics studies, both in terms of volume and contents. Biology is relying more and more on computational methods to process, investigate and extract knowledge from this huge amount of data. In this work, we present the TICA web server (available at http://www.gmql.eu/tica/), a fast and compact tool developed to support data-driven knowledge discovery in the realm of transcription factor interaction prediction. TICA leverages both the GenoMetric Query Language, a novel query tool (based on the Apache Hadoop and Spark technologies) specialized in the integration and management of heterogeneous, large genomic datasets, and a statistical method for robust detection of co-locations across interval-based data, in order to infer physically interacting transcription factors. Notably, TICA allows investigators to upload and analyse their own ChIP-seq experiments datasets, comparing them both against ENCODE data or between themselves, achieving computation time which increases linearly with respect to dataset size and density. Using ENCODE data from three well-studied cell lines as reference, we show that TICA predictions are supported by existing biological knowledge, making the web server a reliable and efficient tool for interaction screening and data-driven hypothesis generation.
10.1109/ICDMW.2013.177Proceedings - IEEE 13th International Conference on Data Mining Workshops, ... more 10.1109/ICDMW.2013.177Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013xxxvi
Genome informatics. Workshop on Genome Informatics, 1998
BLASTP gives a good overall indication of what function a protein might have. However, analysis o... more BLASTP gives a good overall indication of what function a protein might have. However, analysis of BLASTP reports to discover various domain features in the protein is still tedious. We address this problem by using the modern data integration system, Kleisli, to bring out annotated features of BLASTP results. We further strengthen our solution by incorporating additional information from SEG, ClustalW, hmmPfam, etc. It is also noteworthy that the codes of our implementation is sufficiently short to be presented in its entirety.
Time-course correlation patterns can be positive or negative, and time-lagged with gaps. Mining a... more Time-course correlation patterns can be positive or negative, and time-lagged with gaps. Mining all these correlation patterns help to gain broad insights on variable dependencies. Here, we prove that diverse types of correlation patterns can be represented by a generalized form of positive correlation patterns. We prove a correspondence between positive correlation patterns and sequential patterns, and present an efficient single-scan algorithm for mining the correlations. Evaluations on synthetic time course data sets, and yeast cell cycle gene expression data sets indicate that: (i) the algorithm has linear time increment in terms of increasing number of variables; (ii) negative correlation patterns are abundant in real-world data sets; and (iii) correlation patterns with time lags and gaps are abundant. Existing methods have only discovered incomplete forms of many of these patterns, and have missed some important patterns completely. Keywords pan-correlation pattern • time-course data • positive correlation patterns • negative correlation patterns • time-lagged positive correlation patterns • time-lagged negative correlation patterns
Computational Prediction of Protein Complexes from Protein Interaction Networks, 2017
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,... more All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means-electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews-without the prior permission of the publisher. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Contents Preface xi Chapter Introduction to Protein Complex Prediction 1 1.1 From Protein Interactions to Protein Complexes 6 1.2 Databases for Protein Complexes 11 1.3 Organization of the Rest of the Book 13
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
Protein-protein interfaces defined through atomic contact or solvent accessibility change are wid... more Protein-protein interfaces defined through atomic contact or solvent accessibility change are widely adopted in structural biology studies. But, these definitions cannot precisely capture energetically important regions at protein interfaces. The burial depth of an atom in a protein is related to the atom's energy. This work investigates how closely the change in burial level of an atom/residue upon complexation is related to the binding. Burial level change is different from burial level itself. An atom deeply buried in a monomer with a high burial level may not change its burial level after an interaction and it may have little burial level change. We hypothesize that an interface is a region of residues all undergoing burial level changes after interaction. By this definition, an interface can be decomposed into an onion-like structure according to the burial level change extent. We found that our defined interfaces cover energetically important residues more precisely, and t...
Data of interest to biomedical researchers associated with the Human Genome Project (HGP) is stor... more Data of interest to biomedical researchers associated with the Human Genome Project (HGP) is stored all over the world in a variety of electronic data formats and accessible through a variety of interfaces and retrieval languages. These data sources include conventional relational databases with SQL interfaces, formatted text les on top of which indexing is provided for e cient retrieval (ASN.1), and binary les that can be interpreted textually or graphically via special purpose interfaces (ACeDB); there are also image databases of molecular and chemical structures. Researchers within the HGP want to combine data from these di erent data sources, add value through sophisticated data analysis techniques (such as the biosequence comparison software BLAST and FASTA), and view it using special purpose scienti c visualization tools. However, currently there are no commercial tools for enabling such an integrated digital library, and a fundamental barrier to developing such tools appears to be one of language design and optimization. For example, while tools exist for interoperating between heterogeneous relational databases, the data formats and software packages found throughout the HGP contain a number of data types not easily available in conventional databases, such as lists, variants and arrays; furthermore, these types may be deeply nested. We present in this paper a language for querying and transforming data from heterogenous sources, discuss its implementation in a system called BioKleisli and illustrate its use in accessing data sources critical to the HGP.
This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written ... more This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written to be accessible to researchers familiar with the field of frequent pattern mining. The frequent pattern maintenance problem is summarized with a study on how the space of frequent patterns evolves in response to data updates. This chapter focuses on incremental and decremental maintenance. Four major types of maintenance algorithms are studied: Apriori-based, partition-based, prefix-tree-based, and conciserepresentation- based algorithms. The authors study the advantages and limitations of these algorithms from both the theoretical and experimental perspectives. Possible solutions to certain limitations are also proposed. In addition, some potential research opportunities and emerging trends in frequent pattern maintenance are also discussed.
Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common... more Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is a...
High-throughput protein interaction data, with ever-increasing volume, are becoming the foundatio... more High-throughput protein interaction data, with ever-increasing volume, are becoming the foundation of many biological discoveries. However, high-throughput protein interaction data are often associated with high false positive and false negative rates. It is desirable to develop scalable methods to identify these errors. In this paper, we develop a computational method to identify spurious interactions and missing interactions from high-throughput protein interaction data. Our method uses both local and global topological information of protein pairs, and it assigns a local interacting score and a global interacting score to every protein pair. The local interacting score is calculated based on the common neighbors of the protein pairs. The global interacting score is computed using globally interacting protein group pairs. The two scores are then combined to obtain a final score called LGTweight to indicate the interacting possibility of two proteins. We tested our method on the DIP yeast interaction dataset. The experimental results show that the interactions ranked top by our method have higher functional homogeneity and localization coherence than existing methods, and our method also achieves higher sensitivity and precision under 5-fold cross validation than existing methods.
There are different types of correlation patterns between the variables of a time course data set... more There are different types of correlation patterns between the variables of a time course data set, such as positive correlations, negative correlations, time-lagged correlations, and those correlations containing small interrupted gaps. Usually, these correlations are maintained only on a subset of time points rather than on the whole span of the time points which are traditionally required for correlation definition. As these types of patterns underline different trends of data movement, mining all of them is an important step to gain a broad insight into the dependencies of the variables. In this work, we prove that these diverse types of correlation patterns can be all represented by a generalized form of positive correlation patterns. We also prove a correspondence between positive correlation patterns and sequential patterns. We then present an efficient single-scan algorithm for mining all of these types of correlations. This "pan-correlation" mining algorithm is evaluated on synthetic time course data sets, as well as on yeast cell cycle gene expression data sets. The results indicate that: (i) our mining algorithm has linear time increment in terms of increasing number of variables; (ii) negative correlation patterns are abundant in real-world data sets; and (iii) correlation patterns with time lags and gaps are also abundant. Existing methods have only discovered incomplete forms of many of these patterns, and have missed some important patterns completely.
Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortal... more Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortality. The epitope conservation in the HA1 protein allows antibodies to cross-neutralize the 1918 and 2009 H1N1 viruses. However, few works have thoroughly studied the binding hot spots in the two antigen-antibody interfaces which are responsible for the antibody cross neutralization. Results: We apply predictive methods to identify binding hot spots at the epitope sites of the HA1 proteins and at the paratope sites of the 2D1 antibody. We find that the six mutations at the HA1’s epitopes from 1918 to 2009 do not contribute greatly to the 2D1 antibody binding. However, the change of binding free energy on the whole exhibits an increased tendency after these mutations, making the binding stronger. This is consistent with the observation that the 1918 H1N1 neutralizing antibody can cross-react to 2009 H1N1. We have identified five distinguished hot spot residues, including Lys166, which are common between the two epitopes. These common hot spots again can be used to explain why the 2D1 antibody can cross-react. We believe that these hot spot residues are mutation candidates which may help H1N1 viruses evade the immune system. We have also identified seven residues at the paratope site of the 2D1 antibody, four from the heavy chain and three from the light chain. All of them are predicted to be energetically important in HA1 recognition. The identification of these hot spot residues and their structural analysis are potentially beneficial to future drug design against H1N1 viruses. Contact: dcslij@nus.edu.sg
Motivation: Infection with strains of different subtypes and the subsequent crossover reading bet... more Motivation: Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells' reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source. Results: We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted to capture signature patterns of pure subtypes and CRFs. The method was applied to 7185 HIV-1 sequences, comprising 5530 pure subtype sequences and 1655 CRF sequences. Results have demonstrated that the method can achieve very high accuracy (reaching 99%) in the prediction of the complete set of labels of HIV-1 recombinant forms. A few wrong predictions are actually incomplete predictions, very close to the complete set of genuine labels.
Motivation: A maximal match between two genomes is a contiguous non-extendable sub-sequence commo... more Motivation: A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results: We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighboring maximal matches to form long and mutation-containing matches. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark data sets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission.
Novel technologies and growing interest have resulted in a large increase in the amount of data a... more Novel technologies and growing interest have resulted in a large increase in the amount of data available for genomics and transcriptomics studies, both in terms of volume and contents. Biology is relying more and more on computational methods to process, investigate and extract knowledge from this huge amount of data. In this work, we present the TICA web server (available at http://www.gmql.eu/tica/), a fast and compact tool developed to support data-driven knowledge discovery in the realm of transcription factor interaction prediction. TICA leverages both the GenoMetric Query Language, a novel query tool (based on the Apache Hadoop and Spark technologies) specialized in the integration and management of heterogeneous, large genomic datasets, and a statistical method for robust detection of co-locations across interval-based data, in order to infer physically interacting transcription factors. Notably, TICA allows investigators to upload and analyse their own ChIP-seq experiments datasets, comparing them both against ENCODE data or between themselves, achieving computation time which increases linearly with respect to dataset size and density. Using ENCODE data from three well-studied cell lines as reference, we show that TICA predictions are supported by existing biological knowledge, making the web server a reliable and efficient tool for interaction screening and data-driven hypothesis generation.
10.1109/ICDMW.2013.177Proceedings - IEEE 13th International Conference on Data Mining Workshops, ... more 10.1109/ICDMW.2013.177Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013xxxvi
Genome informatics. Workshop on Genome Informatics, 1998
BLASTP gives a good overall indication of what function a protein might have. However, analysis o... more BLASTP gives a good overall indication of what function a protein might have. However, analysis of BLASTP reports to discover various domain features in the protein is still tedious. We address this problem by using the modern data integration system, Kleisli, to bring out annotated features of BLASTP results. We further strengthen our solution by incorporating additional information from SEG, ClustalW, hmmPfam, etc. It is also noteworthy that the codes of our implementation is sufficiently short to be presented in its entirety.
Time-course correlation patterns can be positive or negative, and time-lagged with gaps. Mining a... more Time-course correlation patterns can be positive or negative, and time-lagged with gaps. Mining all these correlation patterns help to gain broad insights on variable dependencies. Here, we prove that diverse types of correlation patterns can be represented by a generalized form of positive correlation patterns. We prove a correspondence between positive correlation patterns and sequential patterns, and present an efficient single-scan algorithm for mining the correlations. Evaluations on synthetic time course data sets, and yeast cell cycle gene expression data sets indicate that: (i) the algorithm has linear time increment in terms of increasing number of variables; (ii) negative correlation patterns are abundant in real-world data sets; and (iii) correlation patterns with time lags and gaps are abundant. Existing methods have only discovered incomplete forms of many of these patterns, and have missed some important patterns completely. Keywords pan-correlation pattern • time-course data • positive correlation patterns • negative correlation patterns • time-lagged positive correlation patterns • time-lagged negative correlation patterns
Computational Prediction of Protein Complexes from Protein Interaction Networks, 2017
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,... more All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means-electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews-without the prior permission of the publisher. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Contents Preface xi Chapter Introduction to Protein Complex Prediction 1 1.1 From Protein Interactions to Protein Complexes 6 1.2 Databases for Protein Complexes 11 1.3 Organization of the Rest of the Book 13
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
Protein-protein interfaces defined through atomic contact or solvent accessibility change are wid... more Protein-protein interfaces defined through atomic contact or solvent accessibility change are widely adopted in structural biology studies. But, these definitions cannot precisely capture energetically important regions at protein interfaces. The burial depth of an atom in a protein is related to the atom's energy. This work investigates how closely the change in burial level of an atom/residue upon complexation is related to the binding. Burial level change is different from burial level itself. An atom deeply buried in a monomer with a high burial level may not change its burial level after an interaction and it may have little burial level change. We hypothesize that an interface is a region of residues all undergoing burial level changes after interaction. By this definition, an interface can be decomposed into an onion-like structure according to the burial level change extent. We found that our defined interfaces cover energetically important residues more precisely, and t...
Data of interest to biomedical researchers associated with the Human Genome Project (HGP) is stor... more Data of interest to biomedical researchers associated with the Human Genome Project (HGP) is stored all over the world in a variety of electronic data formats and accessible through a variety of interfaces and retrieval languages. These data sources include conventional relational databases with SQL interfaces, formatted text les on top of which indexing is provided for e cient retrieval (ASN.1), and binary les that can be interpreted textually or graphically via special purpose interfaces (ACeDB); there are also image databases of molecular and chemical structures. Researchers within the HGP want to combine data from these di erent data sources, add value through sophisticated data analysis techniques (such as the biosequence comparison software BLAST and FASTA), and view it using special purpose scienti c visualization tools. However, currently there are no commercial tools for enabling such an integrated digital library, and a fundamental barrier to developing such tools appears to be one of language design and optimization. For example, while tools exist for interoperating between heterogeneous relational databases, the data formats and software packages found throughout the HGP contain a number of data types not easily available in conventional databases, such as lists, variants and arrays; furthermore, these types may be deeply nested. We present in this paper a language for querying and transforming data from heterogenous sources, discuss its implementation in a system called BioKleisli and illustrate its use in accessing data sources critical to the HGP.
This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written ... more This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written to be accessible to researchers familiar with the field of frequent pattern mining. The frequent pattern maintenance problem is summarized with a study on how the space of frequent patterns evolves in response to data updates. This chapter focuses on incremental and decremental maintenance. Four major types of maintenance algorithms are studied: Apriori-based, partition-based, prefix-tree-based, and conciserepresentation- based algorithms. The authors study the advantages and limitations of these algorithms from both the theoretical and experimental perspectives. Possible solutions to certain limitations are also proposed. In addition, some potential research opportunities and emerging trends in frequent pattern maintenance are also discussed.
Uploads
Papers by Limsoon Wong