Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common... more Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is a...
High-throughput protein interaction data, with ever-increasing volume, are becoming the foundatio... more High-throughput protein interaction data, with ever-increasing volume, are becoming the foundation of many biological discoveries. However, high-throughput protein interaction data are often associated with high false positive and false negative rates. It is desirable to develop scalable methods to identify these errors. In this paper, we develop a computational method to identify spurious interactions and missing interactions from high-throughput protein interaction data. Our method uses both local and global topological information of protein pairs, and it assigns a local interacting score and a global interacting score to every protein pair. The local interacting score is calculated based on the common neighbors of the protein pairs. The global interacting score is computed using globally interacting protein group pairs. The two scores are then combined to obtain a final score called LGTweight to indicate the interacting possibility of two proteins. We tested our method on the DIP yeast interaction dataset. The experimental results show that the interactions ranked top by our method have higher functional homogeneity and localization coherence than existing methods, and our method also achieves higher sensitivity and precision under 5-fold cross validation than existing methods.
Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortal... more Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortality. The epitope conservation in the HA1 protein allows antibodies to cross-neutralize the 1918 and 2009 H1N1 viruses. However, few works have thoroughly studied the binding hot spots in the two antigen-antibody interfaces which are responsible for the antibody cross neutralization. Results: We apply predictive methods to identify binding hot spots at the epitope sites of the HA1 proteins and at the paratope sites of the 2D1 antibody. We find that the six mutations at the HA1’s epitopes from 1918 to 2009 do not contribute greatly to the 2D1 antibody binding. However, the change of binding free energy on the whole exhibits an increased tendency after these mutations, making the binding stronger. This is consistent with the observation that the 1918 H1N1 neutralizing antibody can cross-react to 2009 H1N1. We have identified five distinguished hot spot residues, including Lys166, which are common between the two epitopes. These common hot spots again can be used to explain why the 2D1 antibody can cross-react. We believe that these hot spot residues are mutation candidates which may help H1N1 viruses evade the immune system. We have also identified seven residues at the paratope site of the 2D1 antibody, four from the heavy chain and three from the light chain. All of them are predicted to be energetically important in HA1 recognition. The identification of these hot spot residues and their structural analysis are potentially beneficial to future drug design against H1N1 viruses. Contact: dcslij@nus.edu.sg
10.1109/ICDMW.2013.177Proceedings - IEEE 13th International Conference on Data Mining Workshops, ... more 10.1109/ICDMW.2013.177Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013xxxvi
Genome informatics. Workshop on Genome Informatics, 1998
BLASTP gives a good overall indication of what function a protein might have. However, analysis o... more BLASTP gives a good overall indication of what function a protein might have. However, analysis of BLASTP reports to discover various domain features in the protein is still tedious. We address this problem by using the modern data integration system, Kleisli, to bring out annotated features of BLASTP results. We further strengthen our solution by incorporating additional information from SEG, ClustalW, hmmPfam, etc. It is also noteworthy that the codes of our implementation is sufficiently short to be presented in its entirety.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
Protein-protein interfaces defined through atomic contact or solvent accessibility change are wid... more Protein-protein interfaces defined through atomic contact or solvent accessibility change are widely adopted in structural biology studies. But, these definitions cannot precisely capture energetically important regions at protein interfaces. The burial depth of an atom in a protein is related to the atom's energy. This work investigates how closely the change in burial level of an atom/residue upon complexation is related to the binding. Burial level change is different from burial level itself. An atom deeply buried in a monomer with a high burial level may not change its burial level after an interaction and it may have little burial level change. We hypothesize that an interface is a region of residues all undergoing burial level changes after interaction. By this definition, an interface can be decomposed into an onion-like structure according to the burial level change extent. We found that our defined interfaces cover energetically important residues more precisely, and t...
This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written ... more This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written to be accessible to researchers familiar with the field of frequent pattern mining. The frequent pattern maintenance problem is summarized with a study on how the space of frequent patterns evolves in response to data updates. This chapter focuses on incremental and decremental maintenance. Four major types of maintenance algorithms are studied: Apriori-based, partition-based, prefix-tree-based, and conciserepresentation- based algorithms. The authors study the advantages and limitations of these algorithms from both the theoretical and experimental perspectives. Possible solutions to certain limitations are also proposed. In addition, some potential research opportunities and emerging trends in frequent pattern maintenance are also discussed.
Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common... more Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is a...
High-throughput protein interaction data, with ever-increasing volume, are becoming the foundatio... more High-throughput protein interaction data, with ever-increasing volume, are becoming the foundation of many biological discoveries. However, high-throughput protein interaction data are often associated with high false positive and false negative rates. It is desirable to develop scalable methods to identify these errors. In this paper, we develop a computational method to identify spurious interactions and missing interactions from high-throughput protein interaction data. Our method uses both local and global topological information of protein pairs, and it assigns a local interacting score and a global interacting score to every protein pair. The local interacting score is calculated based on the common neighbors of the protein pairs. The global interacting score is computed using globally interacting protein group pairs. The two scores are then combined to obtain a final score called LGTweight to indicate the interacting possibility of two proteins. We tested our method on the DIP yeast interaction dataset. The experimental results show that the interactions ranked top by our method have higher functional homogeneity and localization coherence than existing methods, and our method also achieves higher sensitivity and precision under 5-fold cross validation than existing methods.
Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortal... more Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortality. The epitope conservation in the HA1 protein allows antibodies to cross-neutralize the 1918 and 2009 H1N1 viruses. However, few works have thoroughly studied the binding hot spots in the two antigen-antibody interfaces which are responsible for the antibody cross neutralization. Results: We apply predictive methods to identify binding hot spots at the epitope sites of the HA1 proteins and at the paratope sites of the 2D1 antibody. We find that the six mutations at the HA1’s epitopes from 1918 to 2009 do not contribute greatly to the 2D1 antibody binding. However, the change of binding free energy on the whole exhibits an increased tendency after these mutations, making the binding stronger. This is consistent with the observation that the 1918 H1N1 neutralizing antibody can cross-react to 2009 H1N1. We have identified five distinguished hot spot residues, including Lys166, which are common between the two epitopes. These common hot spots again can be used to explain why the 2D1 antibody can cross-react. We believe that these hot spot residues are mutation candidates which may help H1N1 viruses evade the immune system. We have also identified seven residues at the paratope site of the 2D1 antibody, four from the heavy chain and three from the light chain. All of them are predicted to be energetically important in HA1 recognition. The identification of these hot spot residues and their structural analysis are potentially beneficial to future drug design against H1N1 viruses. Contact: dcslij@nus.edu.sg
10.1109/ICDMW.2013.177Proceedings - IEEE 13th International Conference on Data Mining Workshops, ... more 10.1109/ICDMW.2013.177Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013xxxvi
Genome informatics. Workshop on Genome Informatics, 1998
BLASTP gives a good overall indication of what function a protein might have. However, analysis o... more BLASTP gives a good overall indication of what function a protein might have. However, analysis of BLASTP reports to discover various domain features in the protein is still tedious. We address this problem by using the modern data integration system, Kleisli, to bring out annotated features of BLASTP results. We further strengthen our solution by incorporating additional information from SEG, ClustalW, hmmPfam, etc. It is also noteworthy that the codes of our implementation is sufficiently short to be presented in its entirety.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
Protein-protein interfaces defined through atomic contact or solvent accessibility change are wid... more Protein-protein interfaces defined through atomic contact or solvent accessibility change are widely adopted in structural biology studies. But, these definitions cannot precisely capture energetically important regions at protein interfaces. The burial depth of an atom in a protein is related to the atom's energy. This work investigates how closely the change in burial level of an atom/residue upon complexation is related to the binding. Burial level change is different from burial level itself. An atom deeply buried in a monomer with a high burial level may not change its burial level after an interaction and it may have little burial level change. We hypothesize that an interface is a region of residues all undergoing burial level changes after interaction. By this definition, an interface can be decomposed into an onion-like structure according to the burial level change extent. We found that our defined interfaces cover energetically important residues more precisely, and t...
This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written ... more This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written to be accessible to researchers familiar with the field of frequent pattern mining. The frequent pattern maintenance problem is summarized with a study on how the space of frequent patterns evolves in response to data updates. This chapter focuses on incremental and decremental maintenance. Four major types of maintenance algorithms are studied: Apriori-based, partition-based, prefix-tree-based, and conciserepresentation- based algorithms. The authors study the advantages and limitations of these algorithms from both the theoretical and experimental perspectives. Possible solutions to certain limitations are also proposed. In addition, some potential research opportunities and emerging trends in frequent pattern maintenance are also discussed.
Uploads
Papers by Limsoon Wong