Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/565196.565200acmconferencesArticle/Chapter ViewAbstractPublication PagesrecombConference Proceedingsconference-collections
Article

Monotony of surprise and large-scale quest for unusual words

Published: 18 April 2002 Publication History

Abstract

The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subwords of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.

References

[1]
Apostolico, A. Of maps bigger than the empire, Keynote, Proceedings of the 8th International Colloquium on String Processing and Information Retrieval, Laguna de San Rafael, Chile, November 2001, IEEE Computer Society Press, 2--10 (2001).
[2]
Apostolico, A., Bock, M. E., Lonardi, S., and Xu, X. Efficient detection of unusual words. J. Comput. Bio. 7, 1/2 (Jan. 2000), 71--94.
[3]
Apostolico, A., Bock, M. E., and Xu, X. Annotated statistical indices for sequence analysis. Keynote, Proceedings of Complexity and Compression of SEQUENCES97, Positano, Italy, June 1997, IEEE Computer Society Press, 215--229 (1998).
[4]
Apostolico, A., and Galil, Z., Eds. Pattern matching algorithms. Oxford University Press, New York, NY, 1997.
[5]
Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., and McConnel, R. Complete inverted files for efficient text retrieval and analysis. J. Assoc. Comput. Mach. 34, 3 (1987), 578--595.
[6]
Borges, J. L. A Universal History of Infamy. Penguin Books, London, 1975.
[7]
Clift, B., Haussler, D., McConnell, R., Schneider, T. D., and Stormo, G. D. Sequence landscapes. Nucleic Acids Res. 14 (1986), 141--158.
[8]
Gentleman, J. The distribution of the frequency of subsequences in alphabetic sequences, as exemplified by deoxyribonucleic acid. Appl. Statist. 43 (1994), 404--414.
[9]
Kleffe, J., and Borodovsky, M. First and second moment of counts of words in random texts generated by Markov chains. Comput. Appl. Biosci. 8 (1992), 433--441.
[10]
Leung, M. Y., Marsh, G. M., and Speed, T. P. Over and underrepresentation of short DNA words in herpesvirus genomes. J. Comput. Bio. 3 (1996), 345--360.
[11]
Lonardi, S. Global Detectors of Unusual Words: Design, Implementation, and Applications to Pattern Discovery in Biosequences. PhD thesis, Purdue University, 2001.
[12]
Lundstrom, R. Stochastic models and statistical methods for DNA sequence data. PhD thesis, University of Utah, 1990.
[13]
Pevzner, P. A., Borodovsky, M. Y., and Mironov, A. A. Linguistics of nucleotides sequences I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynamics 6 (1989), 1013--1026.
[14]
Régnier, M., and Szpankowski, W. On pattern frequency occurrences in a Markovian sequence. Algorithmica 22 (1998), 631--649.
[15]
Reinert, G., Schbath, S., and Waterman, M. S. Probabilistic and statistical properties of words: An overview. J. Comput. Bio. 7 (2000), 1--46.
[16]
Stückle, E., Emmrich, C., Grob, U., and Nielsen, P. Statistical analysis of nucleotide sequences. Nucleic Acids Res. 18, 22 (1990), 6641--6647.
[17]
Waterman, M. S. Introduction to Computational Biology. Chapman & Hall, 1995.

Cited By

View all
  • (2019)Local Recurrence Rates with Automatic Time Windows for Discord Search in Multivariate Time SeriesProcedia Manufacturing10.1016/j.promfg.2020.01.26139(1783-1792)Online publication date: 2019
  • (2018)Streaming data analysis framework for cyber-physical system of metal machining processes2018 IEEE Industrial Cyber-Physical Systems (ICPS)10.1109/ICPHYS.2018.8390764(546-551)Online publication date: May-2018
  • (2017)Adjacent Mean Difference (AMD) method for dynamic segmentation in time series anomaly detection2017 IEEE/SICE International Symposium on System Integration (SII)10.1109/SII.2017.8279219(241-246)Online publication date: Dec-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
RECOMB '02: Proceedings of the sixth annual international conference on Computational biology
April 2002
341 pages
ISBN:1581134983
DOI:10.1145/565196
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

RECOMB02
Sponsor:

Acceptance Rates

RECOMB '02 Paper Acceptance Rate 35 of 118 submissions, 30%;
Overall Acceptance Rate 148 of 538 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Local Recurrence Rates with Automatic Time Windows for Discord Search in Multivariate Time SeriesProcedia Manufacturing10.1016/j.promfg.2020.01.26139(1783-1792)Online publication date: 2019
  • (2018)Streaming data analysis framework for cyber-physical system of metal machining processes2018 IEEE Industrial Cyber-Physical Systems (ICPS)10.1109/ICPHYS.2018.8390764(546-551)Online publication date: May-2018
  • (2017)Adjacent Mean Difference (AMD) method for dynamic segmentation in time series anomaly detection2017 IEEE/SICE International Symposium on System Integration (SII)10.1109/SII.2017.8279219(241-246)Online publication date: Dec-2017
  • (2016)Respiration Disorders Classification With Informative Features for m-Health ApplicationsIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2015.245896520:3(733-747)Online publication date: May-2016
  • (2015)Design and Evaluation of an Intelligent Remote Tidal Volume Variability Monitoring System in E-Health ApplicationsIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2015.244578319:5(1532-1548)Online publication date: Sep-2015
  • (2013)Versatile Succinct Representations of the Bidirectional Burrows-Wheeler TransformAlgorithms – ESA 201310.1007/978-3-642-40450-4_12(133-144)Online publication date: 2013
  • (2013)On Perturbation Measure of Clusters: ApplicationArtificial Intelligence and Soft Computing10.1007/978-3-642-38610-7_17(176-183)Online publication date: 2013
  • (2012)Faster variance computation for patterns with gapsProceedings of the First Mediterranean conference on Design and Analysis of Algorithms10.1007/978-3-642-34862-4_10(134-147)Online publication date: 3-Dec-2012
  • (2012)A clustering algorithm based on distinguishability for nominal attributesProceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II10.1007/978-3-642-29350-4_14(120-127)Online publication date: 29-Apr-2012
  • (2010)Approximate variable-length time series motif discovery using grammar inferenceProceedings of the Tenth International Workshop on Multimedia Data Mining10.1145/1814245.1814255(1-9)Online publication date: 25-Jul-2010
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media