Article

Monotony of surprise and large-scale quest for unusual words

Authors:

Alberto Apostolico,

Mary Ellen Bock,

Stefano LonardiAuthors Info & Claims

RECOMB '02: Proceedings of the sixth annual international conference on Computational biology

Pages 22 - 31

https://doi.org/10.1145/565196.565200

Published: 18 April 2002 Publication History

Abstract

The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subwords of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.

References

[1]

Apostolico, A. Of maps bigger than the empire, Keynote, Proceedings of the 8th International Colloquium on String Processing and Information Retrieval, Laguna de San Rafael, Chile, November 2001, IEEE Computer Society Press, 2--10 (2001).

[2]

Apostolico, A., Bock, M. E., Lonardi, S., and Xu, X. Efficient detection of unusual words. J. Comput. Bio. 7, 1/2 (Jan. 2000), 71--94.

[3]

Apostolico, A., Bock, M. E., and Xu, X. Annotated statistical indices for sequence analysis. Keynote, Proceedings of Complexity and Compression of SEQUENCES97, Positano, Italy, June 1997, IEEE Computer Society Press, 215--229 (1998).

Digital Library

[4]

Apostolico, A., and Galil, Z., Eds. Pattern matching algorithms. Oxford University Press, New York, NY, 1997.

Digital Library

[5]

Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., and McConnel, R. Complete inverted files for efficient text retrieval and analysis. J. Assoc. Comput. Mach. 34, 3 (1987), 578--595.

Digital Library

[6]

Borges, J. L. A Universal History of Infamy. Penguin Books, London, 1975.

[7]

Clift, B., Haussler, D., McConnell, R., Schneider, T. D., and Stormo, G. D. Sequence landscapes. Nucleic Acids Res. 14 (1986), 141--158.

[8]

Gentleman, J. The distribution of the frequency of subsequences in alphabetic sequences, as exemplified by deoxyribonucleic acid. Appl. Statist. 43 (1994), 404--414.

[9]

Kleffe, J., and Borodovsky, M. First and second moment of counts of words in random texts generated by Markov chains. Comput. Appl. Biosci. 8 (1992), 433--441.

[10]

Leung, M. Y., Marsh, G. M., and Speed, T. P. Over and underrepresentation of short DNA words in herpesvirus genomes. J. Comput. Bio. 3 (1996), 345--360.

[11]

Lonardi, S. Global Detectors of Unusual Words: Design, Implementation, and Applications to Pattern Discovery in Biosequences. PhD thesis, Purdue University, 2001.

Digital Library

[12]

Lundstrom, R. Stochastic models and statistical methods for DNA sequence data. PhD thesis, University of Utah, 1990.

[13]

Pevzner, P. A., Borodovsky, M. Y., and Mironov, A. A. Linguistics of nucleotides sequences I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynamics 6 (1989), 1013--1026.

[14]

Régnier, M., and Szpankowski, W. On pattern frequency occurrences in a Markovian sequence. Algorithmica 22 (1998), 631--649.

[15]

Reinert, G., Schbath, S., and Waterman, M. S. Probabilistic and statistical properties of words: An overview. J. Comput. Bio. 7 (2000), 1--46.

[16]

Stückle, E., Emmrich, C., Grob, U., and Nielsen, P. Statistical analysis of nucleotide sequences. Nucleic Acids Res. 18, 22 (1990), 6641--6647.

[17]

Waterman, M. S. Introduction to Computational Biology. Chapman & Hall, 1995.

Cited By

Yang CDarwin FSutrisno H(2019)Local Recurrence Rates with Automatic Time Windows for Discord Search in Multivariate Time SeriesProcedia Manufacturing10.1016/j.promfg.2020.01.26139(1783-1792)Online publication date: 2019
https://doi.org/10.1016/j.promfg.2020.01.261
Yang CSutrisno HLo NChen ZWei CZhang HLin CWei CHsieh S(2018)Streaming data analysis framework for cyber-physical system of metal machining processes2018 IEEE Industrial Cyber-Physical Systems (ICPS)10.1109/ICPHYS.2018.8390764(546-551)Online publication date: May-2018
https://doi.org/10.1109/ICPHYS.2018.8390764
Yang CLiao W(2017)Adjacent Mean Difference (AMD) method for dynamic segmentation in time series anomaly detection2017 IEEE/SICE International Symposium on System Integration (SII)10.1109/SII.2017.8279219(241-246)Online publication date: Dec-2017
https://doi.org/10.1109/SII.2017.8279219
Show More Cited By

Index Terms

Monotony of surprise and large-scale quest for unusual words
1. Applied computing
  1. Life and medical sciences
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Space-Efficient Detection of Unusual Words
SPIRE 2015: Proceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 9309

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-...
Verbumculus and the discovery of unusual words
Abstract
Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivalbe, and yet pose ...
Pattern-avoiding alternating words

A word w = w 1 w 2 w n is alternating if either w 1 < w 2 w 3 < w 4 (when the word is up-down) or w 1 w 2 < w 3 w 4 < (when the word is down-up). In this paper, we initiate the study of (pattern-avoiding) alternating words. We enumerate up-down (...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

RECOMB '02: Proceedings of the sixth annual international conference on Computational biology

April 2002

341 pages

ISBN:1581134983

DOI:10.1145/565196

Editors:
Gene Myers
Celera, USA
,
Sridhar Hannenhalli
Celera, USA
,
David Sankoff
University of Montréal, Canada
,
Sorin Istrail
Celera, USA
,
Pavel Pevzner
University of California at San Diego, USA
,
Michael Waterman
University of California, USA

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

RECOMB02

Sponsor:

RECOMB02: 6th Annual International Conference on Computational Molecular Biology

April 18 - 21, 2002

DC, Washington, USA

Acceptance Rates

RECOMB '02 Paper Acceptance Rate 35 of 118 submissions, 30%;

Overall Acceptance Rate 148 of 538 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
360
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang CDarwin FSutrisno H(2019)Local Recurrence Rates with Automatic Time Windows for Discord Search in Multivariate Time SeriesProcedia Manufacturing10.1016/j.promfg.2020.01.26139(1783-1792)Online publication date: 2019
https://doi.org/10.1016/j.promfg.2020.01.261
Yang CSutrisno HLo NChen ZWei CZhang HLin CWei CHsieh S(2018)Streaming data analysis framework for cyber-physical system of metal machining processes2018 IEEE Industrial Cyber-Physical Systems (ICPS)10.1109/ICPHYS.2018.8390764(546-551)Online publication date: May-2018
https://doi.org/10.1109/ICPHYS.2018.8390764
Yang CLiao W(2017)Adjacent Mean Difference (AMD) method for dynamic segmentation in time series anomaly detection2017 IEEE/SICE International Symposium on System Integration (SII)10.1109/SII.2017.8279219(241-246)Online publication date: Dec-2017
https://doi.org/10.1109/SII.2017.8279219
Fekr AJanidarmian MRadecka KZilic Z(2016)Respiration Disorders Classification With Informative Features for m-Health ApplicationsIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2015.245896520:3(733-747)Online publication date: May-2016
https://doi.org/10.1109/JBHI.2015.2458965
Fekr ARadecka KZilic Z(2015)Design and Evaluation of an Intelligent Remote Tidal Volume Variability Monitoring System in E-Health ApplicationsIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2015.244578319:5(1532-1548)Online publication date: Sep-2015
https://doi.org/10.1109/JBHI.2015.2445783
Belazzougui DCunial FKärkkäinen JMäkinen V(2013)Versatile Succinct Representations of the Bidirectional Burrows-Wheeler TransformAlgorithms – ESA 201310.1007/978-3-642-40450-4_12(133-144)Online publication date: 2013
https://doi.org/10.1007/978-3-642-40450-4_12
Krawczak MSzkatuła G(2013)On Perturbation Measure of Clusters: ApplicationArtificial Intelligence and Soft Computing10.1007/978-3-642-38610-7_17(176-183)Online publication date: 2013
https://doi.org/10.1007/978-3-642-38610-7_17
Cunial F(2012)Faster variance computation for patterns with gapsProceedings of the First Mediterranean conference on Design and Analysis of Algorithms10.1007/978-3-642-34862-4_10(134-147)Online publication date: 3-Dec-2012
https://dl.acm.org/doi/10.1007/978-3-642-34862-4_10
Krawczak MSzkatuła G(2012)A clustering algorithm based on distinguishability for nominal attributesProceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II10.1007/978-3-642-29350-4_14(120-127)Online publication date: 29-Apr-2012
https://dl.acm.org/doi/10.1007/978-3-642-29350-4_14
Li YLin J(2010)Approximate variable-length time series motif discovery using grammar inferenceProceedings of the Tenth International Workshop on Multimedia Data Mining10.1145/1814245.1814255(1-9)Online publication date: 25-Jul-2010
https://dl.acm.org/doi/10.1145/1814245.1814255
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents