Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2213836.2213868acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Online windowed subsequence matching over probabilistic sequences

Published: 20 May 2012 Publication History

Abstract

Windowed subsequence matching over deterministic strings has been studied in previous work in the contexts of knowledge discovery, data mining, and molecular biology. However, we observe that in these applications, as well as in data stream monitoring, complex event processing, and time series data processing in which streams can be mapped to strings, the strings are often noisy and probabilistic. We study this problem in the online setting where efficiency is paramount. We first formulate the query semantics, and propose an exact algorithm. Then we propose a randomized approximation algorithm that is faster and, in the mean time, provably accurate. Moreover, we devise a filtering algorithm to further enhance the efficiency with an optimization technique that is adaptive to sequence stream contents. Finally, we propose algorithms for patterns with negations. In order to verify the algorithms, we conduct a systematic empirical study using three real datasets and some synthetic datasets.

References

[1]
L. Boasson, P. Cegielski, I. Guessarian, Y. Matiyasevich, Window-accumulated subsequence matching problem is linear. In PODS, 1999.
[2]
D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, S. Zdonik. Monitoring streams- a new class of data management applications. In VLDB, 2002.
[3]
S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, J. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, M. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003.
[4]
A. Demers, J. Gehrke, B. Panda, M. Riedewald1, V. Sharma, W. White. Cayuga: A General Purpose Event Monitoring System. In CIDR, 2007.
[5]
Y. Diao, B. Li, A. Liu, L. Peng, C. Sutton, T. Tran, M. Zink. Capturing data uncertainty in high volume stream processing. In CIDR, 2009.
[6]
R. Durrett. Probability: theory and examples (2nd edition), 1996.
[7]
B. Ewing et al. Base-calling of automated sequencer traces using phred I accuracy assessment. In Genome Research, 8, 1998.
[8]
T. Ge and Z. Li. Approximate Substring Matching over Uncertain Strings. In VLDB, 2011.
[9]
N. H. Gehani, H. V. Jagadish, O. Shmueli. Composite event specification in active databases: Model and implementation. In VLDB, 1992.
[10]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.
[11]
P. Hall. The distribution of means for samples of size n drawn from a population in which the variate takes values between 0 and 1, all such values being equally probable. In Biometrika, 19 (3/4), 1927.
[12]
M. Kircher, J. Kelso. High-throughput DNA sequencing -- concepts and limitations. In BioEssays, 32:524--536, 2010.
[13]
G. Kucherov, M. Rusinovitch. Matching a set of strings with variable length don't cares. In Theoretical Computer Science, 1997.
[14]
J. Larson, E. Bradlow, P. Fader. An Exploratory Look at Supermarket Shopping Paths. In Inter. Journal of Research in Marketing, 2005.
[15]
X. Lian, L. Chen. Similarity join processing on uncertain data streams. In IEEE Transactions on Knowledge and Data Engineering, 2010.
[16]
Lin, J., Keogh, E., Lonardi, S., Chiu, B. A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. DMKD, 2003.
[17]
U. Manber, R. Baeza-Yates. An algorithm for string matching with a sequence of don't cares. In Information Processing Letters, 1991.
[18]
H. Mannila. Methods and problems in data mining. In ICDT, 1997.
[19]
H. Mannila, H. Toivonen, A. Verkamo. Discovering frequent episodes in sequences. In SIGKDD, 1995.
[20]
A. Mattu, W. Brady. ECGs for the Emergency Physician. BMJ Books, 2003.
[21]
T. Mitchell. Machine Learning. McGraw Hill, 1997.
[22]
NC-IUB. Nomenclature for incompletely specified bases in nucleic acid sequences. In Biochemistry Journal, 1985.
[23]
C. Re, J. Letchner, Magdalena Balazinska, Dan Suciu: Event queries on correlated probabilistic streams. In SIGMOD, 2008.
[24]
H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. In TASSP, 26(1) pp. 43--49, 1978.
[25]
S. Sarangi, K. Murthy. DUST: a generalized notion of similarity between uncertain time series. In SIGKDD, 2010.
[26]
P. Seshadri, M. Livny, and R. Ramakrishnan. Sequence query processing. In SIGMOD, 1994.
[27]
S. Tata, J. M. Patel, J. Friedman, A. Swaroop. Declarative Querying for Biological Sequences. In ICDE, 2006.
[28]
A. Thiagarajan, L. Ravindranath, K. LaCurts, S. Madden, H. Balakrishnan, S. Toledo, J. Eriksson. VTrack: accurate, energy-aware road traffic delay estimation using mobile phones. In SenSys, 2009.
[29]
E. Wu, Y. Diao, S. Rizvi. High-performance complex event processing over streams. In SIGMOD, 2006.
[30]
H. Wu, B. Salzberg, G. Sharp, S. Jiang, H. Shirato, D. Kaeli. Subsequence Matching on Structured Time Series Data. In SIGMOD, 2005.
[31]
http://eleceng.dit.ie/tburke/biomed/assignment1.html.
[32]
http://www.ktc.uky.edu/.
[33]
http://www.physionet.org/physiobank/database/mitdb/.

Cited By

View all
  • (2017)Modeling and Formal Analysis of Probabilistic Complex Event Processing (CEP) ApplicationsModelling Foundations and Applications10.1007/978-3-319-61482-3_15(248-263)Online publication date: 20-Jun-2017
  • (2016)Effective Privacy Preservation over Composite Events with Markov Correlations2016 13th Web Information Systems and Applications Conference (WISA)10.1109/WISA.2016.50(215-220)Online publication date: Sep-2016
  • (2013)RCSIProceedings of the VLDB Endowment10.14778/2536258.25362656:13(1534-1545)Online publication date: 1-Aug-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. online
  2. subsequence matching
  3. uncertain sequence

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '12
Sponsor:

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Modeling and Formal Analysis of Probabilistic Complex Event Processing (CEP) ApplicationsModelling Foundations and Applications10.1007/978-3-319-61482-3_15(248-263)Online publication date: 20-Jun-2017
  • (2016)Effective Privacy Preservation over Composite Events with Markov Correlations2016 13th Web Information Systems and Applications Conference (WISA)10.1109/WISA.2016.50(215-220)Online publication date: Sep-2016
  • (2013)RCSIProceedings of the VLDB Endowment10.14778/2536258.25362656:13(1534-1545)Online publication date: 1-Aug-2013
  • (2013)ε-MatchingProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2463715(601-612)Online publication date: 22-Jun-2013
  • (2013)Top-K oracleProceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)10.1109/ICDE.2013.6544821(146-157)Online publication date: 8-Apr-2013
  • (2013)Similarity Search on Uncertain Spatio-temporal DataProceedings of the 6th International Conference on Similarity Search and Applications - Volume 819910.1007/978-3-642-41062-8_5(43-49)Online publication date: 2-Oct-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media