Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Identifying hierarchical structure in sequences: a linear-time algorithm

Published: 01 September 1997 Publication History

Abstract

SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. SEQUITUR breaks new ground by operating incrementally. Moreover, the method's simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences.

References

[1]
Andreae, J. H. (1977) Thinking with the teachable machine. London: Academic Press.
[2]
Angluin, D. (1982) Inference of reversible languages, Journal of the Association for Computing Machinery, 29, 741-765.
[3]
Bell, T. C., Cleary, J. G., and Witten, I. H. (1990) Text compression. Englewood Cliffs, NJ: Prentice-Hall.
[4]
Berwick, R. C., and Pilato, S. (1987) Learning syntax by automata induction, Machine Learning, 2, 9-38.
[5]
Cohen, A., Ivry, R. I., and Keele, S. W. (1990) Attention and structure in sequence learning, Journal of Experimental Psychology, 16(1), 17-30.
[6]
Cook, C. M., Rosenfeld, A., & Aronson, A. (1976). Grammatical inference by hill climbing, Informational Sciences, 10, 59-80.
[7]
Cypher, A., editor (1993) Watch what I do: programming by demonstration, Cambridge, Massachusetts: MIT Press.
[8]
Gaines, B. R. (1976) Behaviour/structure transformations under uncertainty, International Journal of Man-Machine Studies, 8, 337-365.
[9]
Gold, M. (1967) Language identification in the limit, Information and Control, 10, 447-474.
[10]
Johansson, S., Leech, G., and Goodluck, H. (1978) "Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers," Oslo: Department of English, University of Oslo.
[11]
Knuth, D. E. (1968) The art of computer programming 1: fundamental algorithms. Addison-Wesley.
[12]
Laird, P. & Saul, R. (1994) Discrete sequence prediction and its applications, Machine Learning 15, 43-68.
[13]
Langley, P. (1994). Simplicity and representation change in grammar induction. Unpublished manuscript, Robotics Laboratory, Computer Science Department, Stanford University, Stanford, CA.
[14]
Nevill-Manning, C. G. & Witten, I.H. Compression and explanation using hierarchical grammars, Computer Journal, in press.
[15]
Nevill-Manning, C. G. (1996) Inferring sequential structure, Ph.D. thesis, Department of Computer Science, University of Waikato, New Zealand.
[16]
Nevill-Manning, C. G., Witten, I. H. & Paynter, G. W. (1997) Browsing in digital libraries: a phrase-based approach, Proc. Second ACM International Conference on Digital Libraries, 230-236, Philadelphia, PA.
[17]
Rabiner, L. R. and Juang, B. H. (1986) An introduction to hidden Markov models, IEEE ASSP Magazine, 3(1), 4-16.
[18]
Stolcke, A., & Omohundro, S. (1994). Inducing probabilistic grammars by Bayesian model merging. Proc. Second International Conference on Grammatical Inference and Applications, 106-118, Alicante, Spain: Springer-Verlag.
[19]
VanLehn, K., & Ball, W. (1987). A version space approach to learning context-free grammars. Machine Learning, 2, 39-74.
[20]
Wharton, R. M. (1977). Grammar enumeration and inference. Information and Control, 33, 253- 272.
[21]
Wolff, J. G. (1975) An algorithm for the segmentation of an artificial language analogue, British Journal of Psychology, 66, 79-90.
[22]
Wolff, J. G. (1977) The discovery of segments in natural language, British Journal of Psychology, 68, 97-106.
[23]
Wolff, J. G. (1980) Language acquisition and the discovery of phrase structure, Language and Speech, 23(3), 255-269.
[24]
Wolff, J. G. (1982) Language acquisition, data compression and generalization, Language and Communication, 2(1), 57-89.

Cited By

View all
  • (2024)Musical phrase segmentation via grammatical inductionProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/855(7726-7734)Online publication date: 3-Aug-2024
  • (2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
  • (2024)Grammar-Based Anomaly Detection of Microservice Systems Execution TracesCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651844(77-81)Online publication date: 7-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Artificial Intelligence Research
Journal of Artificial Intelligence Research  Volume 7, Issue 1
July 1997
314 pages

Publisher

AI Access Foundation

El Segundo, CA, United States

Publication History

Published: 01 September 1997
Received: 01 May 1997
Published in JAIR Volume 7, Issue 1

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Musical phrase segmentation via grammatical inductionProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/855(7726-7734)Online publication date: 3-Aug-2024
  • (2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
  • (2024)Grammar-Based Anomaly Detection of Microservice Systems Execution TracesCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651844(77-81)Online publication date: 7-May-2024
  • (2023)Runahead A*Proceedings of the Thirty-Third International Conference on Automated Planning and Scheduling10.1609/icaps.v33i1.27176(31-41)Online publication date: 8-Jul-2023
  • (2023)Homomorphic Compression: Making Text Processing on Compression UnlimitedProceedings of the ACM on Management of Data10.1145/36267651:4(1-28)Online publication date: 12-Dec-2023
  • (2023)CompressGraph: Efficient Parallel Graph Analytics with Rule-Based CompressionProceedings of the ACM on Management of Data10.1145/35886841:1(1-31)Online publication date: 30-May-2023
  • (2023)Graph-Based Mutations for Music GenerationProceedings of the Companion Conference on Genetic and Evolutionary Computation10.1145/3583133.3596318(1916-1919)Online publication date: 15-Jul-2023
  • (2022)Document Spanners - A Brief Overview of Concepts, Results, and Recent DevelopmentsProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526069(139-150)Online publication date: 12-Jun-2022
  • (2022)Query Evaluation over SLP-Represented Document Databases with Complex Document EditingProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3524158(79-89)Online publication date: 12-Jun-2022
  • (2022)CompressDB: Enabling Efficient Compressed Data Direct Processing for Various DatabasesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526130(1655-1669)Online publication date: 10-Jun-2022
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media