Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Semantics and algorithms for data-dependent grammars

Published: 17 January 2010 Publication History

Abstract

We present the design and theory of a new parsing engine, YAKKER, capable of satisfying the many needs of modern programmers and modern data processing applications. In particular, our new parsing engine handles (1) full scannerless context-free grammars with (2) regular expressions as right-hand sides for defining nonterminals. YAKKER also includes (3) facilities for binding variables to intermediate parse results and (4) using such bindings within arbitrary constraints to control parsing. These facilities allow the kind of data-dependent parsing commonly needed in systems applications, particularly those that operate over binary data. In addition, (5) nonterminals may be parameterized by arbitrary values, which gives the system good modularity and abstraction properties in the presence of data-dependent parsing. Finally, (6) legacy parsing libraries,such as sophisticated libraries for dates and times, may be directly incorporated into parser specifications. We illustrate the importance and utility of this rich collection of features by presenting its use on examples ranging from difficult programming language grammars to web server logs to binary data specification. We also show that our grammars have important compositionality properties and explain why such properties areimportant in modern applications such as automatic grammar induction.
In terms of technical contributions, we provide a traditional high-level semantics for our new grammar formalization and show how to compile grammars into non deterministic automata. These automata are stack-based, somewhat like conventional push-down automata,but are also equipped with environments to track data-dependent parsing state. We prove the correctness of our translation of data-dependent grammars into these new automata and then show how to implement the automata efficiently using a variation of Earley's parsing algorithm.

References

[1]
Martin Bravenboer and Eelco Visser. Concrete syntax for objects: domain-specific language embedding and assimilation without restrictions. In Proceedings of the ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA), 2004.
[2]
S. M. Chou and K. S. Fu. Transition networks for pattern recognition. Technical Report TR-EE-75-39, School for Electrical Engineering, Purdue University, West Lafayette, IN, 1975.
[3]
N. Correa. An extension of Earley's algorithm for S- and L-attributed grammars. In Proceedings of the International Conference on Current Issues in Computational Linguistics, Penang, Malaysia, 1991.
[4]
M. Crispin. Internet Message Access Protocol -- Version 4rev1. http://www.ietf.org/rfc/rfc3501.txt, March 2003.
[5]
Jay Earley. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94--102, 1970.
[6]
Giorgios Economopoulos, Paul Klint, and Jurgen Vinju. Faster scannerless GLR parsing. In Proceedings of the 18th International Conference on Compiler Construction (CC). Springer-Verlag, 2009.
[7]
Kathleen Fisher and Robert Gruber. PADS: A domain specific language for processing ad hoc data. In PLDI, pages 295--304, 2005.
[8]
Kathleen Fisher, Yitzhak Mandelbaum, and David Walker. The next 700 data description languages. In ACM Symposium on Principles of Programming Languages, 2006.
[9]
Kathleen Fisher, David Walker, Kenny Q. Zhu, and Peter White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In ACM Symposium on Principles of Programming Languages, pages 421--434, January 2008.
[10]
Bryan Ford. Packrat parsing: simple, powerful, lazy, linear time. In ACM International Conference on Functional Programming, pages 36--47. ACM Press, October 2002.
[11]
Bryan Ford. Parsing expression grammars: a recognition-based syntactic foundation. In ACM Symposium on Principles of Programming Languages, pages 111--122. ACM Press, January 2004.
[12]
Robert Grimm. Practical packrat parsing. Technical Report TR2004-854, New York University, March 2004.
[13]
Ian Hickson and David Hyatt. HTML 5: A vocabulary and associated APIs for HTML and XHTML. http://dev.w3.org/html5/spec/Overview.html#parsing.
[14]
R. John M. Hughes and S. Doaitse Swierstra. Polish parsers, step by step. In ACM International Conference on Functional Programming, pages 239--248, New York, NY, USA, 2003.
[15]
Trevor Jim and Yitzhak Mandelbaum. Efficient earley parsing with regular right-hand sides. In Workshop on Language Descriptions Tools and Applications, 2009.
[16]
S. C. Johnson. Yacc: Yet another compiler compiler. Technical Report 32, AT&T Bell Laboratories, Murray Hill, NJ, 1975.
[17]
Neil Jones and Michael Madsen. Attribute-influenced LR parsing. In Semantics-Directed Compiler Generation, volume 94 of Lecture Notes in Computer Science, pages 393--407. Springer Berlin, 1980.
[18]
Donald E. Knuth. Semantics of context-free languages. Theory of Computing Systems, 2(2):127--145, June 1968.
[19]
Peter J. Landin. The next 700 programming languages. Communications of the ACM, 9(3):157--166, March 1966.
[20]
Xavier Leroy, Damien Doligez, Jacques Garrigue, Didier Rémy, and Jéróme Vouillon. The Objective Caml system release 3.10: Documentation and user's manual, 2007.
[21]
P.M. Lewis, D.J. Rosenkrantz, and R.E. Stearns. Attributed translations. Journal of Computer and System Sciences, 9(3):279--307, December 1974.
[22]
Yitzhak Mandelbaum, Kathleen Fisher, David Walker, Mary Fernandez, and Artem Gleyzer. PADS/ML: A functional data description language. In ACM Symposium on Principles of Programming Languages, 2007.
[23]
Scott McPeak. Elkhound: A fast, practical GLR parser generator. Technical Report UCS/CSD-2-1214, University of California, Berkeley, 2002.
[24]
Scott McPeak and George C. Necula. Elkhound: A fast, practical GLR parser generator. In Proceedings of Conference on Compiler Constructor, April 2004.
[25]
A. Melnikov. Collected extensions to IMAP4 ABNF. http://www.ietf.org/rfc/rfc4466.txt, April 2006.
[26]
Karel Müller. Attribute-directed top-down parsing. In Compiler Construction, volume 641 of Lecture Notes in Computer Science, pages 37--43. Springer Berlin, 1992.
[27]
Eric Rescorla. SSL and TLS: Designing and Building Secure Systems. Addison-Wesley Professional, October 2000.
[28]
D. J. Salomon and G. V. Cormack. Scannerless NSLR(1) parsing of programming languages. In PLDI, pages 170--178, 1989.
[29]
Elizabeth Scott. SPPF-style parsing from Earley recognisers. In Proceedings of the Seventh Workshop on Language Descriptions, Tools, and Applications (LDTA), March 2007.
[30]
Elizabeth Scott and Adrian Johnstone. Right nulled GLR parsers. ACM Trans. Program. Lang. Syst., 28(4):577--618, 2006.
[31]
Takehiro Tokuda and Yoshimichi Watanabe. An attribute evaluation of context-free languages. Information Processing Letters, 52(2):91--98, October 1994.
[32]
M.G.D. van den Brand, J. Scheerder, J.J. Vinju, and E. Visser. Disambiguation filters for scannerless generalized LR parsers. In Compiler Construction 2002, volume 2304 of Lecture Notes in Computer Science, pages 143--158, April 2002.
[33]
Eelco Visser. Syntax Definition for Language Prototyping. PhD thesis, University of Amsterdam, September 1997.
[34]
David Watt. Rule splitting and attribute-directed parsing. In Semantics-Directed Compiler Generation, Lecture Notes in Computer Science, pages 363--392. Springer Berlin, 1980.
[35]
W. A. Woods. Transition network grammars for natural language analysis. Communications of the ACM, 13(10):591--606, 1970.

Cited By

View all
  • (2023)On Parsing Programming Languages with Turing-Complete ParserMathematics10.3390/math1107159411:7(1594)Online publication date: 25-Mar-2023
  • (2023)Interval Parsing Grammars for File Format ParsingProceedings of the ACM on Programming Languages10.1145/35912647:PLDI(1073-1095)Online publication date: 6-Jun-2023
  • (2022)A Model and Declarative Language for Specifying Binary Data FormatsProgramming and Computer Software10.1134/S036176882207004048:7(469-483)Online publication date: 29-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 45, Issue 1
POPL '10
January 2010
500 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1707801
Issue’s Table of Contents
  • cover image ACM Conferences
    POPL '10: Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
    January 2010
    520 pages
    ISBN:9781605584799
    DOI:10.1145/1706299
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 January 2010
Published in SIGPLAN Volume 45, Issue 1

Check for updates

Author Tags

  1. ambiguous grammars
  2. automata
  3. context-sensitive grammars
  4. data-dependent grammars
  5. earley parsing
  6. ebnf
  7. l-attributed grammars
  8. regular expressions
  9. regular right-sides
  10. scannerless parsing
  11. semantic predicates
  12. transducers

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)On Parsing Programming Languages with Turing-Complete ParserMathematics10.3390/math1107159411:7(1594)Online publication date: 25-Mar-2023
  • (2023)Interval Parsing Grammars for File Format ParsingProceedings of the ACM on Programming Languages10.1145/35912647:PLDI(1073-1095)Online publication date: 6-Jun-2023
  • (2022)A Model and Declarative Language for Specifying Binary Data FormatsProgramming and Computer Software10.1134/S036176882207004048:7(469-483)Online publication date: 29-Nov-2022
  • (2022)Context-sensitive parsing for programming languagesJournal of Computer Languages10.1016/j.cola.2022.10117273(101172)Online publication date: Dec-2022
  • (2016)Taming context-sensitive languages with principled stateful parsingProceedings of the 2016 ACM SIGPLAN International Conference on Software Language Engineering10.1145/2997364.2997370(15-27)Online publication date: 20-Oct-2016
  • (2014)The formalization and implementation of Adaptable Parsing Expression GrammarsScience of Computer Programming10.1016/j.scico.2014.02.02096:P2(191-210)Online publication date: 15-Dec-2014
  • (2013)Layout-Sensitive Generalized ParsingSoftware Language Engineering10.1007/978-3-642-36089-3_14(244-263)Online publication date: 2013
  • (2012)Adaptable parsing expression grammarsProceedings of the 16th Brazilian conference on Programming Languages10.1007/978-3-642-33182-4_7(72-86)Online publication date: 23-Sep-2012
  • (2011)Parsing reflective grammarsProceedings of the Eleventh Workshop on Language Descriptions, Tools and Applications10.1145/1988783.1988793(1-7)Online publication date: 26-Mar-2011
  • (2023)Automated Ambiguity Detection in Layout-Sensitive GrammarsProceedings of the ACM on Programming Languages10.1145/36228387:OOPSLA2(1150-1175)Online publication date: 16-Oct-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media