Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Open access

Type inference for unique pattern matching

Published: 01 May 2006 Publication History

Abstract

Regular expression patterns provide a natural, declarative way to express constraints on semistructured data and to extract relevant information from it. Indeed, it is a core feature of the programming language Perl, surfaces in various UNIX tools such as sed and awk, and has recently been proposed in the context of the XML programming language XDuce. Since regular expressions can be ambiguous in general, different disambiguation policies have been proposed to get a unique matching strategy. We formally define the matching semantics under both (1) the POSIX, and (2) the first and longest match disambiguation strategies. We show that the generally accepted method of defining the longest match in terms of the first match and recursion does not conform to the natural notion of longest match. We continue by solving the type inference problem for both disambiguation strategies, which consists of calculating the set of all subparts of input values a subexpression can match under the given policy.

References

[1]
Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. L. 1997. The Lorel query language for semistructured data. Int. J. Dig. Lib. 1, 1, 68--88.
[2]
Baader, F. and Nipkow, T. 1998. Term Rewriting and All That, Section 2.3. Cambridge University Press, Cambridge, U.K.
[3]
Boag, S., Chamberlin, D., Fernández, M. F., Florescu, D., Robie, J., and Siméon, J. 2005. XQuery 1.0: An XML Query Language. W3C working draft. World Wide Web Consortium. Go online to www.w3.org.
[4]
Book, R., Even, S., Greibach, S., and Ott, G. 1971. Ambiguity in graphs and expressions. IEEE Trans. Comput. 20, 2, 149--153.
[5]
Brüggemann-Klein, A., Murata, M., and Wood, D. 2001. Regular tree and regular hedge languages over unranked alphabets. Unpublished manuscript, version 1.
[6]
Buneman, P., Fernandez, M. F., and Suciu, D. 2000. UnQL: A query language and algebra for semistructured data based on structural recursion. VLDB J. 9, 1, 76--110.
[7]
Clark, J. and Makoto, M. 2001. RELAX NG Specification. Organization for the Advancement of Structured Information Standards. Go online to www.oasis-open.org.
[8]
Davidson, A., Fuchs, M., Hedin, M., Jain, M., Koistinen, J., Lloyd, C., Maloney, M., and Schwarzhof, K. 1999. Schema for object-oriented XML 2.0. Tech. rep., Veo Systems Inc. (Now part of Perfect Commerce. Go online to www.perfect.com.)
[9]
Dougherty, D. and Robbins, A. 1996. Sed and Awk. O'Reilly, Sebastopol, CA.
[10]
Elgaard, J., Klarlund, N., and Møller, A. 1998. Mona 1.x: New techniques for WS1S and WS2S. In Computer Aided Verification, CAV '98, Proceedings. Lecture Notes in Computer Science, vol. 1427. Springer-Verlag, Berlin, Germany.
[11]
Frisch, A. 2004. Regular tree language recognition with static information. In Exploring New Frontiers of Theoretical Informatics, IFIP 18th World Computer Congress, TCS 3rd International Conference on Theoretical Computer Science. Kluwer, Dordrecht, The Netherlands, 661--674.
[12]
Frisch, A. and Cardelli, L. 2004. Greedy regular expression matching. In Automata, Languages and Programming: ICALP 2004. Proceedings. Lecture Notes in Computer Science, vol. 3142. Springer-Verlag, Berlin, Germany, 618--629.
[13]
Frisch, A., Castagna, G., and Benzaken, V. 2002. Semantic subtyping. In Proceedings of the Seventeenth Annual IEEE Symposium on Logic in Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 137--146.
[14]
Frisch, A., Castagna, G., and Benzaken, V. 2003. ℂDuce: An XML-centric general-purpose language. In Proceedings of the Eighth ACM SIGPLAN International Conference on Functional Programming. ACM Press, New York, NY, 51--63.
[15]
Hopcroft, J. E. and Ullman, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA.
[16]
Hosoya, H. 2000. Regular expression types for XML. Ph.D. dissertation. University of Tokyo, Tokyo, Japan.
[17]
Hosoya, H. 2003. Regular expression pattern matching---a simpler design. Tech. rep. 1397, RIMS, Kyoto University, Kyoto, Japan.
[18]
Hosoya, H. and Pierce, B. C. 2002. Regular expression pattern matching for XML. J. Funct. Prog. 13, 6, 961--1004.
[19]
Hosoya, H. and Pierce, B. C. 2003. XDuce: A statically typed XML processing language. ACM Trans. Internet Tech. 3, 2, 117--148.
[20]
Hosoya, H., Vouillon, J., and Pierce, B. C. 2005. Regular expression types for XML. ACM Trans. Prog. Lang. Syst. 27, 1, 46--90.
[21]
Institute of Electrical and Electronic Engineers. 1992. Portable operating system interface (POSIX). IEEE Std 1003.2. IEEE, Piscataway, NJ.
[22]
Klarlund, N. and Møller, A. 2001. MONA Version 1.4 User Manual. Basic Research in Computer Science (BRICS) Notes Series NS-01-1. Department of Computer Science, University of Aarhus, Aarhus, Denmark.
[23]
Laurikari, V. 2000. NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. In Symposium on String Processing and Information Retrieval (SPIRE).
[24]
Laurikari, V. 2001. Efficient submatch addressing for regular expressions. M.S. thesis. Helsinki University of Technology, Helsinki, Finland.
[25]
Levin, M. Y. 2003. Compiling regular patterns. In Proceedings of the Eighth ACM SIGPLAN International Conference on Functional Programming. ACM Press, New York, NY, 65--77.
[26]
Møller, A. 2003. Document structure description 2.0. Tech. rep. Basic Research in Computer Science (BRICS). Department of Computer Science, University of Aarhus, Aarhus, Denmark.
[27]
Murata, M. 1999. Hedge automata: a formal model for XML schemata. Available online at http://www.geocities.com/murata_makoto.
[28]
Murata, M. 2001. Extended path expressions for XML. In Proceedings of the Twentieth ACM Symposium on Principles of Database Systems. ACM Press, New York, NY, 126--137.
[29]
Murata, M., Lee, D., and Mani, M. 2001. Taxonomy of XML schema languages using formal language theory. In Proceedings of the Conference on Extreme Markup Languages (Montreal, P.Q., Canada).
[30]
Neumann, A. and Seidl, H. 1998. Locating matches of tree patterns in forests. In Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, vol. 1530. Springer-Verlag, Berlin, Germany, 134--145.
[31]
Neven, F. 2002. Automata theory for XML researchers. ACM SIGMOD Rec. 31, 3, 39--46.
[32]
Neven, F. and Schwentick, T. 2001. Automata- and logic-based pattern languages for tree-structured data. In Semantics in Databases. Lecture Notes in Computer Science, vol. 2582. Springer. Berlin, Germany, 160--178.
[33]
Sterling, L. and Shapiro, E. 1994. The Art of Prolog (second edition). MIT Press, Cambridge, MA.
[34]
Suciu, D. 2002. The XML typechecking problem. ACM SIGMOD Rec. 31, 1, 89--96.
[35]
Sumii, E. May 2003. Personal communication.
[36]
Tabuchi, N., Sumii, E., and Yonezawa, A. 2002. Regular expression types for strings in a text processing language (extended abstract). In Workshop on Types in Programming (TIP'02). Go online to http://web.yl.is.s.u-tokyo.ac.jp/~tabee/xperl/.
[37]
Thompson, H. S., Beech, D., Maloney, M., and Mendelsohn, N. 2001. XML Schema. W3C Recommendation. World Wide Web Consortium. Go online to www.w3.org.
[38]
Ullman, J. D. 1998. Elements of ML Programming, 2nd ed. Prentice Hall, Englewood Cliffs, NJ.
[39]
Vianu, V. 2001. A Web odyssey: From Codd to XML. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press, New York, NY, 1--15.
[40]
Wall, L., Christiansen, T., and Orwant, J. 2000. Programming Perl, 3rd ed. O'Reilly & Associates, Sebastopol, CA.
[41]
Yergeau, F., Bray, T., Paoli, J., Sperberg-McQueen, C. M., and Maler, E. 2004. Extensible Markup Language (XML) 1.0 (Third Edition). W3C Recommendation. World Wide Web Consortium. Go online to www.w3.org.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems  Volume 28, Issue 3
May 2006
187 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/1133651
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2006
Published in TOPLAS Volume 28, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Pattern matching
  2. XML
  3. disambiguation policies
  4. programming languages

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)15
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)POSIX Lexing with Derivatives of Regular ExpressionsJournal of Automated Reasoning10.1007/s10817-023-09667-167:3Online publication date: 8-Jul-2023
  • (2020)Efficient Enumeration Algorithms for Regular Document SpannersACM Transactions on Database Systems10.1145/335145145:1(1-42)Online publication date: 8-Feb-2020
  • (2016)A Relational Framework for Information ExtractionACM SIGMOD Record10.1145/2935694.293569644:4(5-16)Online publication date: 9-May-2016
  • (2016)Declarative Cleaning of Inconsistencies in Information ExtractionACM Transactions on Database Systems10.1145/287720241:1(1-44)Online publication date: 7-Apr-2016
  • (2016)POSIX Lexing with Derivatives of Regular Expressions (Proof Pearl)Interactive Theorem Proving10.1007/978-3-319-43144-4_5(69-86)Online publication date: 7-Aug-2016
  • (2014)BiFluXProceedings of the 16th International Symposium on Principles and Practice of Declarative Programming10.1145/2643135.2643141(147-158)Online publication date: 8-Sep-2014
  • (2014)Cleaning inconsistencies in information extraction via prioritized repairsProceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems10.1145/2594538.2594540(164-175)Online publication date: 18-Jun-2014
  • (2014)POSIX Regular Expression Parsing with DerivativesFunctional and Logic Programming10.1007/978-3-319-07151-0_13(203-220)Online publication date: 2014
  • (2012)Translating regular expression matching into transducersJournal of Applied Logic10.1016/j.jal.2011.11.00310:1(32-51)Online publication date: 1-Mar-2012
  • (2011)Bit-coded regular expression parsingProceedings of the 5th international conference on Language and automata theory and applications10.5555/2022896.2022930(402-413)Online publication date: 26-May-2011
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media