Learn PyParsing Library
Learn PyParsing Library
John W. Shipman
2013-03-05 12:52
Abstract
A quick reference guide for pyparsing, a recursive descent parser framework for the Python programming language. This publication is available in Web form and also as a PDF document . Please forward any comments to tcc-doc@nmt.edu.
1 2
Table of Contents
1. pyparsing: A tool for extracting information from text ................................................................. 3 2. Structuring your application ..................................................................................................... 4 3. A small, complete example ....................................................................................................... 5 4. How to structure the returned ParseResults ......................................................................... 7 4.1. Use pp.Group() to divide and conquer ........................................................................ 8 4.2. Structuring with results names ....................................................................................... 9 5. Classes .................................................................................................................................. 10 5.1. ParserElement: The basic parser building block ......................................................... 11 5.2. And: Sequence ............................................................................................................. 16 5.3. CaselessKeyword: Case-insensitive keyword match ................................................... 16 5.4. CaselessLiteral: Case-insensitive string match ........................................................ 16 5.5. CharsNotIn: Match characters not in a given set .......................................................... 16 5.6. Combine: Fuse components together ............................................................................ 17 5.7. Dict: A scanner for tables ............................................................................................ 18 5.8. Each: Require components in any order ........................................................................ 18 5.9. Empty: Match empty content ........................................................................................ 18 5.10. FollowedBy: Adding lookahead constraints ............................................................... 19 5.11. Forward: The parser placeholder ............................................................................... 19 5.12. GoToColumn: Advance to a specified position in the line ............................................. 22 5.13. Group: Group repeated items into a list ....................................................................... 23 5.14. Keyword: Match a literal string not adjacent to specified context ................................... 23 5.15. LineEnd: Match end of line ....................................................................................... 24 5.16. LineStart: Match start of line .................................................................................. 24 5.17. Literal: Match a specific string ................................................................................ 25 5.18. MatchFirst: Try multiple matches in a given order ................................................... 25 5.19. NoMatch: A parser that never matches ........................................................................ 25 5.20. NotAny: General lookahead condition ......................................................................... 26 5.21. OneOrMore: Repeat a pattern one or more times .......................................................... 26 5.22. Optional: Match an optional pattern ......................................................................... 26
1 2
http://www.nmt.edu/tcc/help/pubs/pyparsing/ http://www.nmt.edu/tcc/help/pubs/pyparsing/pyparsing.pdf
5.23. Or: Parse one of a set of alternatives ............................................................................ 27 5.24. ParseException ..................................................................................................... 27 5.25. ParseFatalException: Get me out of here! ............................................................. 28 5.26. ParseResults: Result returned from a match ............................................................ 28 5.27. QuotedString: Match a delimited string ................................................................... 29 5.28. Regex: Match a regular expression ............................................................................. 30 5.29. SkipTo: Search ahead for a pattern ............................................................................. 31 5.30. StringEnd: Match the end of the text ......................................................................... 31 5.31. StringStart: Match the start of the text .................................................................... 32 5.32. Suppress: Omit matched text from the result ............................................................. 32 5.33. Upcase: Uppercase the result ..................................................................................... 32 5.34. White: Match whitespace ........................................................................................... 33 5.35. Word: Match characters from a specified set ................................................................. 33 5.36. WordEnd: Match only at the end of a word .................................................................. 34 5.37. WordStart: Match only at the start of a word ............................................................. 34 5.38. ZeroOrMore: Match any number of repetitions including none .................................... 35 6. Functions ........................................................................................................................... 35 6.1. col(): Convert a position to a column number ............................................................. 35 6.2. countedArray: Parse N followed by N things .............................................................. 36 6.3. delimitedList(): Create a parser for a delimited list ................................................. 36 6.4. dictOf(): Build a dictionary from key/value pairs ...................................................... 37 6.5. downcaseTokens(): Lowercasing parse action ........................................................... 38 6.6. getTokensEndLoc(): Find the end of the tokens ........................................................ 38 6.7. line(): In what line does a location occur? .................................................................. 38 6.8. lineno(): Convert a position to a line number ............................................................. 39 6.9. matchOnlyAtCol(): Parse action to limit matches to a specific column ......................... 39 6.10. matchPreviousExpr(): Match the text that the preceding expression matched .......... 39 6.11. matchPreviousLiteral(): Match the literal text that the preceding expression matched ............................................................................................................................ 40 6.12. nestedExpr(): Parser for nested lists ........................................................................ 40 6.13. oneOf(): Check for multiple literals, longest first ........................................................ 41 6.14. srange(): Specify ranges of characters ...................................................................... 41 6.15. removeQuotes(): Strip leading trailing quotes .......................................................... 42 6.16. replaceWith(): Substitute a constant value for the matched text ............................... 42 6.17. traceParseAction(): Decorate a parse action with trace output ............................... 42 6.18. upcaseTokens(): Uppercasing parse action .............................................................. 43 7. Variables ............................................................................................................................... 43 7.1. alphanums: The alphanumeric characters .................................................................... 43 7.2. alphas: The letters ..................................................................................................... 43 7.3. alphas8bit: Supplement Unicode letters .................................................................... 43 7.4. cStyleComment: Match a C-language comment ........................................................... 43 7.5. commaSeparatedList: Parser for a comma-separated list ............................................ 43 7.6. cppStyleComment: Parser for C++ comments ............................................................. 44 7.7. dblQuotedString: String enclosed in "..." .............................................................. 44 7.8. dblSlashComment: Parser for a comment that starts with // .................................... 44 7.9. empty: Match empty content ........................................................................................ 44 7.10. hexnums: All hex digits .............................................................................................. 44 7.11. javaStyleComment: Comments in Java syntax .......................................................... 44 7.12. lineEnd: An instance of LineEnd ............................................................................. 45 7.13. lineStart: An instance of LineStart ..................................................................... 45 7.14. nums: The decimal digits ............................................................................................ 45 7.15. printables: All the printable non-whitespace characters ........................................... 46
7.16. punc8bit: Some Unicode punctuation marks ............................................................. 7.17. pythonStyleComment: Comments in the style of the Python language ....................... 7.18. quotedString: Parser for a default quoted string ...................................................... 7.19. restOfLine: Match the rest of the current line ........................................................... 7.20. sglQuotedString: String enclosed in '...' ............................................................ 7.21. stringEnd: Matches the end of the string .................................................................. 7.22. unicodeString: Match a Python-style Unicode string ...............................................
46 46 46 47 47 47 48
In terms of power, this module is more powerful than regular expressions , as embodied in the Python 5 re module , but not as general as a full-blown compiler. In order to find information within structured text, we must be able to describe that structure. The pyparsing module builds on the fundamental syntax description technology embodied in Backus-Naur 6 Form , or BNF. Some familiarity with the various syntax notations based on BNF will be most helpful to you in using this package. The way that the pyparsing module works is to match patterns in the input text using a recursive descent parser: we write BNF-like syntax productions, and pyparsing provides a machine that matches the input text against those productions. The pyparsing module works best when you can describe the exact syntactic structure of the text you are analyzing. A common application of pyparsing is the analysis of log files. Log file entries generally have a predictable structure including such fields as dates, IP addresses, and such. Possible applications of the module to natural language work are not addressed here. Useful online references include: The pyparsing homepage . Complete online reference documentation for each class, function, and variable. See the tutorial
10 9 8 7
at ONLamp.com.
11
The package author's 2004 tutorial is slightly dated but still useful. For a small example (about ten syntax productions) of an application that uses this package, see ical12 parse: A pyparsing parser for .calendar files . For a modest example, abaraw: A shorthand notation for bird records describes a file format with about thirty syntax productions. The actual implementation appears in a separate document, abaraw internal 14 maintenance specification , which is basically a pyparsing core with a bit of application logic that converts it to XML to be passed to later processing steps.
3 4
13
http://www.python.org/ http://en.wikipedia.org/wiki/Regular_expression 5 http://docs.python.org/2/library/re.html 6 http://en.wikipedia.org/wiki/Backus-Naur_Form 7 http://en.wikipedia.org/wiki/Recursive_descent 8 http://pyparsing.wikispaces.com/ 9 http://packages.python.org/pyparsing/ 10 http://onlamp.com/lpt/a/6435 11 http://www.ptmcg.com/geo/python/howtousepyparsing.html 12 http://www.nmt.edu/tcc/help/lang/python/examples/icalparse/ 13 http://www.nmt.edu/~shipman/aba/raw/doc/ 14 http://www.nmt.edu/~shipman/aba/raw/doc/ims/
Note
Not every feature is covered here; this document is an attempt to cover the features most people will use most of the time. See the reference documentation for all the grisly details. In particular, the author feels strongly that pyparsing is not the right tool for parsing XML and HTML, so numerous related features are not covered here. For a much better XML/HTML tool, see Python XML 15 processing with lxml .
15 16
http://www.nmt.edu/tcc/help/pubs/pylxml/ http://pyparsing.wikispaces.com/
Note
White space (non-printing) characters such as space and tab are normally skipped between tokens, although this behavior can be changed. This greatly simplifies many applications, because you don't have to clutter the syntax with a lot of patterns that specify where white space can be skipped. 7. Extract your application's information from the returned ParseResults instance. The exact structure of this instance depends on how you built the parser. To see how this all fits together: Section 3, A small, complete example (p. 5). Section 4, How to structure the returned ParseResults (p. 7).
That last production can be read as: an identifier consists of one first followed by zero or more rest. Here is a script that implements that syntax and then tests it against a number of strings.
trivex
#!/usr/bin/env python #================================================================ # trivex: Trivial example #---------------------------------------------------------------# - - - - import sys The next line imports the pyparsing module and renames it as pp.
trivex
I m p o r t s
import pyparsing as pp # - - - - M a n i f e s t c o n s t a n t s
In the next line, the pp.alphas variable is a string containing all lowercase and uppercase letters. The pp.Word() class produces a parser that matches a string of letters defined by its first argument; the exact=1 keyword argument tells that parser to accept exactly one character from that string. So first is a parser (that is, a ParserElement instance) that matches exactly one letter or an underbar.
trivex
first = pp.Word(pp.alphas+"_", exact=1) The pp.alphanums variable is a string containing all the letters and all the digits. So the rest pattern matches one or more letters, digits, or underbar characters.
trivex
rest = pp.Word(pp.alphanums+"_") The Python + operator is overloaded for instances of the pp.ParserElement class to mean sequence: that is, the identifier parser matches what the first parser matches, followed optionally by what the rest parser matches.
trivex
identifier = first+pp.Optional(rest) testList = [ # List of test strings # Valid identifiers "a", "foo", "_", "Z04", "_bride_of_mothra", # Not valid "", "1", "$*", "a_#" ] # - - - - m a i n
def test(s): '''See if s matches identifier. ''' print "---Test for '{0}'".format(s) When you call the .parseString() method on an instance of the pp.ParserElement class, either it returns a list of the matched elements or raises a pp.ParseException.
trivex
try: result = identifier.parseString(s) print " Matches: {0}".format(result) except pp.ParseException as x: print " No match: {0}".format(str(x)) # - - - - E p i l o g u e
if __name__ == "__main__": main() Here is the output from this script: ---Test for 'a' Matches: ['a']
---Test for 'foo' Matches: ['f', 'oo'] ---Test for '_' Matches: ['_'] ---Test for 'Z04' Matches: ['Z', '04'] ---Test for '_bride_of_mothra' Matches: ['_', 'bride_of_mothra'] ---Test for '' No match: Expected W:(abcd...) (at char 0), (line:1, col:1) ---Test for '1' No match: Expected W:(abcd...) (at char 0), (line:1, col:1) ---Test for '$*' No match: Expected W:(abcd...) (at char 0), (line:1, col:1) ---Test for 'a_#' Matches: ['a', '_'] The return value is an instance of the pp.ParseResults class; when printed, it appears as a list of the matched strings. You will note that for single-letter test strings, the resulting list has only a single element, while for multi-letter strings, the list has two elements: the first character (the part that matched first) followed by the remaining characters that matched the rest parser. If we want the resulting list to have only one element, we can change one line to get this effect: identifier = pp.Combine(first+pp.Optional(rest)) The pp.Combine() class tells pyparsing to combine all the matching pieces in its argument list into a single result. Here is an example of two output lines from the revised script: ---Test for '_bride_of_mothra' Matches: ['_bride_of_mothra']
>>> result[0] '17' >>> list(result) ['17'] >>> numberList = pp.OneOrMore(number) >>> print numberList.parseString('17 33 88') ['17', '33', '88'] As a dictionary. You can attach a results name r to a parser by calling its .setResultsName(s) method (see Section 5.1, ParserElement: The basic parser building block (p. 11)). Once you have done that, you can extract the matched string from the ParseResults instance r as r[s]. >>> number = pp.Word(pp.nums).setResultsName('nVache') >>> result = number.parseString('17') >>> print result ['17'] >>> result['nVache'] '17' Here are some general principles for structuring your parser's ParseResults instance.
http://en.wikipedia.org/wiki/Stepwise_refinement
For example, suppose your program is disassembling a sequence of words, and you want to treat the first word one way and the rest of the words another way. Here's our first attempt. >>> ungrouped = word + phrase >>> result = ungrouped.parseString('imaginary farcical aquatic ceremony') >>> print result ['imaginary', 'farcical', 'aquatic', 'ceremony'] That result doesn't really match our concept that the parser is a sequence of two things: a single word, followed by a sequence of words. By applying pp.Group() like this, we get a parser that will return a sequence of two items that match our concept. >>> grouped = word + pp.Group(phrase) >>> result = grouped.parseString('imaginary farcical aquatic ceremony') >>> print result ['imaginary', ['farcical', 'aquatic', 'ceremony']] >>> print result[1] ['farcical', 'aquatic', 'ceremony'] >>> type(result[1]) <class 'pyparsing.ParseResults'> >>> result[1][0] 'farcical' >>> type(result[1][0]) <type 'str'> 1. The grouped parser has two components: a word and a pp.Group. Hence, the result returned acts like a two-element list. 2. The first element is an actual string, 'imaginary'. 3. The second part is another pp.ParseResults instance that acts like a list of strings. So for larger grammars, the pp.ParseResults instance, which the top-level parser returns when it matches, will typically be a many-layered structure containing this kind of mixture of ordinary strings and other instances of pp.ParseResults. The next section will give you some suggestions on manage the structure of these beasts.
Here's an example showing what happens when you mix positional and named access at the same level: in bull-riding, the total score is a combination of the rider's score and the bull's score. >>> rider = pp.Word(pp.alphas).setResultsName('Rider') >>> bull = pp.Word(pp.alphas).setResultsName('Bull') >>> score = pp.Word(pp.nums+'.') >>> line = rider + score + bull + score >>> result = line.parseString('Mauney 46.5 Asteroid 46') >>> print result ['Mauney', '46.5', 'Asteroid', '46'] In the four-element list shown above, you can access the first and third elements by name, but the second and fourth would be accessible only by position. A more sensible way to structure this parser would be to write a parser for the combination of a name and a score, and then combine two of those for the overall parser. >>> name = pp.Word(pp.alphas).setResultsName('name') >>> score = pp.Word(pp.nums+'.').setResultsName('score') >>> nameScore = pp.Group(name + score) >>> line = nameScore.setResultsName('Rider') + nameScore.setResultsName('Bull') >>> result = line.parseString('Mauney 46.5 Asteroid 46') >>> result['Rider']['name'] 'Mauney' >>> result['Bull']['score'] '46' Don't use a results name for a repeated element. If you do, only the last one will be accessible by results name in the ParseResults. >>> catName = pp.Word(pp.alphas).setResultsName('catName') >>> catList = pp.OneOrMore(catName) >>> result = catList.parseString('Sandy Mocha Bits') >>> result['catName'] 'Bits' >>> list(result) ['Sandy', 'Mocha', 'Bits'] A better approach is to wrap the entire name in a pp.Group() and then apply the results name to that. >>> owner = pp.Word(pp.alphas).setResultsName('owner') >>> catList = pp.Group(pp.OneOrMore(catName)).setResultsName('cats') >>> line = owner + catList >>> result = line.parseString('Carol Sandy Mocha Bits') >>> result['owner'] 'Carol' >>> print result['cats'] ['Sandy', 'Mocha', 'Bits']
5. Classes
Here are the classes defined in the pyparsing module.
10
11
Note
To save space, in subsequent examples we will omit all the Traceback lines except the last. >>> print nwn.parseString('23xy 47') pyparsing.ParseException: Expected W:(0123...) (at char 4), (line:1, col:5) You will note that even though the num parser does not skip whitespace, whitespace is still disallowed for the string ' 47' because the wn parser disabled automatic whitespace skipping. p.parseFile(f, parseAll=False) Try to match the contents of a file against parser p. The argument f may be either the name of a file or a file-like object. If the entire contents of the file does not match p, it is not considered an error unless you pass the argument parseAll=True. p.parseString(s, parseAll=False) Try to match string s against parser p. If there is a match, it returns an instance of Section 5.26, ParseResults: Result returned from a match (p. 28). If there is no match, it will raise a pp.ParseException. By default, if the entirety of s does not match p, it is not considered an error. If you want to insure that all of s matched p, pass the keyword argument parseAll=True. p.scanString(s) Search through string s to find regions that match p. This method is an iterator that generates a sequence of tuples (r, start, end), where r is a pp.ParseResults instance that represents the matched part, and start and end are the beginning and ending offsets within s that bracket the position of the matched text. >>> name = pp.Word(pp.alphas) >>> text = "**** Farcical aquatic ceremony" >>> for result, start, end in name.scanString(text): ... print "Found {0} at [{1}:{2}]".format(result, start, end) ... Found ['Farcical'] at [5:13] Found ['aquatic'] at [14:21] Found ['ceremony'] at [23:31] p.setBreak() When this parser is about to be used, call up the Python debugger pdb. p.setFailAction(f) This method modifies p so that it will call function f if it fails to parse. The method returns p. Here is the calling sequence for a fail action: f(s, loc, expr, err) s The input string. loc The location in the input where the parse failed, as an offset counting from 0.
12
expr The name of the parser that failed. err The exception instance that the parser raised. Here is an example. >>> def oops(s, loc, expr, err): ... print ("s={0!r} loc={1!r} expr={2!r}\nerr={3!r}".format( ... s, loc, expr, err)) ... >>> fail = pp.NoMatch().setName('fail-parser').setFailAction(oops) >>> r = fail.parseString("None shall pass!") s='None shall pass!' loc=0 expr=fail-parser err=Expected fail-parser (at char 0), (line:1, col:1) pyparsing.ParseException: Expected fail-parser (at char 0), (line:1, col:1) p.setName(name) Attaches a name to this parser for debugging purposes. The argument is a string. The method returns p. >>> print pp.Word(pp.nums) W:(0123...) >>> count = pp.Word(pp.nums).setName('count-parser') >>> print count count-parser >>> count.parseString('FAIL') pyparsing.ParseException: Expected count-parser (at char 0), (line:1, col:1) In the above example, if you convert a parser to a string, you get a generic description of it: the string W:(0123...) tells you it is a Word parser and shows you the first few characters in the set. Once you have attached a name to it, the string form of the parser is that name. Note that when the parse fails, the error message identifies what it expected by naming the failed parser. p.setParseAction(f1, f2, ...) This method returns a copy of p with one or more parse actions attached. When the parser matches the input, it then calls each function fi in the order specified. The calling sequence for a parse action can be any of these four prototypes: f() f(toks) f(loc, toks) f(s, loc, toks) These are the arguments your function will receive, depending on how many arguments it accepts: s The string being parsed. If your string contains tab characters, see the reference documentation for notes about tab expansion and its effect on column positions.
18
18
http://packages.python.org/pyparsing/
13
loc The location of the matching substring as an offset (index, counting from 0). toks A pp.ParseResults instance containing the results of the match. A parse action can modify the result (the toks argument) by returning the modified list. If it returns None, the result is not changed. Here is an example parser with two parse actions. >>> name = pp.Word(pp.alphas) >>> def a1(): ... print "In a1" ... >>> def a2(s, loc, toks): ... print "In a2: s={0!r} loc={1!r} toks={2!r}".format( ... s, loc, toks) ... return ['CENSORED'] ... >>> newName = name.setParseAction(a1, a2) >>> r = newName.parseString('Gambolputty') In a1 In a2: s='Gambolputty' loc=0 toks=(['Gambolputty'], {}) >>> print r ['CENSORED'] p.setResultsName(name) For parsers that deposit the matched text into the ParseResults instance returned by .parseString(), you can use this method to attach a name to that matched text. Once you do this, you can retrieve the matched text from the ParseResults instance by using that instance as if it were a Python dict. >>> count = pp.Word(pp.nums) >>> beanCounter = count.setResultsName('beanCount') >>> r = beanCounter.parseString('7388') >>> r.keys() ['beanCount'] >>> r['beanCount'] '7388' The result of this method is a copy of p. Hence, if you have defined a useful parser, you can create several instances, each with a different results name. Continuing the above example, if we then use the count parser, we find that it does not have the results name that is attached to its copy beanCounter. >>> r2 = count.parseString('8873') >>> r2.keys() [] >>> print r2 ['8873'] p.setWhitespaceChars(s) For parser p, change its definition of whitespace to the characters in string s.
14
p.suppress() This method returns a copy of p modified so that it does not add the matched text to the ParseResult. This is useful for omitting punctuation. See also Section 5.32, Suppress: Omit matched text from the result (p. 32). >>> name = pp.Word(pp.alphas) >>> lb = pp.Literal('[') >>> rb = pp.Literal(']') >>> pat1 = lb + name + rb >>> print pat1.parseString('[hosepipe]') ['[', 'hosepipe', ']'] >>> pat2 = lb.suppress() + name + rb.suppress() >>> print pat2.parseString('[hosepipe]') ['hosepipe'] Additionally, these ordinary Python operators are overloaded to work with ParserElement instances. p1+p2 Equivalent to pp.And(p1, p2). p * n For a parser p and an integer n, the result is a parser that matches n repetitions of p. You can give the operands in either order: for example, 3 * p is the same as p * 3. >>> threeWords = pp.Word(pp.alphas) * 3 >>> text = "Lady of the Lake" >>> print threeWords.parseString(text) ['Lady', 'of', 'the'] >>> print threeWords.parseString(text, parseAll=True) pyparsing.ParseException: Expected end of text (at char 12), (line:1, col:13) p1 | p2 Equivalent to pp.MatchFirst(p1, p2). p1 ^ p2 Equivalent to pp.Or(p1, p2). p1 & p2 Equivalent to pp.Each(p1, p2). ~ p Equivalent to pp.NotAny(p). Class pp.ParserElement also supports one static method: pp.ParserElement.setDefaultWhitespaceChars(s) This static method changes the definition of whitespace to be the characters in string s. Calling this method has this effect on all subsequent instantiations of any pp.ParserElement subclass. >>> blanks = ' \t-=*#^' >>> pp.ParserElement.setDefaultWhitespaceChars(blanks) >>> text = ' \t-=*#^silly ##*=---\t walks--' >>> nameList = pp.OneOrMore(pp.Word(pp.alphas)) >>> print nameList.parseString(text) ['silly', 'walks']
15
16
>>> nonDigits = pp.CharsNotIn(pp.nums) >>> print nonDigits.parseString('zoot86') ['zoot'] >>> fourNonDigits = pp.CharsNotIn(pp.nums, exact=4) >>> print fourNonDigits.parseString('a$_/#') ['a$_/']
17
#!/usr/bin/env python #================================================================ # dicter: Example of pyparsing.Dict pattern #---------------------------------------------------------------import pyparsing as pp data = "cat Sandy Mocha Java|bird finch verdin siskin" rowPat = pp.OneOrMore(pp.Word(pp.alphas)) bigPat = pp.Dict(pp.delimitedList(pp.Group(rowPat), "|")) result = bigPat.parseString(data) for rowKey in result.keys(): print "result['{0}']={1}".format(rowKey, result[rowKey]) Here is the output from that script: result['bird']=['finch', 'verdin', 'siskin'] result['cat']=['Sandy', 'Mocha', 'Java']
18
The constructor returns a parser that always matches, and consumes no input. It can be used as a placeholder where a parser is required but you don't want it to match anything. >>> e=pp.Empty() >>> print e.parseString('') [] >>> print e.parseString('shrubber') [] >>> print e.parseString('shrubber', parseAll=True) pyparsing.ParseException: Expected end of text (at char 0), (line:1, col:1)
19
Let's look at a complete script that demonstrates use of the Forward pattern. For this example, we will reach back to ancient computing history for a feature of the early versions of the FORTRAN programming language: Hollerith string constants. A Hollerith constant is a way to represent a string of characters. It consists of a count, followed by the letter 'H', followed by the number of characters specified by the count. Here are two examples, with their Python equivalents: 1HX 'X'
10H0123456789 '0123456789' We'll write our pattern so that the 'H' can be either uppercase or lowercase. Here's the complete script. We start with the usual preliminaries: imports, some test strings, the main, and a function to run each test.
hollerith
#!/usr/bin/env python #================================================================ # hollerith: Demonstrate Forward class #---------------------------------------------------------------import sys import pyparsing as pp # - - - - M a n i f e s t c o n s t a n t s
def test(pat, text): '''Test to see if text matches parser (pat). ''' print "--- Test for '{0}'".format(text) try: result = pat.parseString(text) print " Matches: '{0}'".format(result[0]) except pp.ParseException as x: print " No match: '{0}'".format(str(x)) Next we'll define the function hollerith() that returns a parse for a Hollerith string.
hollerith
# - - -
h o l l e r i t h
def hollerith():
20
'''Returns a parser for a FORTRAN Hollerith character constant. ''' First we define a parser intExpr that matches the character count. It has a parse action that converts the number from character form to a Python int. The lambda expression defines a nameless function that takes a list of tokens and converts the first token to an int.
hollerith
#-# Define a recognizer for the character count. #-intExpr = pp.Word(pp.nums).setParseAction(lambda t: int(t[0])) Next we create an empty Forward parser as a placeholder for the logic that matches the 'H' and the following characters.
hollerith
#-# Allocate a placeholder for the rest of the parsing logic. #-stringExpr = pp.Forward() Next we define a closure that will be added to intExpr as a second parse action. Notice that we are defining a function within a function. The countedParseAction function will retain access to an external name (stringExpr, which is defined in the outer function's scope) after the function is defined.
hollerith
19
#-# Define a closure that transfers the character count from # the intExpr to the stringExpr. #-def countedParseAction(toks): '''Closure to define the content of stringExpr. ''' The argument is the list of tokens that was recognized by intExpr; because of its parse action, this list contains the count as a single int.
hollerith
n = toks[0] The contents parser will match exactly n characters. We'll use Section 5.5, CharsNotIn: Match characters not in a given set (p. 16) to do this match, specifying the excluded characters as an empty string so that any character will be included. Incidentally, this does not for n==0, but '0H' is not a valid Hollerith literal. A more robust implementation would raise a pp.ParseException in this case.
hollerith
#-# Create a parser for any (n) characters. #-contents = pp.CharsNotIn('', exact=n) This next line inserts the final pattern into the placeholder parser: an 'H' in either case followed by the contents pattern. The '<<' operator is overloaded in the Forward class to perform this operation: for any Forward recognizer F and any parser p, the expression F << p modifies F so that it matches pattern p.
19
http://en.wikipedia.org/wiki/Closure_(computer_science)
21
hollerith
#-# Store a recognizer for 'H' + contents into stringExpr. #-stringExpr << (pp.Suppress(pp.CaselessLiteral('H')) + contents) Parse actions may elect to modify the recognized tokens, but we don't need to do that, so we return None to signify that the tokens remain unchanged.
hollerith
return None That is the end of the countedParseAction closure. We are now back in the scope of hollerith(). The next line adds the closure as the second parse action for the intExpr parser.
hollerith
#-# Add the above closure as a parse action for intExpr. #-intExpr.addParseAction(countedParseAction) Now we are ready to return the completed hollerith parser: intExpr recognizes the count and stringExpr recognizes the 'H' and string contents. When we return it, it is still just an empty Forward, but it will be filled in before it asked to parse.
hollerith
if __name__ == "__main__": main() Here is the output of the script. Note that the last test fails because the '999H' is not followed by 999 more characters. --- Test for '1HX' Matches: 'X' --- Test for '2h$#' Matches: '$#' --- Test for '10H0123456789' Matches: '0123456789' --- Test for '999Hoops' No match: 'Expected !W:() (at char 8), (line:1, col:9)'
22
>>> pat = pp.Word(pp.alphas, max=4)+pp.GoToColumn(5)+pp.Word(pp.nums) >>> print pat.parseString('ab@@123') ['ab', '@@', '123'] >>> print pat.parseString('wxyz987') ['wxyz', '', '987'] >>> print pat.parseString('ab 123') ['ab', '', '123'] In this example, pat is a parser with three parts. The first part matches one to four letters. The second part skips to column 5. The third part matches one or more digits. In the first test, the GoToColumn parser returns '@@' because that was the text between the letters and column 5. In the second test, that parser returns the empty string because there are no characters between 'wxyz' and '987'. In the third example, the part matched by the GoToColumn is empty because white space is ignored between tokens.
23
>>> print key.parseString('Sirrah') pyparsing.ParseException: Expected "Sir" (at char 0), (line:1, col:1)
24
File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/site-packages/pyparsing.py", line 1032, in parseString raise exc pyparsing.ParseException: Expected start of line (at char 4), (line:1, col:5) For more examples, see Section 7.13, lineStart: An instance of LineStart (p. 45).
25
>>> fail = pp.Literal('Go') + pp.NoMatch() >>> fail.parseString('Go') pyparsing.ParseException: Unmatchable token (at char 2), (line:1, col:3)
26
>>> print chapterNo.parseString('23') ['23'] >>> print chapterNo.parseString('23c') ['23', 'c'] >>> chapterX = number + pp.Optional(letter, default='*') >>> print chapterX.parseString('23') ['23', '*']
5.24. ParseException
This is the exception thrown when the parse fails. These attributes are available on an instance: .lineno The line number where the parse failed, counting from 1. .col The column number where the parse failed, counting from 1. .line The text of the line in which the parse failed. >>> fail = pp.NoMatch() >>> try: ... print fail.parseString('Is that an ocarina?') ... except pp.ParseException as x: ... print "Line {e.lineno}, column {e.col}:\n'{e.line}'".format(e=x) ...
27
28
R.asDict() This method returns the named items of R as a normal Python dict. Continuing the example above: >>> r.asDict() {'last': 'Piranha', 'first': 'Doug'} R.asList() This method returns R as a normal Python list. >>> r.asList() ['Doug', 'Piranha'] R.copy() Returns a copy of R. .get(key, defaultValue=None) Works like the .get() method on the standard Python dict type: if the ParseResult has no component named key, the defaultValue is returned. >>> r.get('first', 'Unknown') 'Doug' >>> r.get('middle', 'Unknown') 'Unknown' .insert(where, what) Like the .insert() method of the Python list type, this method will insert the value of the string what before position where in the list of strings. >>> r.insert(1, 'Bubbles') >>> print r ['Doug', 'Bubbles', 'Piranha'] R.items() This method works like the .items() method of Python's dict type, returning a list of tuples (key, value). >>> r.items() [('last', 'Piranha'), ('first', 'Doug')] R.keys() Returns a list of the keys of named results. Continuing the Piranha example: >>> r.keys() ['last', 'first']
29
escChar Strings may not normally include the closing quote character inside the string. To allow closing quote characters inside the string, pass an argument escChar=c, where c is an escape character that signifies that the following character is to be treated as text and not as a delimiter. multiline By default, a string may not include newline characters. If you want to allow the parser to match quoted strings that extend over multiple lines, pass an argument multiline=True. >>> qs = pp.QuotedString('"') >>> print qs.parseString('"semprini"') ['semprini'] >>> cc = pp.QuotedString('/*', endQuoteChar='*/') >>> print cc.parseString("/* Attila the Bun */") [' Attila the Bun '] >>> pat = pp.QuotedString('"', escChar='\\') >>> print pat.parseString(r'"abc\"def"') ['abc"def'] >>> text = """'Ken ... Obvious'""" >>> print text 'Ken Obvious' >>> pat = pp.QuotedString("'") >>> print pat.parseString(text) pyparsing.ParseException: Expected quoted string, starting with ' ending with ' (at char 0), (line:1, col:1) >>> pat = pp.QuotedString("'", multiline=True) >>> print pat.parseString(text) ['Ken\nObvious'] >>> pat = pp.QuotedString('|') >>> print pat.parseString('|clever sheep|') ['clever sheep'] >>> pat = pp.QuotedString('|', unquoteResults=False) >>> print pat.parseString('|clever sheep|') ['|clever sheep|']
http://docs.python.org/2/library/re.html
30
>>> pat2 = pp.Regex(re.compile(r1)) >>> print pat2.parseString('dcbaee', parseAll=True) ['dcbaee'] >>> vowels = r'[aeiou]+' >>> pat1 = pp.Regex(vowels) >>> print pat1.parseString('eauoouEAUUO') ['eauoou'] >>> pat2 = pp.Regex(vowels, flags=re.IGNORECASE) >>> print pat2.parseString('eauoouEAUUO') ['eauoouEAUUO']
31
32
33
excludeChars If supplied, this argument specifies characters not to be considered to match, even if those characters are otherwise considered to match. >>> name = pp.Word('abcdef') >>> print name.parseString('fadedglory') ['faded'] >>> pyName = pp.Word(pp.alphas+'_', bodyChars=pp.alphanums+'_') >>> print pyName.parseString('_crunchyFrog13') ['_crunchyFrog13'] >>> name4 = pp.Word(pp.alphas, exact=4) >>> print name4.parseString('Whizzo') ['Whiz'] >>> noXY = pp.Word(pp.alphas, excludeChars='xy') >>> print noXY.parseString('Sussex') ['Susse']
34
['1234', 'abcd'] >>> print pat.parseString('123zabcd') pyparsing.ParseException: Not at the start of a word (at char 4), (line:1, col:5) >>> firstName = pp.WordStart() + pp.Word(pp.alphas) >>> print firstName.parseString('Lambert') ['Lambert'] >>> badNews = pp.Word(pp.alphas) + firstName >>> print badNews.parseString('MrLambert') pyparsing.ParseException: Not at the start of a word (at char 9), (line:1, col:10) >>> print badNews.parseString('Mr Lambert') ['Mr', 'Lambert']
6. Functions
These functions are available in the pyparsing module.
35
36
['Arthur, Bedevere, Launcelot, Galahad, Robin'] >>> badExample = pp.delimitedList(name, combine=True) >>> print badExample.parseString(text) ['Arthur'] The last example only matches one name because the Combine class suppresses the skipping of whitespace within its internal pieces.
37
>>> noteNames = notePat.parseString(text, parseAll=True) pyparsing.ParseException: Expected end of text (at char 43), (line:1, col:44) It's easy enough to fix the definition of the text, but instead let's fix the parser so that it defines value as ending either with a semicolon or with the end of the string: >>> value = pp.Word(pp.alphas) + (pp.StringEnd() | pp.Suppress(';')) >>> notePat = pp.dictOf(key, value) >>> noteNames = notePat.parseString(text) >>> noteNames.keys() ['1', '3', '2', '5', '4', '7', '6'] >>> noteNames['7'] 'ti'
38
0 1 2 3 4 5 6 7 8
6.10. matchPreviousExpr(): Match the text that the preceding expression matched
pp.matchPreviousExpr(parser) This function returns a new parser that matches not only the same pattern as the given parser, but it matches the value that was matched by parser. >>> name = pp.Word(pp.alphas) >>> name2 = pp.matchPreviousExpr(name) >>> dash2 = name + pp.Literal('-') + name2 >>> print dash2.parseString('aye-aye') ['aye', '-', 'aye']
39
>>> print dash2.parseString('aye-nay') pyparsing.ParseException: (at char 0), (line:1, col:1) >>> print dash2.parseString('no-now') pyparsing.ParseException: (at char 0), (line:1, col:1) The last example above failed because, even though the string "no" occurred both before and after the hyphen, the name2 parser matched the entire string "now" before it tested to see if it matched the previous occurrence "no". Compare the behavior of Section 6.11, matchPreviousLiteral(): Match the literal text that the preceding expression matched (p. 40).
6.11. matchPreviousLiteral(): Match the literal text that the preceding expression matched
pp.matchPreviousLiteral(parser) This function works like the one described in Section 6.10, matchPreviousExpr(): Match the text that the preceding expression matched (p. 39), except that the returned parser matches the exact characters that parser matched, without regard for any following context. Compare the example below with the one in Section 6.10, matchPreviousExpr(): Match the text that the preceding expression matched (p. 39). >>> name = pp.Word(pp.alphas) >>> name2 = pp.matchPreviousLiteral(name) >>> dash2 = pp.Combine(name + pp.Literal('-') + name2) >>> print dash2.parseString('foo-foofaraw') ['foo-foo'] >>> print dash2.parseString('foo-foofaraw', parseAll=True) pyparsing.ParseException: Expected end of text (at char 7), (line:1, col:8)
40
>>> text = '''(define (factorial n) ... (fact-iter 1 1 n))''' >>> print pp.nestedExpr().parseString(text) [['define', ['factorial', 'n'], ['fact-iter', '1', '1', 'n']]]
41
>>> print ident.parseString('N23') ['N23'] >>> print ident.parseString('0xy') pyparsing.ParseException: Expected W:(_abc...) (at char 0), (line:1, col:1)
42
7. Variables
These variables are defined in the pyparsing module.
21
http://www.nmt.edu/tcc/help/pubs/docbook43/iso9573/
43
44
>>> print pp.javaStyleComment.parseString('''/* ... multiline comment */''') ['/*\nmultiline comment */'] >>> print pp.javaStyleComment.parseString('''// This comment ... intentionally left almost blank\n''') ['// This comment']
45
46
>>> justTheGuts = pp.quotedString.addParseAction(pp.removeQuotes) >>> print justTheGuts.parseString("'Kevin Phillips Bong'") ['Kevin Phillips Bong']
47
48