CookBook/Parsing.thy
changeset 58 f3794c231898
parent 56 126646f2aa88
child 59 b5914f3c643c
equal deleted inserted replaced
57:065f472c09ab 58:f3794c231898
    44 
    44 
    45   @{ML_response [display] "($$ \"h\") (explode \"hello\")" "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
    45   @{ML_response [display] "($$ \"h\") (explode \"hello\")" "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
    46   @{ML_response [display] "($$ \"w\") (explode \"world\")" "(\"w\", [\"o\", \"r\", \"l\", \"d\"])"}
    46   @{ML_response [display] "($$ \"w\") (explode \"world\")" "(\"w\", [\"o\", \"r\", \"l\", \"d\"])"}
    47 
    47 
    48   This function will either succeed (as in the two examples above) or raise the exception 
    48   This function will either succeed (as in the two examples above) or raise the exception 
    49   @{ML_text "FAIL"} if no string can be consumed. For example trying to parse
    49   @{text "FAIL"} if no string can be consumed. For example trying to parse
    50 
    50 
    51   @{ML_response_fake [display] "($$ \"x\") (explode \"world\")" 
    51   @{ML_response_fake [display] "($$ \"x\") (explode \"world\")" 
    52                                "Exception FAIL raised"}
    52                                "Exception FAIL raised"}
    53   
    53   
    54   will raise the exception @{ML_text "FAIL"}.
    54   will raise the exception @{text "FAIL"}.
    55   There are three exceptions used in the parsing combinators:
    55   There are three exceptions used in the parsing combinators:
    56 
    56 
    57   \begin{itemize}
    57   \begin{itemize}
    58   \item @{ML_text "FAIL"} is used to indicate that alternative routes of parsing 
    58   \item @{text "FAIL"} is used to indicate that alternative routes of parsing 
    59   might be explored. 
    59   might be explored. 
    60   \item @{ML_text "MORE"} indicates that there is not enough input for the parser. For example 
    60   \item @{text "MORE"} indicates that there is not enough input for the parser. For example 
    61   in @{ML_text "($$ \"h\") []"}.
    61   in @{text "($$ \"h\") []"}.
    62   \item @{ML_text "ABORT"} is the exception which is raised when a dead end is reached. 
    62   \item @{text "ABORT"} is the exception which is raised when a dead end is reached. 
    63   It is used for example in the function @{ML "(op !!)"} (see below).
    63   It is used for example in the function @{ML "(op !!)"} (see below).
    64   \end{itemize}
    64   \end{itemize}
    65 
    65 
    66   However, note that these exceptions are private to the parser and cannot be accessed
    66   However, note that these exceptions are private to the parser and cannot be accessed
    67   by the programmer (for example to handle them).
    67   by the programmer (for example to handle them).
    68   
    68   
    69   Slightly more general than the parser @{ML "(op $$)"} is the function @{ML
    69   Slightly more general than the parser @{ML "(op $$)"} is the function @{ML
    70   Scan.one}, in that it takes a predicate as argument and then parses exactly
    70   Scan.one}, in that it takes a predicate as argument and then parses exactly
    71   one item from the input list satisfying this predicate. For example the
    71   one item from the input list satisfying this predicate. For example the
    72   following parser either consumes an @{ML_text [quotes] "h"} or a @{ML_text
    72   following parser either consumes an @{text [quotes] "h"} or a @{text
    73   [quotes] "w"}:
    73   [quotes] "w"}:
    74 
    74 
    75 
    75 
    76 @{ML_response [display] 
    76 @{ML_response [display] 
    77 "let 
    77 "let 
    82     (hw input1, hw input2)
    82     (hw input1, hw input2)
    83 end"
    83 end"
    84     "((\"h\", [\"e\", \"l\", \"l\", \"o\"]),(\"w\", [\"o\", \"r\", \"l\", \"d\"]))"}
    84     "((\"h\", [\"e\", \"l\", \"l\", \"o\"]),(\"w\", [\"o\", \"r\", \"l\", \"d\"]))"}
    85 
    85 
    86   Two parser can be connected in sequence by using the function @{ML "(op --)"}. 
    86   Two parser can be connected in sequence by using the function @{ML "(op --)"}. 
    87   For example parsing @{ML_text "h"}, @{ML_text "e"} and @{ML_text "l"} in this 
    87   For example parsing @{text "h"}, @{text "e"} and @{text "l"} in this 
    88   sequence can be achieved by
    88   sequence can be achieved by
    89 
    89 
    90   @{ML_response [display] "(($$ \"h\") -- ($$ \"e\") -- ($$ \"l\")) (explode \"hello\")"
    90   @{ML_response [display] "(($$ \"h\") -- ($$ \"e\") -- ($$ \"l\")) (explode \"hello\")"
    91                           "(((\"h\", \"e\"), \"l\"), [\"l\", \"o\"])"}
    91                           "(((\"h\", \"e\"), \"l\"), [\"l\", \"o\"])"}
    92 
    92 
    98   @{ML_response [display] "Scan.this_string \"hel\" (explode \"hello\")"
    98   @{ML_response [display] "Scan.this_string \"hel\" (explode \"hello\")"
    99                           "(\"hel\", [\"l\", \"o\"])"}
    99                           "(\"hel\", [\"l\", \"o\"])"}
   100 
   100 
   101   Parsers that explore alternatives can be constructed using the function @{ML
   101   Parsers that explore alternatives can be constructed using the function @{ML
   102   "(op ||)"}. For example, the parser @{ML "(p || q)" for p q} returns the
   102   "(op ||)"}. For example, the parser @{ML "(p || q)" for p q} returns the
   103   result of @{ML_text "p"}, in case it succeeds, otherwise it returns the
   103   result of @{text "p"}, in case it succeeds, otherwise it returns the
   104   result of @{ML_text "q"}. For example
   104   result of @{text "q"}. For example
   105 
   105 
   106 
   106 
   107 @{ML_response [display] 
   107 @{ML_response [display] 
   108 "let 
   108 "let 
   109   val hw = ($$ \"h\") || ($$ \"w\")
   109   val hw = ($$ \"h\") || ($$ \"w\")
   127   (just_e input, just_h input)
   127   (just_e input, just_h input)
   128 end"
   128 end"
   129   "((\"e\", [\"l\", \"l\", \"o\"]),(\"h\", [\"l\", \"l\", \"o\"]))"}
   129   "((\"e\", [\"l\", \"l\", \"o\"]),(\"h\", [\"l\", \"l\", \"o\"]))"}
   130 
   130 
   131   The parser @{ML "Scan.optional p x" for p x} returns the result of the parser 
   131   The parser @{ML "Scan.optional p x" for p x} returns the result of the parser 
   132   @{ML_text "p"}, if it succeeds; otherwise it returns 
   132   @{text "p"}, if it succeeds; otherwise it returns 
   133   the default value @{ML_text "x"}. For example
   133   the default value @{text "x"}. For example
   134 
   134 
   135 @{ML_response [display]
   135 @{ML_response [display]
   136 "let 
   136 "let 
   137   val p = Scan.optional ($$ \"h\") \"x\"
   137   val p = Scan.optional ($$ \"h\") \"x\"
   138   val input1 = (explode \"hello\")
   138   val input1 = (explode \"hello\")
   153 in 
   153 in 
   154   (p input1, p input2)
   154   (p input1, p input2)
   155 end" "((SOME \"h\", [\"e\", \"l\", \"l\", \"o\"]), (NONE, [\"w\", \"o\", \"r\", \"l\", \"d\"]))"}
   155 end" "((SOME \"h\", [\"e\", \"l\", \"l\", \"o\"]), (NONE, [\"w\", \"o\", \"r\", \"l\", \"d\"]))"}
   156 
   156 
   157   The function @{ML "(op !!)"} helps to produce appropriate error messages
   157   The function @{ML "(op !!)"} helps to produce appropriate error messages
   158   during parsing. For example if one wants to parse that @{ML_text p} is immediately 
   158   during parsing. For example if one wants to parse that @{text p} is immediately 
   159   followed by @{ML_text q}, or start a completely different parser @{ML_text r},
   159   followed by @{text q}, or start a completely different parser @{text r},
   160   one might write
   160   one might write
   161 
   161 
   162   @{ML [display] "(p -- q) || r" for p q r}
   162   @{ML [display] "(p -- q) || r" for p q r}
   163 
   163 
   164   However, this parser is problematic for producing an appropriate error message, in case
   164   However, this parser is problematic for producing an appropriate error message, in case
   165   the parsing of @{ML "(p -- q)" for p q} fails. Because in that case one loses with the parser
   165   the parsing of @{ML "(p -- q)" for p q} fails. Because in that case one loses with the parser
   166   above the information 
   166   above the information 
   167   that @{ML_text p} should be followed by @{ML_text q}. To see this consider the case in
   167   that @{text p} should be followed by @{text q}. To see this consider the case in
   168   which @{ML_text p} 
   168   which @{text p} 
   169   is present in the input, but not @{ML_text q}. That means @{ML "(p -- q)" for p q} will fail 
   169   is present in the input, but not @{text q}. That means @{ML "(p -- q)" for p q} will fail 
   170   and the 
   170   and the 
   171   alternative parser @{ML_text r} will be tried. However in many circumstance this will be the wrong
   171   alternative parser @{text r} will be tried. However in many circumstance this will be the wrong
   172   parser for the input ``p-followed-by-q'' and therefore will also fail. The error message is then 
   172   parser for the input ``p-followed-by-q'' and therefore will also fail. The error message is then 
   173   caused by the
   173   caused by the
   174   failure of @{ML_text r}, not by the absence of @{ML_text q} in the input. This kind of situation
   174   failure of @{text r}, not by the absence of @{text q} in the input. This kind of situation
   175   can be avoided by using the function @{ML "(op !!)"}. This function aborts the whole process of
   175   can be avoided by using the function @{ML "(op !!)"}. This function aborts the whole process of
   176   parsing in case of a failure and invokes an error message. For example if we invoke the parser
   176   parsing in case of a failure and invokes an error message. For example if we invoke the parser
   177   
   177   
   178   @{ML [display] "(!! (fn _ => \"foo\") ($$ \"h\"))"}
   178   @{ML [display] "(!! (fn _ => \"foo\") ($$ \"h\"))"}
   179 
   179 
   180   on @{ML_text [quotes] "hello"}, the parsing succeeds
   180   on @{text [quotes] "hello"}, the parsing succeeds
   181 
   181 
   182   @{ML_response [display] 
   182   @{ML_response [display] 
   183                 "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"hello\")" 
   183                 "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"hello\")" 
   184                 "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
   184                 "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
   185 
   185 
   186   but if we invoke it on @{ML_text [quotes] "world"}
   186   but if we invoke it on @{text [quotes] "world"}
   187   
   187   
   188   @{ML_response_fake [display] "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"world\")"
   188   @{ML_response_fake [display] "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"world\")"
   189                                "Exception ABORT raised"}
   189                                "Exception ABORT raised"}
   190 
   190 
   191   the parsing aborts and the error message @{ML_text "foo"} is printed out. In order to
   191   the parsing aborts and the error message @{text "foo"} is printed out. In order to
   192   see the error message properly, we need to prefix the parser with the function 
   192   see the error message properly, we need to prefix the parser with the function 
   193   @{ML "Scan.error"}. For example
   193   @{ML "Scan.error"}. For example
   194 
   194 
   195   @{ML_response_fake [display] "Scan.error ((!! (fn _ => \"foo\") ($$ \"h\")))"
   195   @{ML_response_fake [display] "Scan.error ((!! (fn _ => \"foo\") ($$ \"h\")))"
   196                                "Exception Error \"foo\" raised"}
   196                                "Exception Error \"foo\" raised"}
   224   @{ML_response [display] "Scan.error (p_followed_by_q \"h\" \"e\" \"w\") (explode \"wworld\")"
   224   @{ML_response [display] "Scan.error (p_followed_by_q \"h\" \"e\" \"w\") (explode \"wworld\")"
   225                           "((\"w\", \"w\"), [\"o\", \"r\", \"l\", \"d\"])"}
   225                           "((\"w\", \"w\"), [\"o\", \"r\", \"l\", \"d\"])"}
   226   
   226   
   227   yields the expected parsing. 
   227   yields the expected parsing. 
   228 
   228 
   229   The function @{ML "Scan.repeat p" for p} will apply a parser @{ML_text p} as 
   229   The function @{ML "Scan.repeat p" for p} will apply a parser @{text p} as 
   230   often as it succeeds. For example
   230   often as it succeeds. For example
   231   
   231   
   232   @{ML_response [display] "Scan.repeat ($$ \"h\") (explode \"hhhhello\")" 
   232   @{ML_response [display] "Scan.repeat ($$ \"h\") (explode \"hhhhello\")" 
   233                 "([\"h\", \"h\", \"h\", \"h\"], [\"e\", \"l\", \"l\", \"o\"])"}
   233                 "([\"h\", \"h\", \"h\", \"h\"], [\"e\", \"l\", \"l\", \"o\"])"}
   234   
   234   
   235   Note that @{ML "Scan.repeat"} stores the parsed items in a list. The function
   235   Note that @{ML "Scan.repeat"} stores the parsed items in a list. The function
   236   @{ML "Scan.repeat1"} is similar, but requires that the parser @{ML_text "p"} 
   236   @{ML "Scan.repeat1"} is similar, but requires that the parser @{text "p"} 
   237   succeeds at least once.
   237   succeeds at least once.
   238 
   238 
   239   Also note that the parser would have aborted with the exception @{ML_text MORE}, if
   239   Also note that the parser would have aborted with the exception @{text MORE}, if
   240   we had run it only on just @{ML_text [quotes] "hhhh"}. This can be avoided using
   240   we had run it only on just @{text [quotes] "hhhh"}. This can be avoided using
   241   the wrapper @{ML Scan.finite} and the ``stopper-token'' @{ML Symbol.stopper}. With
   241   the wrapper @{ML Scan.finite} and the ``stopper-token'' @{ML Symbol.stopper}. With
   242   them we can write
   242   them we can write
   243   
   243   
   244   @{ML_response [display] "Scan.finite Symbol.stopper (Scan.repeat ($$ \"h\")) (explode \"hhhh\")" 
   244   @{ML_response [display] "Scan.finite Symbol.stopper (Scan.repeat ($$ \"h\")) (explode \"hhhh\")" 
   245                 "([\"h\", \"h\", \"h\", \"h\"], [])"}
   245                 "([\"h\", \"h\", \"h\", \"h\"], [])"}
   275 
   275 
   276   succeeds. 
   276   succeeds. 
   277 
   277 
   278   The functions @{ML Scan.repeat} and @{ML Scan.unless} can be combined to read any
   278   The functions @{ML Scan.repeat} and @{ML Scan.unless} can be combined to read any
   279   input until a certain marker symbol is reached. In the example below the marker
   279   input until a certain marker symbol is reached. In the example below the marker
   280   symbol is a @{text [quotes] "*"}:
   280   symbol is a @{text [quotes] "*"} which causes the parser to stop:
   281 
   281 
   282   @{ML_response [display]
   282   @{ML_response [display]
   283 "let 
   283 "let 
   284   val p = Scan.repeat (Scan.unless ($$ \"*\") (Scan.one Symbol.not_eof))
   284   val p = Scan.repeat (Scan.unless ($$ \"*\") (Scan.one Symbol.not_eof))
   285   val input1 = (explode \"fooooo\")
   285   val input1 = (explode \"fooooo\")
   291 "(([\"f\", \"o\", \"o\", \"o\", \"o\", \"o\"], []),
   291 "(([\"f\", \"o\", \"o\", \"o\", \"o\", \"o\"], []),
   292  ([\"f\", \"o\", \"o\"], [\"*\", \"o\", \"o\", \"o\"]))"}
   292  ([\"f\", \"o\", \"o\"], [\"*\", \"o\", \"o\", \"o\"]))"}
   293 
   293 
   294   After parsing succeeded, one nearly always wants to apply a function on the parsed 
   294   After parsing succeeded, one nearly always wants to apply a function on the parsed 
   295   items. This is done using the function @{ML "(p >> f)" for p f} which runs 
   295   items. This is done using the function @{ML "(p >> f)" for p f} which runs 
   296   first the parser @{ML_text p} and upon successful completion applies the 
   296   first the parser @{text p} and upon successful completion applies the 
   297   function @{ML_text f} to the result. For example
   297   function @{text f} to the result. For example
   298 
   298 
   299 @{ML_response [display]
   299 @{ML_response [display]
   300 "let 
   300 "let 
   301   fun double (x,y) = (x^x,y^y) 
   301   fun double (x,y) = (x^x,y^y) 
   302 in
   302 in
   305 "((\"hh\", \"ee\"), [\"l\", \"l\", \"o\"])"}
   305 "((\"hh\", \"ee\"), [\"l\", \"l\", \"o\"])"}
   306 
   306 
   307   doubles the two parsed input strings.
   307   doubles the two parsed input strings.
   308  
   308  
   309   \begin{exercise}\label{ex:scancmts}
   309   \begin{exercise}\label{ex:scancmts}
   310   Write a parser parses an input string so that any comment enclosed inside
   310   Write a parser that parses an input string so that any comment enclosed
   311   @{text "(*\<dots>*)"} is enclosed inside @{text "(**\<dots>**)"} in the output
   311   inside @{text "(*\<dots>*)"} is replaced by a the same comment but enclosed inside
   312   string. To enclose a string, you can use the function @{ML "enclose s1 s2 s"
   312   @{text "(**\<dots>**)"} in the output string. To enclose a string, you can use the
   313   for s1 s2 s} which produces the string @{ML "s1^s^s2" for s1 s2 s}.
   313   function @{ML "enclose s1 s2 s" for s1 s2 s} which produces the string @{ML
       
   314   "s1^s^s2" for s1 s2 s}.
   314   \end{exercise}
   315   \end{exercise}
   315 
   316 
   316 
   317 
   317   The function @{ML Scan.lift} takes a parser and a pair as arguments. This function applies
   318   The function @{ML Scan.lift} takes a parser and a pair as arguments. This function applies
   318   the given parser to the second component of the pair and leaves the  first component 
   319   the given parser to the second component of the pair and leaves the  first component 
   356 "[Token (\<dots>,(Ident, \"hello\"),\<dots>), 
   357 "[Token (\<dots>,(Ident, \"hello\"),\<dots>), 
   357  Token (\<dots>,(Space, \" \"),\<dots>), 
   358  Token (\<dots>,(Space, \" \"),\<dots>), 
   358  Token (\<dots>,(Ident, \"world\"),\<dots>)]"}
   359  Token (\<dots>,(Ident, \"world\"),\<dots>)]"}
   359 
   360 
   360   produces three tokens where the first and the last are identifiers, since
   361   produces three tokens where the first and the last are identifiers, since
   361   @{ML_text [quotes] "hello"} and @{ML_text [quotes] "world"} do not match any
   362   @{text [quotes] "hello"} and @{text [quotes] "world"} do not match any
   362   other syntactic category.\footnote{Note that because of a possible a bug in
   363   other syntactic category.\footnote{Note that because of a possible a bug in
   363   the PolyML runtime system the result is printed as @{text "?"}, instead of
   364   the PolyML runtime system the result is printed as @{text "?"}, instead of
   364   the token.} The second indicates a space.
   365   the token.} The second indicates a space.
   365 
   366 
   366   Many parsing functions later on will require spaces, comments and the like
   367   Many parsing functions later on will require spaces, comments and the like
   428   (OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\") input
   429   (OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\") input
   429 end"
   430 end"
   430 "((\"|\",\"in\"),[])"}
   431 "((\"|\",\"in\"),[])"}
   431 
   432 
   432   The parser @{ML "OuterParse.enum s p" for s p} parses a possibly empty 
   433   The parser @{ML "OuterParse.enum s p" for s p} parses a possibly empty 
   433   list of items recognised by the parser @{ML_text p}, where the items being parsed
   434   list of items recognised by the parser @{text p}, where the items being parsed
   434   are separated by the string @{ML_text s}. For example
   435   are separated by the string @{text s}. For example
   435 
   436 
   436 @{ML_response [display]
   437 @{ML_response [display]
   437 "let 
   438 "let 
   438   val input = filtered_input \"in | in | in foo\"
   439   val input = filtered_input \"in | in | in foo\"
   439 in 
   440 in 
   440   (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   441   (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   441 end" 
   442 end" 
   442 "([\"in\",\"in\",\"in\"],[\<dots>])"}
   443 "([\"in\",\"in\",\"in\"],[\<dots>])"}
   443 
   444 
   444   @{ML "OuterParse.enum1"} works similarly, except that the parsed list must
   445   @{ML "OuterParse.enum1"} works similarly, except that the parsed list must
   445   be non-empty. Note that we had to add an @{ML_text [quotes] "foo"} at the end
   446   be non-empty. Note that we had to add an @{text [quotes] "foo"} at the end
   446   of the parsed string, otherwise the parser would have consumed all tokens
   447   of the parsed string, otherwise the parser would have consumed all tokens
   447   and then failed with the exception @{ML_text "MORE"}. Like in the previous
   448   and then failed with the exception @{text "MORE"}. Like in the previous
   448   section, we can avoid this exception using the wrapper @{ML
   449   section, we can avoid this exception using the wrapper @{ML
   449   Scan.finite}. This time, however, we have to use the ``stopper-token'' @{ML
   450   Scan.finite}. This time, however, we have to use the ``stopper-token'' @{ML
   450   OuterLex.stopper}. We can write
   451   OuterLex.stopper}. We can write
   451 
   452 
   452 @{ML_response [display]
   453 @{ML_response [display]