CookBook/Parsing.thy
changeset 53 0c3580c831a4
parent 52 a04bdee4fb1e
child 54 1783211b3494
equal deleted inserted replaced
52:a04bdee4fb1e 53:0c3580c831a4
    92 
    92 
    93   Note how the result of consumed strings builds up on the left as nested pairs.  
    93   Note how the result of consumed strings builds up on the left as nested pairs.  
    94 
    94 
    95   Parsers that explore 
    95   Parsers that explore 
    96   alternatives can be constructed using the function @{ML "(op ||)"}. For example, the 
    96   alternatives can be constructed using the function @{ML "(op ||)"}. For example, the 
    97   parser @{ML_open "(p || q)" for p q} returns the result of @{ML_text "p"}, in case it succeeds, 
    97   parser @{ML "(p || q)" for p q} returns the result of @{ML_text "p"}, in case it succeeds, 
    98   otherwise it returns the result of @{ML_text "q"}. For example
    98   otherwise it returns the result of @{ML_text "q"}. For example
    99 
    99 
   100 @{ML_response [display] 
   100 @{ML_response [display] 
   101 "let 
   101 "let 
   102   val hw = ($$ \"h\") || ($$ \"w\")
   102   val hw = ($$ \"h\") || ($$ \"w\")
   119 in 
   119 in 
   120   (just_e input, just_h input)
   120   (just_e input, just_h input)
   121 end"
   121 end"
   122   "((\"e\", [\"l\", \"l\", \"o\"]),(\"h\", [\"l\", \"l\", \"o\"]))"}
   122   "((\"e\", [\"l\", \"l\", \"o\"]),(\"h\", [\"l\", \"l\", \"o\"]))"}
   123 
   123 
   124   The parser @{ML_open "Scan.optional p x" for p x} returns the result of the parser 
   124   The parser @{ML "Scan.optional p x" for p x} returns the result of the parser 
   125   @{ML_text "p"}, if it succeeds; otherwise it returns 
   125   @{ML_text "p"}, if it succeeds; otherwise it returns 
   126   the default value @{ML_text "x"}. For example
   126   the default value @{ML_text "x"}. For example
   127 
   127 
   128 @{ML_response [display]
   128 @{ML_response [display]
   129 "let 
   129 "let 
   150   The function @{ML "(op !!)"} helps to produce appropriate error messages
   150   The function @{ML "(op !!)"} helps to produce appropriate error messages
   151   during parsing. For example if one wants to parse that @{ML_text p} is immediately 
   151   during parsing. For example if one wants to parse that @{ML_text p} is immediately 
   152   followed by @{ML_text q}, or start a completely different parser @{ML_text r},
   152   followed by @{ML_text q}, or start a completely different parser @{ML_text r},
   153   one might write
   153   one might write
   154 
   154 
   155   @{ML_open [display] "(p -- q) || r" for p q r}
   155   @{ML [display] "(p -- q) || r" for p q r}
   156 
   156 
   157   However, this parser is problematic for producing an appropriate error message, in case
   157   However, this parser is problematic for producing an appropriate error message, in case
   158   the parsing of @{ML_open "(p -- q)" for p q} fails. Because in that case one loses with the parser
   158   the parsing of @{ML "(p -- q)" for p q} fails. Because in that case one loses with the parser
   159   above the information 
   159   above the information 
   160   that @{ML_text p} should be followed by @{ML_text q}. To see this consider the case in
   160   that @{ML_text p} should be followed by @{ML_text q}. To see this consider the case in
   161   which @{ML_text p} 
   161   which @{ML_text p} 
   162   is present in the input, but not @{ML_text q}. That means @{ML_open "(p -- q)" for p q} will fail 
   162   is present in the input, but not @{ML_text q}. That means @{ML "(p -- q)" for p q} will fail 
   163   and the 
   163   and the 
   164   alternative parser @{ML_text r} will be tried. However in many circumstance this will be the wrong
   164   alternative parser @{ML_text r} will be tried. However in many circumstance this will be the wrong
   165   parser for the input ``p-followed-by-q'' and therefore will also fail. The error message is then 
   165   parser for the input ``p-followed-by-q'' and therefore will also fail. The error message is then 
   166   caused by the
   166   caused by the
   167   failure of @{ML_text r}, not by the absence of @{ML_text q} in the input. This kind of situation
   167   failure of @{ML_text r}, not by the absence of @{ML_text q} in the input. This kind of situation
   189                                "Exception Error \"foo\" raised"}
   189                                "Exception Error \"foo\" raised"}
   190 
   190 
   191   This ``prefixing'' is usually done by wrappers such as @{ML "OuterSyntax.command"} 
   191   This ``prefixing'' is usually done by wrappers such as @{ML "OuterSyntax.command"} 
   192   (FIXME: give reference to later place). 
   192   (FIXME: give reference to later place). 
   193   
   193   
   194   Returning to our example of parsing @{ML_open "(p -- q) || r" for p q r}. If we want
   194   Returning to our example of parsing @{ML "(p -- q) || r" for p q r}. If we want
   195   to generate the correct error message for p-followed-by-q, then
   195   to generate the correct error message for p-followed-by-q, then
   196   we have to write:
   196   we have to write:
   197 *}
   197 *}
   198 
   198 
   199 ML {* 
   199 ML {* 
   217   @{ML_response [display] "Scan.error (p_followed_by_q \"h\" \"e\" \"w\") (explode \"wworld\")"
   217   @{ML_response [display] "Scan.error (p_followed_by_q \"h\" \"e\" \"w\") (explode \"wworld\")"
   218                           "((\"w\", \"w\"), [\"o\", \"r\", \"l\", \"d\"])"}
   218                           "((\"w\", \"w\"), [\"o\", \"r\", \"l\", \"d\"])"}
   219   
   219   
   220   yields the expected parsing. 
   220   yields the expected parsing. 
   221 
   221 
   222   The function @{ML_open "Scan.repeat p" for p} will apply a parser @{ML_text p} as 
   222   The function @{ML "Scan.repeat p" for p} will apply a parser @{ML_text p} as 
   223   often as it succeeds. For example
   223   often as it succeeds. For example
   224   
   224   
   225   @{ML_response [display] "Scan.repeat ($$ \"h\") (explode \"hhhhello\")" 
   225   @{ML_response [display] "Scan.repeat ($$ \"h\") (explode \"hhhhello\")" 
   226                 "([\"h\", \"h\", \"h\", \"h\"], [\"e\", \"l\", \"l\", \"o\"])"}
   226                 "([\"h\", \"h\", \"h\", \"h\"], [\"e\", \"l\", \"l\", \"o\"])"}
   227   
   227   
   239 
   239 
   240   However, this kind of manually wrapping needs to be done only very rarely
   240   However, this kind of manually wrapping needs to be done only very rarely
   241   in practise, because it is already done by the infrastructure for you. 
   241   in practise, because it is already done by the infrastructure for you. 
   242 
   242 
   243   After parsing succeeded, one nearly always wants to apply a function on the parsed 
   243   After parsing succeeded, one nearly always wants to apply a function on the parsed 
   244   items. This is done using the function @{ML_open "(p >> f)" for p f} which runs 
   244   items. This is done using the function @{ML "(p >> f)" for p f} which runs 
   245   first the parser @{ML_text p} and upon successful completion applies the 
   245   first the parser @{ML_text p} and upon successful completion applies the 
   246   function @{ML_text f} to the result. For example
   246   function @{ML_text f} to the result. For example
   247 
   247 
   248 @{ML_response [display]
   248 @{ML_response [display]
   249 "let 
   249 "let 
   279   @{ML_struct OuterParse} defined in the file @{ML_file  "Pure/Isar/outer_parse.ML"}.
   279   @{ML_struct OuterParse} defined in the file @{ML_file  "Pure/Isar/outer_parse.ML"}.
   280   The definition for tokens is in the file @{ML_file "Pure/Isar/outer_lex.ML"}.
   280   The definition for tokens is in the file @{ML_file "Pure/Isar/outer_lex.ML"}.
   281   \end{readmore}
   281   \end{readmore}
   282 
   282 
   283   The structure @{ML_struct OuterLex} defines several kinds of token (for example 
   283   The structure @{ML_struct OuterLex} defines several kinds of token (for example 
   284   @{ML "OuterLex.Ident"} for identifiers, @{ML "OuterLex.Keyword"} for keywords and
   284   @{ML "Ident" in OuterLex} for identifiers, @{ML "Keyword" in OuterLex} for keywords and
   285   @{ML "OuterLex.Command"} for commands). Some token parsers take into account the 
   285   @{ML "Command" in OuterLex} for commands). Some token parsers take into account the 
   286   kind of token.
   286   kind of token.
   287   
   287 *}  
   288   We can generate a token list out of a string using the function @{ML
   288 
   289   "OuterSyntax.scan"}, which we give below @{ML "Position.none"} as argument
   289 text {*
   290   since, at the moment, we are not interested in generating precise error
   290   For the examples below, we can generate a token list out of a string using
   291   messages. The following
   291   the function @{ML "OuterSyntax.scan"}, which we give below @{ML
       
   292   "Position.none"} as argument since, at the moment, we are not interested in
       
   293   generating precise error messages. The following
       
   294 
   292 
   295 
   293 @{ML_response_fake [display] "OuterSyntax.scan Position.none \"hello world\"" 
   296 @{ML_response_fake [display] "OuterSyntax.scan Position.none \"hello world\"" 
   294 "[Token (\<dots>,(Ident, \"hello\"),\<dots>), 
   297 "[Token (\<dots>,(Ident, \"hello\"),\<dots>), 
   295  Token (\<dots>,(Space, \" \"),\<dots>), 
   298  Token (\<dots>,(Space, \" \"),\<dots>), 
   296  Token (\<dots>,(Ident, \"world\"),\<dots>)]"}
   299  Token (\<dots>,(Ident, \"world\"),\<dots>)]"}
   300   other syntactic category.\footnote{Note that because of a possible a bug in
   303   other syntactic category.\footnote{Note that because of a possible a bug in
   301   the PolyML runtime system the result is printed as @{text "?"}, instead of
   304   the PolyML runtime system the result is printed as @{text "?"}, instead of
   302   the token.} The second indicates a space.
   305   the token.} The second indicates a space.
   303 
   306 
   304   Many parsing functions later on will require spaces, comments and the like
   307   Many parsing functions later on will require spaces, comments and the like
   305   to have been filtered out.  In what follows below, we are going to use the 
   308   to have already been filtered out.  So from now on we are going to use the 
   306   functions @{ML filter} and @{ML OuterLex.is_proper} do this. For example
   309   functions @{ML filter} and @{ML OuterLex.is_proper} do this. For example
   307 
   310 
   308 
   311 
   309 @{ML_response_fake [display]
   312 @{ML_response_fake [display]
   310 "let
   313 "let
   365 in 
   368 in 
   366   (OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\") input
   369   (OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\") input
   367 end"
   370 end"
   368 "((\"|\",\"in\"),[])"}
   371 "((\"|\",\"in\"),[])"}
   369 
   372 
   370   The parser @{ML_open "OuterParse.enum s p" for s p} parses a possibly empty 
   373   The parser @{ML "OuterParse.enum s p" for s p} parses a possibly empty 
   371   list of items recognised by the parser @{ML_text p}, where the items are 
   374   list of items recognised by the parser @{ML_text p}, where the items being parsed
   372   separated by the string @{ML_text s}. For example
   375   are separated by the string @{ML_text s}. For example
   373 
   376 
   374 @{ML_response [display]
   377 @{ML_response [display]
   375 "let 
   378 "let 
   376   val input = filtered_input \"in | in | in end\"
   379   val input = filtered_input \"in | in | in foo\"
   377 in 
   380 in 
   378   (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   381   (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   379 end" 
   382 end" 
   380 "([\"in\",\"in\",\"in\"],[\<dots>])"}
   383 "([\"in\",\"in\",\"in\"],[\<dots>])"}
   381 
   384 
   382   @{ML "OuterParse.enum1"} works similarly, except that the parsed list must
   385   @{ML "OuterParse.enum1"} works similarly, except that the parsed list must
   383   be non-empty. Note that we had to add a @{ML_text [quotes] "end"} at the end
   386   be non-empty. Note that we had to add an @{ML_text [quotes] "foo"} at the end
   384   of the parsed string, otherwise the parser would have consumed all tokens
   387   of the parsed string, otherwise the parser would have consumed all tokens
   385   and then failed with the exception @{ML_text "MORE"}. Like in the previous
   388   and then failed with the exception @{ML_text "MORE"}. Like in the previous
   386   section, we can avoid this exception using the wrapper @{ML
   389   section, we can avoid this exception using the wrapper @{ML
   387   Scan.finite}. This time, however, we have to use the ``stopper-token'' @{ML
   390   Scan.finite}. This time, however, we have to use the ``stopper-token'' @{ML
   388   OuterLex.stopper}. We can write
   391   OuterLex.stopper}. We can write
   389 
   392 
   390 @{ML_response [display]
   393 @{ML_response [display]
   391 "let 
   394 "let 
   392   val input = filtered_input \"in | in | in\"
   395   val input = filtered_input \"in | in | in\"
   393 in 
   396 in 
   394   Scan.finite OuterLex.stopper (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   397   Scan.finite OuterLex.stopper 
       
   398          (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   395 end" 
   399 end" 
   396 "([\"in\",\"in\",\"in\"],[])"}
   400 "([\"in\",\"in\",\"in\"],[])"}
       
   401 
       
   402   The following function will help us to run examples
       
   403 
       
   404 *}
       
   405 
       
   406 ML {*
       
   407 fun parse p input = Scan.finite OuterLex.stopper (Scan.error p) input
       
   408 *}
       
   409 
       
   410 text {*
   397 
   411 
   398   The function @{ML "OuterParse.!!!"} can be used to force termination of the
   412   The function @{ML "OuterParse.!!!"} can be used to force termination of the
   399   parser in case of a dead end, just like @{ML "Scan.!!"} (see previous section), 
   413   parser in case of a dead end, just like @{ML "Scan.!!"} (see previous section), 
   400   except that the error message is fixed to be @{text [quotes] "Outer syntax error"}
   414   except that the error message is fixed to be @{text [quotes] "Outer syntax error"}
   401   with a relatively precise description of the failure. For example:
   415   with a relatively precise description of the failure. For example:
   403 @{ML_response_fake [display]
   417 @{ML_response_fake [display]
   404 "let 
   418 "let 
   405   val input = filtered_input \"in |\"
   419   val input = filtered_input \"in |\"
   406   val parse_bar_then_in = OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\"
   420   val parse_bar_then_in = OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\"
   407 in 
   421 in 
   408   Scan.error (OuterParse.!!! parse_bar_then_in) input 
   422   parse (OuterParse.!!! parse_bar_then_in) input 
   409 end"
   423 end"
   410 "Exception ERROR \"Outer syntax error: keyword \"|\" expected, 
   424 "Exception ERROR \"Outer syntax error: keyword \"|\" expected, 
   411 but keyword in was found\" raised"
   425 but keyword in was found\" raised"
   412 }
   426 }
   413 
   427 
   414 *}
   428   \begin{exercise}
       
   429   A type-identifier, for example @{typ "'a"}, is a token of 
       
   430   kind @{ML "Keyword" in OuterLex} can be parsed 
       
   431 
       
   432   \end{exercise}
       
   433 
       
   434 
       
   435 *}
       
   436 
   415 
   437 
   416 section {* Positional Information *}
   438 section {* Positional Information *}
   417 
   439 
   418 text {*
   440 text {*
   419 
   441