CookBook/Parsing.thy
changeset 50 3d4b49921cdb
parent 49 a0edabf14457
child 52 a04bdee4fb1e
equal deleted inserted replaced
49:a0edabf14457 50:3d4b49921cdb
    19   Isabelle developers are usually concerned with writing these outer syntax parsers, 
    19   Isabelle developers are usually concerned with writing these outer syntax parsers, 
    20   either for new definitional packages or for calling tactics with specific arguments. 
    20   either for new definitional packages or for calling tactics with specific arguments. 
    21 
    21 
    22   \begin{readmore}
    22   \begin{readmore}
    23   The library 
    23   The library 
    24   for writing parser combinators can be split up, roughly, into two parts. 
    24   for writing parser combinators is split up, roughly, into two parts. 
    25   The first part consists of a collection of generic parser combinators defined
    25   The first part consists of a collection of generic parser combinators defined
    26   in the structure @{ML_struct Scan} in the file 
    26   in the structure @{ML_struct Scan} in the file 
    27   @{ML_file "Pure/General/scan.ML"}. The second part of the library consists of 
    27   @{ML_file "Pure/General/scan.ML"}. The second part of the library consists of 
    28   combinators for dealing with specific token types, which are defined in the 
    28   combinators for dealing with specific token types, which are defined in the 
    29   structure @{ML_struct OuterParse} in the file 
    29   structure @{ML_struct OuterParse} in the file 
    35 section {* Building Generic Parsers *}
    35 section {* Building Generic Parsers *}
    36 
    36 
    37 text {*
    37 text {*
    38 
    38 
    39   Let us first have a look at parsing strings using generic parsing combinators. 
    39   Let us first have a look at parsing strings using generic parsing combinators. 
    40   The function @{ML "(op $$)"} takes a string and will ``consume'' this string from 
    40   The function @{ML "(op $$)"} takes a string as argument and will ``consume'' this string from 
    41   a given input list of strings. ``Consume'' in this context means that it will 
    41   a given input list of strings. ``Consume'' in this context means that it will 
    42   return a pair consisting of this string and the rest of the input list. 
    42   return a pair consisting of this string and the rest of the input list. 
    43   For example:
    43   For example:
    44 
    44 
    45   @{ML_response [display] "($$ \"h\") (explode \"hello\")" "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
    45   @{ML_response [display] "($$ \"h\") (explode \"hello\")" "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
    61   in @{ML_text "($$ \"h\") []"}.
    61   in @{ML_text "($$ \"h\") []"}.
    62   \item @{ML_text "ABORT"} is the exception which is raised when a dead end is reached. 
    62   \item @{ML_text "ABORT"} is the exception which is raised when a dead end is reached. 
    63   It is used for example in the function @{ML "(op !!)"} (see below).
    63   It is used for example in the function @{ML "(op !!)"} (see below).
    64   \end{itemize}
    64   \end{itemize}
    65 
    65 
    66   However, note that these exception private to the parser and cannot be accessed
    66   However, note that these exceptions are private to the parser and cannot be accessed
    67   by the programmer (for example to handle them).
    67   by the programmer (for example to handle them).
    68   
    68   
    69   Slightly more general than the parser @{ML "(op $$)"} is the function @{ML
    69   Slightly more general than the parser @{ML "(op $$)"} is the function @{ML
    70   Scan.one}, in that it takes a predicate as argument and then parses exactly
    70   Scan.one}, in that it takes a predicate as argument and then parses exactly
    71   one item from the input list satisfying this prediate. For example the
    71   one item from the input list satisfying this prediate. For example the
   106   (hw input1, hw input2)
   106   (hw input1, hw input2)
   107 end"
   107 end"
   108   "((\"h\", [\"e\", \"l\", \"l\", \"o\"]), (\"w\", [\"o\", \"r\", \"l\", \"d\"]))"}
   108   "((\"h\", [\"e\", \"l\", \"l\", \"o\"]), (\"w\", [\"o\", \"r\", \"l\", \"d\"]))"}
   109 
   109 
   110   The functions @{ML "(op |--)"} and @{ML "(op --|)"} work like the sequencing funtion 
   110   The functions @{ML "(op |--)"} and @{ML "(op --|)"} work like the sequencing funtion 
   111   for parsers, except that they discard the item parsed by the first (respectively second)
   111   for parsers, except that they discard the item being parsed by the first (respectively second)
   112   parser. For example
   112   parser. For example
   113   
   113   
   114 @{ML_response [display]
   114 @{ML_response [display]
   115 "let 
   115 "let 
   116   val just_e = ($$ \"h\") |-- ($$ \"e\") 
   116   val just_e = ($$ \"h\") |-- ($$ \"e\") 
   134   (p input1, p input2)
   134   (p input1, p input2)
   135 end" 
   135 end" 
   136  "((\"h\", [\"e\", \"l\", \"l\", \"o\"]), (\"x\", [\"w\", \"o\", \"r\", \"l\", \"d\"]))"}
   136  "((\"h\", [\"e\", \"l\", \"l\", \"o\"]), (\"x\", [\"w\", \"o\", \"r\", \"l\", \"d\"]))"}
   137 
   137 
   138   The function @{ML Scan.option} works similarly, except no default value can
   138   The function @{ML Scan.option} works similarly, except no default value can
   139   be given and in the failure case this parser does nothing.
   139   be given. Instead, the result is wrapped as an @{text "option"}-type. For example:
       
   140 
       
   141 @{ML_response [display]
       
   142 "let 
       
   143   val p = Scan.option ($$ \"h\")
       
   144   val input1 = (explode \"hello\")
       
   145   val input2 = (explode \"world\")  
       
   146 in 
       
   147   (p input1, p input2)
       
   148 end" "((SOME \"h\", [\"e\", \"l\", \"l\", \"o\"]), (NONE, [\"w\", \"o\", \"r\", \"l\", \"d\"]))"}
   140 
   149 
   141   The function @{ML "(op !!)"} helps to produce appropriate error messages
   150   The function @{ML "(op !!)"} helps to produce appropriate error messages
   142   during parsing. For example if one wants to parse that @{ML_text p} is immediately 
   151   during parsing. For example if one wants to parse that @{ML_text p} is immediately 
   143   followed by @{ML_text q}, or start a completely different parser @{ML_text r},
   152   followed by @{ML_text q}, or start a completely different parser @{ML_text r},
   144   one might write
   153   one might write
   155   alternative parser @{ML_text r} will be tried. However in many circumstanes this will be the wrong
   164   alternative parser @{ML_text r} will be tried. However in many circumstanes this will be the wrong
   156   parser for the input ``p-followed-by-q'' and therefore will also fail. The error message is then 
   165   parser for the input ``p-followed-by-q'' and therefore will also fail. The error message is then 
   157   caused by the
   166   caused by the
   158   failure of @{ML_text r}, not by the absense of @{ML_text q} in the input. This kind of situation
   167   failure of @{ML_text r}, not by the absense of @{ML_text q} in the input. This kind of situation
   159   can be avoided by using the funtion @{ML "(op !!)"}. This function aborts the whole process of
   168   can be avoided by using the funtion @{ML "(op !!)"}. This function aborts the whole process of
   160   parsing in case of failure and invokes an error message. For example if we invoke the parser
   169   parsing in case of a failure and invokes an error message. For example if we invoke the parser
   161   
   170   
   162   @{ML [display] "(!! (fn _ => \"foo\") ($$ \"h\"))"}
   171   @{ML [display] "(!! (fn _ => \"foo\") ($$ \"h\"))"}
   163 
   172 
   164   on @{ML_text [quotes] "hello"}, the parsing succeeds
   173   on @{ML_text [quotes] "hello"}, the parsing succeeds
   165 
   174 
   182   This ``prefixing'' is usually done by wrappers such as @{ML "OuterSyntax.command"} 
   191   This ``prefixing'' is usually done by wrappers such as @{ML "OuterSyntax.command"} 
   183   (FIXME: give reference to later place). 
   192   (FIXME: give reference to later place). 
   184   
   193   
   185   Returning to our example of parsing @{ML_open "(p -- q) || r" for p q r}. If we want
   194   Returning to our example of parsing @{ML_open "(p -- q) || r" for p q r}. If we want
   186   to generate the correct error message for p-followed-by-q, then
   195   to generate the correct error message for p-followed-by-q, then
   187   we have to write, for example
   196   we have to write:
   188 *}
   197 *}
   189 
   198 
   190 ML {* 
   199 ML {* 
   191 fun p_followed_by_q p q r =
   200 fun p_followed_by_q p q r =
   192   let 
   201   let 
   219   Note that @{ML "Scan.repeat"} stores the parsed items in a list. The function
   228   Note that @{ML "Scan.repeat"} stores the parsed items in a list. The function
   220   @{ML "Scan.repeat1"} is similar, but requires that the parser @{ML_text "p"} 
   229   @{ML "Scan.repeat1"} is similar, but requires that the parser @{ML_text "p"} 
   221   succeeds at least once.
   230   succeeds at least once.
   222 
   231 
   223   Also note that the parser would have aborted with the exception @{ML_text MORE}, if
   232   Also note that the parser would have aborted with the exception @{ML_text MORE}, if
   224   we had run it only on just @{ML_text [quotes] "hhhh"}. This can be awoided using
   233   we had run it only on just @{ML_text [quotes] "hhhh"}. This can be avoided using
   225   the wrapper @{ML Scan.finite} and the ``stopper-token'' @{ML Symbol.stopper}. With
   234   the wrapper @{ML Scan.finite} and the ``stopper-token'' @{ML Symbol.stopper}. With
   226   them we can write
   235   them we can write
   227   
   236   
   228   @{ML_response [display] "Scan.finite Symbol.stopper (Scan.repeat ($$ \"h\")) (explode \"hhhh\")" 
   237   @{ML_response [display] "Scan.finite Symbol.stopper (Scan.repeat ($$ \"h\")) (explode \"hhhh\")" 
   229                 "([\"h\", \"h\", \"h\", \"h\"], [])"}
   238                 "([\"h\", \"h\", \"h\", \"h\"], [])"}
   230 
   239 
   231   However, the Isabelle-develloper rarely needs to do this kind of wrapping manually. 
   240   However, this kind of manually wrapping needs to be done only very rarely
       
   241   in practise, because it is already done by the infrastructure for you. 
   232 
   242 
   233   After parsing succeeded, one nearly always wants to apply a function on the parsed 
   243   After parsing succeeded, one nearly always wants to apply a function on the parsed 
   234   items. This is done using the function @{ML_open "(p >> f)" for p f} which runs 
   244   items. This is done using the function @{ML_open "(p >> f)" for p f} which runs 
   235   first the parser @{ML_text p} and upon successful completion applies the 
   245   first the parser @{ML_text p} and upon successful completion applies the 
   236   function @{ML_text f} to the result. For example
   246   function @{ML_text f} to the result. For example
   270   The definition for tokens is in the file @{ML_file "Pure/Isar/outer_lex.ML"}.
   280   The definition for tokens is in the file @{ML_file "Pure/Isar/outer_lex.ML"}.
   271   \end{readmore}
   281   \end{readmore}
   272 
   282 
   273   The structure @{ML_struct OuterLex} defines several kinds of token (for example 
   283   The structure @{ML_struct OuterLex} defines several kinds of token (for example 
   274   @{ML "OuterLex.Ident"} for identifiers, @{ML "OuterLex.Keyword"} for keywords and
   284   @{ML "OuterLex.Ident"} for identifiers, @{ML "OuterLex.Keyword"} for keywords and
   275   @{ML "OuterLex.Command"} for commands). Some parsers take into account the 
   285   @{ML "OuterLex.Command"} for commands). Some token parsers take into account the 
   276   kind of token.
   286   kind of token.
   277   
   287   
   278   We can generate a token list using the function @{ML "OuterSyntax.scan"}, which we give
   288   We can generate a token list out of a string using the function @{ML
   279   below @{ML "Position.none"} as argument since, at the moment, we are not interested in
   289   "OuterSyntax.scan"}, which we give below @{ML "Position.none"} as argument
   280   generating precise error messages. The following\footnote{There is something funny 
   290   since, at the moment, we are not interested in generating precise error
   281   going on with the pretty printing of the result token list.}
   291   messages. The following
   282 
   292 
   283 @{ML_response_fake [display] "OuterSyntax.scan Position.none \"hello world\"" 
   293 @{ML_response_fake [display] "OuterSyntax.scan Position.none \"hello world\"" 
   284 "[Token (\<dots>,(OuterLex.Ident, \"hello\"),\<dots>), 
   294 "[Token (\<dots>,(Ident, \"hello\"),\<dots>), 
   285  Token (\<dots>,(OuterLex.Space, \" \"),\<dots>), 
   295  Token (\<dots>,(Space, \" \"),\<dots>), 
   286  Token (\<dots>,(OuterLex.Ident, \"world\"),\<dots>)]"}
   296  Token (\<dots>,(Ident, \"world\"),\<dots>)]"}
   287 
   297 
   288   produces three tokens where the first and the last are identifiers, since 
   298   produces three tokens where the first and the last are identifiers, since
   289   @{ML_text [quotes] "hello"} and @{ML_text [quotes] "world"} do not match 
   299   @{ML_text [quotes] "hello"} and @{ML_text [quotes] "world"} do not match any
   290   any other category. The second indicates a space. Many parsing functions 
   300   other syntactic category.\footnote{Note that because of a possible a bug in
   291   later on will require spaces, comments and the like to have been filtered out.
   301   the PolyML runtime system the result is printed as @{text "?"}, instead of
       
   302   the token.} The second indicates a space.
       
   303 
       
   304   Many parsing functions later on will require spaces, comments and the like
       
   305   to have been filtered out.  In what follows below, we are going to use the 
       
   306   functions @{ML filter} and @{ML OuterLex.is_proper} do this. For example
       
   307 
       
   308 
       
   309 @{ML_response_fake [display]
       
   310 "let
       
   311    val input = OuterSyntax.scan Position.none \"hello world\"
       
   312 in
       
   313    filter OuterLex.is_proper input
       
   314 end" 
       
   315 "[Token (\<dots>,(Ident, \"hello\"), \<dots>), Token (\<dots>,(Ident, \"world\"), \<dots>)]"}
       
   316 
       
   317   For convenience we are going to use the function
       
   318 
       
   319 *}
       
   320 
       
   321 ML {* 
       
   322 fun filtered_input str = 
       
   323        filter OuterLex.is_proper (OuterSyntax.scan Position.none str) 
       
   324 *}
       
   325 
       
   326 text {*
       
   327 
   292   If we parse
   328   If we parse
   293 
   329 
   294 @{ML_response_fake [display] "OuterSyntax.scan Position.none \"inductive|for\"" 
   330 @{ML_response_fake [display] 
   295 "[Token (\<dots>,(OuterLex.Command, \"inductive\"),\<dots>), 
   331 "filtered_input \"inductive | for\"" 
   296  Token (\<dots>,(OuterLex.Keyword, \"|\"),\<dots>), 
   332 "[Token (\<dots>,(Command, \"inductive\"),\<dots>), 
   297  Token (\<dots>,(OuterLex.Keyword, \"for\"),\<dots>)]"}
   333  Token (\<dots>,(Keyword, \"|\"),\<dots>), 
   298 
   334  Token (\<dots>,(Keyword, \"for\"),\<dots>)]"}
   299   we obtain a list consiting of only command and keyword tokens.
   335 
       
   336   we obtain a list consiting of only a command and two keyword tokens.
   300   If you want to see which keywords and commands are currently known, use
   337   If you want to see which keywords and commands are currently known, use
   301   the following (you might have to adjust the @{ML print_depth} in order to
   338   the following (you might have to adjust the @{ML print_depth} in order to
   302   see the complete list):
   339   see the complete list):
   303 
   340 
   304 @{ML_response_fake [display] 
   341 @{ML_response_fake [display] 
   308   (Scan.dest_lexicon commands, Scan.dest_lexicon keywords)
   345   (Scan.dest_lexicon commands, Scan.dest_lexicon keywords)
   309 end" 
   346 end" 
   310 "([\"}\",\"{\",\<dots>],[\"\<rightleftharpoons>\",\"\<leftharpoondown>\",\<dots>])"}
   347 "([\"}\",\"{\",\<dots>],[\"\<rightleftharpoons>\",\"\<leftharpoondown>\",\<dots>])"}
   311 
   348 
   312   Now the parser @{ML "OuterParse.$$$"} parses a single keyword. For example
   349   Now the parser @{ML "OuterParse.$$$"} parses a single keyword. For example
   313  
   350 
   314 @{ML_response [display]
   351 @{ML_response [display]
   315 "let 
   352 "let 
   316   val input1 = OuterSyntax.scan Position.none \"where for\"
   353   val input1 = filtered_input \"where for\"
   317   val input2 = OuterSyntax.scan Position.none \"|in\"
   354   val input2 = filtered_input \"| in\"
   318 in 
   355 in 
   319   (OuterParse.$$$ \"where\" input1, OuterParse.$$$ \"|\" input2)
   356   (OuterParse.$$$ \"where\" input1, OuterParse.$$$ \"|\" input2)
   320 end"
   357 end"
   321 "((\"where\",\<dots>),(\"|\",\<dots>))"}
   358 "((\"where\",\<dots>),(\"|\",\<dots>))"}
   322 
   359 
   323   Like before, we can sequentially connect parsers with @{ML "(op --)"}. For example 
   360   Like before, we can sequentially connect parsers with @{ML "(op --)"}. For example 
   324 
   361 
   325 @{ML_response [display]
   362 @{ML_response [display]
   326 "let 
   363 "let 
   327   val input = OuterSyntax.scan Position.none \"|in\"
   364   val input = filtered_input \"| in\"
   328 in 
   365 in 
   329   (OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\") input
   366   (OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\") input
   330 end"
   367 end"
   331 "((\"|\",\"in\"),[])"}
   368 "((\"|\",\"in\"),[])"}
   332 
   369 
   334   list of items recognised by the parser @{ML_text p}, where the items are 
   371   list of items recognised by the parser @{ML_text p}, where the items are 
   335   separated by the string @{ML_text s}. For example
   372   separated by the string @{ML_text s}. For example
   336 
   373 
   337 @{ML_response [display]
   374 @{ML_response [display]
   338 "let 
   375 "let 
   339   val input = OuterSyntax.scan Position.none \"in|in|in\\n\"
   376   val input = filtered_input \"in | in | in end\"
   340 in 
   377 in 
   341   (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   378   (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   342 end" 
   379 end" 
   343 "([\"in\",\"in\",\"in\"],[\<dots>])"}
   380 "([\"in\",\"in\",\"in\"],[\<dots>])"}
   344 
   381 
   345   @{ML "OuterParse.enum1"} works similarly,  except that the parsed list must 
   382   @{ML "OuterParse.enum1"} works similarly, except that the parsed list must
   346   be non-empty.
   383   be non-empty. Note that we had to add a @{ML_text [quotes] "end"} at the end
   347 
   384   of the parsed string, otherwise the parser would have consumed all tokens
   348   Note that we had to add a @{ML_text [quotes] "\\n"} at the end of the parsed
   385   and then failed with the exception @{ML_text "MORE"}. Like in the previous
   349   string, otherwise the parser would have consumed all tokens and then failed with
   386   section, we can avoid this exception using the wrapper @{ML
   350   the exception @{ML_text "MORE"}. Like in the previous section, we can avoid
   387   Scan.finite}. This time, however, we have to use the ``stopper-token'' @{ML
   351   this exception using the wrapper @{ML Scan.finite}. This time, however, we
   388   OuterLex.stopper}. We can write
   352   have to use the ``stopper-token'' @{ML OuterLex.stopper}. We can write
       
   353 
   389 
   354 @{ML_response [display]
   390 @{ML_response [display]
   355 "let 
   391 "let 
   356   val input = OuterSyntax.scan Position.none \"in|in|in\"
   392   val input = filtered_input \"in | in | in\"
   357 in 
   393 in 
   358   Scan.finite OuterLex.stopper (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   394   Scan.finite OuterLex.stopper (OuterParse.enum \"|\" (OuterParse.$$$ \"in\")) input
   359 end" 
   395 end" 
   360 "([\"in\",\"in\",\"in\"],[])"}
   396 "([\"in\",\"in\",\"in\"],[])"}
   361 
   397 
   364   except that the error message is fixed to be @{text [quotes] "Outer syntax error"}
   400   except that the error message is fixed to be @{text [quotes] "Outer syntax error"}
   365   with a relatively precise description of the failure. For example:
   401   with a relatively precise description of the failure. For example:
   366 
   402 
   367 @{ML_response_fake [display]
   403 @{ML_response_fake [display]
   368 "let 
   404 "let 
   369   val input = OuterSyntax.scan Position.none \"in|\"
   405   val input = filtered_input \"in |\"
   370   val parse_bar_then_in = OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\"
   406   val parse_bar_then_in = OuterParse.$$$ \"|\" -- OuterParse.$$$ \"in\"
   371 in 
   407 in 
   372   Scan.error (OuterParse.!!! parse_bar_then_in) input 
   408   Scan.error (OuterParse.!!! parse_bar_then_in) input 
   373 end"
   409 end"
   374 "Exception ERROR \"Outer syntax error: keyword \"|\" expected, 
   410 "Exception ERROR \"Outer syntax error: keyword \"|\" expected, 
   375 but keyword in was found\" raised"
   411 but keyword in was found\" raised"
   376 }
   412 }
   377 
   413 
   378 *}
   414 *}
   379 
   415 
   380 
       
   381 ML {*
       
   382 let 
       
   383   val input = filter OuterLex.is_proper (OuterSyntax.scan Position.none "(in foo)")
       
   384 in 
       
   385   OuterParse.target input
       
   386 end
       
   387 *}
       
   388 
       
   389 section {* Positional Information *}
   416 section {* Positional Information *}
   390 
   417 
   391 text {*
   418 text {*
   392 
   419 
   393   @{ML OuterParse.position}
   420   @{ML OuterParse.position}
   405 let 
   432 let 
   406   val input = OuterSyntax.scan Position.none "0"
   433   val input = OuterSyntax.scan Position.none "0"
   407 in 
   434 in 
   408   OuterParse.prop input
   435   OuterParse.prop input
   409 end 
   436 end 
       
   437 
       
   438 *}
       
   439 
       
   440 ML {*
       
   441   OuterParse.opt_target
       
   442 *}
       
   443 
       
   444 ML {*
       
   445   OuterParse.opt_target --
       
   446   OuterParse.fixes -- 
       
   447   OuterParse.for_fixes --
       
   448   Scan.optional (OuterParse.$$$ "where" |--
       
   449     OuterParse.!!! (OuterParse.enum1 "|" (SpecParse.opt_thm_name ":" -- OuterParse.prop))) []
   410 
   450 
   411 *}
   451 *}
   412 
   452 
   413 text {* (FIXME funny output for a proposition) *}
   453 text {* (FIXME funny output for a proposition) *}
   414 
   454