CookBook/Parsing.thy
changeset 42 cd612b489504
parent 41 b11653b11bd3
child 43 02f76f1b6e7b
equal deleted inserted replaced
41:b11653b11bd3 42:cd612b489504
     1 theory Parsing
     1 theory Parsing
     2 imports Base
     2 imports Base
     3 
     3 
     4 begin
     4 begin
       
     5 
     5 
     6 
     6 chapter {* Parsing *}
     7 chapter {* Parsing *}
     7 
     8 
     8 text {*
     9 text {*
     9 
    10 
    13   as terms, types and so on, belong to the inner syntax. For parsing inner syntax, 
    14   as terms, types and so on, belong to the inner syntax. For parsing inner syntax, 
    14   Isabelle uses a rather general and sophisticated algorithm due to Earley, which 
    15   Isabelle uses a rather general and sophisticated algorithm due to Earley, which 
    15   is driven by priority grammars. Parsers for outer syntax are built up by functional
    16   is driven by priority grammars. Parsers for outer syntax are built up by functional
    16   parsing combinators. These combinators are a well-established technique for parsing, 
    17   parsing combinators. These combinators are a well-established technique for parsing, 
    17   which has, for example, been described in Paulson's classic book \cite{paulson-ml2}.
    18   which has, for example, been described in Paulson's classic book \cite{paulson-ml2}.
    18   
    19   Isabelle developers are usually concerned with writing these outer syntax parsers, 
    19   Isabelle developers are usually concerned with writing parsers for outer
    20   either for new definitional packages or for calling tactics with specific arguments. 
    20   syntax, either for new definitional packages or for calling tactics. The library 
    21 
    21   for writing such parsers in can roughly be split up into two parts. 
    22   \begin{readmore}
       
    23   The library 
       
    24   for writing parser combinators can be split up, roughly, into two parts. 
    22   The first part consists of a collection of generic parser combinators defined
    25   The first part consists of a collection of generic parser combinators defined
    23   in the structure @{ML_struct Scan} in the file 
    26   in the structure @{ML_struct Scan} in the file 
    24   @{ML_file "Pure/General/scan.ML"}. The second part of the library consists of 
    27   @{ML_file "Pure/General/scan.ML"}. The second part of the library consists of 
    25   combinators for dealing with specific token types, which are defined in the 
    28   combinators for dealing with specific token types, which are defined in the 
    26   structure @{ML_struct OuterParse} in the file 
    29   structure @{ML_struct OuterParse} in the file 
    27   @{ML_file "Pure/Isar/outer_parse.ML"}.
    30   @{ML_file "Pure/Isar/outer_parse.ML"}.
       
    31   \end{readmore}
    28 
    32 
    29 *}
    33 *}
    30 
    34 
    31 section {* Building Up Generic Parsers *}
    35 section {* Building Up Generic Parsers *}
    32 
    36 
    47   @{ML_response_fake [display] "($$ \"x\") (explode \"world\")" 
    51   @{ML_response_fake [display] "($$ \"x\") (explode \"world\")" 
    48                                "Exception FAIL raised"}
    52                                "Exception FAIL raised"}
    49   
    53   
    50   will raise the exception @{ML_text "FAIL"}.
    54   will raise the exception @{ML_text "FAIL"}.
    51   There are three exceptions used in the parsing combinators:
    55   There are three exceptions used in the parsing combinators:
       
    56 
       
    57   (FIXME: do the exception need to be explained, because the user cannot use them from ``outside''?)
    52 
    58 
    53   \begin{itemize}
    59   \begin{itemize}
    54   \item @{ML_text "FAIL"} is used to indicate that alternative routes of parsing 
    60   \item @{ML_text "FAIL"} is used to indicate that alternative routes of parsing 
    55   might be explored. 
    61   might be explored. 
    56   \item @{ML_text "MORE"} indicates that there is not enough input for the parser. For example 
    62   \item @{ML_text "MORE"} indicates that there is not enough input for the parser. For example 
    84 
    90 
    85   Note how the result of consumed strings builds up on the left as nested pairs.  
    91   Note how the result of consumed strings builds up on the left as nested pairs.  
    86 
    92 
    87   Parsers that explore 
    93   Parsers that explore 
    88   alternatives can be constructed using the function @{ML "(op ||)"}. For example, the 
    94   alternatives can be constructed using the function @{ML "(op ||)"}. For example, the 
    89   parser @{ML_open "(p || q)" for p q} returns the result of @{ML_text "p"}, if it succeeds, 
    95   parser @{ML_open "(p || q)" for p q} returns the result of @{ML_text "p"}, in case it succeeds, 
    90   otherwise it returns the result of @{ML_text "q"}. For example
    96   otherwise it returns the result of @{ML_text "q"}. For example
    91 
    97 
    92 @{ML_response [display] 
    98 @{ML_response [display] 
    93 "let 
    99 "let 
    94   val hw = ($$ \"h\") || ($$ \"w\")
   100   val hw = ($$ \"h\") || ($$ \"w\")
   133   one might write
   139   one might write
   134 
   140 
   135   @{ML_open [display] "(p -- q) || r" for p q r}
   141   @{ML_open [display] "(p -- q) || r" for p q r}
   136 
   142 
   137   However, this parser is problematic for producing an appropriate error message, in case
   143   However, this parser is problematic for producing an appropriate error message, in case
   138   the parsing of @{ML_open "(p -- q)" for p q} fails. Because one loses the information 
   144   the parsing of @{ML_open "(p -- q)" for p q} fails. Because in that case one loses with the parser
   139   that @{ML_text p} should be followed by @{ML_text q}. To see this consider the case that @{ML_text p} 
   145   above the information 
   140   is present in the input, but not @{ML_text q}. So @{ML_open "(p -- q)" for p q} will fail and the 
   146   that @{ML_text p} should be followed by @{ML_text q}. To see this consider the case in
       
   147   which @{ML_text p} 
       
   148   is present in the input, but not @{ML_text q}. That means @{ML_open "(p -- q)" for p q} will fail 
       
   149   and the 
   141   alternative parser @{ML_text r} will be tried. However in many circumstanes this will be the wrong
   150   alternative parser @{ML_text r} will be tried. However in many circumstanes this will be the wrong
   142   parser for the input and therefore will fail. The error message is then caused by the
   151   parser for the input ``p-followed-by-q'' and therefore will fail. The error message is then caused by the
   143   failure of @{ML_text r}, not by the absense of @{ML_text p} in the input. These situations
   152   failure of @{ML_text r}, not by the absense of @{ML_text q} in the input. This kind of situation
   144   can be avoided using the funtion @{ML "(op !!)"}. This function aborts the whole process of
   153   can be avoided by using the funtion @{ML "(op !!)"}. This function aborts the whole process of
   145   parsing in case of failure and invokes an error message. For example if we invoke the parser
   154   parsing in case of failure and invokes an error message. For example if we invoke the parser
   146   
   155   
   147   @{ML [display] "(!! (fn _ => \"foo\") ($$ \"h\"))"}
   156   @{ML [display] "(!! (fn _ => \"foo\") ($$ \"h\"))"}
   148 
   157 
   149   on @{ML_text "hello"}, the parsing succeeds
   158   on @{ML_text [quotes] "hello"}, the parsing succeeds
   150 
   159 
   151   @{ML_response [display] 
   160   @{ML_response [display] 
   152                 "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"hello\")" 
   161                 "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"hello\")" 
   153                 "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
   162                 "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
   154 
   163 
   155   but if we invoke it on @{ML_text "world"}
   164   but if we invoke it on @{ML_text [quotes] "world"}
   156   
   165   
   157   @{ML_response_fake [display] "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"world\")"
   166   @{ML_response_fake [display] "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"world\")"
   158                                "Exception ABORT raised"}
   167                                "Exception ABORT raised"}
   159 
   168 
   160   the parsing aborts and the error message @{ML_text "foo"} is printed out. In order to
   169   the parsing aborts and the error message @{ML_text "foo"} is printed out. In order to
   163 
   172 
   164   @{ML_response_fake [display] "Scan.error ((!! (fn _ => \"foo\") ($$ \"h\")))"
   173   @{ML_response_fake [display] "Scan.error ((!! (fn _ => \"foo\") ($$ \"h\")))"
   165                                "Exception Error \"foo\" raised"}
   174                                "Exception Error \"foo\" raised"}
   166 
   175 
   167   This ``prefixing'' is usually done by wrappers such as @{ML "OuterSyntax.command"} 
   176   This ``prefixing'' is usually done by wrappers such as @{ML "OuterSyntax.command"} 
   168   (FIXME: give reference to late). 
   177   (FIXME: give reference to later place). 
   169   
   178   
   170   Returning to our example of parsing @{ML_open "(p -- q) || r" for p q r}. If we want
   179   Returning to our example of parsing @{ML_open "(p -- q) || r" for p q r}. If we want
   171   to generate the correct error message for @{ML_text q} not following @{ML_text p}, then
   180   to generate the correct error message for p-followed-by-q, then
   172   we have to write
   181   we have to write
   173 *}
   182 *}
   174 
   183 
   175 ML {*
   184 ML {* 
   176   fun p_followed_by_q p q r =
   185   fun p_followed_by_q p q r =
   177   let 
   186   let 
   178      val err = (fn _ => p ^ " is not followed by " ^ q)
   187      val err = (fn _ => p ^ " is not followed by " ^ q)
   179   in
   188   in
   180     (($$ p) -- (!! err ($$ q))) || (($$ r) -- ($$ r))
   189     (($$ p) -- (!! err ($$ q))) || (($$ r) -- ($$ r))
   238 text {*
   247 text {*
   239   Most of the time, however, Isabelle developers have to deal with parsing tokens, not strings.
   248   Most of the time, however, Isabelle developers have to deal with parsing tokens, not strings.
   240   This is because the parsers for the theory syntax, as well as the parsers for the 
   249   This is because the parsers for the theory syntax, as well as the parsers for the 
   241   argument syntax of proof methods and attributes, use the type 
   250   argument syntax of proof methods and attributes, use the type 
   242   @{ML_type OuterParse.token} (which is identical to the type @{ML_type OuterLex.token}).
   251   @{ML_type OuterParse.token} (which is identical to the type @{ML_type OuterLex.token}).
       
   252 
       
   253   \begin{readmore}
   243   The parser functions for the theory syntax are contained in the structure
   254   The parser functions for the theory syntax are contained in the structure
   244   @{ML_struct OuterParse} defined in the file\linebreak @{ML_file  "Pure/Isar/outer_parse.ML"}.
   255   @{ML_struct OuterParse} defined in the file @{ML_file  "Pure/Isar/outer_parse.ML"}.
       
   256   The definition for tokens is in the file @{ML_file "Pure/Isar/outer_lex.ML"}.
       
   257   \end{readmore}
       
   258 *}
       
   259 
       
   260 ML {* OuterLex.mk_text "hello" *}
       
   261 
       
   262 ML {* print_depth 50 *}
       
   263 
       
   264 ML {* 
       
   265   let open OuterLex in 
       
   266   OuterSyntax.scan Position.none "for" 
       
   267   end;
       
   268 
   245 *}
   269 *}
   246 
   270 
   247 ML {* map OuterLex.mk_text (explode "hello") *} 
   271 ML {* map OuterLex.mk_text (explode "hello") *} 
       
   272 
       
   273 ML {*
       
   274 
       
   275 fun my_scan lex pos str =
       
   276    Source.of_string str
       
   277    |> Symbol.source {do_recover = false}
       
   278    |> OuterLex.source {do_recover = SOME false}
       
   279         (fn () => pairself (Scan.make_lexicon o map Symbol.explode) lex) pos
       
   280    |> Source.exhaust;
       
   281 
       
   282 *}
       
   283 
       
   284 ML {*
       
   285 let val toks = my_scan (["hello"], []) Position.none "hello"
       
   286 in (OuterParse.$$$ "hello") toks end
       
   287 *}
   248 
   288 
   249 text {* 
   289 text {* 
   250 
   290 
   251   @{ML_response_fake [display] "map OuterLex.mk_text (explode \"hello\")"  
   291   @{ML_response_fake [display] "map OuterLex.mk_text (explode \"hello\")"  
   252        "[Token (\<dots>), Token (\<dots>), Token (\<dots>), Token (\<dots>), Token (\<dots>)]"} 
   292        "[Token (\<dots>), Token (\<dots>), Token (\<dots>), Token (\<dots>), Token (\<dots>)]"}