ProgTutorial/Parsing.thy
changeset 236 7b6d81ff9d9a
parent 230 8def50824320
child 240 d111f5988e49
equal deleted inserted replaced
235:dc955603d813 236:7b6d81ff9d9a
     5 chapter {* Parsing *}
     5 chapter {* Parsing *}
     6 
     6 
     7 text {*
     7 text {*
     8 
     8 
     9   Isabelle distinguishes between \emph{outer} and \emph{inner} syntax. 
     9   Isabelle distinguishes between \emph{outer} and \emph{inner} syntax. 
    10   Theory commands, such as \isacommand{definition}, \isacommand{inductive} and so
    10   Commands, such as \isacommand{definition}, \isacommand{inductive} and so
    11   on, belong to the outer syntax, whereas items inside double quotation marks, such 
    11   on, belong to the outer syntax, whereas terms, types and so on belong to the 
    12   as terms, types and so on, belong to the inner syntax. For parsing inner syntax, 
    12   inner syntax. For parsing inner syntax, 
    13   Isabelle uses a rather general and sophisticated algorithm, which 
    13   Isabelle uses a rather general and sophisticated algorithm, which 
    14   is driven by priority grammars. Parsers for outer syntax are built up by functional
    14   is driven by priority grammars. Parsers for outer syntax are built up by functional
    15   parsing combinators. These combinators are a well-established technique for parsing, 
    15   parsing combinators. These combinators are a well-established technique for parsing, 
    16   which has, for example, been described in Paulson's classic ML-book \cite{paulson-ml2}.
    16   which has, for example, been described in Paulson's classic ML-book \cite{paulson-ml2}.
    17   Isabelle developers are usually concerned with writing these outer syntax parsers, 
    17   Isabelle developers are usually concerned with writing these outer syntax parsers, 
    18   either for new definitional packages or for calling methods with specific arguments. 
    18   either for new definitional packages or for calling methods with specific arguments. 
    19 
    19 
    20   \begin{readmore}
    20   \begin{readmore}
    21   The library 
    21   The library for writing parser combinators is split up, roughly, into two
    22   for writing parser combinators is split up, roughly, into two parts. 
    22   parts. The first part consists of a collection of generic parser combinators
    23   The first part consists of a collection of generic parser combinators defined
    23   defined in the structure @{ML_struct Scan} in the file @{ML_file
    24   in the structure @{ML_struct Scan} in the file 
    24   "Pure/General/scan.ML"}. The second part of the library consists of
    25   @{ML_file "Pure/General/scan.ML"}. The second part of the library consists of 
    25   combinators for dealing with specific token types, which are defined in the
    26   combinators for dealing with specific token types, which are defined in the 
    26   structure @{ML_struct OuterParse} in the file @{ML_file
    27   structure @{ML_struct OuterParse} in the file 
    27   "Pure/Isar/outer_parse.ML"}. Specific parsers for packages are defined
    28   @{ML_file "Pure/Isar/outer_parse.ML"}.
    28   in @{ML_file "Pure/Isar/spec_parse.ML"}. Parsers for method arguments
       
    29   are defined in @{ML_file "Pure/Isar/args.ML"}.
    29   \end{readmore}
    30   \end{readmore}
    30 
    31 
    31 *}
    32 *}
    32 
    33 
    33 section {* Building Generic Parsers *}
    34 section {* Building Generic Parsers *}
    73 
    74 
    74 
    75 
    75 @{ML_response [display,gray] 
    76 @{ML_response [display,gray] 
    76 "let 
    77 "let 
    77   val hw = Scan.one (fn x => x = \"h\" orelse x = \"w\")
    78   val hw = Scan.one (fn x => x = \"h\" orelse x = \"w\")
    78   val input1 = (explode \"hello\")
    79   val input1 = explode \"hello\"
    79   val input2 = (explode \"world\")
    80   val input2 = explode \"world\"
    80 in
    81 in
    81     (hw input1, hw input2)
    82   (hw input1, hw input2)
    82 end"
    83 end"
    83     "((\"h\", [\"e\", \"l\", \"l\", \"o\"]),(\"w\", [\"o\", \"r\", \"l\", \"d\"]))"}
    84     "((\"h\", [\"e\", \"l\", \"l\", \"o\"]),(\"w\", [\"o\", \"r\", \"l\", \"d\"]))"}
    84 
    85 
    85   Two parsers can be connected in sequence by using the function @{ML "--"}. 
    86   Two parsers can be connected in sequence by using the function @{ML "--"}. 
    86   For example parsing @{text "h"}, @{text "e"} and @{text "l"} (in this 
    87   For example parsing @{text "h"}, @{text "e"} and @{text "l"} (in this 
    87   order) you can achieve by:
    88   order) you can achieve by:
    88 
    89 
    89   @{ML_response [display,gray] "(($$ \"h\") -- ($$ \"e\") -- ($$ \"l\")) (explode \"hello\")"
    90   @{ML_response [display,gray] 
    90                           "(((\"h\", \"e\"), \"l\"), [\"l\", \"o\"])"}
    91   "($$ \"h\" -- $$ \"e\" -- $$ \"l\") (explode \"hello\")"
       
    92   "(((\"h\", \"e\"), \"l\"), [\"l\", \"o\"])"}
    91 
    93 
    92   Note how the result of consumed strings builds up on the left as nested pairs.  
    94   Note how the result of consumed strings builds up on the left as nested pairs.  
    93 
    95 
    94   If, as in the previous example, you want to parse a particular string, 
    96   If, as in the previous example, you want to parse a particular string, 
    95   then you should use the function @{ML Scan.this_string}:
    97   then you should use the function @{ML Scan.this_string}:
    96 
    98 
    97   @{ML_response [display,gray] "Scan.this_string \"hell\" (explode \"hello\")"
    99   @{ML_response [display,gray] 
    98                           "(\"hell\", [\"o\"])"}
   100   "Scan.this_string \"hell\" (explode \"hello\")"
       
   101   "(\"hell\", [\"o\"])"}
    99 
   102 
   100   Parsers that explore alternatives can be constructed using the function @{ML
   103   Parsers that explore alternatives can be constructed using the function @{ML
   101   "||"}. The parser @{ML "(p || q)" for p q} returns the
   104   "||"}. The parser @{ML "(p || q)" for p q} returns the
   102   result of @{text "p"}, in case it succeeds, otherwise it returns the
   105   result of @{text "p"}, in case it succeeds, otherwise it returns the
   103   result of @{text "q"}. For example:
   106   result of @{text "q"}. For example:
   104 
   107 
   105 
   108 
   106 @{ML_response [display,gray] 
   109 @{ML_response [display,gray] 
   107 "let 
   110 "let 
   108   val hw = ($$ \"h\") || ($$ \"w\")
   111   val hw = $$ \"h\" || $$ \"w\"
   109   val input1 = (explode \"hello\")
   112   val input1 = explode \"hello\"
   110   val input2 = (explode \"world\")
   113   val input2 = explode \"world\"
   111 in
   114 in
   112   (hw input1, hw input2)
   115   (hw input1, hw input2)
   113 end"
   116 end"
   114   "((\"h\", [\"e\", \"l\", \"l\", \"o\"]), (\"w\", [\"o\", \"r\", \"l\", \"d\"]))"}
   117   "((\"h\", [\"e\", \"l\", \"l\", \"o\"]), (\"w\", [\"o\", \"r\", \"l\", \"d\"]))"}
   115 
   118 
   117   for parsers, except that they discard the item being parsed by the first (respectively second)
   120   for parsers, except that they discard the item being parsed by the first (respectively second)
   118   parser. For example:
   121   parser. For example:
   119   
   122   
   120 @{ML_response [display,gray]
   123 @{ML_response [display,gray]
   121 "let 
   124 "let 
   122   val just_e = ($$ \"h\") |-- ($$ \"e\") 
   125   val just_e = $$ \"h\" |-- $$ \"e\" 
   123   val just_h = ($$ \"h\") --| ($$ \"e\") 
   126   val just_h = $$ \"h\" --| $$ \"e\" 
   124   val input = (explode \"hello\")  
   127   val input = explode \"hello\"  
   125 in 
   128 in 
   126   (just_e input, just_h input)
   129   (just_e input, just_h input)
   127 end"
   130 end"
   128   "((\"e\", [\"l\", \"l\", \"o\"]),(\"h\", [\"l\", \"l\", \"o\"]))"}
   131   "((\"e\", [\"l\", \"l\", \"o\"]),(\"h\", [\"l\", \"l\", \"o\"]))"}
   129 
   132 
   132   the default value @{text "x"}. For example:
   135   the default value @{text "x"}. For example:
   133 
   136 
   134 @{ML_response [display,gray]
   137 @{ML_response [display,gray]
   135 "let 
   138 "let 
   136   val p = Scan.optional ($$ \"h\") \"x\"
   139   val p = Scan.optional ($$ \"h\") \"x\"
   137   val input1 = (explode \"hello\")
   140   val input1 = explode \"hello\"
   138   val input2 = (explode \"world\")  
   141   val input2 = explode \"world\"  
   139 in 
   142 in 
   140   (p input1, p input2)
   143   (p input1, p input2)
   141 end" 
   144 end" 
   142  "((\"h\", [\"e\", \"l\", \"l\", \"o\"]), (\"x\", [\"w\", \"o\", \"r\", \"l\", \"d\"]))"}
   145  "((\"h\", [\"e\", \"l\", \"l\", \"o\"]), (\"x\", [\"w\", \"o\", \"r\", \"l\", \"d\"]))"}
   143 
   146 
   145   be given. Instead, the result is wrapped as an @{text "option"}-type. For example:
   148   be given. Instead, the result is wrapped as an @{text "option"}-type. For example:
   146 
   149 
   147 @{ML_response [display,gray]
   150 @{ML_response [display,gray]
   148 "let 
   151 "let 
   149   val p = Scan.option ($$ \"h\")
   152   val p = Scan.option ($$ \"h\")
   150   val input1 = (explode \"hello\")
   153   val input1 = explode \"hello\"
   151   val input2 = (explode \"world\")  
   154   val input2 = explode \"world\"  
   152 in 
   155 in 
   153   (p input1, p input2)
   156   (p input1, p input2)
   154 end" "((SOME \"h\", [\"e\", \"l\", \"l\", \"o\"]), (NONE, [\"w\", \"o\", \"r\", \"l\", \"d\"]))"}
   157 end" "((SOME \"h\", [\"e\", \"l\", \"l\", \"o\"]), (NONE, [\"w\", \"o\", \"r\", \"l\", \"d\"]))"}
   155 
   158 
   156   The function @{ML "!!"} helps to produce appropriate error messages
   159   The function @{ML "!!"} helps to produce appropriate error messages
   164   message, if the parsing of @{ML "(p -- q)" for p q} fails. Because in that
   167   message, if the parsing of @{ML "(p -- q)" for p q} fails. Because in that
   165   case you lose the information that @{text p} should be followed by @{text q}.
   168   case you lose the information that @{text p} should be followed by @{text q}.
   166   To see this assume that @{text p} is present in the input, but it is not
   169   To see this assume that @{text p} is present in the input, but it is not
   167   followed by @{text q}. That means @{ML "(p -- q)" for p q} will fail and
   170   followed by @{text q}. That means @{ML "(p -- q)" for p q} will fail and
   168   hence the alternative parser @{text r} will be tried. However, in many
   171   hence the alternative parser @{text r} will be tried. However, in many
   169   circumstances this will be the wrong parser for the input ``p-followed-by-something''
   172   circumstances this will be the wrong parser for the input ``@{text "p"}-followed-by-something''
   170   and therefore will also fail. The error message is then caused by the failure
   173   and therefore will also fail. The error message is then caused by the failure
   171   of @{text r}, not by the absence of @{text q} in the input. This kind of
   174   of @{text r}, not by the absence of @{text q} in the input. This kind of
   172   situation can be avoided when using the function @{ML "!!"}.  This function
   175   situation can be avoided when using the function @{ML "!!"}.  This function
   173   aborts the whole process of parsing in case of a failure and prints an error
   176   aborts the whole process of parsing in case of a failure and prints an error
   174   message. For example if you invoke the parser
   177   message. For example if you invoke the parser
   175 
   178 
   176   
   179   
   177   @{ML [display,gray] "(!! (fn _ => \"foo\") ($$ \"h\"))"}
   180   @{ML [display,gray] "!! (fn _ => \"foo\") ($$ \"h\")"}
   178 
   181 
   179   on @{text [quotes] "hello"}, the parsing succeeds
   182   on @{text [quotes] "hello"}, the parsing succeeds
   180 
   183 
   181   @{ML_response [display,gray] 
   184   @{ML_response [display,gray] 
   182                 "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"hello\")" 
   185   "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"hello\")" 
   183                 "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
   186   "(\"h\", [\"e\", \"l\", \"l\", \"o\"])"}
   184 
   187 
   185   but if you invoke it on @{text [quotes] "world"}
   188   but if you invoke it on @{text [quotes] "world"}
   186   
   189   
   187   @{ML_response_fake [display,gray] "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"world\")"
   190   @{ML_response_fake [display,gray] "(!! (fn _ => \"foo\") ($$ \"h\")) (explode \"world\")"
   188                                "Exception ABORT raised"}
   191                                "Exception ABORT raised"}
   189 
   192 
   190   then the parsing aborts and the error message @{text "foo"} is printed. In order to
   193   then the parsing aborts and the error message @{text "foo"} is printed. In order to
   191   see the error message properly, you need to prefix the parser with the function 
   194   see the error message properly, you need to prefix the parser with the function 
   192   @{ML "Scan.error"}. For example:
   195   @{ML "Scan.error"}. For example:
   193 
   196 
   194   @{ML_response_fake [display,gray] "Scan.error (!! (fn _ => \"foo\") ($$ \"h\"))"
   197   @{ML_response_fake [display,gray] 
   195                                "Exception Error \"foo\" raised"}
   198   "Scan.error (!! (fn _ => \"foo\") ($$ \"h\"))"
       
   199   "Exception Error \"foo\" raised"}
   196 
   200 
   197   This ``prefixing'' is usually done by wrappers such as @{ML "OuterSyntax.local_theory"} 
   201   This ``prefixing'' is usually done by wrappers such as @{ML "OuterSyntax.local_theory"} 
   198   (see Section~\ref{sec:newcommand} which explains this function in more detail). 
   202   (see Section~\ref{sec:newcommand} which explains this function in more detail). 
   199   
   203   
   200   Let us now return to our example of parsing @{ML "(p -- q) || r" for p q
   204   Let us now return to our example of parsing @{ML "(p -- q) || r" for p q
   201   r}. If you want to generate the correct error message for p-followed-by-q,
   205   r}. If you want to generate the correct error message for 
   202   then you have to write:
   206   @{text "p"}-followed-by-@{text "q"}, then you have to write:
   203 *}
   207 *}
   204 
   208 
   205 ML{*fun p_followed_by_q p q r =
   209 ML{*fun p_followed_by_q p q r =
   206 let 
   210 let 
   207   val err_msg = (fn _ => p ^ " is not followed by " ^ q)
   211   val err_msg = fn _ => p ^ " is not followed by " ^ q
   208 in
   212 in
   209   ($$ p -- (!! err_msg ($$ q))) || ($$ r -- $$ r)
   213   ($$ p -- (!! err_msg ($$ q))) || ($$ r -- $$ r)
   210 end *}
   214 end *}
   211 
   215 
   212 
   216 
   251   string as in
   255   string as in
   252 
   256 
   253   @{ML_response [display,gray] 
   257   @{ML_response [display,gray] 
   254 "let 
   258 "let 
   255    val p = Scan.repeat (Scan.one Symbol.not_eof)
   259    val p = Scan.repeat (Scan.one Symbol.not_eof)
   256    val input = (explode \"foo bar foo\") 
   260    val input = explode \"foo bar foo\"
   257 in
   261 in
   258    Scan.finite Symbol.stopper p input
   262    Scan.finite Symbol.stopper p input
   259 end" 
   263 end" 
   260 "([\"f\", \"o\", \"o\", \" \", \"b\", \"a\", \"r\", \" \", \"f\", \"o\", \"o\"], [])"}
   264 "([\"f\", \"o\", \"o\", \" \", \"b\", \"a\", \"r\", \" \", \"f\", \"o\", \"o\"], [])"}
   261 
   265 
   280   symbol is a @{text [quotes] "*"}.
   284   symbol is a @{text [quotes] "*"}.
   281 
   285 
   282   @{ML_response [display,gray]
   286   @{ML_response [display,gray]
   283 "let 
   287 "let 
   284   val p = Scan.repeat (Scan.unless ($$ \"*\") (Scan.one Symbol.not_eof))
   288   val p = Scan.repeat (Scan.unless ($$ \"*\") (Scan.one Symbol.not_eof))
   285   val input1 = (explode \"fooooo\")
   289   val input1 = explode \"fooooo\"
   286   val input2 = (explode \"foo*ooo\")
   290   val input2 = explode \"foo*ooo\"
   287 in
   291 in
   288   (Scan.finite Symbol.stopper p input1, 
   292   (Scan.finite Symbol.stopper p input1, 
   289    Scan.finite Symbol.stopper p input2)
   293    Scan.finite Symbol.stopper p input2)
   290 end"
   294 end"
   291 "(([\"f\", \"o\", \"o\", \"o\", \"o\", \"o\"], []),
   295 "(([\"f\", \"o\", \"o\", \"o\", \"o\", \"o\"], []),
   307   doubles the two parsed input strings; or
   311   doubles the two parsed input strings; or
   308 
   312 
   309   @{ML_response [display,gray] 
   313   @{ML_response [display,gray] 
   310 "let 
   314 "let 
   311    val p = Scan.repeat (Scan.one Symbol.not_eof)
   315    val p = Scan.repeat (Scan.one Symbol.not_eof)
   312    val input = (explode \"foo bar foo\") 
   316    val input = explode \"foo bar foo\" 
   313 in
   317 in
   314    Scan.finite Symbol.stopper (p >> implode) input
   318    Scan.finite Symbol.stopper (p >> implode) input
   315 end" 
   319 end" 
   316 "(\"foo bar foo\",[])"}
   320 "(\"foo bar foo\",[])"}
   317 
   321 
   330   The function @{ML Scan.lift} takes a parser and a pair as arguments. This function applies
   334   The function @{ML Scan.lift} takes a parser and a pair as arguments. This function applies
   331   the given parser to the second component of the pair and leaves the  first component 
   335   the given parser to the second component of the pair and leaves the  first component 
   332   untouched. For example
   336   untouched. For example
   333 
   337 
   334 @{ML_response [display,gray]
   338 @{ML_response [display,gray]
   335 "Scan.lift (($$ \"h\") -- ($$ \"e\")) (1,(explode \"hello\"))"
   339 "Scan.lift ($$ \"h\" -- $$ \"e\") (1, explode \"hello\")"
   336 "((\"h\", \"e\"), (1, [\"l\", \"l\", \"o\"]))"}
   340 "((\"h\", \"e\"), (1, [\"l\", \"l\", \"o\"]))"}
   337 
   341 
   338   (FIXME: In which situations is this useful? Give examples.) 
   342   (FIXME: In which situations is this useful? Give examples.) 
   339 
   343 
   340   \begin{exercise}\label{ex:scancmts}
   344   \begin{exercise}\label{ex:scancmts}
   341   Write a parser that parses an input string so that any comment enclosed
   345   Write a parser that parses an input string so that any comment enclosed
   342   within @{text "(*\<dots>*)"} is replaced by the same comment but enclosed within
   346   within @{text "(*\<dots>*)"} is replaced by the same comment but enclosed within
   343   @{text "(**\<dots>**)"} in the output string. To enclose a string, you can use the
   347   @{text "(**\<dots>**)"} in the output string. To enclose a string, you can use the
   344   function @{ML "enclose s1 s2 s" for s1 s2 s} which produces the string @{ML
   348   function @{ML "enclose s1 s2 s" for s1 s2 s} which produces the string @{ML
   345   "s1 ^ s ^ s2" for s1 s2 s}.
   349   "s1 ^ s ^ s2" for s1 s2 s}. Hint: To simplify the task ignore the proper 
       
   350   nesting of comments.
   346   \end{exercise}
   351   \end{exercise}
   347 *}
   352 *}
   348 
   353 
   349 section {* Parsing Theory Syntax *}
   354 section {* Parsing Theory Syntax *}
   350 
   355