thys/Journal/Paper.thy
changeset 287 95b3880d428f
parent 280 c840a99a3e05
child 288 9ab8609c66c5
equal deleted inserted replaced
286:804fbb227568 287:95b3880d428f
     2 theory Paper
     2 theory Paper
     3 imports 
     3 imports 
     4    "../Lexer"
     4    "../Lexer"
     5    "../Simplifying" 
     5    "../Simplifying" 
     6    "../Positions"
     6    "../Positions"
       
     7    "../Sulzmann"
     7    "~~/src/HOL/Library/LaTeXsugar"
     8    "~~/src/HOL/Library/LaTeXsugar"
     8 begin
     9 begin
     9 
    10 
    10 lemma Suc_0_fold:
    11 lemma Suc_0_fold:
    11   "Suc 0 = 1"
    12   "Suc 0 = 1"
    44   ZERO ("\<^bold>0" 81) and 
    45   ZERO ("\<^bold>0" 81) and 
    45   ONE ("\<^bold>1" 81) and 
    46   ONE ("\<^bold>1" 81) and 
    46   CHAR ("_" [1000] 80) and
    47   CHAR ("_" [1000] 80) and
    47   ALT ("_ + _" [77,77] 78) and
    48   ALT ("_ + _" [77,77] 78) and
    48   SEQ ("_ \<cdot> _" [77,77] 78) and
    49   SEQ ("_ \<cdot> _" [77,77] 78) and
    49   STAR ("_\<^sup>\<star>" [78] 78) and
    50   STAR ("_\<^sup>\<star>" [79] 78) and
    50   
    51   
    51   val.Void ("Empty" 78) and
    52   val.Void ("Empty" 78) and
    52   val.Char ("Char _" [1000] 78) and
    53   val.Char ("Char _" [1000] 78) and
    53   val.Left ("Left _" [79] 78) and
    54   val.Left ("Left _" [79] 78) and
    54   val.Right ("Right _" [1000] 78) and
    55   val.Right ("Right _" [1000] 78) and
   120 
   121 
   121 
   122 
   122 text {*
   123 text {*
   123 
   124 
   124 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
   125 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
   125 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
   126 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\
   126 character~@{text c}, and showed that it gave a simple solution to the
   127 a character~@{text c}, and showed that it gave a simple solution to the
   127 problem of matching a string @{term s} with a regular expression @{term r}:
   128 problem of matching a string @{term s} with a regular expression @{term
   128 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
   129 r}: if the derivative of @{term r} w.r.t.\ (in succession) all the
   129 the string matches the empty string, then @{term r} matches @{term s} (and
   130 characters of the string matches the empty string, then @{term r}
   130 {\em vice versa}). The derivative has the property (which may almost be
   131 matches @{term s} (and {\em vice versa}). The derivative has the
   131 regarded as its specification) that, for every string @{term s} and regular
   132 property (which may almost be regarded as its specification) that, for
   132 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
   133 every string @{term s} and regular expression @{term r} and character
   133 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
   134 @{term c}, one has @{term "cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}. 
   134 derivatives is that they are neatly expressible in any functional language,
   135 The beauty of Brzozowski's derivatives is that
   135 and easily definable and reasoned about in theorem provers---the definitions
   136 they are neatly expressible in any functional language, and easily
   136 just consist of inductive datatypes and simple recursive functions. A
   137 definable and reasoned about in theorem provers---the definitions just
       
   138 consist of inductive datatypes and simple recursive functions. A
   137 mechanised correctness proof of Brzozowski's matcher in for example HOL4
   139 mechanised correctness proof of Brzozowski's matcher in for example HOL4
   138 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
   140 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in
   139 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given
   141 Isabelle/HOL is part of the work by Krauss and Nipkow \cite{Krauss2011}.
   140 by Coquand and Siles \cite{Coquand2012}.
   142 And another one in Coq is given by Coquand and Siles \cite{Coquand2012}.
   141 
   143 
   142 If a regular expression matches a string, then in general there is more than
   144 If a regular expression matches a string, then in general there is more
   143 one way of how the string is matched. There are two commonly used
   145 than one way of how the string is matched. There are two commonly used
   144 disambiguation strategies to generate a unique answer: one is called GREEDY
   146 disambiguation strategies to generate a unique answer: one is called
   145 matching \cite{Frisch2004} and the other is POSIX
   147 GREEDY matching \cite{Frisch2004} and the other is POSIX
   146 matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}. For example consider
   148 matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}.
   147 the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
   149 For example consider the string @{term xy} and the regular expression
   148 (ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
   150 \mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be
   149 the single letter-regular expressions @{term x} and @{term y}, or directly
   151 matched in two `iterations' by the single letter-regular expressions
   150 in one iteration by @{term xy}. The first case corresponds to GREEDY
   152 @{term x} and @{term y}, or directly in one iteration by @{term xy}. The
   151 matching, which first matches with the left-most symbol and only matches the
   153 first case corresponds to GREEDY matching, which first matches with the
   152 next symbol in case of a mismatch (this is greedy in the sense of preferring
   154 left-most symbol and only matches the next symbol in case of a mismatch
   153 instant gratification to delayed repletion). The second case is POSIX
   155 (this is greedy in the sense of preferring instant gratification to
   154 matching, which prefers the longest match.
   156 delayed repletion). The second case is POSIX matching, which prefers the
       
   157 longest match.
   155 
   158 
   156 In the context of lexing, where an input string needs to be split up
   159 In the context of lexing, where an input string needs to be split up
   157 into a sequence of tokens, POSIX is the more natural disambiguation
   160 into a sequence of tokens, POSIX is the more natural disambiguation
   158 strategy for what programmers consider basic syntactic building blocks
   161 strategy for what programmers consider basic syntactic building blocks
   159 in their programs.  These building blocks are often specified by some
   162 in their programs.  These building blocks are often specified by some
   160 regular expressions, say @{text "r\<^bsub>key\<^esub>"} and @{text
   163 regular expressions, say @{text "r\<^bsub>key\<^esub>"} and @{text
   161 "r\<^bsub>id\<^esub>"} for recognising keywords and identifiers,
   164 "r\<^bsub>id\<^esub>"} for recognising keywords and identifiers,
   162 respectively. There are a few underlying (informal) rules behind
   165 respectively. There are a few underlying (informal) rules behind
   163 tokenising a string in a POSIX \cite{POSIX} fashion according to a
   166 tokenising a string in a POSIX \cite{POSIX} fashion:
   164 collection of regular expressions:
       
   165 
   167 
   166 \begin{itemize} 
   168 \begin{itemize} 
   167 \item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``{M}aximal {M}unch {R}ule''}):
   169 \item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``{M}aximal {M}unch {R}ule''}):
   168 The longest initial substring matched by any regular expression is taken as
   170 The longest initial substring matched by any regular expression is taken as
   169 next token.\smallskip
   171 next token.\smallskip
   224 Char} are constructors for values. @{text Stars} records how many
   226 Char} are constructors for values. @{text Stars} records how many
   225 iterations were used; @{text Left}, respectively @{text Right}, which
   227 iterations were used; @{text Left}, respectively @{text Right}, which
   226 alternative is used. This `tree view' leads naturally to the idea that
   228 alternative is used. This `tree view' leads naturally to the idea that
   227 regular expressions act as types and values as inhabiting those types
   229 regular expressions act as types and values as inhabiting those types
   228 (see, for example, \cite{HosoyaVouillonPierce2005}).  The value for
   230 (see, for example, \cite{HosoyaVouillonPierce2005}).  The value for
   229 the single `iteration', i.e.~the POSIX value, would look as follows
   231 matching @{text "xy"} in a single `iteration', i.e.~the POSIX value,
       
   232 would look as follows
   230 
   233 
   231 \begin{center}
   234 \begin{center}
   232 @{term "Stars [Seq (Char x) (Char y)]"}
   235 @{term "Stars [Seq (Char x) (Char y)]"}
   233 \end{center}
   236 \end{center}
   234 
   237 
   235 \noindent where @{const Stars} has only a single-element list for the
   238 \noindent where @{const Stars} has only a single-element list for the
   236 single iteration and @{const Seq} indicates that @{term xy} is matched 
   239 single iteration and @{const Seq} indicates that @{term xy} is matched 
   237 by a sequence regular expression, which we will in what follows 
   240 by a sequence regular expression.
   238 write more formally as @{term "SEQ x y"}.
   241 
       
   242 %, which we will in what follows 
       
   243 %write more formally as @{term "SEQ x y"}.
   239 
   244 
   240 
   245 
   241 Sulzmann and Lu give a simple algorithm to calculate a value that
   246 Sulzmann and Lu give a simple algorithm to calculate a value that
   242 appears to be the value associated with POSIX matching.  The challenge
   247 appears to be the value associated with POSIX matching.  The challenge
   243 then is to specify that value, in an algorithm-independent fashion,
   248 then is to specify that value, in an algorithm-independent fashion,
   269 \it{}``The POSIX strategy is more complicated than the greedy because of 
   274 \it{}``The POSIX strategy is more complicated than the greedy because of 
   270 the dependence on information about the length of matched strings in the 
   275 the dependence on information about the length of matched strings in the 
   271 various subexpressions.''
   276 various subexpressions.''
   272 \end{quote}
   277 \end{quote}
   273 
   278 
   274 %\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} 
       
   275 %is a relation on the
       
   276 %values for the regular expression @{term r}; but it only holds between
       
   277 %@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have
       
   278 %the same flattening (underlying string). So a counterexample to totality is
       
   279 %given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that
       
   280 %have different flattenings (see Section~\ref{posixsec}). A different
       
   281 %relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
       
   282 %with flattening @{term s} is definable by the same approach, and is indeed
       
   283 %total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}
       
   284 
   279 
   285 
   280 
   286 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
   281 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
   287 derivative-based regular expression matching algorithm of
   282 derivative-based regular expression matching algorithm of
   288 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   283 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   295 informal proof contains gaps, and possible fixes are not fully worked out.}
   290 informal proof contains gaps, and possible fixes are not fully worked out.}
   296 Our specification of a POSIX value consists of a simple inductive definition
   291 Our specification of a POSIX value consists of a simple inductive definition
   297 that given a string and a regular expression uniquely determines this value.
   292 that given a string and a regular expression uniquely determines this value.
   298 We also show that our definition is equivalent to an ordering 
   293 We also show that our definition is equivalent to an ordering 
   299 of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}.
   294 of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}.
   300 Derivatives as calculated by Brzozowski's method are usually more complex
   295 
   301 regular expressions than the initial one; various optimisations are
   296 %Derivatives as calculated by Brzozowski's method are usually more complex
   302 possible. We prove the correctness when simplifications of @{term "ALT ZERO
   297 %regular expressions than the initial one; various optimisations are
   303 r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
   298 %possible. We prove the correctness when simplifications of @{term "ALT ZERO r"}, 
   304 @{term r} are applied. We extend our results to ???
   299 %@{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
       
   300 %@{term r} are applied. 
       
   301 
       
   302 We extend our results to ???
   305 
   303 
   306 *}
   304 *}
   307 
   305 
   308 section {* Preliminaries *}
   306 section {* Preliminaries *}
   309 
   307