diff -r 804fbb227568 -r 95b3880d428f thys/Journal/Paper.thy --- a/thys/Journal/Paper.thy Wed Aug 15 13:48:57 2018 +0100 +++ b/thys/Journal/Paper.thy Thu Aug 16 01:12:00 2018 +0100 @@ -4,6 +4,7 @@ "../Lexer" "../Simplifying" "../Positions" + "../Sulzmann" "~~/src/HOL/Library/LaTeXsugar" begin @@ -46,7 +47,7 @@ CHAR ("_" [1000] 80) and ALT ("_ + _" [77,77] 78) and SEQ ("_ \ _" [77,77] 78) and - STAR ("_\<^sup>\" [78] 78) and + STAR ("_\<^sup>\" [79] 78) and val.Void ("Empty" 78) and val.Char ("Char _" [1000] 78) and @@ -122,36 +123,38 @@ text {* Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em -derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a -character~@{text c}, and showed that it gave a simple solution to the -problem of matching a string @{term s} with a regular expression @{term r}: -if the derivative of @{term r} w.r.t.\ (in succession) all the characters of -the string matches the empty string, then @{term r} matches @{term s} (and -{\em vice versa}). The derivative has the property (which may almost be -regarded as its specification) that, for every string @{term s} and regular -expression @{term r} and character @{term c}, one has @{term "cs \ L(r)"} if -and only if \mbox{@{term "s \ L(der c r)"}}. The beauty of Brzozowski's -derivatives is that they are neatly expressible in any functional language, -and easily definable and reasoned about in theorem provers---the definitions -just consist of inductive datatypes and simple recursive functions. A +derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ +a character~@{text c}, and showed that it gave a simple solution to the +problem of matching a string @{term s} with a regular expression @{term +r}: if the derivative of @{term r} w.r.t.\ (in succession) all the +characters of the string matches the empty string, then @{term r} +matches @{term s} (and {\em vice versa}). The derivative has the +property (which may almost be regarded as its specification) that, for +every string @{term s} and regular expression @{term r} and character +@{term c}, one has @{term "cs \ L(r)"} if and only if \mbox{@{term "s \ L(der c r)"}}. +The beauty of Brzozowski's derivatives is that +they are neatly expressible in any functional language, and easily +definable and reasoned about in theorem provers---the definitions just +consist of inductive datatypes and simple recursive functions. A mechanised correctness proof of Brzozowski's matcher in for example HOL4 -has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part -of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given -by Coquand and Siles \cite{Coquand2012}. +has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in +Isabelle/HOL is part of the work by Krauss and Nipkow \cite{Krauss2011}. +And another one in Coq is given by Coquand and Siles \cite{Coquand2012}. -If a regular expression matches a string, then in general there is more than -one way of how the string is matched. There are two commonly used -disambiguation strategies to generate a unique answer: one is called GREEDY -matching \cite{Frisch2004} and the other is POSIX -matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}. For example consider -the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT -(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by -the single letter-regular expressions @{term x} and @{term y}, or directly -in one iteration by @{term xy}. The first case corresponds to GREEDY -matching, which first matches with the left-most symbol and only matches the -next symbol in case of a mismatch (this is greedy in the sense of preferring -instant gratification to delayed repletion). The second case is POSIX -matching, which prefers the longest match. +If a regular expression matches a string, then in general there is more +than one way of how the string is matched. There are two commonly used +disambiguation strategies to generate a unique answer: one is called +GREEDY matching \cite{Frisch2004} and the other is POSIX +matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}. +For example consider the string @{term xy} and the regular expression +\mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be +matched in two `iterations' by the single letter-regular expressions +@{term x} and @{term y}, or directly in one iteration by @{term xy}. The +first case corresponds to GREEDY matching, which first matches with the +left-most symbol and only matches the next symbol in case of a mismatch +(this is greedy in the sense of preferring instant gratification to +delayed repletion). The second case is POSIX matching, which prefers the +longest match. In the context of lexing, where an input string needs to be split up into a sequence of tokens, POSIX is the more natural disambiguation @@ -160,8 +163,7 @@ regular expressions, say @{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and identifiers, respectively. There are a few underlying (informal) rules behind -tokenising a string in a POSIX \cite{POSIX} fashion according to a -collection of regular expressions: +tokenising a string in a POSIX \cite{POSIX} fashion: \begin{itemize} \item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``{M}aximal {M}unch {R}ule''}): @@ -226,7 +228,8 @@ alternative is used. This `tree view' leads naturally to the idea that regular expressions act as types and values as inhabiting those types (see, for example, \cite{HosoyaVouillonPierce2005}). The value for -the single `iteration', i.e.~the POSIX value, would look as follows +matching @{text "xy"} in a single `iteration', i.e.~the POSIX value, +would look as follows \begin{center} @{term "Stars [Seq (Char x) (Char y)]"} @@ -234,8 +237,10 @@ \noindent where @{const Stars} has only a single-element list for the single iteration and @{const Seq} indicates that @{term xy} is matched -by a sequence regular expression, which we will in what follows -write more formally as @{term "SEQ x y"}. +by a sequence regular expression. + +%, which we will in what follows +%write more formally as @{term "SEQ x y"}. Sulzmann and Lu give a simple algorithm to calculate a value that @@ -271,16 +276,6 @@ various subexpressions.'' \end{quote} -%\footnote{The relation @{text "\\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} -%is a relation on the -%values for the regular expression @{term r}; but it only holds between -%@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have -%the same flattening (underlying string). So a counterexample to totality is -%given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that -%have different flattenings (see Section~\ref{posixsec}). A different -%relation @{text "\\<^bsub>r,s\<^esub>"} on the set of values for @{term r} -%with flattening @{term s} is definable by the same approach, and is indeed -%total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.} \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the @@ -297,11 +292,14 @@ that given a string and a regular expression uniquely determines this value. We also show that our definition is equivalent to an ordering of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}. -Derivatives as calculated by Brzozowski's method are usually more complex -regular expressions than the initial one; various optimisations are -possible. We prove the correctness when simplifications of @{term "ALT ZERO -r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to -@{term r} are applied. We extend our results to ??? + +%Derivatives as calculated by Brzozowski's method are usually more complex +%regular expressions than the initial one; various optimisations are +%possible. We prove the correctness when simplifications of @{term "ALT ZERO r"}, +%@{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to +%@{term r} are applied. + +We extend our results to ??? *}