--- a/thys/Journal/Paper.thy Wed Aug 15 13:48:57 2018 +0100
+++ b/thys/Journal/Paper.thy Thu Aug 16 01:12:00 2018 +0100
@@ -4,6 +4,7 @@
"../Lexer"
"../Simplifying"
"../Positions"
+ "../Sulzmann"
"~~/src/HOL/Library/LaTeXsugar"
begin
@@ -46,7 +47,7 @@
CHAR ("_" [1000] 80) and
ALT ("_ + _" [77,77] 78) and
SEQ ("_ \<cdot> _" [77,77] 78) and
- STAR ("_\<^sup>\<star>" [78] 78) and
+ STAR ("_\<^sup>\<star>" [79] 78) and
val.Void ("Empty" 78) and
val.Char ("Char _" [1000] 78) and
@@ -122,36 +123,38 @@
text {*
Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
-derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
-character~@{text c}, and showed that it gave a simple solution to the
-problem of matching a string @{term s} with a regular expression @{term r}:
-if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
-the string matches the empty string, then @{term r} matches @{term s} (and
-{\em vice versa}). The derivative has the property (which may almost be
-regarded as its specification) that, for every string @{term s} and regular
-expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
-and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
-derivatives is that they are neatly expressible in any functional language,
-and easily definable and reasoned about in theorem provers---the definitions
-just consist of inductive datatypes and simple recursive functions. A
+derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\
+a character~@{text c}, and showed that it gave a simple solution to the
+problem of matching a string @{term s} with a regular expression @{term
+r}: if the derivative of @{term r} w.r.t.\ (in succession) all the
+characters of the string matches the empty string, then @{term r}
+matches @{term s} (and {\em vice versa}). The derivative has the
+property (which may almost be regarded as its specification) that, for
+every string @{term s} and regular expression @{term r} and character
+@{term c}, one has @{term "cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
+The beauty of Brzozowski's derivatives is that
+they are neatly expressible in any functional language, and easily
+definable and reasoned about in theorem provers---the definitions just
+consist of inductive datatypes and simple recursive functions. A
mechanised correctness proof of Brzozowski's matcher in for example HOL4
-has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
-of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given
-by Coquand and Siles \cite{Coquand2012}.
+has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in
+Isabelle/HOL is part of the work by Krauss and Nipkow \cite{Krauss2011}.
+And another one in Coq is given by Coquand and Siles \cite{Coquand2012}.
-If a regular expression matches a string, then in general there is more than
-one way of how the string is matched. There are two commonly used
-disambiguation strategies to generate a unique answer: one is called GREEDY
-matching \cite{Frisch2004} and the other is POSIX
-matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}. For example consider
-the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
-(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
-the single letter-regular expressions @{term x} and @{term y}, or directly
-in one iteration by @{term xy}. The first case corresponds to GREEDY
-matching, which first matches with the left-most symbol and only matches the
-next symbol in case of a mismatch (this is greedy in the sense of preferring
-instant gratification to delayed repletion). The second case is POSIX
-matching, which prefers the longest match.
+If a regular expression matches a string, then in general there is more
+than one way of how the string is matched. There are two commonly used
+disambiguation strategies to generate a unique answer: one is called
+GREEDY matching \cite{Frisch2004} and the other is POSIX
+matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}.
+For example consider the string @{term xy} and the regular expression
+\mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be
+matched in two `iterations' by the single letter-regular expressions
+@{term x} and @{term y}, or directly in one iteration by @{term xy}. The
+first case corresponds to GREEDY matching, which first matches with the
+left-most symbol and only matches the next symbol in case of a mismatch
+(this is greedy in the sense of preferring instant gratification to
+delayed repletion). The second case is POSIX matching, which prefers the
+longest match.
In the context of lexing, where an input string needs to be split up
into a sequence of tokens, POSIX is the more natural disambiguation
@@ -160,8 +163,7 @@
regular expressions, say @{text "r\<^bsub>key\<^esub>"} and @{text
"r\<^bsub>id\<^esub>"} for recognising keywords and identifiers,
respectively. There are a few underlying (informal) rules behind
-tokenising a string in a POSIX \cite{POSIX} fashion according to a
-collection of regular expressions:
+tokenising a string in a POSIX \cite{POSIX} fashion:
\begin{itemize}
\item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``{M}aximal {M}unch {R}ule''}):
@@ -226,7 +228,8 @@
alternative is used. This `tree view' leads naturally to the idea that
regular expressions act as types and values as inhabiting those types
(see, for example, \cite{HosoyaVouillonPierce2005}). The value for
-the single `iteration', i.e.~the POSIX value, would look as follows
+matching @{text "xy"} in a single `iteration', i.e.~the POSIX value,
+would look as follows
\begin{center}
@{term "Stars [Seq (Char x) (Char y)]"}
@@ -234,8 +237,10 @@
\noindent where @{const Stars} has only a single-element list for the
single iteration and @{const Seq} indicates that @{term xy} is matched
-by a sequence regular expression, which we will in what follows
-write more formally as @{term "SEQ x y"}.
+by a sequence regular expression.
+
+%, which we will in what follows
+%write more formally as @{term "SEQ x y"}.
Sulzmann and Lu give a simple algorithm to calculate a value that
@@ -271,16 +276,6 @@
various subexpressions.''
\end{quote}
-%\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014}
-%is a relation on the
-%values for the regular expression @{term r}; but it only holds between
-%@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have
-%the same flattening (underlying string). So a counterexample to totality is
-%given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that
-%have different flattenings (see Section~\ref{posixsec}). A different
-%relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
-%with flattening @{term s} is definable by the same approach, and is indeed
-%total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}
\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
@@ -297,11 +292,14 @@
that given a string and a regular expression uniquely determines this value.
We also show that our definition is equivalent to an ordering
of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}.
-Derivatives as calculated by Brzozowski's method are usually more complex
-regular expressions than the initial one; various optimisations are
-possible. We prove the correctness when simplifications of @{term "ALT ZERO
-r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
-@{term r} are applied. We extend our results to ???
+
+%Derivatives as calculated by Brzozowski's method are usually more complex
+%regular expressions than the initial one; various optimisations are
+%possible. We prove the correctness when simplifications of @{term "ALT ZERO r"},
+%@{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
+%@{term r} are applied.
+
+We extend our results to ???
*}