lexing: comparison thys/Journal/Paper.thy

equal deleted inserted replaced

-:804fbb227568
+:95b3880d428f
 theory Paper
 imports
 "../Lexer"
 "../Simplifying"
 "../Positions"
+"../Sulzmann"
 "~~/src/HOL/Library/LaTeXsugar"
 begin
 lemma Suc_0_fold:
 "Suc 0 = 1"
 ZERO ("\<^bold>0" 81) and
 ONE ("\<^bold>1" 81) and
 CHAR ("_" [1000] 80) and
 ALT ("_ + _" [77,77] 78) and
 SEQ ("_ \<cdot> _" [77,77] 78) and
-STAR ("_\<^sup>\<star>" [78] 78) and
+STAR ("_\<^sup>\<star>" [79] 78) and
 val.Void ("Empty" 78) and
 val.Char ("Char _" [1000] 78) and
 val.Left ("Left _" [79] 78) and
 val.Right ("Right _" [1000] 78) and
 text {*
 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
-derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
+derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\
-character~@{text c}, and showed that it gave a simple solution to the
+a character~@{text c}, and showed that it gave a simple solution to the
-problem of matching a string @{term s} with a regular expression @{term r}:
+problem of matching a string @{term s} with a regular expression @{term
-if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
+r}: if the derivative of @{term r} w.r.t.\ (in succession) all the
-the string matches the empty string, then @{term r} matches @{term s} (and
+characters of the string matches the empty string, then @{term r}
-{\em vice versa}). The derivative has the property (which may almost be
+matches @{term s} (and {\em vice versa}). The derivative has the
-regarded as its specification) that, for every string @{term s} and regular
+property (which may almost be regarded as its specification) that, for
-expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
+every string @{term s} and regular expression @{term r} and character
-and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
+@{term c}, one has @{term "cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
-derivatives is that they are neatly expressible in any functional language,
+The beauty of Brzozowski's derivatives is that
-and easily definable and reasoned about in theorem provers---the definitions
+they are neatly expressible in any functional language, and easily
-just consist of inductive datatypes and simple recursive functions. A
+definable and reasoned about in theorem provers---the definitions just
+consist of inductive datatypes and simple recursive functions. A
 mechanised correctness proof of Brzozowski's matcher in for example HOL4
-has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
+has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in
-of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given
+Isabelle/HOL is part of the work by Krauss and Nipkow \cite{Krauss2011}.
-by Coquand and Siles \cite{Coquand2012}.
+And another one in Coq is given by Coquand and Siles \cite{Coquand2012}.
-If a regular expression matches a string, then in general there is more than
+If a regular expression matches a string, then in general there is more
-one way of how the string is matched. There are two commonly used
+than one way of how the string is matched. There are two commonly used
-disambiguation strategies to generate a unique answer: one is called GREEDY
+disambiguation strategies to generate a unique answer: one is called
-matching \cite{Frisch2004} and the other is POSIX
+GREEDY matching \cite{Frisch2004} and the other is POSIX
-matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}. For example consider
+matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}.
-the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
+For example consider the string @{term xy} and the regular expression
-(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
+\mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be
-the single letter-regular expressions @{term x} and @{term y}, or directly
+matched in two `iterations' by the single letter-regular expressions
-in one iteration by @{term xy}. The first case corresponds to GREEDY
+@{term x} and @{term y}, or directly in one iteration by @{term xy}. The
-matching, which first matches with the left-most symbol and only matches the
+first case corresponds to GREEDY matching, which first matches with the
-next symbol in case of a mismatch (this is greedy in the sense of preferring
+left-most symbol and only matches the next symbol in case of a mismatch
-instant gratification to delayed repletion). The second case is POSIX
+(this is greedy in the sense of preferring instant gratification to
-matching, which prefers the longest match.
+delayed repletion). The second case is POSIX matching, which prefers the
+longest match.
 In the context of lexing, where an input string needs to be split up
 into a sequence of tokens, POSIX is the more natural disambiguation
 strategy for what programmers consider basic syntactic building blocks
 in their programs.  These building blocks are often specified by some
 regular expressions, say @{text "r\<^bsub>key\<^esub>"} and @{text
 "r\<^bsub>id\<^esub>"} for recognising keywords and identifiers,
 respectively. There are a few underlying (informal) rules behind
-tokenising a string in a POSIX \cite{POSIX} fashion according to a
+tokenising a string in a POSIX \cite{POSIX} fashion:
-collection of regular expressions:
 \begin{itemize}
 \item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``{M}aximal {M}unch {R}ule''}):
 The longest initial substring matched by any regular expression is taken as
 next token.\smallskip
 Char} are constructors for values. @{text Stars} records how many
 iterations were used; @{text Left}, respectively @{text Right}, which
 alternative is used. This `tree view' leads naturally to the idea that
 regular expressions act as types and values as inhabiting those types
 (see, for example, \cite{HosoyaVouillonPierce2005}).  The value for
-the single `iteration', i.e.~the POSIX value, would look as follows
+matching @{text "xy"} in a single `iteration', i.e.~the POSIX value,
+would look as follows
 \begin{center}
 @{term "Stars [Seq (Char x) (Char y)]"}
 \end{center}
 \noindent where @{const Stars} has only a single-element list for the
 single iteration and @{const Seq} indicates that @{term xy} is matched
-by a sequence regular expression, which we will in what follows
+by a sequence regular expression.
-write more formally as @{term "SEQ x y"}.
+%, which we will in what follows
+%write more formally as @{term "SEQ x y"}.
 Sulzmann and Lu give a simple algorithm to calculate a value that
 appears to be the value associated with POSIX matching.  The challenge
 then is to specify that value, in an algorithm-independent fashion,
 \it{}``The POSIX strategy is more complicated than the greedy because of
 the dependence on information about the length of matched strings in the
 various subexpressions.''
 \end{quote}
-%\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014}
-%is a relation on the
-%values for the regular expression @{term r}; but it only holds between
-%@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have
-%the same flattening (underlying string). So a counterexample to totality is
-%given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that
-%have different flattenings (see Section~\ref{posixsec}). A different
-%relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
-%with flattening @{term s} is definable by the same approach, and is indeed
-%total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}
 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
 derivative-based regular expression matching algorithm of
 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
 informal proof contains gaps, and possible fixes are not fully worked out.}
 Our specification of a POSIX value consists of a simple inductive definition
 that given a string and a regular expression uniquely determines this value.
 We also show that our definition is equivalent to an ordering
 of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}.
-Derivatives as calculated by Brzozowski's method are usually more complex
-regular expressions than the initial one; various optimisations are
+%Derivatives as calculated by Brzozowski's method are usually more complex
-possible. We prove the correctness when simplifications of @{term "ALT ZERO
+%regular expressions than the initial one; various optimisations are
-r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
+%possible. We prove the correctness when simplifications of @{term "ALT ZERO r"},
-@{term r} are applied. We extend our results to ???
+%@{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
+%@{term r} are applied.
+We extend our results to ???
 *}
 section {* Preliminaries *}

changeset 287	95b3880d428f
parent 280	c840a99a3e05
child 288	9ab8609c66c5