lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:5a1196466a9c
+:e3eb82ea2244
 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
 derivatives is that they are neatly expressible in any functional language,
 and easily definable and reasoned about in theorem provers---the definitions
 just consist of inductive datatypes and simple recursive functions. A
 completely formalised correctness proof of this matcher in for example HOL4
-has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is part
+has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
-of the work in \cite{Krauss2011}.
+of the work by Krauss and Nipkow \cite{Krauss2011}.
 One limitation of Brzozowski's matcher is that it only generates a YES/NO
 answer for whether a string is being matched by a regular expression.
 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
 generation not just of a YES/NO answer but of an actual matching, called a
 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
 GREEDY regular expression matching algorithm. Beginning with our
 observations that, without evidence that it is transitive, it cannot be
 called an ``order relation'', and that the relation is called a ``total
 order'' despite being evidently not total\footnote{The relation @{text
-"\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the
+"\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} is a relation on the
 values for the regular expression @{term r}; but it only holds between
-@{term v} and @{term "v'"} in cases where @{term v} and @{term "v'"} have
+@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have
 the same flattening (underlying string). So a counterexample to totality is
-given by taking two values @{term v} and @{term "v'"} for @{term r} that
+given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that
 have different flattenings (see Section~\ref{posixsec}). A different
 relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
 with flattening @{term s} is definable by the same approach, and is indeed
 total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we
 identify problems with this approach (of which some of the proofs are not
 being a {\em POSIX value} for a regular expression @{term r} and a string
 @{term s}; we show that the algorithm computes such a value and that such a
 value is unique. Proofs are both done by hand and checked in Isabelle/HOL.
 The experience of doing our proofs has been that this mechanical checking
 was absolutely essential: this subject area has hidden snares. This was also
-noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
+noted by Kuklewicz \cite{Kuklewicz} who found that nearly all POSIX matching
 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
 If a regular expression matches a string, then in general there is more than
 one way of how the string is matched. There are two commonly used
 disambiguation strategies to generate a unique answer: one is called GREEDY
 identifiers, respectively. There are two underlying (informal) rules behind
 tokenising a string in a POSIX fashion according to a collection of regular
 expressions:
 \begin{itemize}
-\item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
+\item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}\smallskip
 The longest initial substring matched by any regular expression is taken as
 next token.\smallskip
-\item[$\bullet$] \underline{Priority Rule:}
+\item[$\bullet$] \underline{Priority Rule:}\smallskip
 For a particular longest initial substring, the first regular expression
 that can match determines the token.
 \end{itemize}
 section {* POSIX Regular Expression Matching\label{posixsec} *}
 text {*
-The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
+The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to introduce values for encoding
 \emph{how} a regular expression matches a string and then define a
 function on values that mirrors (but inverts) the construction of the
 derivative on regular expressions. \emph{Values} are defined as the
 inductive datatype
 @{term "Right v"} $\mid$
 @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$
 @{term "Stars vs"}
 \end{center}
-\noindent where we use @{term vs} to stand for a list of values. (This is
+\noindent where we use @{term vs} to stand for a list of
-similar to the approach taken by Frisch and Cardelli for GREEDY matching
+values. (This is similar to the approach taken by Frisch and
-\cite{Frisch2004}, and Sulzmann and Lu \cite{Sulzmann2014} for POSIX
+Cardelli for GREEDY matching \cite{Frisch2004}, and Sulzmann and Lu
-matching). The string underlying a value can be calculated by the @{const
+for POSIX matching \cite{Sulzmann2014}). The string underlying a
-flat} function, written @{term "flat DUMMY"} and defined as:
+value can be calculated by the @{const flat} function, written
+@{term "flat DUMMY"} and defined as:
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
 @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
 without giving evidence that it is transitive.} between values and argue
 that there is a maximum value, as given by the derivative-based algorithm.
 In contrast, we shall introduce a simple inductive definition that
 specifies directly what a \emph{POSIX value} is, incorporating the
 POSIX-specific choices into the side-conditions of our rules. Our
-definition is inspired by the matching relation given in
+definition is inspired by the matching relation given by Vansummeren
 \cite{Vansummeren2006}. The relation we define is ternary and written as
 \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings, regular expressions and
 values.
 \begin{center}
 section {* Conclusion *}
 text {*
-We have implemented the POSIX value calculation algorithm introduced in
+We have implemented the POSIX value calculation algorithm introduced by
+Sulzmann and Lu
 \cite{Sulzmann2014}. Our implementation is nearly identical to the
 original and all modifications we introduced are harmless (like our char-clause for
 @{text inj}). We have proved this algorithm to be correct, but correct
 according to our own specification of what POSIX values are. Our
-specification (inspired from work in \cite{Vansummeren2006}) appears to be
+specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be
 much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
 straightforward. We have attempted to formalise the original proof
 by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
 unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
 already acknowledge some small problems, but our experience suggests
 Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
 initial substrings, but Nipkow's algorithm is not completely
 computational. The algorithm by Sulzmann and Lu, in contrast, can be
 implemented with easy in any functional language. A bespoke lexer for the
 Imp-language is formalised in Coq as part of the Software Foundations book
-\cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
+by Pierce et al\cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
 do not generalise easily to more advanced features.
 Our formalisation is available from
-\url{http://www.inf.kcl.ac.uk/staff/urbanc/lex}.
+\url{http://www.inf.kcl.ac.uk/staff/urbanc/lex}.\medskip
-%\noindent
+\noindent
-%{\bf Acknowledgements:}
+{\bf Acknowledgements:}
-%We are grateful for the comments we received from anonymous
+We are very grateful to Martin Sulzmann for his comments on our work and
-%referees.
+also patiently explaining to us the details in \cite{Sulzmann2014}.
 \small
 \bibliographystyle{plain}
 \bibliography{root}
 *}

changeset 152	e3eb82ea2244
parent 151	5a1196466a9c
child 154	2de3cf684ba0